How to Create Lemmatized (French) Text for Topic Modeling
rss:link
https://dragonfly.hypotheses.org/648
rss:description
It would not, some years ago, have occurred to me that anyone would want to reduce literary texts to the following pitiful state: “me avoir tu faire un rapport bien sincère / ne déguiser tu rien de ce que avoir dire mon père / tout mon sens à moi-même en être encor charmer : / il estimer Rodrigue autant que...
content:encoded
<p><div id="attachment_659" style="width: 248px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2014/06/corneille21.jpg"><img aria-describedby="caption-attachment-659" loading="lazy" class="size-medium wp-image-659" src="http://dragonfly.hypotheses.org/files/2014/06/corneille21-238x300.jpg" alt="From Wikipedia; source: Bibliothèque nationale de France. http://commons.wikimedia.org/wiki/File:Gravure_Pierre_Corneille.jpg (public domain)" width="238" height="300" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2014/06/corneille21-238x300.jpg 238w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2014/06/corneille21-396x500.jpg 396w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2014/06/corneille21.jpg 476w" sizes="(max-width: 238px) 100vw, 238px" /></a><p id="caption-attachment-659" class="wp-caption-text">“Gravure représantant Pierre Corneille.” From Wikipedia; source: Bibliothèque nationale de France. http://commons.wikimedia.org/wiki/File:Gravure_Pierre_Corneille.jpg (public domain).</p></div></p>
<p>It would not, some years ago, have occurred to me that anyone would want to reduce literary texts to the following pitiful state: “me avoir tu faire un rapport bien sincère / ne déguiser tu rien de ce que avoir dire mon père / tout mon sens à moi-même en être encor charmer : / il estimer Rodrigue autant que vous la|le aimer.” In case you’re wondering what that is supposed to be, it is a deformed version of the first four lines of <a href="https://en.wikipedia.org/wiki/Le_Cid">Pierre Corneille’s famous 1637 tragicomedy “Le Cid”</a>, deformed in the sense of being a lemmatized version of the text. And if you now apply a stopword list, it gets even worse: “rapport sincère déguiser rien dire père tout sens charmer estimer aimer.” <span id="more-648"></span></p>
<p>I’m certainly not saying that this is the best form this text could take if you’re interested in reading this play (in that case, you’d better go, for instance, for the <a href="http://www.theatre-classique.fr/pages/programmes/edition.php?t=../documents/CORNEILLEP_CID.xml" target="_blank">Théâtre classique</a> version of the play). But if you would like to get a sense of the themes running through a collection of 600 plays spanning the years 1630 to 1780, then this is a very worthwile representation of the text focused entirely on the base forms of content-bearing words.</p>
<p>This post won’t manage to go all the way to the trends over time and genre which you can discover using topic modeling in this way. It will simply show how to create lemmatized text in a form that is useful as input for topic modeling with <a href="http://mallet.cs.umass.edu/topics.php" target="_blank">Mallet</a>. Basically, two steps are involved: the first is running your texts through <a href="http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/" target="_blank">TreeTagger</a>, a tool which conducts <a href="https://en.wikipedia.org/wiki/Tokenization" target="_blank">tokenization</a>, <a href="https://en.wikipedia.org/wiki/Lemmatisation" target="_blank">lemmatization</a> and <a href="https://en.wikipedia.org/wiki/Pos-tagging" target="_blank">part-of-speech tagging</a> for you. The tool has been developed almost 20 years ago by Helmut Schmid and is still one of the most solid options around, especially when you need models for languages other than English; besides French, TreeTagger also provides models for German, Spanish, Italian, Estonian, Swahili, Polish, Mongolian, and quite a few more. The second step is transforming the TreeTagger-output to a format Mallet can usefully deal with, a task which can for example be accomplished using Python.</p>
<p>Using TreeTagger is quite straightforward. You can follow the installation instructions on the <a href="http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/">website</a>. The website falls silent on how to actually use the tool, but your favorite search engine may come up with some suggestions (I was not particularly impressed with what I found in a quick search, however; maybe some of you know a good tutorial?). For very basic use, however, you just need to know how to call the TreeTagger from the command line interface. At the most simple level, the syntax is as follows; first, you move to the directory where you installed TreeTagger; in my case:</p>
<pre>> cd /home/christof/programs/treetagger</pre>
<p>Then, you call the TreeTagger function appropriate to your language and indicate the (existing) inputfile, which should just be plain text, and the outputfile, which is to be created by TreeTagger:</p>
<pre>> cmd/tree-tagger-french-utf8 inputfile.txt > outputfile.txt</pre>
<p>Because you will most likely not want to do this with just one file, but several (in my case, the above-mentioned 600 plays), you need to write a little “for loop” for the command line which will iterate through all the files in a given folder (in my case, a folder called “input” and located in the TreeTagger directory) and apply the TreeTagger function to each of them. For the Linux bash, this looks as follows:</p>
<pre>#!/bin/bash
FILES=/home/christof/programs/treetagger/input/*
for file in $FILES
do cmd/tree-tagger-french-utf8 $file > $file.trt
done</pre>
<p>You can save this little command as a text file (called, for example, “treetagger-loop.sh”), make sure it is placed right in the TreeTagger directory, make it executable as a program, and then run it from the command line by writing:</p>
<pre> > bash treetagger-loop.sh</pre>
<p>on the command line interface again. Now the text in each of the files in the “input” folder will be lemmatized and a new file with the (fantasy) extension “.trt” will be created for each of them.</p>
<p>This is all well and good, but TreeTagger produces an output that is much richer than what you need for Topic Modeling lemmatized text. The output looks as follows:</p>
<pre>Cette PRO:DEM ce
Philaminte NOM <unknown>
est VER:pres être
belle ADJ beau
sans PRP sans
doute NOM doute
? SENT ?</pre>
<p>You can clearly see how the second column holds the POS-tags and the third column holds the lemmas. But not exactly what Mallet needs! This is where Python steps in. The little script I created for this can be found on <a href="https://github.com/clgs/toolbox/tree/master/trt2txt" target="_blank">GitHub</a> along with some sample files; note that it expects just the kind of input produced by TreeTagger, including the “*.trt” extension I invented. This script pulls just the lemmas from the third column of this table, skipping punctuation in the process, and for each text, writes the lemmas to a new plain text file. That is the kind of output I quoted at the beginning of this post. And this is just what you may want to use as input to Mallet. [Edit: In order to use the script, you will need to <a href="https://docs.python.org/3/using/index.html" target="_blank">have Python3 installed</a>. The easiest way to run the script is to launch it with the “run” command from IDLE (which comes packaged with Python) or from an editor like <a href="http://www.geany.org/" target="_blank">Geany</a>. In one directory, there should be the “trt2txt.py”-script itself as well as the output files from TreeTagger in a folder called “trt”. Create two subfolders, one called “csv” and one called “txt” so the script can place the resulting files there. Now you’re ready to run the script and you will get files with lemmatized plain text.]</p>
<p>Although this post is intended to stop here, let me just formulate a word of warning. There are some good arguments for using lemmatized texts as the basis for topic modeling. A practical argument is that it becomes much easier to manage the stopword list, because with one entry for a given lemma, you cover all the derived forms. A more important argument is that because you merge all forms of a given lemma into one, the semantic information contained in the lemma is not (somewhat arbitrarily?) spread out between the different word forms derived from the lemma, so that the same semantic information is assembled in one place, potentially making topic modeling more accurate. Also, some ambiguities may be resolved, like the one between “(je) sens” and “(les) sens”, i.e. is between the verb “to feel” and the noun “the senses”.</p>
<p>But there also some downsides: first of all, new ambiguities arise: for example, TreeTagger does not know whether “je suis” should be rendered as “suivre” (to follow) or “être” (to be). The same is true for “l'”, which could be “la” or “le”, something which appeared in the example from <em>Le Cid</em> but is less of a problem, of course, when you’re discarding articles anyway. And the topics change, too: one strong effect for a language like French, and I suppose for other highly inflected languages, is that all of a sudden, the verbs become very important in the topics. For example, the first 20 words of a nice topic like “criminal emotions”, strongly related to tragedy, are:</p>
<pre>crime voir venger punir croire traître cœur trahir lâche ingrat
haine âme coupable oser courroux colère accuser craindre souffrir foi</pre>
<p>Nothing wrong with that, of course. However, this, in turn leads to some tough decisions that need to be made for the stopword list: sure, auxiliaries like “avoir” and “être” should be excluded, and maybe also “faire” and “aller”, but what about “savoir” and “pouvoir”, highly frequent and flexible verbs which, at the same time, have important semantics related to knowledge and power? Whatever the answer to this may be, this post was simply intended to provide you with the tools to get to the point where you need to make this kind of decisions, and find out for yourselves.</p>
<div id="attachment_659" style="width: 248px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2014/06/corneille21.jpg"><img aria-describedby="caption-attachment-659" loading="lazy" class="size-medium wp-image-659" src="http://dragonfly.hypotheses.org/files/2014/06/corneille21-238x300.jpg" alt="From Wikipedia; source: Bibliothèque nationale de France. http://commons.wikimedia.org/wiki/File:Gravure_Pierre_Corneille.jpg (public domain)" width="238" height="300" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2014/06/corneille21-238x300.jpg 238w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2014/06/corneille21-396x500.jpg 396w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2014/06/corneille21.jpg 476w" sizes="(max-width: 238px) 100vw, 238px" /></a><p id="caption-attachment-659" class="wp-caption-text">“Gravure représantant Pierre Corneille.” From Wikipedia; source: Bibliothèque nationale de France. http://commons.wikimedia.org/wiki/File:Gravure_Pierre_Corneille.jpg (public domain).</p></div>
<p>It would not, some years ago, have occurred to me that anyone would want to reduce literary texts to the following pitiful state: “me avoir tu faire un rapport bien sincère / ne déguiser tu rien de ce que avoir dire mon père / tout mon sens à moi-même en être encor charmer : / il estimer Rodrigue autant que vous la|le aimer.” In case you’re wondering what that is supposed to be, it is a deformed version of the first four lines of <a href="https://en.wikipedia.org/wiki/Le_Cid">Pierre Corneille’s famous 1637 tragicomedy “Le Cid”</a>, deformed in the sense of being a lemmatized version of the text. And if you now apply a stopword list, it gets even worse: “rapport sincère déguiser rien dire père tout sens charmer estimer aimer.” <span id="more-648"></span></p>
<p>I’m certainly not saying that this is the best form this text could take if you’re interested in reading this play (in that case, you’d better go, for instance, for the <a href="http://www.theatre-classique.fr/pages/programmes/edition.php?t=../documents/CORNEILLEP_CID.xml" target="_blank">Théâtre classique</a> version of the play). But if you would like to get a sense of the themes running through a collection of 600 plays spanning the years 1630 to 1780, then this is a very worthwile representation of the text focused entirely on the base forms of content-bearing words.</p>
<p>This post won’t manage to go all the way to the trends over time and genre which you can discover using topic modeling in this way. It will simply show how to create lemmatized text in a form that is useful as input for topic modeling with <a href="http://mallet.cs.umass.edu/topics.php" target="_blank">Mallet</a>. Basically, two steps are involved: the first is running your texts through <a href="http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/" target="_blank">TreeTagger</a>, a tool which conducts <a href="https://en.wikipedia.org/wiki/Tokenization" target="_blank">tokenization</a>, <a href="https://en.wikipedia.org/wiki/Lemmatisation" target="_blank">lemmatization</a> and <a href="https://en.wikipedia.org/wiki/Pos-tagging" target="_blank">part-of-speech tagging</a> for you. The tool has been developed almost 20 years ago by Helmut Schmid and is still one of the most solid options around, especially when you need models for languages other than English; besides French, TreeTagger also provides models for German, Spanish, Italian, Estonian, Swahili, Polish, Mongolian, and quite a few more. The second step is transforming the TreeTagger-output to a format Mallet can usefully deal with, a task which can for example be accomplished using Python.</p>
<p>Using TreeTagger is quite straightforward. You can follow the installation instructions on the <a href="http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/">website</a>. The website falls silent on how to actually use the tool, but your favorite search engine may come up with some suggestions (I was not particularly impressed with what I found in a quick search, however; maybe some of you know a good tutorial?). For very basic use, however, you just need to know how to call the TreeTagger from the command line interface. At the most simple level, the syntax is as follows; first, you move to the directory where you installed TreeTagger; in my case:</p>
<pre>> cd /home/christof/programs/treetagger</pre>
<p>Then, you call the TreeTagger function appropriate to your language and indicate the (existing) inputfile, which should just be plain text, and the outputfile, which is to be created by TreeTagger:</p>
<pre>> cmd/tree-tagger-french-utf8 inputfile.txt > outputfile.txt</pre>
<p>Because you will most likely not want to do this with just one file, but several (in my case, the above-mentioned 600 plays), you need to write a little “for loop” for the command line which will iterate through all the files in a given folder (in my case, a folder called “input” and located in the TreeTagger directory) and apply the TreeTagger function to each of them. For the Linux bash, this looks as follows:</p>
<pre>#!/bin/bash
FILES=/home/christof/programs/treetagger/input/*
for file in $FILES
do cmd/tree-tagger-french-utf8 $file > $file.trt
done</pre>
<p>You can save this little command as a text file (called, for example, “treetagger-loop.sh”), make sure it is placed right in the TreeTagger directory, make it executable as a program, and then run it from the command line by writing:</p>
<pre> > bash treetagger-loop.sh</pre>
<p>on the command line interface again. Now the text in each of the files in the “input” folder will be lemmatized and a new file with the (fantasy) extension “.trt” will be created for each of them.</p>
<p>This is all well and good, but TreeTagger produces an output that is much richer than what you need for Topic Modeling lemmatized text. The output looks as follows:</p>
<pre>Cette PRO:DEM ce
Philaminte NOM <unknown>
est VER:pres être
belle ADJ beau
sans PRP sans
doute NOM doute
? SENT ?</pre>
<p>You can clearly see how the second column holds the POS-tags and the third column holds the lemmas. But not exactly what Mallet needs! This is where Python steps in. The little script I created for this can be found on <a href="https://github.com/clgs/toolbox/tree/master/trt2txt" target="_blank">GitHub</a> along with some sample files; note that it expects just the kind of input produced by TreeTagger, including the “*.trt” extension I invented. This script pulls just the lemmas from the third column of this table, skipping punctuation in the process, and for each text, writes the lemmas to a new plain text file. That is the kind of output I quoted at the beginning of this post. And this is just what you may want to use as input to Mallet. [Edit: In order to use the script, you will need to <a href="https://docs.python.org/3/using/index.html" target="_blank">have Python3 installed</a>. The easiest way to run the script is to launch it with the “run” command from IDLE (which comes packaged with Python) or from an editor like <a href="http://www.geany.org/" target="_blank">Geany</a>. In one directory, there should be the “trt2txt.py”-script itself as well as the output files from TreeTagger in a folder called “trt”. Create two subfolders, one called “csv” and one called “txt” so the script can place the resulting files there. Now you’re ready to run the script and you will get files with lemmatized plain text.]</p>
<p>Although this post is intended to stop here, let me just formulate a word of warning. There are some good arguments for using lemmatized texts as the basis for topic modeling. A practical argument is that it becomes much easier to manage the stopword list, because with one entry for a given lemma, you cover all the derived forms. A more important argument is that because you merge all forms of a given lemma into one, the semantic information contained in the lemma is not (somewhat arbitrarily?) spread out between the different word forms derived from the lemma, so that the same semantic information is assembled in one place, potentially making topic modeling more accurate. Also, some ambiguities may be resolved, like the one between “(je) sens” and “(les) sens”, i.e. is between the verb “to feel” and the noun “the senses”.</p>
<p>But there also some downsides: first of all, new ambiguities arise: for example, TreeTagger does not know whether “je suis” should be rendered as “suivre” (to follow) or “être” (to be). The same is true for “l'”, which could be “la” or “le”, something which appeared in the example from <em>Le Cid</em> but is less of a problem, of course, when you’re discarding articles anyway. And the topics change, too: one strong effect for a language like French, and I suppose for other highly inflected languages, is that all of a sudden, the verbs become very important in the topics. For example, the first 20 words of a nice topic like “criminal emotions”, strongly related to tragedy, are:</p>
<pre>crime voir venger punir croire traître cœur trahir lâche ingrat
haine âme coupable oser courroux colère accuser craindre souffrir foi</pre>
<p>Nothing wrong with that, of course. However, this, in turn leads to some tough decisions that need to be made for the stopword list: sure, auxiliaries like “avoir” and “être” should be excluded, and maybe also “faire” and “aller”, but what about “savoir” and “pouvoir”, highly frequent and flexible verbs which, at the same time, have important semantics related to knowledge and power? Whatever the answer to this may be, this post was simply intended to provide you with the tools to get to the point where you need to make this kind of decisions, and find out for yourselves.</p>