"It would not, some years ago, have occurred to me that anyone would want to reduce literary texts to the following pitiful state: “me avoir tu faire un rapport bien sinc\u00E8re / ne d\u00E9guiser tu rien de ce que avoir dire mon p\u00E8re / tout mon sens \u00E0 moi-m\u00EAme en \u00EAtre encor charmer : / il estimer Rodrigue autant que..." . "Python" . "How to Create Lemmatized (French) Text for Topic Modeling" . "TreeTagger" . . "Tools" . "Christof Sch\u00F6ch" . "https://dragonfly.hypotheses.org/648" . "lemmatization" . "2014-05-31T19:40:53Z" . "My research" . "french literature" . "topic modeling" . "

$\"From$

“Gravure repr\u00E9santant Pierre Corneille.” From Wikipedia; source: Biblioth\u00E8que nationale de France. http://commons.wikimedia.org/wiki/File:Gravure_Pierre_Corneille.jpg (public domain).

It would not, some years ago, have occurred to me that anyone would want to reduce literary texts to the following pitiful state: “me avoir tu faire un rapport bien sinc\u00E8re / ne d\u00E9guiser tu rien de ce que avoir dire mon p\u00E8re / tout mon sens \u00E0 moi-m\u00EAme en \u00EAtre encor charmer : / il estimer Rodrigue autant que vous la|le aimer.” In case you’re wondering what that is supposed to be, it is a deformed version of the first four lines of Pierre Corneille’s famous 1637 tragicomedy “Le Cid”, deformed in the sense of being a lemmatized version of the text. And if you now apply a stopword list, it gets even worse: “rapport sinc\u00E8re d\u00E9guiser rien dire p\u00E8re tout sens charmer estimer aimer.”

I’m certainly not saying that this is the best form this text could take if you’re interested in reading this play (in that case, you’d better go, for instance, for the Th\u00E9\u00E2tre classique version of the play). But if you would like to get a sense of the themes running through a collection of 600 plays spanning the years 1630 to 1780, then this is a very worthwile representation of the text focused entirely on the base forms of content-bearing words.

This post won’t manage to go all the way to the trends over time and genre which you can discover using topic modeling in this way. It will simply show how to create lemmatized text in a form that is useful as input for topic modeling with Mallet. Basically, two steps are involved: the first is running your texts through TreeTagger, a tool which conducts tokenization, lemmatization and part-of-speech tagging for you. The tool has been developed almost 20 years ago by Helmut Schmid and is still one of the most solid options around, especially when you need models for languages other than English; besides French, TreeTagger also provides models for German, Spanish, Italian, Estonian, Swahili, Polish, Mongolian, and quite a few more. The second step is transforming the TreeTagger-output to a format Mallet can usefully deal with, a task which can for example be accomplished using Python.

Using TreeTagger is quite straightforward. You can follow the installation instructions on the website. The website falls silent on how to actually use the tool, but your favorite search engine may come up with some suggestions (I was not particularly impressed with what I found in a quick search, however; maybe some of you know a good tutorial?). For very basic use, however, you just need to know how to call the TreeTagger from the command line interface. At the most simple level, the syntax is as follows; first, you move to the directory where you installed TreeTagger; in my case:

> cd /home/christof/programs/treetagger

Then, you call the TreeTagger function appropriate to your language and indicate the (existing) inputfile, which should just be plain text, and the outputfile, which is to be created by TreeTagger:

> cmd/tree-tagger-french-utf8 inputfile.txt > outputfile.txt

Because you will most likely not want to do this with just one file, but several (in my case, the above-mentioned 600 plays), you need to write a little “for loop” for the command line which will iterate through all the files in a given folder (in my case, a folder called “input” and located in the TreeTagger directory) and apply the TreeTagger function to each of them. For the Linux bash, this looks as follows:

#!/bin/bash\nFILES=/home/christof/programs/treetagger/input/*\nfor file in $FILES\n  do cmd/tree-tagger-french-utf8 $file > $file.trt\ndone

You can save this little command as a text file (called, for example, “treetagger-loop.sh”), make sure it is placed right in the TreeTagger directory, make it executable as a program, and then run it from the command line by writing:

 > bash treetagger-loop.sh

on the command line interface again. Now the text in each of the files in the “input” folder will be lemmatized and a new file with the (fantasy) extension “.trt” will be created for each of them.

This is all well and good, but TreeTagger produces an output that is much richer than what you need for Topic Modeling lemmatized text. The output looks as follows:

Cette\u00A0\u00A0 \u00A0PRO:DEM\u00A0\u00A0 ce\nPhilaminte\u00A0\u00A0 \u00A0NOM\u00A0 <unknown>\nest\u00A0\u00A0 \u00A0  VER:pres\u00A0 \u00EAtre\nbelle\u00A0\u00A0 \u00A0ADJ\u00A0\u00A0 \u00A0   beau\nsans\u00A0\u00A0 \u00A0 PRP\u00A0\u00A0 \u00A0   sans\ndoute\u00A0\u00A0 \u00A0NOM\u00A0\u00A0 \u00A0   doute\n?\u00A0\u00A0 \u00A0    SENT\u00A0\u00A0    ?

You can clearly see how the second column holds the POS-tags and the third column holds the lemmas. But not exactly what Mallet needs! This is where Python steps in. The little script I created for this can be found on GitHub along with some sample files; note that it expects just the kind of input produced by TreeTagger, including the “*.trt” extension I invented. This script pulls just the lemmas from the third column of this table, skipping punctuation in the process, and for each text, writes the lemmas to a new plain text file. That is the kind of output I quoted at the beginning of this post. And this is just what you may want to use as input to Mallet. [Edit: In order to use the script, you will need to have Python3 installed. The easiest way to run the script is to launch it with the “run” command from IDLE (which comes packaged with Python) or from an editor like Geany. In one directory, there should be the “trt2txt.py”-script itself as well as the output files from TreeTagger in a folder called “trt”. Create two subfolders, one called “csv” and one called “txt” so the script can place the resulting files there. Now you’re ready to run the script and you will get files with lemmatized plain text.]

Although this post is intended to stop here, let me just formulate a word of warning. There are some good arguments for using lemmatized texts as the basis for topic modeling. A practical argument is that it becomes much easier to manage the stopword list, because with one entry for a given lemma, you cover all the derived forms. A more important argument is that because you merge all forms of a given lemma into one, the semantic information contained in the lemma is not (somewhat arbitrarily?) spread out between the different word forms derived from the lemma, so that the same semantic information is assembled in one place, potentially making topic modeling more accurate. Also, some ambiguities may be resolved, like the one between “(je) sens” and “(les) sens”, i.e. is between the verb “to feel” and the noun “the senses”.

But there also some downsides: first of all, new ambiguities arise: for example, TreeTagger does not know whether “je suis” should be rendered as “suivre” (to follow) or “\u00EAtre” (to be). The same is true for “l'”, which could be “la” or “le”, something which appeared in the example from Le Cid but is less of a problem, of course, when you’re discarding articles anyway. And the topics change, too: one strong effect for a language like French, and I suppose for other highly inflected languages, is that all of a sudden, the verbs become very important in the topics. For example, the first 20 words of a nice topic like “criminal emotions”, strongly related to tragedy, are:

crime voir venger punir croire tra\u00EEtre c\u0153ur trahir l\u00E2che ingrat \nhaine \u00E2me coupable oser courroux col\u00E8re accuser craindre souffrir foi

Nothing wrong with that, of course. However, this, in turn leads to some tough decisions that need to be made for the stopword list: sure, auxiliaries like “avoir” and “\u00EAtre” should be excluded, and maybe also “faire” and “aller”, but what about “savoir” and “pouvoir”, highly frequent and flexible verbs which, at the same time, have important semantics related to knowledge and power? Whatever the answer to this may be, this post was simply intended to provide you with the tools to get to the point where you need to make this kind of decisions, and find out for yourselves.

\n" . . "

$\"From$

“Gravure repr\u00E9santant Pierre Corneille.” From Wikipedia; source: Biblioth\u00E8que nationale de France. http://commons.wikimedia.org/wiki/File:Gravure_Pierre_Corneille.jpg (public domain).

> cd /home/christof/programs/treetagger

Then, you call the TreeTagger function appropriate to your language and indicate the (existing) inputfile, which should just be plain text, and the outputfile, which is to be created by TreeTagger:

> cmd/tree-tagger-french-utf8 inputfile.txt > outputfile.txt

#!/bin/bash\nFILES=/home/christof/programs/treetagger/input/*\nfor file in $FILES\n  do cmd/tree-tagger-french-utf8 $file > $file.trt\ndone

 > bash treetagger-loop.sh

This is all well and good, but TreeTagger produces an output that is much richer than what you need for Topic Modeling lemmatized text. The output looks as follows:

Cette\u00A0\u00A0 \u00A0PRO:DEM\u00A0\u00A0 ce\nPhilaminte\u00A0\u00A0 \u00A0NOM\u00A0 <unknown>\nest\u00A0\u00A0 \u00A0  VER:pres\u00A0 \u00EAtre\nbelle\u00A0\u00A0 \u00A0ADJ\u00A0\u00A0 \u00A0   beau\nsans\u00A0\u00A0 \u00A0 PRP\u00A0\u00A0 \u00A0   sans\ndoute\u00A0\u00A0 \u00A0NOM\u00A0\u00A0 \u00A0   doute\n?\u00A0\u00A0 \u00A0    SENT\u00A0\u00A0    ?

crime voir venger punir croire tra\u00EEtre c\u0153ur trahir l\u00E2che ingrat \nhaine \u00E2me coupable oser courroux col\u00E8re accuser craindre souffrir foi

\n" . .