For some reason I can’t explain, I have had for many years a very keen interest in crime fiction, especially French crime fiction written since the 1950s, roughly. Some of my favorite authors are Léo Malet, Jean-Patrick Manchette, Sébastien Japrisot and Didier Daeninckx. And it is not for no reason that I was drawn to Mitzi Morris’ stylometric murder mystery Poetic Justice. Although I have been teaching a class on the genre some years ago, given a talk about “money and morality in French crime fiction” and dabbled with some relevant Wikipedia articles, my interest in crime fiction somehow never turned into an active research area of mine. Maybe there was just too much of it: think of Georges Simenon‘s 75 Maigret novels, not to mention his more than 100 other works, or of Léo Malet’s 30 Nestor Burma novels and dozens of other crime fiction works, or of Boileau-Narcejac‘s 40 novels published under that name. Sure Balzac was even more productive, but there was only one Balzac! At some point it occurred to me that rather than a problem, this was an advantage: crime fiction is the perfect playground for computational / quantitative methods of text analysis, simply because there is so much relatively homogeneous material to work with.
During the recent holiday season, I spent a few days putting together a nice little collection of French crime fiction of the twentieth century. It is really more of a testing ground than anything aspiring at a complete or representative coverage, but the need to scan, ocr and clean-up most of the texts really did not make anything else possible. In any case, the result of this work was a collection of sixty French crime fiction novels published between 1907 and 2010, with ten novels written by each of the authors represented: Maurice Leblanc, Gaston Leroux, Georges Simenon, Léo Malet, Jean-Patrick Manchette and Didier Daeninckx. The coverage is unequal, the texts are faulty, but in order to see what is possible with these texts, it should be fine.
The first and very obvious thing to do when you are a “stylo“-aficionado is to run a cluster analysis or principal component analysis to get a feel for your text collection.
Figure 1: Cluster Analysis for 60 French crime fiction novels (Eder’s Delta, 400 MFW, Ward clustering)
This is what I did and the graph above simply confirms two relatively obvious things: the six authors included in the corpus all have very distinct styles and therefore, it is really not a problem for “Delta” to distinguish clearly between them. [Edit: Also, they fall into two larger chronological clusters, with Manchette and Daeninckx being distinctly more modern while Malet, although distinct from Leblanc, Leroux and Simenon, does not cluster directly with Manchette and Daeninckx.] However, there is also something strange going on: there are a few novels by Jean-Patrick Manchette which seem strangely similar to Léo Malet, despite the fact that any reader of these two authors will tell you that they are quite different in style, and not just in style.
A Principal Component Analysis done with just the three authors writing the largest part of their oeuvre after 1945, that is Malet, Manchette and Daeninckx, shows an interesting result.
Figure 2: Principal Component Analysis for 30 French crime fiction novels (800 MFW)
On the first principal component (horizontal axis in the graph above), Daeninckx and Manchette on the one hand, and Malet on the other hand, are relatively distinct, again with the exception of the three Manchette novels which had clustered with Malet in the Cluster Analysis Graph above. But on the second principal component (vertical axis), not only are Manchette and Daenincks clearly separated from each other while Malet’s texts hover around the neutral ground of this component; also, two of Manchette’s texts are far away from Malet in this dimension. The solution becomes clear when looking at the narrative perspective of the novels: the three Manchette outliers are first-person narratives, like almost all of the Malet novels, something which makes them similar to each other on one dimension, but not on the other one. One mystery remains: the text labeled “MANJ_1971-AF”, which corresponds to Manchette’s L’Affaire N’Gustro, is very close to Malet in both dimensions. This needs some follow-up, but not for today. It does show that the slightly less aggressive dimensionality reduction of PCA can retain information that is relevant and tends to get lost with distance measures and Cluster Analysis.
The next thing to do, rather, was go beyond stylometry in the narrow sense of authorship attribution and try out some more tools from my toolbox, especially Topic Modeling. These authors, I felt, are not just very different stylistically, but also in their themes. Also, the thematic differences, similarities, and relations between these authors may be of great interest in terms of a history of French crime fiction.
For Topic Modeling, which I do with Mallet, a little bit of preprocessing is helpful. For example, results really only get interesting when novels are split into shorter pieces. I have had pretty good results with the brief chapters crime fiction novels frequently have, and was happy to take advantage of my TEI-encoded versions with chapters marked-up to easily create chapter-wise chunks. However, I then decided to go for even smaller pieces and split the plain text versions of my 60 novels into 10-line chunks using a little python script I stiched together from various sources. Also, I needed to find a good regular expression for Mallet’s tokenizer and ended up using the following one, which seems to do a good job on French: “\p{L}[\p{L}\p{P}]*\p{L}”.
Running Mallet itself is easy and well-documented, and I did some iterations in order to improve my stopword list; in particular, I used topics related primarily to characters’ names to create a list of characters and then add that to the stopword list. It is really in the post-processing phase that things get tricky again. The first table shows the topics themselves with their ID, their weight in the entire text collection and the most important individual words associated with the topic.
Figure 3: Topics with their weight and some of the most important words; colored according to the weight of the topic.
With this collection as in other cases, it seems like the “best” or most readily interpretable topics are those with a medium weight: the ones with very high weights are relatively unspecific, and the ones with very low weights are overly specific. The most distinctive topics in this set were: drinking utensils, criminal or violent acts (surprise!), the face, writing and driving, but also money and landscape.
Now, this is nice and already quite impressive (although my feeling is there is a lot of room for improvement in the topics here.) To go further from here and gain some insight into which topics are most strongy associated with each of the authors, I was able to rely on the wonderful tutorials about “Text Analysis with Topic Modeling” Allen Riddell has created with support from DARIAH-DE. (We call them “TAToM” for short and they will be published very soon). One of the things Allen explains there is how to use Python to aggregate the topic distributions per text chunk back to a level like complete novels. Adapting his example a little bit, I managed to do this for my six authors. I then opened the resulting file with LibreOffice and added conditional formatting to the table to make sense of it. The following table shows the importance of each of the 30 topics in each of the six authors’ ten novels taken as a unit. The stronger the weight of a given topic in an authors’ novels, the darker the cell is.
Figure 4: Topics by author with their aggregated weight (in blue), as well as the total weight of each topic (in red)
It becomes clear that each author has one or several signature topics, in a way (those with the highest score in any one author column, topic 20 for Simenon, for example). Also, it is clear that some topics are quite evenly distributed among all authors while for others, the weight varies greatly between authors (compare topic 9 and topic 10, for instance).
However, these results only really become interesting once we put them together with the first table, i.e. with the meaning of the topics. It turns out that one of the least surprising topics in this set of novels, “physical violence” (with words like: coup, tête, sang, bras, feu, corps) is not very evenly distributed but really only plays a major role in Jean-Patrick Manchette’s novels; the other novels, even Léo Malet’s ones, seem to tend less towards the pole of physical violence and more towards the pole of abstract reasoning. And Simenon’s signature topic, which is topic #20 according to this analysis, does not really have a very strong coherence: it combines family member words with expressions of doubt (fils, peut-être, presque, père, femme, are among the topic words). More often than not, the “signature” topics are not actually very interesting topics. This needs some more work.
My bottom line for today is really that my little topic modeling workflow is functional, but that the results can be improved in many ways: better preprocessing (a better splitting function, for example; a more extensive list of stopwords; maybe even lemmatization of the texts) as well as more flexible postprocessing (more immediate link between topic distributions per author and the actual topic words; or aggregation not in terms of authors, but in terms of individual novels) could yield interesting results. In any case, the potential of this kind of analysis for getting a new type of overview over the history of crime fiction in France seems quite clear to me.