This HTML5 document contains 20 embedded RDF statements represented using HTML+Microdata notation.

The embedded RDF content will be recognized by any processor of HTML5 Microdata.

Namespace Prefixes

PrefixIRI
n8http://demo.openlinksw.com/about/id/http/dragonfly.hypotheses.org/91/
wdrshttp://www.w3.org/2007/05/powder-s#
dchttp://purl.org/dc/elements/1.1/
n2https://dragonfly.hypotheses.org/
rsshttp://purl.org/rss/1.0/
n7http://demo.openlinksw.com/about/id/http/www.dragonfly.hypotheses.org/
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
contenthttp://purl.org/rss/1.0/modules/content/
xsdhhttp://www.w3.org/2001/XMLSchema#

Statements

Subject Item
n2:745
rdf:type
rss:item
dc:creator
Christof Schöch
wdrs:describedby
n7:91 n8:
dc:date
2015-01-04T16:35:03Z
dc:subject
crime fiction Boileau-Narcejac parody Arsène Lupin similarity of texts style My research stylometry pastiche Maurice Leblanc
rss:title
Stylometry and pastiche. A case study from French crime fiction
rss:link
https://dragonfly.hypotheses.org/745
rss:description
When I enthusiastically present literary scholars with the surprising accuracy stylometric methods display in many cases of authorship attribution (given appropriate conditions, such as sufficient material, a certain homogeneity in the genre of the texts, and state-of-the art distance measures), some come up with a clever question: What if one author parodies the style of another author? Will stylometric methods...
content:encoded
<p><div id="attachment_752" style="width: 160px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/Lupin.jpg"><img aria-describedby="caption-attachment-752" loading="lazy" class=" wp-image-752 " title="Arsène Lupin. Source: http://commons.wikimedia.org/wiki/File:Lupin01.jpg. Image in the public domain." src="http://dragonfly.hypotheses.org/files/2015/01/Lupin-233x300.jpg" alt="Arsène Lupin. Source: http://commons.wikimedia.org/wiki/File:Lupin01.jpg. Image in the public domain." width="150" height="200" /></a><p id="caption-attachment-752" class="wp-caption-text">Arsène Lupin. Source: http://commons.wikimedia.org/wiki/File:Lupin01.jpg. Image in the public domain.</p></div></p> <p>When I enthusiastically present literary scholars with the surprising accuracy stylometric methods display in many cases of authorship attribution (given appropriate conditions, such as sufficient material, a certain homogeneity in the genre of the texts, and state-of-the art distance measures), some come up with a clever question: What if one author parodies the style of another author? Will stylometric methods be fooled? <span id="more-745"></span></p> <p>What if&#8230; indeed! Up to now, all I was able to reply to this was to acknowledge the interest of the question, refer to one or two relevant papers[<a href="https://dragonfly.hypotheses.org/745#footnote_0_745" id="identifier_0_745" class="footnote-link footnote-identifier-link" title="One on an Alice in Wonderland pastiche &mdash; Harold Somers, Fiona Tweedie: &ldquo;Authorship Attribution and Pastiche&rdquo;, in: Computers and the Humanities, 37.4, 2003, 407-429, paywalled at: http://link.springer.com/article/10.1023%2FA%3A1025786724466 &mdash; another one on detecting fraudulous authorship in online media &mdash; Sadia Afroz et al., &ldquo;Detecting Hoaxes, Frauds, and Deception in Writing Style&rdquo;, Proceedings of the 2012 IEEE Symposium on Security and Privacy, 2012, p. 461-475, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6234430 &mdash; [edit: and a third one on Raymond Chandler imitations &mdash; Lee Sigelman and William Jacoby, &ldquo;The Not-So-Simple Art of Imitation: Pastiche, Literary Style, and Raymond Chandler&rdquo;, Computers and the Humanities 30.1, 1996, 11-28].">1</a>] and make an educated guess: If the author is very good, he or she can pull it off, but in most cases, I suspect that the author signal remains strong even in a pastiche, i.e. the imitation of the style or character of a work of art.[<a href="https://dragonfly.hypotheses.org/745#footnote_1_745" id="identifier_1_745" class="footnote-link footnote-identifier-link" title="See art. &ldquo;Pastiche&rdquo;, Wikipedia, http://en.wikipedia.org/wiki/Pastiche.">2</a>] But I had never come across a good test-case from French literature for an actual stylometric experiment.</p> <p>The other day, when expanding my collection of French crime fiction novels, I finally came across a suitable test-case. The authors Pierre Boileau and Thomas Narcejac, better known as <a title="Boileau-Narcejac" href="http://en.wikipedia.org/wiki/Boileau-Narcejac" target="_blank">Boileau-Narcejac</a>[<a href="https://dragonfly.hypotheses.org/745#footnote_2_745" id="identifier_2_745" class="footnote-link footnote-identifier-link" title="See also: Claude Mespl&egrave;de, &ldquo;Boileau-Narcejac&rdquo;, Dictionnaire des litt&eacute;ratures polici&egrave;res, Nantes: Joseph K, 2007, vol. 2, p. 410-411.">3</a>], have not only jointly written a large number of suspense-oriented crime fiction novels, from the 1950s to the late 1980s, with their collaboration being an interesting case for stylometry in itself. As admirers of early twentieth-century proponents of the genre, they have also produced, in the course of the 1970s, five pastiches of Maurice Leblanc&#8217;s novels featuring Arsène Lupin. The five novels are: <em>Le Secret d&#8217;Eunerville</em> (1973), <i>La Poudrière </i>(1974), <i>Le Second visage d&#8217;Arsène Lupin </i>(1975), <i>La Justice d&#8217;Arsène Lupin</i> (1977) and <i>Le Serment d&#8217;Arsène Lupin</i> (1979). These novels feature Maurice Leblanc&#8217;s famous &#8220;gentleman-cambrioleur&#8221; Arsène Lupin[<a href="https://dragonfly.hypotheses.org/745#footnote_3_745" id="identifier_3_745" class="footnote-link footnote-identifier-link" title="A legal agreement with the heir, Claude Leblanc, was necessary to allow Boileau-Narcejac to feature the character; O tempora, o mores!">4</a>] and lead him to new adventures in which to be bold, save several people&#8217;s lives, meet beautiful women, and possibly steal (and/or restitute) some valuable jewlery or works of art in the process. In their brief prefatory note to <em>La Poudrière</em>, the authors explicitly designate their novels as &#8220;pastiches&#8221; and state the following (my translation): &#8220;There is a Leblanc style [écriture], whose mouvement it is not all that difficult to reproduce ; but there is, in Leblanc, a creativity, a way of approaching the absence of verisimilitude with a natural elegance, which intimidated us a lot.&#8221;[<a href="https://dragonfly.hypotheses.org/745#footnote_4_745" id="identifier_4_745" class="footnote-link footnote-identifier-link" title="French original: &ldquo;Il y a une &eacute;criture Leblanc, dont il n&rsquo;est pas trop difficile de reproduire le mouvement ; mais il y a, chez Leblanc, une invention, une mani&egrave;re de c&ocirc;toyer l&rsquo;invraisemblance avec naturel et &eacute;l&eacute;gance, qui nous intimidait beaucoup.&rdquo;">5</a>] Basically, they say: it was easy to imitate the style, but hard to imitate the plot.</p> <p>My reading experience would rather suggest the opposite: it seems easier to imitate the characters and events of a type of novel than its style. So, to approach the question of style, what if we prepared a collection of texts to test whether these pastiches of Leblanc by Boileau-Narcejac (four of which I have) are more similar either to Leblanc&#8217;s originals or to Boileau-Narcejac&#8217;s other crime fiction novels? Besides the contestants, such a collection should also contain crime fiction contemporary to Leblanc&#8217;s originals (such as novels by Gaston Leroux) as well as contemporary to Boileau-Narcejac&#8217;s pastiches and their other novels (such as novels by Georges Simenon, Léo Malet, Frédérid Dard, and Jean-Patrick Manchette). This makes for a collection of 123 crime fiction novels, written by seven different authors (Boileau-Narcejac counted as one) during a period of around seventy years (1908-1977). While I have all these texts at my disposal, unfortunately, I do not have any &#8216;normal&#8217; crime fiction by Boileau-Narcejac exactly contemporary to the pastiches, only a number of slightly earlier ones, so if their Leblanc pastiches turn out to be different from their &#8216;normal&#8217; novels, this could be due to chronology. Also, narrative perspective and its consequences for the frequencies of pronouns and verb forms may come into our way (French being a highly inflected language). But let&#8217;s see.</p> <p>First things first, I wanted to see what could be done with a well-established distance measure such as Eder&#8217;s Delta (a variant of Burrows&#8217; Delta which has served me well on French texts), using the <em>stylo for R</em> package.[<a href="https://dragonfly.hypotheses.org/745#footnote_5_745" id="identifier_5_745" class="footnote-link footnote-identifier-link" title="Maciej Eder, Mike Kestemont, and Jan Rybicki. (2013). &ldquo;Stylometry with R: a suite of tools&rdquo;. Digital Humanities 2013: Conference Abstracts. Lincoln: University of Nebraska-Lincoln, pp. 487-89. [pre-print].">6</a>] In order to minimize the influence of narrative perspective on the word frequency profiles of the texts, I used a custom list of stopwords including personal pronouns, possessive pronouns and some very common verbs in their first and third person form.[<a href="https://dragonfly.hypotheses.org/745#footnote_6_745" id="identifier_6_745" class="footnote-link footnote-identifier-link" title="Here is how to do this (explanation courtesy of Maciej): 1. Run stylo normally: &gt; results1 = stylo() 2. Assign the frequencies to a variable: &gt; frequencies1 = results1$frequencies.0.culling 3. Define a new variable with frequencies without stopwords: &gt; frequencies2 = delete.stop.words(frequencies1, stop.words=c(&quot;your&quot;,&quot;stop&quot;,&quot;words&quot;) ) 4. Run stylo again with culled table of frequencies: &gt; stylo(frequencies = frequencies2). That&rsquo;s it.">7</a>] In addition, I excluded the names &#8220;Arsène&#8221; and &#8220;Lupin&#8221;, in order not to let the simple mention of the hero&#8217;s name influence the assessment of stylistic similarity.[<a href="https://dragonfly.hypotheses.org/745#footnote_7_745" id="identifier_7_745" class="footnote-link footnote-identifier-link" title="Here is the list of stopwords: &ldquo;je&rdquo;, &ldquo;j&rdquo;, &ldquo;tu&rdquo;, &ldquo;il&rdquo;, &ldquo;nous&rdquo;, &ldquo;vous&rdquo;, &ldquo;elle&rdquo;, &ldquo;ils&rdquo;, &ldquo;elles&rdquo;, &ldquo;moi&rdquo;, &ldquo;lui&rdquo;, &ldquo;me&rdquo;,&rdquo;m&rdquo;,&rdquo;se&rdquo;,&rdquo;te&rdquo;,&rdquo;t&rdquo;, &ldquo;ma&rdquo;, &ldquo;sa&rdquo;, &ldquo;mes&rdquo;, &ldquo;son&rdquo;, &ldquo;mon&rdquo;, &ldquo;ses&rdquo;,&rdquo;votre&rdquo;, &ldquo;ai&rdquo;, &ldquo;a&rdquo;, &ldquo;suis&rdquo;, &ldquo;est&rdquo;, &ldquo;sont&rdquo;, &ldquo;ont&rdquo;, &ldquo;avais&rdquo;, &ldquo;avait&rdquo;, &ldquo;avaient&rdquo;, &ldquo;avez&rdquo;, &ldquo;&eacute;tais&rdquo;, &ldquo;&eacute;tait&rdquo;, &ldquo;&eacute;taient&rdquo;, &ldquo;fus&rdquo;, &ldquo;fut&rdquo;, &ldquo;fais&rdquo;, &ldquo;fait&rdquo;, &ldquo;font&rdquo;, &ldquo;fis&rdquo;, &ldquo;fit&rdquo;, &ldquo;sais&rdquo;, &ldquo;sait&rdquo;, &ldquo;&ccedil;a&rdquo;, &ldquo;c&rdquo;, &ldquo;dis&rdquo;, &ldquo;dit&rdquo;, &ldquo;disent&rdquo;, &ldquo;Ars&egrave;ne&rdquo;, &ldquo;Lupin&rdquo;.">8</a>] The pastiches by Boileau-Narcejac have their own &#8220;author&#8221; label (BoilNarcP), so they will be set apart visually in the results.</p> <p>As always, the possibilities for parameter setting are endless, and adding a custom-built list of stopwords to the mix does not help. But let&#8217;s start with a carefully cherry-picked if conservative set of parameters: the 400 most frequent words only, without the above-mentioned stopwords, culling of 20% applied to somewhat level the playing field, and using Eder&#8217;s Delta. The result is the following dendrogram:</p> <p><div id="attachment_761" style="width: 190px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012.jpg"><img aria-describedby="caption-attachment-761" loading="lazy" class="size-medium wp-image-761" src="http://dragonfly.hypotheses.org/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012-180x300.jpg" alt="Dendrogram of stylistic similarities (stylo, Eder's Delta, 400 MFW, custom stoplist, 20% culling). Click to enlarge." width="180" height="300" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012-180x300.jpg 180w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012-300x500.jpg 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012-624x1040.jpg 624w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012.jpg 1800w" sizes="(max-width: 180px) 100vw, 180px" /></a><p id="caption-attachment-761" class="wp-caption-text">Figure 1: Dendrogram of stylistic similarities (stylo, Eder&#8217;s Delta, 400 MFW, custom stoplist, 20% culling). Click to enlarge.</p></div></p> <p>As you can easily see from figure 1, the results are almost flawless: the 123 texts cluster into perfectly coherent, strongly-separated author-based groups (only exception: one novel by Malet which clusters with Dard). These combine into larger groups (from top to bottom): one made up of Boileau-Narcejac relatively close to Frédéric Dard as well as Jean-Patrick Manchette, with Léo Malet a bit more removed; a second one with Maurice Leblanc and Gaston Leroux, clearly based on chronology and shared narrative perspective. And finally, Georges Simenon on his own.</p> <p>The four pastiches of Maurice Leblanc&#8217;s Arsène Lupin, written by Boileau-Narcejac, which are shown in green, are right among the other texts by Boileau-Narcejac. Nothing indicates their special status. On the level of analysis used here, and with the parameters used, any potentially existing &#8220;Maurice Leblanc&#8221; style in these four novels remains invisible. It is not entirely clear to me whether this is a success or a failure, but it is certainly surprising!</p> <p><div id="attachment_762" style="width: 310px" class="wp-caption alignright"><a href="http://dragonfly.hypotheses.org/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011.jpg"><img aria-describedby="caption-attachment-762" loading="lazy" class="size-medium wp-image-762 " src="http://dragonfly.hypotheses.org/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-300x300.jpg" alt="Principal Component Analysis (same settings as above). Click to enlarge." width="300" height="300" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-300x300.jpg 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-150x150.jpg 150w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-500x500.jpg 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-624x624.jpg 624w" sizes="(max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-762" class="wp-caption-text">Figure 2: Principal Component Analysis (same settings as above). Click to enlarge.</p></div></p> <p>A Principal Component Analysis with analogous settings reveals the same relationships, adding some nuance (see figure 2): Simenon is the only author removed from the others in the first dimension. Also, there is not just closeness, but overlap between Leroux and Leblanc. Manchette seems to have a significantly larger stylistic range compared to what the dendrogram suggests. And again, the pastiches are right in the middle of the other Boileau-Narcejac texts. Interestingly, the one text by Boileau-Narcejac approaching the Maurice Leblanc novels, further to the upper right, is not one of the pastiches, but the 1957 novel <em>Les Magiciennes</em>.</p> <p>Now, things actually get a little bit more mixed-up when using a longer wordlist: at some point, three novels by Malet join the Dard-cluster; and at some other point, three novels by Leblanc also join the Dard-cluster. The reasons for this, and the type of stylistic similarity involved, remain to be investigated. However, the two types of Boileau-Narcejac texts always cluster solidly together, without any intrusions from texts by other authors.</p> <p><div id="attachment_772" style="width: 146px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/final-results.jpg"><img aria-describedby="caption-attachment-772" loading="lazy" class=" wp-image-772 " src="http://dragonfly.hypotheses.org/files/2015/01/final-results.jpg" alt="Results from classification tasks (several algorithms). Click to enlarge." width="136" height="300" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/final-results.jpg 454w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/final-results-136x300.jpg 136w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/final-results-227x500.jpg 227w" sizes="(max-width: 136px) 100vw, 136px" /></a><p id="caption-attachment-772" class="wp-caption-text">Figure 3: Results from classification tasks (several algorithms). Click to enlarge.</p></div></p> <p>The Boileau-Narcejac pastiches&#8217; solid stylistic identification as authored by Boileau-Narcejac is also confirmed by various classification tasks performed using the classify() function in stylo for R, this time using a lot more features (the 5000 most frequent words) and no custom list of stopwords (the idea is that the algorithms will sort it all out). Again, while not perfectly unanimous on all cases, the results are rock solid for the pastiches (see figure 3).</p> <p>So, what exactly is going on? It is possible to approach these distant-reading results from some more angels. For example, when trying to understand how exactly the normal novels and the pastiches by Boileau-Narcejac are similar, one could look at their word frequencies in comparison. Just for illustration, figure 3 below shows the word frequencies for the first 50 most frequent words, not for each of the 123 novels, but as averages across the novels by one &#8220;author&#8221;: Boileau-Narcejac, their pastiches, Leblanc (the target of their pastiche) and Frédéric Dard.</p> <p><div id="attachment_764" style="width: 310px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/author-averages.jpg"><img aria-describedby="caption-attachment-764" loading="lazy" class="size-medium wp-image-764 " src="http://dragonfly.hypotheses.org/files/2015/01/author-averages-300x230.jpg" alt="Average word frequencies for four &quot;authors&quot; (first 50 most frequent words). Click to enlarge." width="300" height="230" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/author-averages-300x230.jpg 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/author-averages-500x384.jpg 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/author-averages-624x480.jpg 624w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/author-averages.jpg 1242w" sizes="(max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-764" class="wp-caption-text">Average word frequencies for four &#8220;authors&#8221; (first 50 most frequent words). Click to enlarge.</p></div></p> <p>Of course, this is only a small portion of the 400 most frequent words used for the cluster analysis and principal component analysis above. But the issue becomes clear, I think: Sure, one can perfectly well find words for which the the &#8216;normal&#8217; novels and the pastiches by Boileau-Narcejac seem to &#8220;stick together&#8221; while Leblanc and Dard deviate from them and/or from each other (e.g. for &#8220;pas&#8221;, &#8220;en&#8221;, &#8220;mais&#8221;, &#8220;bien&#8221;). But one can just as well find words for which the &#8216;normal&#8217; novels deviate from the pastiches, which seem closer to Leblanc (e.g. &#8220;à&#8221;, &#8220;dans&#8221;, &#8220;comme&#8221;). Each of these words merit an investigation into their stylistic and content-wise contribution to the four groups of novels. However, it is clear that only quantitative methods, namely distance measures, can add up all these subtle differences and similarities into a score.</p> <p>Distance measures, yes, but correlation tests could also be interesting to assess the similarity between these word frequency distributions. The following is a table of Pearson&#8217;s correlation tests on the average word frequencies for each of the 400 most frequent words across all novels by a given author (Boileau-Narcejac and their pastiches being, again, treated as two separate authors).</p> <p><div id="attachment_765" style="width: 310px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/cor-test-matrix.jpg"><img aria-describedby="caption-attachment-765" loading="lazy" class=" wp-image-765 " src="http://dragonfly.hypotheses.org/files/2015/01/cor-test-matrix-300x65.jpg" alt="Correlation table between authors (average word frequencies across all novels by one author). Click to enlarge." width="300" height="65" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/cor-test-matrix-300x65.jpg 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/cor-test-matrix-500x108.jpg 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/cor-test-matrix-624x135.jpg 624w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/cor-test-matrix.jpg 1553w" sizes="(max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-765" class="wp-caption-text">Correlation table between authors (average word frequencies across all novels by one author). Click to enlarge.</p></div></p> <p>The table shows how much correlation there is between authors. Generally speaking, correlations are very strong, and differences are subtle (even with the confidence level set to 0.99).  You could think of it as a &#8220;poor man&#8217;s distance matrix&#8221;. In any case, it shows without surprise (by now) that Boileau-Narcejacs Arsène Lupin-pastiches are more similar to these authors other novels (correlation score of 0.9938), than to the original Arsène Lupin-novels by Maurice Leblanc (0.9871).</p> <p>So, what if an author tries to imitate the style of another author? Will stylometric tools be fooled, or not? In this test case, they have clearly not been fooled. In terms of quantitative stylistics, the Leblanc pastiches by Boileau-Narcejac are clearly written in the style of Boileau-Narcejac. As for the plot, which is unmistakeably Leblanc&#8217;esque, that is another story.Notes</p> <ol class="footnotes"> <li id="footnote_0_745" class="footnote">One on an <em>Alice in Wonderland</em> pastiche &#8212; Harold Somers, Fiona Tweedie: &#8220;Authorship Attribution and Pastiche&#8221;, in: Computers and the Humanities, 37.4, 2003, 407-429, paywalled at: http://link.springer.com/article/10.1023%2FA%3A1025786724466 &#8212; another one on detecting fraudulous authorship in online media &#8212; Sadia Afroz et al., &#8220;Detecting Hoaxes, Frauds, and Deception in Writing Style&#8221;, Proceedings of the 2012 IEEE Symposium on Security and Privacy, 2012, p. 461-475, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6234430 &#8212; [edit: and a third one on Raymond Chandler imitations &#8212; Lee Sigelman and William Jacoby, &#8220;The Not-So-Simple Art of Imitation: Pastiche, Literary Style, and Raymond Chandler&#8221;, Computers and the Humanities 30.1, 1996, 11-28].</li> <li id="footnote_1_745" class="footnote"> See art. &#8220;Pastiche&#8221;, Wikipedia, <a title="Pastiche" href="http://en.wikipedia.org/wiki/Pastiche" target="_blank">http://en.wikipedia.org/wiki/Pastiche</a>.</li> <li id="footnote_2_745" class="footnote">See also: Claude Mesplède, &#8220;Boileau-Narcejac&#8221;, <cite>Dictionnaire des littératures policières</cite>, Nantes: Joseph K, 2007, vol. 2, p. 410-411.</li> <li id="footnote_3_745" class="footnote">A legal agreement with the heir, Claude Leblanc, was necessary to allow Boileau-Narcejac to feature the character; O tempora, o mores!</li> <li id="footnote_4_745" class="footnote">French original: &#8220;Il y a une écriture Leblanc, dont il n&#8217;est pas trop difficile de reproduire le mouvement ; mais il y a, chez Leblanc, une invention, une manière de côtoyer l&#8217;invraisemblance avec naturel et élégance, qui nous intimidait beaucoup.&#8221;</li> <li id="footnote_5_745" class="footnote">Maciej Eder, Mike Kestemont, and Jan Rybicki. (2013). &#8220;Stylometry with R: a suite of tools&#8221;. <a href="http://dh2013.unl.edu/abstracts/" rel="nofollow"><i>Digital Humanities 2013: Conference Abstracts</i></a>. Lincoln: University of Nebraska-Lincoln, pp. 487-89. [<a href="https://sites.google.com/site/computationalstylistics/preprints/Eder-Kestemont-Rybicki_Stylometry_with_R.pdf?attredirects=0">pre-print</a>].</li> <li id="footnote_6_745" class="footnote">Here is how to do this (explanation courtesy of Maciej):<br /> 1. Run stylo normally:<br /> <code>&gt; results1 = stylo()</code><br /> 2. Assign the frequencies to a variable:<br /> <code>&gt; frequencies1 = results1$frequencies.0.culling</code><br /> 3. Define a new variable with frequencies without stopwords:<br /> <code>&gt; frequencies2 = delete.stop.words(frequencies1, stop.words=c("your","stop","words") )</code><br /> 4. Run stylo again with culled table of frequencies:<br /> <code>&gt; stylo(frequencies = frequencies2)</code>.<br /> That&#8217;s it.</li> <li id="footnote_7_745" class="footnote">Here is the list of stopwords: &#8220;je&#8221;, &#8220;j&#8221;, &#8220;tu&#8221;, &#8220;il&#8221;, &#8220;nous&#8221;, &#8220;vous&#8221;, &#8220;elle&#8221;, &#8220;ils&#8221;, &#8220;elles&#8221;, &#8220;moi&#8221;, &#8220;lui&#8221;, &#8220;me&#8221;,&#8221;m&#8221;,&#8221;se&#8221;,&#8221;te&#8221;,&#8221;t&#8221;, &#8220;ma&#8221;, &#8220;sa&#8221;, &#8220;mes&#8221;, &#8220;son&#8221;, &#8220;mon&#8221;, &#8220;ses&#8221;,&#8221;votre&#8221;, &#8220;ai&#8221;, &#8220;a&#8221;, &#8220;suis&#8221;, &#8220;est&#8221;, &#8220;sont&#8221;, &#8220;ont&#8221;, &#8220;avais&#8221;, &#8220;avait&#8221;, &#8220;avaient&#8221;, &#8220;avez&#8221;, &#8220;étais&#8221;, &#8220;était&#8221;, &#8220;étaient&#8221;, &#8220;fus&#8221;, &#8220;fut&#8221;, &#8220;fais&#8221;, &#8220;fait&#8221;, &#8220;font&#8221;, &#8220;fis&#8221;, &#8220;fit&#8221;, &#8220;sais&#8221;, &#8220;sait&#8221;, &#8220;ça&#8221;, &#8220;c&#8221;, &#8220;dis&#8221;, &#8220;dit&#8221;, &#8220;disent&#8221;, &#8220;Arsène&#8221;, &#8220;Lupin&#8221;.</li> </ol> <div id="attachment_752" style="width: 160px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/Lupin.jpg"><img aria-describedby="caption-attachment-752" loading="lazy" class=" wp-image-752 " title="Arsène Lupin. Source: http://commons.wikimedia.org/wiki/File:Lupin01.jpg. Image in the public domain." src="http://dragonfly.hypotheses.org/files/2015/01/Lupin-233x300.jpg" alt="Arsène Lupin. Source: http://commons.wikimedia.org/wiki/File:Lupin01.jpg. Image in the public domain." width="150" height="200" /></a><p id="caption-attachment-752" class="wp-caption-text">Arsène Lupin. Source: http://commons.wikimedia.org/wiki/File:Lupin01.jpg. Image in the public domain.</p></div> <p>When I enthusiastically present literary scholars with the surprising accuracy stylometric methods display in many cases of authorship attribution (given appropriate conditions, such as sufficient material, a certain homogeneity in the genre of the texts, and state-of-the art distance measures), some come up with a clever question: What if one author parodies the style of another author? Will stylometric methods be fooled? <span id="more-745"></span></p> <p>What if&#8230; indeed! Up to now, all I was able to reply to this was to acknowledge the interest of the question, refer to one or two relevant papers[<a href="https://dragonfly.hypotheses.org/745#footnote_0_745" id="identifier_0_745" class="footnote-link footnote-identifier-link" title="One on an Alice in Wonderland pastiche &mdash; Harold Somers, Fiona Tweedie: &ldquo;Authorship Attribution and Pastiche&rdquo;, in: Computers and the Humanities, 37.4, 2003, 407-429, paywalled at: http://link.springer.com/article/10.1023%2FA%3A1025786724466 &mdash; another one on detecting fraudulous authorship in online media &mdash; Sadia Afroz et al., &ldquo;Detecting Hoaxes, Frauds, and Deception in Writing Style&rdquo;, Proceedings of the 2012 IEEE Symposium on Security and Privacy, 2012, p. 461-475, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6234430 &mdash; [edit: and a third one on Raymond Chandler imitations &mdash; Lee Sigelman and William Jacoby, &ldquo;The Not-So-Simple Art of Imitation: Pastiche, Literary Style, and Raymond Chandler&rdquo;, Computers and the Humanities 30.1, 1996, 11-28].">1</a>] and make an educated guess: If the author is very good, he or she can pull it off, but in most cases, I suspect that the author signal remains strong even in a pastiche, i.e. the imitation of the style or character of a work of art.[<a href="https://dragonfly.hypotheses.org/745#footnote_1_745" id="identifier_1_745" class="footnote-link footnote-identifier-link" title="See art. &ldquo;Pastiche&rdquo;, Wikipedia, http://en.wikipedia.org/wiki/Pastiche.">2</a>] But I had never come across a good test-case from French literature for an actual stylometric experiment.</p> <p>The other day, when expanding my collection of French crime fiction novels, I finally came across a suitable test-case. The authors Pierre Boileau and Thomas Narcejac, better known as <a title="Boileau-Narcejac" href="http://en.wikipedia.org/wiki/Boileau-Narcejac" target="_blank">Boileau-Narcejac</a>[<a href="https://dragonfly.hypotheses.org/745#footnote_2_745" id="identifier_2_745" class="footnote-link footnote-identifier-link" title="See also: Claude Mespl&egrave;de, &ldquo;Boileau-Narcejac&rdquo;, Dictionnaire des litt&eacute;ratures polici&egrave;res, Nantes: Joseph K, 2007, vol. 2, p. 410-411.">3</a>], have not only jointly written a large number of suspense-oriented crime fiction novels, from the 1950s to the late 1980s, with their collaboration being an interesting case for stylometry in itself. As admirers of early twentieth-century proponents of the genre, they have also produced, in the course of the 1970s, five pastiches of Maurice Leblanc&#8217;s novels featuring Arsène Lupin. The five novels are: <em>Le Secret d&#8217;Eunerville</em> (1973), <i>La Poudrière </i>(1974), <i>Le Second visage d&#8217;Arsène Lupin </i>(1975), <i>La Justice d&#8217;Arsène Lupin</i> (1977) and <i>Le Serment d&#8217;Arsène Lupin</i> (1979). These novels feature Maurice Leblanc&#8217;s famous &#8220;gentleman-cambrioleur&#8221; Arsène Lupin[<a href="https://dragonfly.hypotheses.org/745#footnote_3_745" id="identifier_3_745" class="footnote-link footnote-identifier-link" title="A legal agreement with the heir, Claude Leblanc, was necessary to allow Boileau-Narcejac to feature the character; O tempora, o mores!">4</a>] and lead him to new adventures in which to be bold, save several people&#8217;s lives, meet beautiful women, and possibly steal (and/or restitute) some valuable jewlery or works of art in the process. In their brief prefatory note to <em>La Poudrière</em>, the authors explicitly designate their novels as &#8220;pastiches&#8221; and state the following (my translation): &#8220;There is a Leblanc style [écriture], whose mouvement it is not all that difficult to reproduce ; but there is, in Leblanc, a creativity, a way of approaching the absence of verisimilitude with a natural elegance, which intimidated us a lot.&#8221;[<a href="https://dragonfly.hypotheses.org/745#footnote_4_745" id="identifier_4_745" class="footnote-link footnote-identifier-link" title="French original: &ldquo;Il y a une &eacute;criture Leblanc, dont il n&rsquo;est pas trop difficile de reproduire le mouvement ; mais il y a, chez Leblanc, une invention, une mani&egrave;re de c&ocirc;toyer l&rsquo;invraisemblance avec naturel et &eacute;l&eacute;gance, qui nous intimidait beaucoup.&rdquo;">5</a>] Basically, they say: it was easy to imitate the style, but hard to imitate the plot.</p> <p>My reading experience would rather suggest the opposite: it seems easier to imitate the characters and events of a type of novel than its style. So, to approach the question of style, what if we prepared a collection of texts to test whether these pastiches of Leblanc by Boileau-Narcejac (four of which I have) are more similar either to Leblanc&#8217;s originals or to Boileau-Narcejac&#8217;s other crime fiction novels? Besides the contestants, such a collection should also contain crime fiction contemporary to Leblanc&#8217;s originals (such as novels by Gaston Leroux) as well as contemporary to Boileau-Narcejac&#8217;s pastiches and their other novels (such as novels by Georges Simenon, Léo Malet, Frédérid Dard, and Jean-Patrick Manchette). This makes for a collection of 123 crime fiction novels, written by seven different authors (Boileau-Narcejac counted as one) during a period of around seventy years (1908-1977). While I have all these texts at my disposal, unfortunately, I do not have any &#8216;normal&#8217; crime fiction by Boileau-Narcejac exactly contemporary to the pastiches, only a number of slightly earlier ones, so if their Leblanc pastiches turn out to be different from their &#8216;normal&#8217; novels, this could be due to chronology. Also, narrative perspective and its consequences for the frequencies of pronouns and verb forms may come into our way (French being a highly inflected language). But let&#8217;s see.</p> <p>First things first, I wanted to see what could be done with a well-established distance measure such as Eder&#8217;s Delta (a variant of Burrows&#8217; Delta which has served me well on French texts), using the <em>stylo for R</em> package.[<a href="https://dragonfly.hypotheses.org/745#footnote_5_745" id="identifier_5_745" class="footnote-link footnote-identifier-link" title="Maciej Eder, Mike Kestemont, and Jan Rybicki. (2013). &ldquo;Stylometry with R: a suite of tools&rdquo;. Digital Humanities 2013: Conference Abstracts. Lincoln: University of Nebraska-Lincoln, pp. 487-89. [pre-print].">6</a>] In order to minimize the influence of narrative perspective on the word frequency profiles of the texts, I used a custom list of stopwords including personal pronouns, possessive pronouns and some very common verbs in their first and third person form.[<a href="https://dragonfly.hypotheses.org/745#footnote_6_745" id="identifier_6_745" class="footnote-link footnote-identifier-link" title="Here is how to do this (explanation courtesy of Maciej): 1. Run stylo normally: &gt; results1 = stylo() 2. Assign the frequencies to a variable: &gt; frequencies1 = results1$frequencies.0.culling 3. Define a new variable with frequencies without stopwords: &gt; frequencies2 = delete.stop.words(frequencies1, stop.words=c(&quot;your&quot;,&quot;stop&quot;,&quot;words&quot;) ) 4. Run stylo again with culled table of frequencies: &gt; stylo(frequencies = frequencies2). That&rsquo;s it.">7</a>] In addition, I excluded the names &#8220;Arsène&#8221; and &#8220;Lupin&#8221;, in order not to let the simple mention of the hero&#8217;s name influence the assessment of stylistic similarity.[<a href="https://dragonfly.hypotheses.org/745#footnote_7_745" id="identifier_7_745" class="footnote-link footnote-identifier-link" title="Here is the list of stopwords: &ldquo;je&rdquo;, &ldquo;j&rdquo;, &ldquo;tu&rdquo;, &ldquo;il&rdquo;, &ldquo;nous&rdquo;, &ldquo;vous&rdquo;, &ldquo;elle&rdquo;, &ldquo;ils&rdquo;, &ldquo;elles&rdquo;, &ldquo;moi&rdquo;, &ldquo;lui&rdquo;, &ldquo;me&rdquo;,&rdquo;m&rdquo;,&rdquo;se&rdquo;,&rdquo;te&rdquo;,&rdquo;t&rdquo;, &ldquo;ma&rdquo;, &ldquo;sa&rdquo;, &ldquo;mes&rdquo;, &ldquo;son&rdquo;, &ldquo;mon&rdquo;, &ldquo;ses&rdquo;,&rdquo;votre&rdquo;, &ldquo;ai&rdquo;, &ldquo;a&rdquo;, &ldquo;suis&rdquo;, &ldquo;est&rdquo;, &ldquo;sont&rdquo;, &ldquo;ont&rdquo;, &ldquo;avais&rdquo;, &ldquo;avait&rdquo;, &ldquo;avaient&rdquo;, &ldquo;avez&rdquo;, &ldquo;&eacute;tais&rdquo;, &ldquo;&eacute;tait&rdquo;, &ldquo;&eacute;taient&rdquo;, &ldquo;fus&rdquo;, &ldquo;fut&rdquo;, &ldquo;fais&rdquo;, &ldquo;fait&rdquo;, &ldquo;font&rdquo;, &ldquo;fis&rdquo;, &ldquo;fit&rdquo;, &ldquo;sais&rdquo;, &ldquo;sait&rdquo;, &ldquo;&ccedil;a&rdquo;, &ldquo;c&rdquo;, &ldquo;dis&rdquo;, &ldquo;dit&rdquo;, &ldquo;disent&rdquo;, &ldquo;Ars&egrave;ne&rdquo;, &ldquo;Lupin&rdquo;.">8</a>] The pastiches by Boileau-Narcejac have their own &#8220;author&#8221; label (BoilNarcP), so they will be set apart visually in the results.</p> <p>As always, the possibilities for parameter setting are endless, and adding a custom-built list of stopwords to the mix does not help. But let&#8217;s start with a carefully cherry-picked if conservative set of parameters: the 400 most frequent words only, without the above-mentioned stopwords, culling of 20% applied to somewhat level the playing field, and using Eder&#8217;s Delta. The result is the following dendrogram:</p> <div id="attachment_761" style="width: 190px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012.jpg"><img aria-describedby="caption-attachment-761" loading="lazy" class="size-medium wp-image-761" src="http://dragonfly.hypotheses.org/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012-180x300.jpg" alt="Dendrogram of stylistic similarities (stylo, Eder's Delta, 400 MFW, custom stoplist, 20% culling). Click to enlarge." width="180" height="300" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012-180x300.jpg 180w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012-300x500.jpg 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012-624x1040.jpg 624w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_CA_400_MFWs_Culled_20__Eders-Delta_0012.jpg 1800w" sizes="(max-width: 180px) 100vw, 180px" /></a><p id="caption-attachment-761" class="wp-caption-text">Figure 1: Dendrogram of stylistic similarities (stylo, Eder&#8217;s Delta, 400 MFW, custom stoplist, 20% culling). Click to enlarge.</p></div> <p>As you can easily see from figure 1, the results are almost flawless: the 123 texts cluster into perfectly coherent, strongly-separated author-based groups (only exception: one novel by Malet which clusters with Dard). These combine into larger groups (from top to bottom): one made up of Boileau-Narcejac relatively close to Frédéric Dard as well as Jean-Patrick Manchette, with Léo Malet a bit more removed; a second one with Maurice Leblanc and Gaston Leroux, clearly based on chronology and shared narrative perspective. And finally, Georges Simenon on his own.</p> <p>The four pastiches of Maurice Leblanc&#8217;s Arsène Lupin, written by Boileau-Narcejac, which are shown in green, are right among the other texts by Boileau-Narcejac. Nothing indicates their special status. On the level of analysis used here, and with the parameters used, any potentially existing &#8220;Maurice Leblanc&#8221; style in these four novels remains invisible. It is not entirely clear to me whether this is a success or a failure, but it is certainly surprising!</p> <div id="attachment_762" style="width: 310px" class="wp-caption alignright"><a href="http://dragonfly.hypotheses.org/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011.jpg"><img aria-describedby="caption-attachment-762" loading="lazy" class="size-medium wp-image-762 " src="http://dragonfly.hypotheses.org/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-300x300.jpg" alt="Principal Component Analysis (same settings as above). Click to enlarge." width="300" height="300" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-300x300.jpg 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-150x150.jpg 150w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-500x500.jpg 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/Lupins3_PCA_400_MFWs_Culled_20__PCA_0011-624x624.jpg 624w" sizes="(max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-762" class="wp-caption-text">Figure 2: Principal Component Analysis (same settings as above). Click to enlarge.</p></div> <p>A Principal Component Analysis with analogous settings reveals the same relationships, adding some nuance (see figure 2): Simenon is the only author removed from the others in the first dimension. Also, there is not just closeness, but overlap between Leroux and Leblanc. Manchette seems to have a significantly larger stylistic range compared to what the dendrogram suggests. And again, the pastiches are right in the middle of the other Boileau-Narcejac texts. Interestingly, the one text by Boileau-Narcejac approaching the Maurice Leblanc novels, further to the upper right, is not one of the pastiches, but the 1957 novel <em>Les Magiciennes</em>.</p> <p>Now, things actually get a little bit more mixed-up when using a longer wordlist: at some point, three novels by Malet join the Dard-cluster; and at some other point, three novels by Leblanc also join the Dard-cluster. The reasons for this, and the type of stylistic similarity involved, remain to be investigated. However, the two types of Boileau-Narcejac texts always cluster solidly together, without any intrusions from texts by other authors.</p> <div id="attachment_772" style="width: 146px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/final-results.jpg"><img aria-describedby="caption-attachment-772" loading="lazy" class=" wp-image-772 " src="http://dragonfly.hypotheses.org/files/2015/01/final-results.jpg" alt="Results from classification tasks (several algorithms). Click to enlarge." width="136" height="300" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/final-results.jpg 454w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/final-results-136x300.jpg 136w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/final-results-227x500.jpg 227w" sizes="(max-width: 136px) 100vw, 136px" /></a><p id="caption-attachment-772" class="wp-caption-text">Figure 3: Results from classification tasks (several algorithms). Click to enlarge.</p></div> <p>The Boileau-Narcejac pastiches&#8217; solid stylistic identification as authored by Boileau-Narcejac is also confirmed by various classification tasks performed using the classify() function in stylo for R, this time using a lot more features (the 5000 most frequent words) and no custom list of stopwords (the idea is that the algorithms will sort it all out). Again, while not perfectly unanimous on all cases, the results are rock solid for the pastiches (see figure 3).</p> <p>So, what exactly is going on? It is possible to approach these distant-reading results from some more angels. For example, when trying to understand how exactly the normal novels and the pastiches by Boileau-Narcejac are similar, one could look at their word frequencies in comparison. Just for illustration, figure 3 below shows the word frequencies for the first 50 most frequent words, not for each of the 123 novels, but as averages across the novels by one &#8220;author&#8221;: Boileau-Narcejac, their pastiches, Leblanc (the target of their pastiche) and Frédéric Dard.</p> <div id="attachment_764" style="width: 310px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/author-averages.jpg"><img aria-describedby="caption-attachment-764" loading="lazy" class="size-medium wp-image-764 " src="http://dragonfly.hypotheses.org/files/2015/01/author-averages-300x230.jpg" alt="Average word frequencies for four &quot;authors&quot; (first 50 most frequent words). Click to enlarge." width="300" height="230" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/author-averages-300x230.jpg 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/author-averages-500x384.jpg 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/author-averages-624x480.jpg 624w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/author-averages.jpg 1242w" sizes="(max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-764" class="wp-caption-text">Average word frequencies for four &#8220;authors&#8221; (first 50 most frequent words). Click to enlarge.</p></div> <p>Of course, this is only a small portion of the 400 most frequent words used for the cluster analysis and principal component analysis above. But the issue becomes clear, I think: Sure, one can perfectly well find words for which the the &#8216;normal&#8217; novels and the pastiches by Boileau-Narcejac seem to &#8220;stick together&#8221; while Leblanc and Dard deviate from them and/or from each other (e.g. for &#8220;pas&#8221;, &#8220;en&#8221;, &#8220;mais&#8221;, &#8220;bien&#8221;). But one can just as well find words for which the &#8216;normal&#8217; novels deviate from the pastiches, which seem closer to Leblanc (e.g. &#8220;à&#8221;, &#8220;dans&#8221;, &#8220;comme&#8221;). Each of these words merit an investigation into their stylistic and content-wise contribution to the four groups of novels. However, it is clear that only quantitative methods, namely distance measures, can add up all these subtle differences and similarities into a score.</p> <p>Distance measures, yes, but correlation tests could also be interesting to assess the similarity between these word frequency distributions. The following is a table of Pearson&#8217;s correlation tests on the average word frequencies for each of the 400 most frequent words across all novels by a given author (Boileau-Narcejac and their pastiches being, again, treated as two separate authors).</p> <div id="attachment_765" style="width: 310px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2015/01/cor-test-matrix.jpg"><img aria-describedby="caption-attachment-765" loading="lazy" class=" wp-image-765 " src="http://dragonfly.hypotheses.org/files/2015/01/cor-test-matrix-300x65.jpg" alt="Correlation table between authors (average word frequencies across all novels by one author). Click to enlarge." width="300" height="65" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/cor-test-matrix-300x65.jpg 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/cor-test-matrix-500x108.jpg 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/cor-test-matrix-624x135.jpg 624w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2015/01/cor-test-matrix.jpg 1553w" sizes="(max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-765" class="wp-caption-text">Correlation table between authors (average word frequencies across all novels by one author). Click to enlarge.</p></div> <p>The table shows how much correlation there is between authors. Generally speaking, correlations are very strong, and differences are subtle (even with the confidence level set to 0.99).  You could think of it as a &#8220;poor man&#8217;s distance matrix&#8221;. In any case, it shows without surprise (by now) that Boileau-Narcejacs Arsène Lupin-pastiches are more similar to these authors other novels (correlation score of 0.9938), than to the original Arsène Lupin-novels by Maurice Leblanc (0.9871).</p> <p>So, what if an author tries to imitate the style of another author? Will stylometric tools be fooled, or not? In this test case, they have clearly not been fooled. In terms of quantitative stylistics, the Leblanc pastiches by Boileau-Narcejac are clearly written in the style of Boileau-Narcejac. As for the plot, which is unmistakeably Leblanc&#8217;esque, that is another story.</p> Notes<ol class="footnotes"><li id="footnote_0_745" class="footnote">One on an <em>Alice in Wonderland</em> pastiche &#8212; Harold Somers, Fiona Tweedie: &#8220;Authorship Attribution and Pastiche&#8221;, in: Computers and the Humanities, 37.4, 2003, 407-429, paywalled at: http://link.springer.com/article/10.1023%2FA%3A1025786724466 &#8212; another one on detecting fraudulous authorship in online media &#8212; Sadia Afroz et al., &#8220;Detecting Hoaxes, Frauds, and Deception in Writing Style&#8221;, Proceedings of the 2012 IEEE Symposium on Security and Privacy, 2012, p. 461-475, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6234430 &#8212; [edit: and a third one on Raymond Chandler imitations &#8212; Lee Sigelman and William Jacoby, &#8220;The Not-So-Simple Art of Imitation: Pastiche, Literary Style, and Raymond Chandler&#8221;, Computers and the Humanities 30.1, 1996, 11-28].</li><li id="footnote_1_745" class="footnote"> See art. &#8220;Pastiche&#8221;, Wikipedia, <a title="Pastiche" href="http://en.wikipedia.org/wiki/Pastiche" target="_blank">http://en.wikipedia.org/wiki/Pastiche</a>.</li><li id="footnote_2_745" class="footnote">See also: Claude Mesplède, &#8220;Boileau-Narcejac&#8221;, <cite>Dictionnaire des littératures policières</cite>, Nantes: Joseph K, 2007, vol. 2, p. 410-411.</li><li id="footnote_3_745" class="footnote">A legal agreement with the heir, Claude Leblanc, was necessary to allow Boileau-Narcejac to feature the character; O tempora, o mores!</li><li id="footnote_4_745" class="footnote">French original: &#8220;Il y a une écriture Leblanc, dont il n&#8217;est pas trop difficile de reproduire le mouvement ; mais il y a, chez Leblanc, une invention, une manière de côtoyer l&#8217;invraisemblance avec naturel et élégance, qui nous intimidait beaucoup.&#8221;</li><li id="footnote_5_745" class="footnote">Maciej Eder, Mike Kestemont, and Jan Rybicki. (2013). &#8220;Stylometry with R: a suite of tools&#8221;. <a href="http://dh2013.unl.edu/abstracts/" rel="nofollow"><i>Digital Humanities 2013: Conference Abstracts</i></a>. Lincoln: University of Nebraska-Lincoln, pp. 487-89. [<a href="https://sites.google.com/site/computationalstylistics/preprints/Eder-Kestemont-Rybicki_Stylometry_with_R.pdf?attredirects=0">pre-print</a>].</li><li id="footnote_6_745" class="footnote">Here is how to do this (explanation courtesy of Maciej):<br /> 1. Run stylo normally:<br /> <code>&gt; results1 = stylo()</code><br /> 2. Assign the frequencies to a variable:<br /> <code>&gt; frequencies1 = results1$frequencies.0.culling</code><br /> 3. Define a new variable with frequencies without stopwords:<br /> <code>&gt; frequencies2 = delete.stop.words(frequencies1, stop.words=c("your","stop","words") )</code><br /> 4. Run stylo again with culled table of frequencies:<br /> <code>&gt; stylo(frequencies = frequencies2)</code>.<br /> That&#8217;s it.</li><li id="footnote_7_745" class="footnote">Here is the list of stopwords: &#8220;je&#8221;, &#8220;j&#8221;, &#8220;tu&#8221;, &#8220;il&#8221;, &#8220;nous&#8221;, &#8220;vous&#8221;, &#8220;elle&#8221;, &#8220;ils&#8221;, &#8220;elles&#8221;, &#8220;moi&#8221;, &#8220;lui&#8221;, &#8220;me&#8221;,&#8221;m&#8221;,&#8221;se&#8221;,&#8221;te&#8221;,&#8221;t&#8221;, &#8220;ma&#8221;, &#8220;sa&#8221;, &#8220;mes&#8221;, &#8220;son&#8221;, &#8220;mon&#8221;, &#8220;ses&#8221;,&#8221;votre&#8221;, &#8220;ai&#8221;, &#8220;a&#8221;, &#8220;suis&#8221;, &#8220;est&#8221;, &#8220;sont&#8221;, &#8220;ont&#8221;, &#8220;avais&#8221;, &#8220;avait&#8221;, &#8220;avaient&#8221;, &#8220;avez&#8221;, &#8220;étais&#8221;, &#8220;était&#8221;, &#8220;étaient&#8221;, &#8220;fus&#8221;, &#8220;fut&#8221;, &#8220;fais&#8221;, &#8220;fait&#8221;, &#8220;font&#8221;, &#8220;fis&#8221;, &#8220;fit&#8221;, &#8220;sais&#8221;, &#8220;sait&#8221;, &#8220;ça&#8221;, &#8220;c&#8221;, &#8220;dis&#8221;, &#8220;dit&#8221;, &#8220;disent&#8221;, &#8220;Arsène&#8221;, &#8220;Lupin&#8221;.</li></ol>