content:encoded
| Belgian author Georges Simenon is probably most famous for his crime fiction novels in which police detective Maigret investigates serious crimes and elucidates them with intelligence, empathy, a team of inspectors and acquaintances, and, of course, his tobacco pipe as well as sandwiches and beers brought to his office. Simenon wrote 75 of these Maigret novels (and a certain number of Maigret novellas) over an impressive period stretching from 1930 to 1972. In addition to around 200 popular novels written under pseudonym which he started to write in the early 1920s, Simenon also wrote, during that same forty-year period, no less than 117 other novels which are not devoted to the investigation and elucidation of a crime, but to the psychological tensions and interior dramas which shape the lives of ordinary individuals. These novels are usually referred to as the “romans durs”. In some cases, these novels also involve crimes, but if they do, they do so entirely from the perspective of the author or victim of the crime, not from that of an investigator. Interestingly, while the Maigret novels no doubt made Simenon rich and famous, he held his psychological novels in higher esteem. Simenon is, finally, the author of an ensemble of 24 autobiographical writings, most (but not all) of which he recorded using a dictaphone later in life, in the 1970s and the early 1980s. These later works are usually referred to as the “Dictées” and are probably the least-read part of Simenon’s oeuvre.
A body of work of such dimensions appears ideally suited for digital, quantitative approaches, but despite this fact, such studies have been rare. Earlier quantitative investigations into Simenon’s work (notably the contribution by François Richeaudeau in 1982) were still relying on manual counting of phenomena. Fortunately, this is now no longer necessary, given the fact that almost all of these works have been published, recently, in digital editions by Omnibus (see their aptly named website dedicated to this endeavor, http://www.toutsimenon.com/), and that we now have the tools, notably script languages such as Python or dedicated text analysis tools such as TXM, which make such analyses relatively accessible.
As part of a larger investigation into the work of Simenon, which is in turn part of a larger research project into literary genre stylistics using quantitative methods, I have recently been investigating the question of sentence length in the work of George Simenon compared to other French novelists. It is sometimes said (including by Simenon himself) that part of the reason why Simenon was so successful, was that he was using a very simple style, characterized by short sentences and relatively simple, concrete words. François Richeaudeau, in his study on “Simenon : une écriture pas si simple qu’on le penserait” (Communication et langages 53.1, 1982), investigated the question of sentence length in Simenon’s writings in 1982. So it was a natural starting point to repeat that earlier study using a somewhat larger selection of works by Georges Simenon as well as a comparison corpus. Of course, sentence length is only one among many aspects of a writer’s style. However, it is a very important one, as it is strongly related to syntactic and stylistic complexity as well as to readability. In early stylometry, sentence length had been used as a key feature for authorship attribution or chronological attribution. Also, it is of course a measure that is extremely easy to quantify.
What is a sentence, what is a word?
In order to investigate sentence length in a body of texts, the most fundamental question to answer is: how do we define a sentence, and how do we measure its length? One of the (many) good things about digital literary studies done quantitatively is that, rather than coming up with some definition which is then applied in an uncontrolled fashion, our definitions of such things as sentences and sentence length are embodied, so to speak, in the code used to implement the analyses. In the research reported here, a sentence is simply defined as a textual unit contained inside a paragraph which is delimited from the previous and/or next sentence either by strong punctuation (full stop, question mark or exclamation mark) or the beginning or end of the paragraph. In the Python code used for this, this looks as follows:
Paras = re.split("\n", Text)
for Para in Paras:
if len(Para) > 2:
Sents = re.split("[\.?!]", Para)
With the sentence defined in this manner, I decided to measure sentence length by counting the words contained in each sentence, regardless of the length (in characters) of the words themselves. Again, it is necessary to define what a word is, and for the research reported on here, a word is anything other than punctuation (which is therefore removed) separated from the surrounding words by any whitespace character including the short hyphen (-) and the apostrophe (‘). This is not uncontroversial, of course, especially in a language such as French, which has quite a number of compound words such as “aujourd’hui” or “peut-être” which are separated into two words under this model. However, there are many more examples of French constructions where this makes sense, such as the ubiquitous “j’ai” or “l’amour/l’usine/l’orange) and similar forms, so that this decision appears reasonable. Also, of course, such a decision affects all texts equally, so that (assuming no systematic differences in the use of such constructions with apostrophe and hyphen), even if the absolute values do change, the relative values when comparing two ensembles should not be affected significantly. The following is a simplified version of the code used for counting the words in the manner described:
Sent = re.sub("[\.,!?:;«»]"," ", Sent)
Sent = Sent.strip()
Words = re.split("[\W'-]", Sent)
LengthWords = len(Words)
One thing to note about my approach: It would have been easy to calculate the average sentence length for a novel by counting the words and counting the sentences and simply dividing the number of words by the number of sentences. However, this approach would remove a lot of relevant information, namely, the spread (from shortest to longest sentence) and dispersion (measured as the standard deviation) of the sentence lengths in a given text, which are interesting indicators of what range of sentence lengths is used and how varied the sentence length is.
In any case, equipped with this script, I analysed the sentence length (mean, median and standard deviation) for my collection of French texts and then visualized the results using the wonderful data visualization library Pygal to create an interactive time-series scatterplot of the data.
How does Simenon stack up against his contemporaries?
Currently, I am working with a collection of Simenon’s writings composed of 40 Maigret novels (10 for each of the four decades in which he published them), 40 psychological novels (spread out in the same way) as well as the complete set of 24 autobiographical writings (2 of which where written before 1970 and the remainder afterwards). In order to be able to put results in perspective, I have also used a collection of 480 French novels written during the 20th century and into the beginning of the 21st century, a collection which, although by no means representative, does include a wide range of recognized and less-well-known French authors.
The following is simply an account of some initial results concerning sentence length in these two collections of text. The main results are contained in the following two graphs. (For technical reasons, only static versions of the graphs can be displayed here, but there is also an interactive SVG-based version which actually makes it easier to see things; it is available from the GitHub project repository; you need to right-click the image and select “view image” there, not here!, for it to become interactive; sorry about that).
Figure 1: Average sentence length for 480 French novels from 1900 to 2015
The first graph (figure 1 above) shows the average sentence length for the 480 French novels published during the time between 1900 and 2015. Every red dot represents one novel, and its position on the x-axis indicates its date of publication while its position on the y-axis indicates the average sentence length, measured in the number of words, of that novel. It is easy to see that most of the novels have an average sentence length somewhere in the range between 10 and 20 words per sentence. It also appears from this graph that there is no real change over time. Rather, most novelists stick to that range, whether they write in the 1930s or the 1990s, although interestingly, they do so with particular discipline between 1930 and 1950, roughly.
However, the number of writers choosing to depart from this unwritten rule and use longer sentences does increase after 1970 or so. The most extreme case is a novel with an average sentence length of 137 words, Claude Simon’s Tramway published in 2001, although this is as much a matter of punctuation habits as of sentence complexity. Earlier examples of novels with long sentences can be seen in the 1910s and 1920, most of which are actually the different volumes of Marcel Proust’s À la recherche du temps perdu, notorious for its long sentences (and overall length). On average, these French novels taken together have a mean sentence length of 15.7 words, with a range from 5 to 137. The shortest sentences on average, by the way, are used by Céline, with only slightly more than 5 words per sentence on average in some of his novels, although this is again, of course, heavily influenced by punctuation habits.
Figure 2: Average sentence length with an additional 104 works by Simenon
The second graph (figure 2 above) adds Simenon’s publications to the graph, in three groups: The Maigret novels (green), the psychological novels (yellow) and the autobiographical writings (blue). It becomes clear from this that Simenon does use rather short sentences in both of the novelistic genres he practised. At least, his mean of 12.1 words per sentence is noticeably lower than that of the novelists generally. However, it is also clear that he is in no way exceptional in this respect, as virtually all of Simenon’s novels stay safely in the range occupied by most other novelists as well, that is, between 10 and 20 words per sentence. The mean value of the French novels overall is, after all, influenced by the small number of novels with rather high sentence lengths. Finally, it becomes clear that for the two novelistic genres, there is hardly any chronological development inside Simenon’s work, although on could be tempted to see slightly higher values in the years around 1955 than before and after. This is not a very clear trend, however.
Figure 3: Detailed view of the Simenon texts only.
Remains the question of whether there are systematic differences between the various genres practised by Simenon. Figure 3 shows just the Simenon texts in a close-up view of the above graph. The psychological novels do have a slightly more elevated mean sentence length than the Maigret novels (12.6 vs. 11.5), but this is barely significant and certainly no explanation for the fact that the Maigret novels have been so much more successful commercially. The autobiographical writings do have a clearly higher average sentence length, with a value of 18.3 words per sentence. There are several interesting hypotheses to be considered to explain this. What makes this question tricky is that Simenon stopped writing novels completedly in the years when he wrote the largest part of this autobiographical material. This makes it hard, methodologically speaking, to decide whether the underlying reason for the autobiographical writings’ longer sentences is a question of genre or one of chronological development.
One hypothesis, which considers this to be an effect of literary genre, would be able to explain why the novels have shorter sentences than the autobiographical pieces, for instance by pointing out that novels contain relatively high proportions of dialogue and direct speech and that direct speech is known to be built with shorter sentences than narrative. However, it is not clear whether Simenon’s autobiographical writings really do contain less dialogue than his novels (this is something to investigate using the methods for direct speech detection in the absence of clear typographical markers we recently presented at the DH conference 2016 in Krakow). More importantly, however, the two autobiographical texts written in the 1940s and the 1960s, respectively, do not participate in this trend. They have an average sentence length on the same level as many of the novels written during that time. So it may seem that this difference is more about chronology than about genre, but if this is the case, the later development is not prepared by an earlier trend towards longer sentences.
Another hypothesis, however, would relate the difference between the novels and the autobiographical writings neither to genre or time, but to the mode of writing. In fact, Simenon did not himself physically write the later autobiographical materials using a pen or typewriter; rather, he recorded them using a dictaphone and they were later transcribed. This could explain why the earlier autobiographical writings, which were written by Simenon himself, have shorter sentences, while the later ones, which were dictated, have longer sentences. Whether this is an effect of the writing medium, in Nietzsche’s sense, or an effect of the transcription habits of Simenon’s transcriber(s), remains an open question.
Final thoughts
For one, a note for the data-minded readers: As usual, the code and some of the data are available online in the CLiGS project repository. I am not able to make the full texts available, for obvious but unfortunate reasons of copyright. However, the repository includes the code used to analyze the texts as well as the table of results with some basic metadata as well as the values for number of sentences, number of words, mean and median sentence length, and standard deviation of sentence length. Maybe this can serve as a comparison point for other people’s investigations. Also, of course, it contains a lot more detailed information, with metadata and results for each novel, than what can be shown in a chart.
And to conclude: With sentence length alone, the question of why the Maigret novels were (and continue to be) so popular cannot, in any case, be explained. The next steps from here will involve at least three further aspects of Simenon’s work: One is to investigate his use of concrete rather than abstract words, again comparing the three major genres he practised to the larger body of French twentieth-century novels in order to check whether he really used more concrete words than is usual. Another aspect is to check his (and some critics’) claim that he used a rather limited vocabulary of simple words and again, to compare this with other writers. Finally, I would like to analyze his use of themes, narrative motives and descriptive vocabulary using topic modeling, again in a comparative perspective. Lot’s of exciting things to do!
[Edit, August 15, 2016:] Erik Champion points out to me, on Twitter, that the popularity of the Maigret novels may have been due to character, plot or setting as well, which I think is probably true but more difficult to measure. However, it made me think of another difference between the Maigret novels and the “romans durs”, which is that the latter frequently have a rather dark, sometimes depression quality to them. It would be very tempting to investigate this difference using Sentiment Analysis. Another item on my list…
[Edit, September 9, 2016:] See also the follow-up to this post, about exploratory visualization and/or hypothesis testing.
Belgian author Georges Simenon is probably most famous for his crime fiction novels in which police detective Maigret investigates serious crimes and elucidates them with intelligence, empathy, a team of inspectors and acquaintances, and, of course, his tobacco pipe as well as sandwiches and beers brought to his office. Simenon wrote 75 of these Maigret novels (and a certain number of Maigret novellas) over an impressive period stretching from 1930 to 1972. In addition to around 200 popular novels written under pseudonym which he started to write in the early 1920s, Simenon also wrote, during that same forty-year period, no less than 117 other novels which are not devoted to the investigation and elucidation of a crime, but to the psychological tensions and interior dramas which shape the lives of ordinary individuals. These novels are usually referred to as the “romans durs”. In some cases, these novels also involve crimes, but if they do, they do so entirely from the perspective of the author or victim of the crime, not from that of an investigator. Interestingly, while the Maigret novels no doubt made Simenon rich and famous, he held his psychological novels in higher esteem. Simenon is, finally, the author of an ensemble of 24 autobiographical writings, most (but not all) of which he recorded using a dictaphone later in life, in the 1970s and the early 1980s. These later works are usually referred to as the “Dictées” and are probably the least-read part of Simenon’s oeuvre.
A body of work of such dimensions appears ideally suited for digital, quantitative approaches, but despite this fact, such studies have been rare. Earlier quantitative investigations into Simenon’s work (notably the contribution by François Richeaudeau in 1982) were still relying on manual counting of phenomena. Fortunately, this is now no longer necessary, given the fact that almost all of these works have been published, recently, in digital editions by Omnibus (see their aptly named website dedicated to this endeavor, http://www.toutsimenon.com/), and that we now have the tools, notably script languages such as Python or dedicated text analysis tools such as TXM, which make such analyses relatively accessible.
As part of a larger investigation into the work of Simenon, which is in turn part of a larger research project into literary genre stylistics using quantitative methods, I have recently been investigating the question of sentence length in the work of George Simenon compared to other French novelists. It is sometimes said (including by Simenon himself) that part of the reason why Simenon was so successful, was that he was using a very simple style, characterized by short sentences and relatively simple, concrete words. François Richeaudeau, in his study on “Simenon : une écriture pas si simple qu’on le penserait” (Communication et langages 53.1, 1982), investigated the question of sentence length in Simenon’s writings in 1982. So it was a natural starting point to repeat that earlier study using a somewhat larger selection of works by Georges Simenon as well as a comparison corpus. Of course, sentence length is only one among many aspects of a writer’s style. However, it is a very important one, as it is strongly related to syntactic and stylistic complexity as well as to readability. In early stylometry, sentence length had been used as a key feature for authorship attribution or chronological attribution. Also, it is of course a measure that is extremely easy to quantify.
What is a sentence, what is a word?
In order to investigate sentence length in a body of texts, the most fundamental question to answer is: how do we define a sentence, and how do we measure its length? One of the (many) good things about digital literary studies done quantitatively is that, rather than coming up with some definition which is then applied in an uncontrolled fashion, our definitions of such things as sentences and sentence length are embodied, so to speak, in the code used to implement the analyses. In the research reported here, a sentence is simply defined as a textual unit contained inside a paragraph which is delimited from the previous and/or next sentence either by strong punctuation (full stop, question mark or exclamation mark) or the beginning or end of the paragraph. In the Python code used for this, this looks as follows:
Paras = re.split("\n", Text)
for Para in Paras:
if len(Para) > 2:
Sents = re.split("[\.?!]", Para)
With the sentence defined in this manner, I decided to measure sentence length by counting the words contained in each sentence, regardless of the length (in characters) of the words themselves. Again, it is necessary to define what a word is, and for the research reported on here, a word is anything other than punctuation (which is therefore removed) separated from the surrounding words by any whitespace character including the short hyphen (-) and the apostrophe (‘). This is not uncontroversial, of course, especially in a language such as French, which has quite a number of compound words such as “aujourd’hui” or “peut-être” which are separated into two words under this model. However, there are many more examples of French constructions where this makes sense, such as the ubiquitous “j’ai” or “l’amour/l’usine/l’orange) and similar forms, so that this decision appears reasonable. Also, of course, such a decision affects all texts equally, so that (assuming no systematic differences in the use of such constructions with apostrophe and hyphen), even if the absolute values do change, the relative values when comparing two ensembles should not be affected significantly. The following is a simplified version of the code used for counting the words in the manner described:
Sent = re.sub("[\.,!?:;«»]"," ", Sent)
Sent = Sent.strip()
Words = re.split("[\W'-]", Sent)
LengthWords = len(Words)
One thing to note about my approach: It would have been easy to calculate the average sentence length for a novel by counting the words and counting the sentences and simply dividing the number of words by the number of sentences. However, this approach would remove a lot of relevant information, namely, the spread (from shortest to longest sentence) and dispersion (measured as the standard deviation) of the sentence lengths in a given text, which are interesting indicators of what range of sentence lengths is used and how varied the sentence length is.
In any case, equipped with this script, I analysed the sentence length (mean, median and standard deviation) for my collection of French texts and then visualized the results using the wonderful data visualization library Pygal to create an interactive time-series scatterplot of the data.
How does Simenon stack up against his contemporaries?
Currently, I am working with a collection of Simenon’s writings composed of 40 Maigret novels (10 for each of the four decades in which he published them), 40 psychological novels (spread out in the same way) as well as the complete set of 24 autobiographical writings (2 of which where written before 1970 and the remainder afterwards). In order to be able to put results in perspective, I have also used a collection of 480 French novels written during the 20th century and into the beginning of the 21st century, a collection which, although by no means representative, does include a wide range of recognized and less-well-known French authors.
The following is simply an account of some initial results concerning sentence length in these two collections of text. The main results are contained in the following two graphs. (For technical reasons, only static versions of the graphs can be displayed here, but there is also an interactive SVG-based version which actually makes it easier to see things; it is available from the GitHub project repository; you need to right-click the image and select “view image” there, not here!, for it to become interactive; sorry about that).
Figure 1: Average sentence length for 480 French novels from 1900 to 2015
The first graph (figure 1 above) shows the average sentence length for the 480 French novels published during the time between 1900 and 2015. Every red dot represents one novel, and its position on the x-axis indicates its date of publication while its position on the y-axis indicates the average sentence length, measured in the number of words, of that novel. It is easy to see that most of the novels have an average sentence length somewhere in the range between 10 and 20 words per sentence. It also appears from this graph that there is no real change over time. Rather, most novelists stick to that range, whether they write in the 1930s or the 1990s, although interestingly, they do so with particular discipline between 1930 and 1950, roughly.
However, the number of writers choosing to depart from this unwritten rule and use longer sentences does increase after 1970 or so. The most extreme case is a novel with an average sentence length of 137 words, Claude Simon’s Tramway published in 2001, although this is as much a matter of punctuation habits as of sentence complexity. Earlier examples of novels with long sentences can be seen in the 1910s and 1920, most of which are actually the different volumes of Marcel Proust’s À la recherche du temps perdu, notorious for its long sentences (and overall length). On average, these French novels taken together have a mean sentence length of 15.7 words, with a range from 5 to 137. The shortest sentences on average, by the way, are used by Céline, with only slightly more than 5 words per sentence on average in some of his novels, although this is again, of course, heavily influenced by punctuation habits.
Figure 2: Average sentence length with an additional 104 works by Simenon
The second graph (figure 2 above) adds Simenon’s publications to the graph, in three groups: The Maigret novels (green), the psychological novels (yellow) and the autobiographical writings (blue). It becomes clear from this that Simenon does use rather short sentences in both of the novelistic genres he practised. At least, his mean of 12.1 words per sentence is noticeably lower than that of the novelists generally. However, it is also clear that he is in no way exceptional in this respect, as virtually all of Simenon’s novels stay safely in the range occupied by most other novelists as well, that is, between 10 and 20 words per sentence. The mean value of the French novels overall is, after all, influenced by the small number of novels with rather high sentence lengths. Finally, it becomes clear that for the two novelistic genres, there is hardly any chronological development inside Simenon’s work, although on could be tempted to see slightly higher values in the years around 1955 than before and after. This is not a very clear trend, however.
Figure 3: Detailed view of the Simenon texts only.
Remains the question of whether there are systematic differences between the various genres practised by Simenon. Figure 3 shows just the Simenon texts in a close-up view of the above graph. The psychological novels do have a slightly more elevated mean sentence length than the Maigret novels (12.6 vs. 11.5), but this is barely significant and certainly no explanation for the fact that the Maigret novels have been so much more successful commercially. The autobiographical writings do have a clearly higher average sentence length, with a value of 18.3 words per sentence. There are several interesting hypotheses to be considered to explain this. What makes this question tricky is that Simenon stopped writing novels completedly in the years when he wrote the largest part of this autobiographical material. This makes it hard, methodologically speaking, to decide whether the underlying reason for the autobiographical writings’ longer sentences is a question of genre or one of chronological development.
One hypothesis, which considers this to be an effect of literary genre, would be able to explain why the novels have shorter sentences than the autobiographical pieces, for instance by pointing out that novels contain relatively high proportions of dialogue and direct speech and that direct speech is known to be built with shorter sentences than narrative. However, it is not clear whether Simenon’s autobiographical writings really do contain less dialogue than his novels (this is something to investigate using the methods for direct speech detection in the absence of clear typographical markers we recently presented at the DH conference 2016 in Krakow). More importantly, however, the two autobiographical texts written in the 1940s and the 1960s, respectively, do not participate in this trend. They have an average sentence length on the same level as many of the novels written during that time. So it may seem that this difference is more about chronology than about genre, but if this is the case, the later development is not prepared by an earlier trend towards longer sentences.
Another hypothesis, however, would relate the difference between the novels and the autobiographical writings neither to genre or time, but to the mode of writing. In fact, Simenon did not himself physically write the later autobiographical materials using a pen or typewriter; rather, he recorded them using a dictaphone and they were later transcribed. This could explain why the earlier autobiographical writings, which were written by Simenon himself, have shorter sentences, while the later ones, which were dictated, have longer sentences. Whether this is an effect of the writing medium, in Nietzsche’s sense, or an effect of the transcription habits of Simenon’s transcriber(s), remains an open question.
Final thoughts
For one, a note for the data-minded readers: As usual, the code and some of the data are available online in the CLiGS project repository. I am not able to make the full texts available, for obvious but unfortunate reasons of copyright. However, the repository includes the code used to analyze the texts as well as the table of results with some basic metadata as well as the values for number of sentences, number of words, mean and median sentence length, and standard deviation of sentence length. Maybe this can serve as a comparison point for other people’s investigations. Also, of course, it contains a lot more detailed information, with metadata and results for each novel, than what can be shown in a chart.
And to conclude: With sentence length alone, the question of why the Maigret novels were (and continue to be) so popular cannot, in any case, be explained. The next steps from here will involve at least three further aspects of Simenon’s work: One is to investigate his use of concrete rather than abstract words, again comparing the three major genres he practised to the larger body of French twentieth-century novels in order to check whether he really used more concrete words than is usual. Another aspect is to check his (and some critics’) claim that he used a rather limited vocabulary of simple words and again, to compare this with other writers. Finally, I would like to analyze his use of themes, narrative motives and descriptive vocabulary using topic modeling, again in a comparative perspective. Lot’s of exciting things to do!
[Edit, August 15, 2016:] Erik Champion points out to me, on Twitter, that the popularity of the Maigret novels may have been due to character, plot or setting as well, which I think is probably true but more difficult to measure. However, it made me think of another difference between the Maigret novels and the “romans durs”, which is that the latter frequently have a rather dark, sometimes depression quality to them. It would be very tempting to investigate this difference using Sentiment Analysis. Another item on my list…
[Edit, September 9, 2016:] See also the follow-up to this post, about exploratory visualization and/or hypothesis testing.
|