In a conversation about my recent post on sentence length in Georges Simenon’s work, Fotis Jannidis said he thought the post was typical of quite a lot of recent work in digital literary studies in that it is exploratory rather than focused on hypothesis testing. I think this is true and that it is a problem. Exploratory methods, especially when using visualizations, are great for getting an idea of what is going on in a text collection and can be very helpful for discovering potential trends, patterns, correlations or groups in the data, as well as for discovering problems with data and metadata. But because interpreting visualizations involves a lot of guess-work, they are an invitation for overinterpretation and are probably less helpful for getting actual and robust results; “belastbar”, as we say in German, which roughly translates into: providing a solid foundation for arguments you may want to build on them.
In order to go beyond exploratory methods, we need to formulate existing or new hypotheses in a such a way that they can be tested using the data at hand and then go ahead and use statistical tests to decide whether those trends, patterns, correlations and groups we believe to see in the data are really there. Or, inversely, whether some absence of a trend, pattern, correlation or group is really there. So, the aim of this follow-up post is to go back to the Simenon-post, find some potentially testable hypotheses, decide which statistical test is suitable to verify them and then see what kinds of results we can get in this way.
Hypotheses
The following are some possibly useful hypotheses, formulated based on Richeaudeau’s work and the graphs in the orginal post. They are really quite straightforward:
- (h1) Simenon’s novels have shorter sentences than those written by contemporary authors.
- (h2) Simenon’s late autobiographical writings have longer sentences than his novels.
- (h3) Simenon’s earlier autobiographical texts do not have longer sentences than his novels.
- (h4) Simenon’s Maigret novels and his “roman durs” do not show a difference in sentence length.
These hypotheses reflect statements by Richeaudeau in the article which prompted me to do these sentence-length analyses in the first place, or are obvious candidates for testing, based on the plots in the first Simenon post.
Data
So, the next step is to decide which data to use for these five hypotheses. There are 127 texts by Simenon, 101 of which are novels, either Maigret novels or so-called romans durs, and have been written between 1930 and 1972. Also, there are 24 are autobiographical writings which have been written after 1974, plus two earlier autobiographical texts. In addition, there are 500 novels by authors other than Simenon, of which 165 have been written between 1930 and 1980, the period of Simenon’s production considered here. Note that we are running into the first problem here: since the first Simenon post, I have slighly expanded my dataset to include 20 more Maigret novels as well as 20 more twentieth-century novels. So the dataset is not exactly comparable to the first run. Therefore, I include plots and statistical test results for each hypothesis, below.
And what sentence length data should be used? In the simplest case, we would only consider the data on mean sentence length per text, plus metadata on date of publication and type of text. In this case, and for each hypothesis, we have two distributions of values belonging to two groups of texts. However, it would also be possible to take information about the spread, inter-quartile ranges or standard deviation of the distribution for each novel, or even use the complete sentence length distribution per novel, which the sentence-length script extracts in order to calculate not only the mean but also the standard deviation. In this case, we would have a much larger number of data points to compare, something which may mean the tests of significance would be more sensitive. I will only use the first type of data here but it would of course be interesting to use both kinds of data and compare the results.
Statistical test
Also, we need to decide which type of significance test is appropriate for testing the hypotheses. I’m not sure whether there is a good reason to assume that the distribution of mean sentence length should be normal, so we are on the safe side with a non-parametric test. Also, we are comparing two independent samples of different size, because we are looking at the mean sentence length in the Maigret novels vs. the romans durs, or in Simenon’s novels vs. the novels by contemporaries. Under these circumstances, the Mann-Whitney U-test appears to be appropriate.
Finally, we need to decide which variant of the test to use. For h1 and h2, the hypothesis includes a statement of the direction of the expected difference. For h3 and h4, the hypothesis only states that the two distributions are not different, and a significant difference in either direction would falsify the hypotheses. This will need to be reflected in the choice of a one-tailed or two-tailed test and in the nature of the null-hypothesis. Of course, the scipy implementation of the Mann-Whitney U-test for Python lets us choose the appropriate variant.
Results
So, let’s see what we can do! For the tests described here, I have rewritten the visualization script somewhat to integrate a slightly more flexible data selection procedure as well as a test for significant difference between any two subsets of the data selected for visualization. In this way, every visualization is accompanied by the results from the statistical tests. This also saves the mean value for both subsets as well as the ratio between them, so that it becomes clear (not just visible from the plot) which direction the difference takes, if there is one, and how strong the difference is in itself (independently of the test statistics and p-value).
(h1) “Simenon’s novels have shorter sentences than those written by contemporary authors”. For this, I filter the data to include only novels written between 1930 and 1972, so that they can clearly be considered contemporary of Simenon’s novels. The remaining novels are then split into those written by Simenon and those written by others. We can look at the plot first:
As in the previous post, this doesn’t look like a significant difference, even though there are some contemporary novels published after 1950 that have a much higher sentence length than most other novels. However, let’s not jump to conclusions and have a good look at the statistics first:
feature |
category |
group1 |
group2 |
mw-statistic |
p-value |
mean1 |
mean2 |
ratio-1/2 |
SLMean |
subcorpus |
simenon |
contemporains |
9164 |
0.13 |
12.1 |
13.7 |
0.88 |
What we see hear is that yes, there is a difference in the mean sentence length: Simenon’s sentences have a mean length of 12.1 words, while his contemporaries use sentences with a mean of 13.7 words. That is, Simenon’s sentences have only 88% of the mean lenght, on average, of his contemporaries’ sentences. However, given the available data, and when using a one-sided test (parameter alternative=”less”), we also see from the p-value of the test that there is a 13 percent chance that such a difference may arise by chance, which is a lot more than the 5% or 1% chance we usually accept for a significant result. So, hypothesis rejected.
Or is it? Comparing Simenon to his exact contemporaries seems fair, but what if we expand the scope of the contemporary novels a bit and compare Simenon’s sentence length with that of novelis published, say, between 1910 and 1992, that is with a 20-year margin on either side of Simenon’s own production? We compare Simenon’s novels to a much larger number of novels now, 338 to be precise. Let’s see the plot first:
From the plot, it appears quite clearly that Simenon’s sentence length is on the same level as most of the other authors, in a range of 8 to 16 words per sentence, approximately, that the majority of the contemporaries also use. However, some contemporaries use shorter sentences, and quite a few, especially around the 1920 and after 1955, also use quite a bit longer sentences than this range.
In fact, in the spirit of the hard-and-fast number game, I suppose the thing to do is calculate the inter-quartile range for the contemporaries and see which proportion of Simenon’s novels lies in that range. Or, in fact, look at various percentiles of both groups and compare them. This is the data:
group |
IQR |
0-percentile |
5-perc. |
25-perc. |
75-perc. |
95-perc. |
100-perc. |
simenon |
2.5 |
9.0 |
9.6 |
10.7 |
13.3 |
15.3 |
18.8 |
contemporains |
4.7 |
5.0 |
8.6 |
10.6 |
15.3 |
32.9 |
86.3 |
What can be seen here is that the 50% of the contemporary novels closest to the median (i.e. those between the 25- and 75 percentiles, covered by the interquartile range) lie in a range between 10.6 and 15.3 words per sentence. Therefore, this can be considered the range of typical sentence lengths. That same range goes from 10.7 to 13.3 for the Simenon novels, so his typical range is inside the contemporaries’ typical range. Also, we can see that 90% of Simenon’s novels have a mean sentence length of 9.6 to 15.3, meaning that Simenon’s novels only go beyond the “typical” mean sentence length in exceptional cases (5%), but they do routinely go below that typical sentence length. There are so many numbers to consider here that I think a graph would be nice! (For example, a boxplot with a lot of percentile indicators; or a histogram!)
(h2) “Simenon’s late autobiographical writings have longer sentences than his novels.” This one should be a no-brainer. We split the Simenon data into the novels written before 1973 and the autobiographical writings, written after 1973. Here is the plot: This really looks like a completedly clear case, and if the statistics don’t support this, there is a problem with the statistics. And indeed, all is well. Have a look:
feature |
category |
group1 |
group2 |
mw-statistic |
p-value |
mean1 |
mean2 |
ratio-1/2 |
SLMean |
genre |
novel |
essay |
129 |
6.41E-12 |
12.1 |
18.2 |
0.66 |
The difference in means is drastic, with the novels only having 66% of the mean sentence length compared with the autobiographical writings. And the p-value of the Mann-Whitney test is really really tiny, which means such a huge difference would be extremely unlikely to be found by chance. Again, the test was performed with the parameter alternative=”less” because we are testing a hypothesis about a specific direction of the difference, not just difference either way. In any case, the result is reassuring, because it means the statistical test works as expected. Hypothesis not rejected (and therefore, most likely confirmed).
(h3) “Simenon’s earlier autobiographical texts do not have longer sentences than his novels.” This one is tricky, not only because we can expect the difference in sentence length between these two sets to be subtle, but most of all because there are only 2 items for the autobiographical texts written before 1970. So, sure we can plot this and calculate the statistics, but the latter will not be very meaningful. Here is the plot:
Clearly, the two early autobiographical texts have a higher mean sentence length than the novels do on average, but at the same time they are nothing special. Let’s look at the statistics anyway:
feature |
category |
group1 |
group2 |
mw-statistic |
p-value |
mean1 |
mean2 |
ratio-1/2 |
SLMean |
genre |
novel |
essay |
42 |
0.165 |
12.1 |
13.8 |
0.88 |
This is actually reassuring, although I wouldn’t bet anything on those numbers. While the difference in means is noticeable (in fact, on the same level as for some of the other comparisons), the small number of samples for the group of autobiographical texts means that the test is not very confident this difference is significant, with a 16.5% chance that this happens by chance. Interesting, actually, how similar ratios of means can correspond to very different test statistics, depending on the number of samples in each group and on the distribution of the values. Hypothesis probably rejected. In any case, let’s move on to more solid ground.
(h4) “Simenon’s Maigret novels and his roman durs do not show a difference in sentence length.” This time, only novels by Simenon are included, and they are then split into two groups according to whether they are Maigret novels or romans durs. This is the plot:
From this plot, I would guess that the Maigret novels do have a somewhat lower mean sentence length, overall, than the romans durs. I would even say this is more clearly visible here in this plot with appropriate axes than in the plot in the original post. But, what do the statistics say?
feature |
category |
group1 |
group2 |
mw-statistic |
p-value |
mean1 |
mean2 |
ratio-1/2 |
SLMean |
simenon-types |
maigret |
romans |
750 |
0.0007 |
11.4 |
12.7 |
0.90 |
Interestingly, the difference in means is quite similar to what it was in h1: with a mean of 11.4 words for the Maigret novels vs. a mean of 12.7 words for the romans durs, the Maigret novels have just 90% of the mean sentence length of the romans durs. However, the distribution of the data being different, the probability that such a difference could arise by chance is very slim, with just 0.07%, so this difference is truly significant, statistically. So, hypothesis rejected! The Maigret novels and the romans durs really are different in sentence length, something I did not expect, certainly not with this level of significance.
Fotis suggested an interesting two-part hypothesis to investigate the relation between the Maigret novels and romans durs more closely: while there may not be a difference between the two subgenres before 1950, a difference then does appear after 1950. If this were true, we could assume that Simenon started to differentiate the two main novelistic genres he practised only around 1950, for reasons it may then be interesting to investigate. Without going into the details, suffice it to say that if we test the difference in sentence length for each 12-year window between 1930 and 1972, in order to keep the time period and number of texts roughly equal, we get the following results:
period |
1930-1932 |
1940-1952 |
1950-1962 |
1960-1972 |
p-value |
0.130 |
0.007 |
0.0005 |
0.197 |
This means there is no significant difference in mean sentence length between the Maigret novels and the romans durs in the early and late periods, but there is one in the two middle periods. Somehow, it seems, Simenon started to differentiate the two genres after practising them for some time but then gave up on this differentiation again in his last years of activity as a novelist. Curious! I would love to repeat this with an even larger part of his novels than I have right now, just to be sure.
Conclusions
This has been an exciting experiment. Not only because it was an occasion for me to vastly improve my code (there is still plenty of room upwards, however). And not only because I now have a much more precise notion of where Simenon stands, in comparison to contemporary novelists and in terms of internal generic variation, with regard to his use of sentence length. And it has been exciting despite the fact that sentence length is only one of a half-dozen indicators I have extracted for this dataset.
Mostly, this has been exciting because it made me realize that, rather than replacing exploratory visualizations with pure hypothesis testing, the way forward may be to more closely connect visualizations and statistical indicators. In fact, I would plead for their close and systematic combination in the digital humanities and elsewhere. For example, we need to routinely use boxplots to show distributions, not just with the median, the 25- and 75-percentiles, but with 5- and 95-percentiles as well. And not just as lines, but with the actual precise numbers shown as well. This may mean those plots need to become more interactive, with additional information shown through clicks and mouseOvers. The same is true for the scatterplots: we need to routinely add lines for the mean of each data series with the value shown as well, as well as a trend line with its parameters shown, and the results of a statistical test for differences between any two data series. Ultimately, it may mean combining such an amplified scatterplot with those enriched boxplots, all in one appealing and information-rich visualization-with-statistics. I’m already thinking about another follow-up post with examples of this …
None of this is rocket science, so what’s stopping us?