"Boucher" . "Python" . "Moli\u00E8re" . "Everyone knows that verse and prose are different. Well, there are two exeptions: the first is Monsieur Jourdain in Moli\u00E8re’s play\u00A0Le bourgois gentilhomme, who only learns about that difference from his “ma\u00EEtre de philosophie”. The other exception is the computer, for which both prose and verse are simply and equally strings of characters. Now, if you have nice TEI-encoded texts,..." . "Tools" . "https://dragonfly.hypotheses.org/672" . "verse" . . "Jean-Nicholas Bouilly" . "length of lines" . "My research" . "text classification" . "Christof Sch\u00F6ch" . "
\"Illustration

Illustration from “Le bourgois gentilhomme”. Source: http://en.wikipedia.org/wiki/Le_Bourgeois_gentilhomme (public domain).

\n

Everyone knows that verse and prose are different. Well, there are two exeptions: the first is Monsieur Jourdain in Moli\u00E8re’s play\u00A0Le bourgois gentilhomme, who only learns about that difference from his “ma\u00EEtre de philosophie”. The other exception is the computer, for which both prose and verse are simply and equally strings of characters.

\n

Now, if you have nice TEI-encoded texts, the difference between verse and prose will be marked up, with prose paragraphs being marked\u00A0 with “p” for paragraph and verse being marked with “l” for line. Also, you would be able to assume that most text in elements such as “stage” (for stage directions) or “note” (for editorial or authorial notes) will also be prose. In this case, the computer will be able to tell prose and verse apart, but only because someone made that distinction explicit beforehand.

\n

But what are we do to when we have a lot of plain text files and, far from knowing which sections are prose and which are verse, we don’t even know which files contain mostly verse and which contain mostly prose? One way, of course, is to simply look at the content of these files, but that gets a bit tedious after a few dozen files. So I would like to report on an attempt to solve this issue with the simplest means possible, without markup, without looking at the files, and without complicated heuristics. This is basically just a special kind of text classification task, and finding the right indicator for the target categories is key.

\n

My approach relies quite simply on measuring the length of lines in plays. Even very simple plain text files usually separate each line or each paragraph with a newline character. Of course, they also usually separate each speaker name in this way, thus creating a line for each speakername, possibly with some stage directions added to it. Basically, I assume that the texts look something like this; the following is, obviously, a verse example (taken from Boucher, Champagne le coiffeur, 1662):

\n

CL\u00C9ANDRE.
\nGuillot, l’agr\u00E9able nouvelle
\nQue m’apportes-tu de ma belle !
\nQuelle adresse trouverons-nous ?
\nGUILLOT.
\nJ’en ai mille; vous moquez-vous ?
\nCL\u00C9ANDRE.
\nGuillot, il n’en faut trouver qu’une,
\nEt je r\u00E9ponds de ta fortune.

\n

Another almost random piece from a prose play (Jean-Nicholas Bouilly’s Descartes, 1763) looks as follows:

\n

DESCARTES, seul, assis devant son bureau.
\nQue l’\u00E9tude de la nature offre un champ vaste et imposant !…
\nComme elle \u00E9l\u00E8ve, comme elle agrandit l’\u00E2me.
\nDe quel feu je me sens embras\u00E9 !
\n\u00D4 toi, que je trouve partout, et ne puis d\u00E9finir, cause premi\u00E8re de tout ce qui existe, \u00CAtre \u00E9ternel, puisque tu m’as cr\u00E9\u00E9, je ne mourrai point sans avoir m\u00E9dit\u00E9 sur tes ouvrages, sans avoir cherch\u00E9 la v\u00E9rit\u00E9 parmi tous ces chef-d’oeuvres dont tu composas l’univers…
\nSoutiens mon courage, seconde mes projets ; et si tu permets que\u00A0 j’ajoute \u00E0 la perfection des hommes, que je contribue \u00E0 leur bonheur, je te rendrai gr\u00E2ce en mourant de m’avoir donn\u00E9 l’existance.
\nDESCARTES.
\nEntrez.
\nFLORINA.
\nC’est moi, Monsieur Descartes…
\nJe vous apporte votre th\u00E9.
\nDESCARTES.
\nAh !

\n

It is clear that the mean length of lines will not help us much, here. But another approach to determine, or more technically speaking, to predict, whether a play is in verse or in prose, given the lengths of its lines, is quite obvious from these two examples: the lines in a verse play tend to be more homogeneous in length, because they are usually being written in a given metre (in the French classical age, that would have been twelve-syllable alexandrins, in some rarer cases eight-syllable huitains). By contrast, the length of the prose lines varies widely, from the long phrase in the body of the example above the very short final “Ah!”. This degree of homogeneity of lengths of lines is nicely captured by standard deviation, so that calculating the standard deviation of the lengths of all lines of speeches in plays should allow to correctly predict that all plays with a small standard deviation are written in verse and all plays with a large standard deviation are written in prose. And this actually works, as the following boxplot shows.

\n
\"Standard

Standard deviation of length of lines in 236 French verse and prose plays. (click to enlarge)

\n

In this case, I used data for which I already have manually added labels in order to test whether the measure just described would be good at distinguishing prose and verse. It clearly is, because the two boxplots do not overlap, and a standard deviation of about 20 nicely separates both boxplots. There isn’t a single outlier in this set of 236 plays from 1620 to 1780 that disobeys from the rule, as it were. (By the way, the mean of length of lines really does not separate the plays well. And although the simple length of the plays measured in lines also works well, verse plays being significantly longer than prose plays when their length is measured in number of lines, this does not seem to be a measure that would be very robust. Short verse plays would inevitably create problems.)

\n

There are, however, more complicated cases than just such plays written either all in regular alexandrins or in prose lines of widely varying length. Many verse plays have some lines of prose in them, for instance when someone reads out a letter; and many prose plays have some verse in them, for instance when at the end of a comedy, everyone breaks out into a little song. And, to make matters worse, there are plays that are really an inextricable mix of verse and prose, either short sections of both alternating for most of the play, or the play being divided into several larger sections in verse or prose. Taking such plays into the test collection and using the same “standard deviation of length of lines” measure yields the following boxplot:

\n
\"Boxplots

Boxplots for standard deviation of length of lines for 300 plays in prose, verse, prose with some verse, verse with some prose, and mixed. (click to enlarge)

\n

Prose and verse continue to be separated quite nicely, and a little smattering of prose in verse or verse in prose makes a visible difference, especially in the case of verse with some prose, but does not really affect the reliable separation of the two main types (or inversely, is not enough to trigger a differentiating reponse from this measure). However, the problem is with the mixed plays (all the way to the left in the graph): they seem to be perfectly indistinguishable from the prose plays, with just a sligtly less elevated median and largely overlapping lower and upper first quartiles. Not separate enough, in any case, to solve this with a simple rule.

\n

What to do? Actually, another only slightly more complicated measure seemed promising. The standard deviation does not capture any information about the sequence of the lines. However, even if mixed plays have a standard deviation in their length of lines that is similar to prose plays, the sequential distribution should be different: prose plays will have short, medium-length and long lines all mingled together, while in mixed plays, the verse section should consist of lines of similar length for a little stretch. Based on this, the mean of the (absolute) differences in length between each line and the next could be an interesting measure. And it is, as the following boxplot shows:

\n
\"Boxplot

Boxplot showing the mean of absolute differences of adjacent lines in 300 French plays. (Click to enlarge.)

\n

Here, not only are the verse plays clearly separated from the rest, but the separation between the prose plays and the mixed plays is also markedly more clear: even though there is still considerable overlap, the medians are now very different and the lower and upper first quartiles hardly overlap anymore.

\n

So based on this measure, a prediction as to whether a play is written in prose, verse or in a mix of both should be possible with a degree of accuracy that would be well above the baseline. In the case of this dataset, the rule I ended up implementing was the following: if the mean of differences is lower than 18, predict verse; if the mean is between 18 and 38, predict mixed; and if the mean is higher than 38, predict prose. Lowering that “38” score would increase the number of true positives for prose plays, but also the number of false negatives for mixed plays, while raising the score would have the opposite effect. This is actually a textbook example for precision and recall, it seems to me.

\n

Be that as it may, let me add a technical note. The counting of the number of characters in each line of a play and the calculation of such scores as mean, median, standard deviation, and even the calculation of the mean of the differences in length of sequential lines, as well as the implementation of such simple rule-based predictions, is what computers generally, and Python in particular, are really good at. As a consequence, you may find the code used for these calculations as usually on GitHub along with some sample files for illustration and testing. Making the boxplots, however, is much faster with R, especially because R reads labeled columns and implements conditions for grouping values out of the box, so I used that instead.

\n

Anyway, it all goes to show that since Monsieur Jourdain learns it in the course of Moli\u00E8re’s play, and the computer can also learn it, really everyone now knows the difference between verse and prose. And similarly to Monsieur Jourdain, the computer also knows the difference between them although it remains completedly oblivious to matters like metre and rhyme, despite the fact that these are of course essential to verse from the point of literary studies and, indeed, readers. Is that good enough? Yes, if what you want to do is label a collection of texts too large for you to open each file and check yourself. No, in many other cases. But we’ll deal with that (and with the inevitable special cases) another time.

\n

***

\n

Addendum, June 8, 2014: In reaction to this post, Michael Hendry raised the question, on Twitter, whether this technique would work for plays mixing verse and prose and whether in French theater, there was as much variation in the proportions of verse and prose in dramatic texts as, for example, in Shakespeare: “Just to clarify: you did consider mixed prose and verse plays, but do any of them vary as much as Shakespeare’s?” (Michael Hendry,\u00A0Twitter, June 6, 2014).

\n

I did not know any numbers on this for French drama but I had of course realized that the mixed plays are a challenge to the simple measure used above. But since the ever excellent collection of plays provided by Th\u00E9\u00E2tre classique is in TEI and has all verse and prose passages as well as stage directions (and much more) meticulosly marked-up, it was not a big challenge to pull out that information from those 600 plays ranging approximately from 1630 to 1800. Here are the results (based on number of characters per type of text, displayed as stacked columns):

\n
\"Proportions

Proportions of verse, prose and stage directions in 600 French plays 1630-1800 (click to enlarge)

\n

As becomes clear from this graph, French drama of the time considered here knows many combinations of prose, verse and stage directions. The largest part, maybe half of the plays, are verse plays with no or very little prose mixed in; another quarter are prose plays with no or very little verse in them. The remaining quarter of plays have varying proportions of verse and prose, and there are even some plays with suprising proportions of stage directions, 75% in one extreme case (the very short play\u00A0Orph\u00E9e ou Arlequin aux enfers (1711) by Fuzelier).

\n

The next question raised by such data is whether there are any interesting chronological trends. The following graph shows the same 600 plays arranged in chronological order:

\n
\"Proportion

Proportion of prose, verse and stage directions in 600 French plays, chronologically (click to enlarge)

\n

From this graph, it becomes apparent that there was a strong domination of verse until around 1690, followed by about a decade in which a lot of plays in prose were written, and then a long period during which both forms coexisted. It should be noted that these trends are not based on representative data, and that plays are not evenly spread out over the entire period, so all of this needs to be taken with a large grain of salt.

\n

To come back to the initial question raised by Michael Hendry: yes, there is a lot of variation in proportions in verse and prose in French drama of the seventeenth and eighteenth centuries, and to assess this reliably, it is preferable, when possible, not to use the simple heuristics described in this post, but to exploit explicitly marked-up data about prose, verse and stage directions. Not always, however, do you have such nice data. (For cases where such data is available, the Python script relying on lxml for parsing the XML files and called “verspers.py” , can be found in the GitHub repository.)

\n

***

\n

Addendum, June 9, 2014: This is just to report on a brief correlation test I ran to test the initial indicators based on various length-of-lines calculations against the percentages observed from the TEI data. Based on the data about 600 plays, the “Pearson’s product-moment correlation” test, cor.test in R, yields the following correlation scores between the percentage of prose and several measures:

\n
    \n
  • percentage of prose vs. mean of length of lines: 0.634
  • \n
  • ditto vs. standard deviation of length of lines:\u00A0 0.870
  • \n
  • ditto vs. mean of difference in length of subsequent lines: 0.918
  • \n
\n

The p-value is very low in all three cases, indicating very high significance of the correlations, but as expected, the correlation is clearly the strongest for the third measure, something which was apparent from the third boxplot but is quantified here with some more precsion.

\n" . "Monsieur Jourdain" . "prose" . . "2014-06-06T17:16:40Z" . "lxml" . "standard deviation" . "boxplot" . "

\"Illustration

Illustration from “Le bourgois gentilhomme”. Source: http://en.wikipedia.org/wiki/Le_Bourgeois_gentilhomme (public domain).

\n

Everyone knows that verse and prose are different. Well, there are two exeptions: the first is Monsieur Jourdain in Moli\u00E8re’s play\u00A0Le bourgois gentilhomme, who only learns about that difference from his “ma\u00EEtre de philosophie”. The other exception is the computer, for which both prose and verse are simply and equally strings of characters.

\n

Now, if you have nice TEI-encoded texts, the difference between verse and prose will be marked up, with prose paragraphs being marked\u00A0 with “p” for paragraph and verse being marked with “l” for line. Also, you would be able to assume that most text in elements such as “stage” (for stage directions) or “note” (for editorial or authorial notes) will also be prose. In this case, the computer will be able to tell prose and verse apart, but only because someone made that distinction explicit beforehand.

\n

But what are we do to when we have a lot of plain text files and, far from knowing which sections are prose and which are verse, we don’t even know which files contain mostly verse and which contain mostly prose? One way, of course, is to simply look at the content of these files, but that gets a bit tedious after a few dozen files. So I would like to report on an attempt to solve this issue with the simplest means possible, without markup, without looking at the files, and without complicated heuristics. This is basically just a special kind of text classification task, and finding the right indicator for the target categories is key.

\n

My approach relies quite simply on measuring the length of lines in plays. Even very simple plain text files usually separate each line or each paragraph with a newline character. Of course, they also usually separate each speaker name in this way, thus creating a line for each speakername, possibly with some stage directions added to it. Basically, I assume that the texts look something like this; the following is, obviously, a verse example (taken from Boucher, Champagne le coiffeur, 1662):

\n

CL\u00C9ANDRE.
\nGuillot, l’agr\u00E9able nouvelle
\nQue m’apportes-tu de ma belle !
\nQuelle adresse trouverons-nous ?
\nGUILLOT.
\nJ’en ai mille; vous moquez-vous ?
\nCL\u00C9ANDRE.
\nGuillot, il n’en faut trouver qu’une,
\nEt je r\u00E9ponds de ta fortune.

\n

Another almost random piece from a prose play (Jean-Nicholas Bouilly’s Descartes, 1763) looks as follows:

\n

DESCARTES, seul, assis devant son bureau.
\nQue l’\u00E9tude de la nature offre un champ vaste et imposant !…
\nComme elle \u00E9l\u00E8ve, comme elle agrandit l’\u00E2me.
\nDe quel feu je me sens embras\u00E9 !
\n\u00D4 toi, que je trouve partout, et ne puis d\u00E9finir, cause premi\u00E8re de tout ce qui existe, \u00CAtre \u00E9ternel, puisque tu m’as cr\u00E9\u00E9, je ne mourrai point sans avoir m\u00E9dit\u00E9 sur tes ouvrages, sans avoir cherch\u00E9 la v\u00E9rit\u00E9 parmi tous ces chef-d’oeuvres dont tu composas l’univers…
\nSoutiens mon courage, seconde mes projets ; et si tu permets que\u00A0 j’ajoute \u00E0 la perfection des hommes, que je contribue \u00E0 leur bonheur, je te rendrai gr\u00E2ce en mourant de m’avoir donn\u00E9 l’existance.
\nDESCARTES.
\nEntrez.
\nFLORINA.
\nC’est moi, Monsieur Descartes…
\nJe vous apporte votre th\u00E9.
\nDESCARTES.
\nAh !

\n

It is clear that the mean length of lines will not help us much, here. But another approach to determine, or more technically speaking, to predict, whether a play is in verse or in prose, given the lengths of its lines, is quite obvious from these two examples: the lines in a verse play tend to be more homogeneous in length, because they are usually being written in a given metre (in the French classical age, that would have been twelve-syllable alexandrins, in some rarer cases eight-syllable huitains). By contrast, the length of the prose lines varies widely, from the long phrase in the body of the example above the very short final “Ah!”. This degree of homogeneity of lengths of lines is nicely captured by standard deviation, so that calculating the standard deviation of the lengths of all lines of speeches in plays should allow to correctly predict that all plays with a small standard deviation are written in verse and all plays with a large standard deviation are written in prose. And this actually works, as the following boxplot shows.

\n

\"Standard

Standard deviation of length of lines in 236 French verse and prose plays. (click to enlarge)

\n

In this case, I used data for which I already have manually added labels in order to test whether the measure just described would be good at distinguishing prose and verse. It clearly is, because the two boxplots do not overlap, and a standard deviation of about 20 nicely separates both boxplots. There isn’t a single outlier in this set of 236 plays from 1620 to 1780 that disobeys from the rule, as it were. (By the way, the mean of length of lines really does not separate the plays well. And although the simple length of the plays measured in lines also works well, verse plays being significantly longer than prose plays when their length is measured in number of lines, this does not seem to be a measure that would be very robust. Short verse plays would inevitably create problems.)

\n

There are, however, more complicated cases than just such plays written either all in regular alexandrins or in prose lines of widely varying length. Many verse plays have some lines of prose in them, for instance when someone reads out a letter; and many prose plays have some verse in them, for instance when at the end of a comedy, everyone breaks out into a little song. And, to make matters worse, there are plays that are really an inextricable mix of verse and prose, either short sections of both alternating for most of the play, or the play being divided into several larger sections in verse or prose. Taking such plays into the test collection and using the same “standard deviation of length of lines” measure yields the following boxplot:

\n

\"Boxplots

Boxplots for standard deviation of length of lines for 300 plays in prose, verse, prose with some verse, verse with some prose, and mixed. (click to enlarge)

\n

Prose and verse continue to be separated quite nicely, and a little smattering of prose in verse or verse in prose makes a visible difference, especially in the case of verse with some prose, but does not really affect the reliable separation of the two main types (or inversely, is not enough to trigger a differentiating reponse from this measure). However, the problem is with the mixed plays (all the way to the left in the graph): they seem to be perfectly indistinguishable from the prose plays, with just a sligtly less elevated median and largely overlapping lower and upper first quartiles. Not separate enough, in any case, to solve this with a simple rule.

\n

What to do? Actually, another only slightly more complicated measure seemed promising. The standard deviation does not capture any information about the sequence of the lines. However, even if mixed plays have a standard deviation in their length of lines that is similar to prose plays, the sequential distribution should be different: prose plays will have short, medium-length and long lines all mingled together, while in mixed plays, the verse section should consist of lines of similar length for a little stretch. Based on this, the mean of the (absolute) differences in length between each line and the next could be an interesting measure. And it is, as the following boxplot shows:

\n

\"Boxplot

Boxplot showing the mean of absolute differences of adjacent lines in 300 French plays. (Click to enlarge.)

\n

Here, not only are the verse plays clearly separated from the rest, but the separation between the prose plays and the mixed plays is also markedly more clear: even though there is still considerable overlap, the medians are now very different and the lower and upper first quartiles hardly overlap anymore.

\n

So based on this measure, a prediction as to whether a play is written in prose, verse or in a mix of both should be possible with a degree of accuracy that would be well above the baseline. In the case of this dataset, the rule I ended up implementing was the following: if the mean of differences is lower than 18, predict verse; if the mean is between 18 and 38, predict mixed; and if the mean is higher than 38, predict prose. Lowering that “38” score would increase the number of true positives for prose plays, but also the number of false negatives for mixed plays, while raising the score would have the opposite effect. This is actually a textbook example for precision and recall, it seems to me.

\n

Be that as it may, let me add a technical note. The counting of the number of characters in each line of a play and the calculation of such scores as mean, median, standard deviation, and even the calculation of the mean of the differences in length of sequential lines, as well as the implementation of such simple rule-based predictions, is what computers generally, and Python in particular, are really good at. As a consequence, you may find the code used for these calculations as usually on GitHub along with some sample files for illustration and testing. Making the boxplots, however, is much faster with R, especially because R reads labeled columns and implements conditions for grouping values out of the box, so I used that instead.

\n

Anyway, it all goes to show that since Monsieur Jourdain learns it in the course of Moli\u00E8re’s play, and the computer can also learn it, really everyone now knows the difference between verse and prose. And similarly to Monsieur Jourdain, the computer also knows the difference between them although it remains completedly oblivious to matters like metre and rhyme, despite the fact that these are of course essential to verse from the point of literary studies and, indeed, readers. Is that good enough? Yes, if what you want to do is label a collection of texts too large for you to open each file and check yourself. No, in many other cases. But we’ll deal with that (and with the inevitable special cases) another time.

\n

***

\n

Addendum, June 8, 2014: In reaction to this post, Michael Hendry raised the question, on Twitter, whether this technique would work for plays mixing verse and prose and whether in French theater, there was as much variation in the proportions of verse and prose in dramatic texts as, for example, in Shakespeare: “Just to clarify: you did consider mixed prose and verse plays, but do any of them vary as much as Shakespeare’s?” (Michael Hendry,\u00A0Twitter, June 6, 2014).

\n

I did not know any numbers on this for French drama but I had of course realized that the mixed plays are a challenge to the simple measure used above. But since the ever excellent collection of plays provided by Th\u00E9\u00E2tre classique is in TEI and has all verse and prose passages as well as stage directions (and much more) meticulosly marked-up, it was not a big challenge to pull out that information from those 600 plays ranging approximately from 1630 to 1800. Here are the results (based on number of characters per type of text, displayed as stacked columns):

\n

\"Proportions

Proportions of verse, prose and stage directions in 600 French plays 1630-1800 (click to enlarge)

\n

As becomes clear from this graph, French drama of the time considered here knows many combinations of prose, verse and stage directions. The largest part, maybe half of the plays, are verse plays with no or very little prose mixed in; another quarter are prose plays with no or very little verse in them. The remaining quarter of plays have varying proportions of verse and prose, and there are even some plays with suprising proportions of stage directions, 75% in one extreme case (the very short play\u00A0Orph\u00E9e ou Arlequin aux enfers (1711) by Fuzelier).

\n

The next question raised by such data is whether there are any interesting chronological trends. The following graph shows the same 600 plays arranged in chronological order:

\n

\"Proportion

Proportion of prose, verse and stage directions in 600 French plays, chronologically (click to enlarge)

\n

From this graph, it becomes apparent that there was a strong domination of verse until around 1690, followed by about a decade in which a lot of plays in prose were written, and then a long period during which both forms coexisted. It should be noted that these trends are not based on representative data, and that plays are not evenly spread out over the entire period, so all of this needs to be taken with a large grain of salt.

\n

To come back to the initial question raised by Michael Hendry: yes, there is a lot of variation in proportions in verse and prose in French drama of the seventeenth and eighteenth centuries, and to assess this reliably, it is preferable, when possible, not to use the simple heuristics described in this post, but to exploit explicitly marked-up data about prose, verse and stage directions. Not always, however, do you have such nice data. (For cases where such data is available, the Python script relying on lxml for parsing the XML files and called “verspers.py” , can be found in the GitHub repository.)

\n

***

\n

Addendum, June 9, 2014: This is just to report on a brief correlation test I ran to test the initial indicators based on various length-of-lines calculations against the percentages observed from the TEI data. Based on the data about 600 plays, the “Pearson’s product-moment correlation” test, cor.test in R, yields the following correlation scores between the percentage of prose and several measures:

\n
    \n
  • percentage of prose vs. mean of length of lines: 0.634
  • \n
  • ditto vs. standard deviation of length of lines:\u00A0 0.870
  • \n
  • ditto vs. mean of difference in length of subsequent lines: 0.918
  • \n
\n

The p-value is very low in all three cases, indicating very high significance of the correlations, but as expected, the correlation is clearly the strongest for the third measure, something which was apparent from the third boxplot but is quantified here with some more precsion.

\n" . . "mean" . "correlation" . "What\u2019s the simplest way of telling verse and prose apart?" .