Detecting Transpositions when Comparing Text Versions using CollateX
rss:link
https://dragonfly.hypotheses.org/954
rss:description
The aims of this post are to explain what collation is, why detecting transpositions is special, and how to accomplish it using CollateX. The example I will be using involves comparing two versions of the recent best-selling novel The Martian by Andy Weir, a novel whose publishing history Erik Ketzan and me are currently investigating. More specifically, I will illustrate...
content:encoded
<p>The aims of this post are to explain what collation is, why detecting transpositions is special, and how to accomplish it using CollateX. The example I will be using involves comparing two versions of the recent best-selling novel <a href="https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)"><em>The Martian</em> by Andy Weir</a>, a novel whose publishing history <a href="https://twitter.com/erikannotations">Erik Ketzan</a> and me are currently investigating. More specifically, I will illustrate the process using small fragments from that novel. Some sample data and code <a href="https://github.com/cligs/projects/tree/master/2016/martian">can be found on GitHub</a>.</p>
<p><strong>What is collation?</strong></p>
<p>Collation is an activity in the area of scholarly editing and textual studies; the term designates the comparison of two or more differing versions of the same text with the aim of identifying their differences. Whether this is done in order to reconstruct the relations between several copies of a medieval text, in order to better understand an author’s writing process when studying a modern manuscript, or in order to see how a self-published version of a novel differs from its professionally-edited version (as in our <em>The Martian</em> case), collation yields fascinating insights.</p>
<p>In the so-called “Gothenburg” model of collation, there are five steps: Tokenization, Normalization/Regularization, Alignment, Analysis/Feedback and Visualization, neatly summarized in <a href="http://collatex.net/doc/">the documentation to <em>CollateX</em></a>. Each has its importance and its particular challenges. Tokenization involves splitting the text into smaller segments, which are usually words (separated by whitespace characters) but can also be larger segments such as lines or sentences. Normalization and regularization are optional and involve, for instance, making all words lower-case or removing punctuation, procedures which simplify the following steps but rely on the assumption that differences in these areas are immaterial. In the alignment step, the two or more text versions are aligned, that is identical portions of the text versions are identified and gaps are inserted into each version to make everything line up. In the analysis step, the relation between the aligned text, each gap (in one version) and each piece of corresponding text (in another version) are either considered to be a deletion or an addition (or, more subtly, a contraction or expansion). A special aspect of analysis is identifying transpositions (more on which below), if any are present in the texts under scrutiny. Finally, the resulting representation of the textual variants can be visualized, either textually or in the form (and following the model) of a variant graph.</p>
<p><strong>What tools are out there for it?</strong></p>
<p>Naturally, collation is nowadays done computationally, as it is a tedious and very error-prone process when done manually. There are quite a number of tools that do collation and one may distinguish two fundamental types of such tools. The first type to appear, historically, are so-called “diff tools” which have been developed for the comparison of two versions of programming code but can be used for other materials as well. Examples are <em>diff</em> (<a href="https://en.wikipedia.org/wiki/Hunt%E2%80%93McIlroy_algorithm">whose algorithm was developed by Hunt and McIlroy in the mid-1970s</a>) as well as the command-line fronted for diff, <em><a href="http://www.gnu.org/software/wdiff">wdiff</a></em> or the graphical frontend to <em>diff</em> called <em><a href="http://meldmerge.org/">meld</a></em>. This type of tools works best with short lines and is word-based in scope. The second type of tools is a more recent development, involving algorithms and associated tools which have been specifically developed with the requirements of textual scholars in mind (or which, interestingly, adapt algorithms developed for gene-sequence comparision in genetics to textual materials). Examples are Collate (developed by Peter Robinson) and its successor <em><a href="http://collatex.net">CollateX</a></em> as well as <em><a href="http://www.juxtasoftware.org/">Juxta</a></em>. In our research on <em>The Martian</em>, we are using wdiff and meld as well as Juxta and CollateX, each for different purposes. Finally, there are also tools more specifically serving to visualize textual variation, such as the immensely useful <em><a href="http://v-machine.org/">Versioning Machine</a></em> or the more recent and very impressive <em><a href="http://www.traviz.vizcovery.org/index.html">TRAViz</a></em>.</p>
<p><strong>Why CollateX is special: detection of transpositions</strong></p>
<p>What sets CollateX apart from the rest of the pack are probably two things: first, the tool ships with not just one but several different collation algorithms as well as with a number of output formats, so that scholars can decide which collation type and data model best suits their project and workflow requirements; and second, CollateX is able to recognize transpositions, that is instances where a word or sequence of words remains the same between two versions but is moved to a new location in the text. To virtually all other tools, such a transposition just looks like a deletion in one place and an (unrelated) addition in another place. What is also nice about CollateX is that it comes in two versions, a Java-based core version with a command-line interface and a Python-based derived version which it is particularly easy to integrate into a longer Python-based workflow. What CollateX does not provide is the user-friendly graphical user interface of Juxta, which even exists in a web-based version that can be freely used without even needing any installation, or the fancy interactive visualizations of TRAViz.</p>
<p><strong>Requirements for the detection of transpositions</strong></p>
<p>What follows is a brief walk-through of collation with CollateX, specifically focused on the case where you would like to detect transpositions. It is essential to be aware of a few specific requirements of CollateX when wanting to conduct such an analysis. First, longer texts (such as entire novels) should be split into smaller sections (for example chapters), for performance reasons. Second, while CollateX accepts various input formats, the detection of transpositions requires the texts to be fed to CollateX in a specific, though very simple, <a href="https://en.wikipedia.org/wiki/JSON">JSON</a> format. Third, among the (currently) four collation algorithms provided by CollateX, only the “dekker” algorithm (developed by Ronald Dekker) is able to detect transpositions. And finally, among the several output formats, only the <a href="https://en.wikipedia.org/wiki/DOT_(graph_description_language)">dot/graphviz</a> as well as the <a href="https://en.wikipedia.org/wiki/GraphML">xml/graphml</a> formats have the capacity to also represent the transpositions.</p>
<p><strong>Preparing your texts: creating JSON<br />
</strong></p>
<p>This walk-through assumes your texts have already been prepared in the form of simple plain-text files, one for each chapter, each version in a separate file. Now, the first step is to produce the right kind of JSON for CollateX. If there are more than two sections to the text you would like to collate (and probably even if there is only one), it makes sense to generate this JSON automatically using a very simple script (see “make_json.py” on GitHub). There are a few simple details which need to be taken care of. First of all, because JSON uses quotation marks internally, we need to escape or transform single and double quotation marks, including apostrophes, appearing in our text. I have chosen to transform them to some very explicit pseudo-html because that will make it easy to transform it back if necessary. Also, CollateX works best when line breaks are removed, so these are transformed to some pseudo-html as well. In the script, this is simply done with a few regular expressions. The next step is to define the basic JSON structure and insert the two versions of each chapter in the appropriate place. This is accomplished by the following line; where everything but “Text1” and “Text2” corresponds to the required JSON structure:</p>
<pre>Json = "{ \"witnesses\" : [ { \"id\" : \"M1\", \"content\" : \"" + Text1 + "\" }, { \"id\" : \"M2\", \"content\" : \"" + Text2 + "\" } ] }"</pre>
<p>The rest of the script is simply about reading the files associated with each chapter and saving the resulting JSON files.</p>
<p><strong>Calling CollateX from the command line</strong></p>
<p>Assuming CollateX is already installed (see the documentation; the following assumes that the operating system is Linux/Unix-based), the next step is to call CollateX from the command line with the appropriate parameters. The basic command which is necessary for this is very simple. In the Terminal, you simply need to move to the folder in which CollateX is installed and there, run the following command:</p>
<pre>java -jar collatex-tools-1.7.1.jar --algorithm dekker --format dot --output json-gv/chapter1.gv json-gv/chapter1.json</pre>
<p>This basically tells the computer to use Java to run the CollateX “jar” file, using the “dekker” algorithm, producing output in the “dot/graphviz” format, writing the output to a file called “chapter1.gv” and doing all of this with the file called “chapter1.json”, which is in the folder “json-gv”. Again, of course, as soon as you have more than a couple of sections or chapters, it makes sense to have them all dealt with automatically, using a little loop in bash. That bash/shell script called “cx-dot-loop.sh” (also included on GitHub) looks as follows:</p>
<pre>#!/bin/bash
filename="json-gv/*.json"
for file in $filename
do
java -jar collatex-tools-1.7.1.jar --algorithm dekker --format dot --output $file.gv $file
done</pre>
<p>Basically, all this script does is repeat exactly what I described above for each file ending on “.json” in the folder json-gv. So instead of repeating the command for as many times as there are chapters, you simply call “bash cx-dot-loop.sh” and wait a little bit. As an alternative to the “dot” format chosen here, you could also select “graphml” as the output format (and adapt the file extension accordingly), which essentially contains the same information but in an XML format.</p>
<p><strong>Inspecting the output files: where are the transpositions?</strong></p>
<p>The nice thing about the dot/graphviz format is that it can both be inspected as a text file and visualized as a graph, as this plain text format is rendered as a graph by the XDot viewer. In the dot/graphviz format, each continuous textual fragment is represented as a node with an identifier; both are listed together at the beginning of the file. This looks as follows:</p>
<pre>v84 [label = "Ares 4 will be landing at the Schiaparelli crater, which is about "];
v85 [label = "3,200 km "];
v86 [label = "3200 kilometers "];
v87 [label = "away from my location here in "];
v88 [label = "the "];
v89 [label = "Acidalia Planitia.\n ];</pre>
<p>Then, information about the sequence which these fragments follow in each of the versions is represented as links between nodes, with a label identifying the relevant version. For the nodes shown above, the edges look as follows (M1 and M2 identify the two versions of <em>The Martian</em>):</p>
<pre>v84 -> v85 [label = "M1"];
v84 -> v86 [label = "M2"];
v85 -> v87 [label = "M1"];
v86 -> v87 [label = "M2"];
v87 -> v89 [label = "M2"];
v87 -> v88 [label = "M1"];
v88 -> v89 [label = "M1"];</pre>
<p>Finally, there is a special type of link between nodes which remain identical in the text but appear in a different position in the text, in the two or more versions of the text. This special type of links represents, you guessed it, the transpositions. They appear at the end of the file, for instance like so:</p>
<pre>v258 -> v266 [ color = "lightgray", style = "dotted" arrowhead = "none", arrowtail = "none" ];
v273 -> v275 [ color = "lightgrey", style = "dotted" arrowhead = "none", arrowtail = "none" ];</pre>
<p>Now, this is human-readable only in a very limited sense, and is probably not meant to be. The biggest problem for humans is, of course, to bring the information about the text represented by each node together with the information about the link between each node. This is no less true for the transpositions as for any other link, so we can quickly see that there are two transpositions here, but not what they are. The graphml format essentially works the same, except that it is XML and therefore a bit more verbose. That’s why there is the graphical version, so let’s have a look at that!</p>
<p><strong>Change the output parameters and look at your graphs!</strong></p>
<p>In order to view dot/graphviz files, you need an appropriate viewer. There are surprisingly few options for this, but the <em>XDot</em> viewer for Linux does the trick, if not very gracefully. Now another issue appears: It is surprisingly difficult to spot the transpositions in those graphs, which are very very stretched out vertically; the following is a very small section of such a graph (the passage in Figure 1 corresponds roughly to the dot-file sample above):</p>
<div id="attachment_992" style="width: 510px" class="wp-caption aligncenter"><a href="http://dragonfly.hypotheses.org/954/chapter2_example" rel="attachment wp-att-992"><img aria-describedby="caption-attachment-992" loading="lazy" class="wp-image-992 size-large" src="http://dragonfly.hypotheses.org/files/2016/08/chapter2_example-500x286.png" alt="chapter2_example" width="500" height="286" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_example-500x286.png 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_example-300x171.png 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_example-768x439.png 768w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_example.png 1389w" sizes="(max-width: 500px) 100vw, 500px" /></a><p id="caption-attachment-992" class="wp-caption-text">Figure 1: Small section of a variant graph based on CollateX dot/graphviz output.</p></div>
<p>This very neatly visualizes the textual flow of each version and the differences between them. It is read from top to bottom as follows: Just start reading at the top node and then follow the arrow marked with either one of the versions (M1 or M2) to read that version; or look at the alternative paths taken by each version to compare the two versions at any point in the text. And, for a slightly more interesting example, a small section from the beginning of chapter 2 of <em>The Martian</em>:</p>
<div id="attachment_960" style="width: 425px" class="wp-caption aligncenter"><a href="http://dragonfly.hypotheses.org/954/chapter2_start" rel="attachment wp-att-960"><img aria-describedby="caption-attachment-960" loading="lazy" class="wp-image-960 size-large" src="http://dragonfly.hypotheses.org/files/2016/08/chapter2_start-415x500.png" alt="chapter2_start" width="415" height="500" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_start-415x500.png 415w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_start-249x300.png 249w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_start-768x925.png 768w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_start.png 831w" sizes="(max-width: 415px) 100vw, 415px" /></a><p id="caption-attachment-960" class="wp-caption-text">Figure 2: The beginning of chapter 2 of The Martian, visualized as a variant graph using CollateX and the dot/graphviz output.</p></div>
<p>This is already a somewhat cleaned-up version, the line-breaks have been reintroduced so that the nodes are not too much streched out horizontally, something which dramatically increases readability.</p>
<p>For a book chapter of any reasonable length, the resulting graph is very very long and there is no way you can search for a transposition in it, as it is fundamentally just marked by a light-grey dotted line! In the output, I changed the properties of the transposition edge to be a blue, bold line to make spotting the transpositions a little bit easier. (And I’d recommend to the developers of CollateX to make something like this the default behaviour, even if transpositions could be considered secondary information; your users will love you for it!) So finally, here we have a transposition, even if it is a very simple one!</p>
<div id="attachment_961" style="width: 348px" class="wp-caption aligncenter"><a href="http://dragonfly.hypotheses.org/954/chapter4_transposition" rel="attachment wp-att-961"><img aria-describedby="caption-attachment-961" loading="lazy" class="wp-image-961 size-large" src="http://dragonfly.hypotheses.org/files/2016/08/chapter4_transposition-338x500.png" alt="chapter4_transposition" width="338" height="500" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter4_transposition-338x500.png 338w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter4_transposition-203x300.png 203w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter4_transposition.png 655w" sizes="(max-width: 338px) 100vw, 338px" /></a><p id="caption-attachment-961" class="wp-caption-text">Figure 3: A simple example of a transposition detected and visualized using CollateX, from chapter 2 of The Martian.</p></div>
<p>It is intuitively visible that the phrase “why not” appears early on in the sentence in the first version of <em>The Martian</em> (M1), which reads: “So why not bring them all in?” And the same little phrase also appears in the same context, but in a different position, in the second version: “So I brought them all in. Why not?” (M2). Interestingly, CollateX is flexible enough here to identify the transposition despite the fact that the two segments differ with regard to capitalization (and the text has not been transformed to lowercase in this analysis). Note also that CollateX identifies single words that have been transposed as well as longer sequences of text. I’m not sure how far back and ahead it checks, but the issues I encountered with performance when trying to collate an entire novel suggest it does look quite far.</p>
<p>This is just a very simple illustration of the principle, but even with just two versions as in the case of <em>The Martian</em>, there can be considerable complexity in the way many apparently small, even unimportant, local changes interact. And the collation graph produced by CollateX brings out this complexity rather nicely without overwhelming the reader, even though I do find it a bit confusing that the two versions switch sides, from left to right or vice versa, occasionally.</p>
<p>Simply for illustration, one more example of a transposition, this time from chapter 2 of <em>The Martian</em>, where the transposition concerns a larger segment of text and is part of a more complicated rewriting of the sentences involved.</p>
<div id="attachment_962" style="width: 510px" class="wp-caption aligncenter"><a href="http://dragonfly.hypotheses.org/954/chapter2_transposition" rel="attachment wp-att-962"><img aria-describedby="caption-attachment-962" loading="lazy" class="wp-image-962 size-large" src="http://dragonfly.hypotheses.org/files/2016/08/chapter2_transposition-500x500.png" alt="chapter2_transposition" width="500" height="500" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-500x500.png 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-150x150.png 150w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-300x300.png 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-768x769.png 768w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-160x160.png 160w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-320x320.png 320w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition.png 1130w" sizes="(max-width: 500px) 100vw, 500px" /></a><p id="caption-attachment-962" class="wp-caption-text">Figure 4: Variant graph with a transposition, from chapter 2 of The Martian.</p></div>
<p>It would be a pity to miss out on such phenomena when collating several variants of a text, wouldn’t it? Although, come to think of it: There appears to be a second transposition here which the CollateX algorithm does not catch: that of “5 square meters”, which in version 1 appears right after the “dirt gods” section, whereas in version 2, it appears before that section, in the form “five square meters” (note the spelled-out number). Well, that’s a tricky one, I guess, and is a welcome reminder that algorithmic approaches never have 100% recall and precision…</p>
<p>To conclude: I hope this little post has been able not only to show you why it may be interesting to look at transpositions, but also to explain how you may go about detecting them in your own data using CollateX and why this tool is an excellent choice for collation and visualization of textual variants containing transpositions. Don’t miss out on transpositions next time you do a collation! And, of course, don’t miss out on our full paper on <em>The Martian</em> once it comes out.</p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>The aims of this post are to explain what collation is, why detecting transpositions is special, and how to accomplish it using CollateX. The example I will be using involves comparing two versions of the recent best-selling novel <a href="https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)"><em>The Martian</em> by Andy Weir</a>, a novel whose publishing history <a href="https://twitter.com/erikannotations">Erik Ketzan</a> and me are currently investigating. More specifically, I will illustrate the process using small fragments from that novel. Some sample data and code <a href="https://github.com/cligs/projects/tree/master/2016/martian">can be found on GitHub</a>.</p>
<p><strong>What is collation?</strong></p>
<p>Collation is an activity in the area of scholarly editing and textual studies; the term designates the comparison of two or more differing versions of the same text with the aim of identifying their differences. Whether this is done in order to reconstruct the relations between several copies of a medieval text, in order to better understand an author’s writing process when studying a modern manuscript, or in order to see how a self-published version of a novel differs from its professionally-edited version (as in our <em>The Martian</em> case), collation yields fascinating insights.</p>
<p>In the so-called “Gothenburg” model of collation, there are five steps: Tokenization, Normalization/Regularization, Alignment, Analysis/Feedback and Visualization, neatly summarized in <a href="http://collatex.net/doc/">the documentation to <em>CollateX</em></a>. Each has its importance and its particular challenges. Tokenization involves splitting the text into smaller segments, which are usually words (separated by whitespace characters) but can also be larger segments such as lines or sentences. Normalization and regularization are optional and involve, for instance, making all words lower-case or removing punctuation, procedures which simplify the following steps but rely on the assumption that differences in these areas are immaterial. In the alignment step, the two or more text versions are aligned, that is identical portions of the text versions are identified and gaps are inserted into each version to make everything line up. In the analysis step, the relation between the aligned text, each gap (in one version) and each piece of corresponding text (in another version) are either considered to be a deletion or an addition (or, more subtly, a contraction or expansion). A special aspect of analysis is identifying transpositions (more on which below), if any are present in the texts under scrutiny. Finally, the resulting representation of the textual variants can be visualized, either textually or in the form (and following the model) of a variant graph.</p>
<p><strong>What tools are out there for it?</strong></p>
<p>Naturally, collation is nowadays done computationally, as it is a tedious and very error-prone process when done manually. There are quite a number of tools that do collation and one may distinguish two fundamental types of such tools. The first type to appear, historically, are so-called “diff tools” which have been developed for the comparison of two versions of programming code but can be used for other materials as well. Examples are <em>diff</em> (<a href="https://en.wikipedia.org/wiki/Hunt%E2%80%93McIlroy_algorithm">whose algorithm was developed by Hunt and McIlroy in the mid-1970s</a>) as well as the command-line fronted for diff, <em><a href="http://www.gnu.org/software/wdiff">wdiff</a></em> or the graphical frontend to <em>diff</em> called <em><a href="http://meldmerge.org/">meld</a></em>. This type of tools works best with short lines and is word-based in scope. The second type of tools is a more recent development, involving algorithms and associated tools which have been specifically developed with the requirements of textual scholars in mind (or which, interestingly, adapt algorithms developed for gene-sequence comparision in genetics to textual materials). Examples are Collate (developed by Peter Robinson) and its successor <em><a href="http://collatex.net">CollateX</a></em> as well as <em><a href="http://www.juxtasoftware.org/">Juxta</a></em>. In our research on <em>The Martian</em>, we are using wdiff and meld as well as Juxta and CollateX, each for different purposes. Finally, there are also tools more specifically serving to visualize textual variation, such as the immensely useful <em><a href="http://v-machine.org/">Versioning Machine</a></em> or the more recent and very impressive <em><a href="http://www.traviz.vizcovery.org/index.html">TRAViz</a></em>.</p>
<p><strong>Why CollateX is special: detection of transpositions</strong></p>
<p>What sets CollateX apart from the rest of the pack are probably two things: first, the tool ships with not just one but several different collation algorithms as well as with a number of output formats, so that scholars can decide which collation type and data model best suits their project and workflow requirements; and second, CollateX is able to recognize transpositions, that is instances where a word or sequence of words remains the same between two versions but is moved to a new location in the text. To virtually all other tools, such a transposition just looks like a deletion in one place and an (unrelated) addition in another place. What is also nice about CollateX is that it comes in two versions, a Java-based core version with a command-line interface and a Python-based derived version which it is particularly easy to integrate into a longer Python-based workflow. What CollateX does not provide is the user-friendly graphical user interface of Juxta, which even exists in a web-based version that can be freely used without even needing any installation, or the fancy interactive visualizations of TRAViz.</p>
<p><strong>Requirements for the detection of transpositions</strong></p>
<p>What follows is a brief walk-through of collation with CollateX, specifically focused on the case where you would like to detect transpositions. It is essential to be aware of a few specific requirements of CollateX when wanting to conduct such an analysis. First, longer texts (such as entire novels) should be split into smaller sections (for example chapters), for performance reasons. Second, while CollateX accepts various input formats, the detection of transpositions requires the texts to be fed to CollateX in a specific, though very simple, <a href="https://en.wikipedia.org/wiki/JSON">JSON</a> format. Third, among the (currently) four collation algorithms provided by CollateX, only the “dekker” algorithm (developed by Ronald Dekker) is able to detect transpositions. And finally, among the several output formats, only the <a href="https://en.wikipedia.org/wiki/DOT_(graph_description_language)">dot/graphviz</a> as well as the <a href="https://en.wikipedia.org/wiki/GraphML">xml/graphml</a> formats have the capacity to also represent the transpositions.</p>
<p><strong>Preparing your texts: creating JSON<br />
</strong></p>
<p>This walk-through assumes your texts have already been prepared in the form of simple plain-text files, one for each chapter, each version in a separate file. Now, the first step is to produce the right kind of JSON for CollateX. If there are more than two sections to the text you would like to collate (and probably even if there is only one), it makes sense to generate this JSON automatically using a very simple script (see “make_json.py” on GitHub). There are a few simple details which need to be taken care of. First of all, because JSON uses quotation marks internally, we need to escape or transform single and double quotation marks, including apostrophes, appearing in our text. I have chosen to transform them to some very explicit pseudo-html because that will make it easy to transform it back if necessary. Also, CollateX works best when line breaks are removed, so these are transformed to some pseudo-html as well. In the script, this is simply done with a few regular expressions. The next step is to define the basic JSON structure and insert the two versions of each chapter in the appropriate place. This is accomplished by the following line; where everything but “Text1” and “Text2” corresponds to the required JSON structure:</p>
<pre>Json = "{ \"witnesses\" : [ { \"id\" : \"M1\", \"content\" : \"" + Text1 + "\" }, { \"id\" : \"M2\", \"content\" : \"" + Text2 + "\" } ] }"</pre>
<p>The rest of the script is simply about reading the files associated with each chapter and saving the resulting JSON files.</p>
<p><strong>Calling CollateX from the command line</strong></p>
<p>Assuming CollateX is already installed (see the documentation; the following assumes that the operating system is Linux/Unix-based), the next step is to call CollateX from the command line with the appropriate parameters. The basic command which is necessary for this is very simple. In the Terminal, you simply need to move to the folder in which CollateX is installed and there, run the following command:</p>
<pre>java -jar collatex-tools-1.7.1.jar --algorithm dekker --format dot --output json-gv/chapter1.gv json-gv/chapter1.json</pre>
<p>This basically tells the computer to use Java to run the CollateX “jar” file, using the “dekker” algorithm, producing output in the “dot/graphviz” format, writing the output to a file called “chapter1.gv” and doing all of this with the file called “chapter1.json”, which is in the folder “json-gv”. Again, of course, as soon as you have more than a couple of sections or chapters, it makes sense to have them all dealt with automatically, using a little loop in bash. That bash/shell script called “cx-dot-loop.sh” (also included on GitHub) looks as follows:</p>
<pre>#!/bin/bash
filename="json-gv/*.json"
for file in $filename
do
java -jar collatex-tools-1.7.1.jar --algorithm dekker --format dot --output $file.gv $file
done</pre>
<p>Basically, all this script does is repeat exactly what I described above for each file ending on “.json” in the folder json-gv. So instead of repeating the command for as many times as there are chapters, you simply call “bash cx-dot-loop.sh” and wait a little bit. As an alternative to the “dot” format chosen here, you could also select “graphml” as the output format (and adapt the file extension accordingly), which essentially contains the same information but in an XML format.</p>
<p><strong>Inspecting the output files: where are the transpositions?</strong></p>
<p>The nice thing about the dot/graphviz format is that it can both be inspected as a text file and visualized as a graph, as this plain text format is rendered as a graph by the XDot viewer. In the dot/graphviz format, each continuous textual fragment is represented as a node with an identifier; both are listed together at the beginning of the file. This looks as follows:</p>
<pre>v84 [label = "Ares 4 will be landing at the Schiaparelli crater, which is about "];
v85 [label = "3,200 km "];
v86 [label = "3200 kilometers "];
v87 [label = "away from my location here in "];
v88 [label = "the "];
v89 [label = "Acidalia Planitia.\n ];</pre>
<p>Then, information about the sequence which these fragments follow in each of the versions is represented as links between nodes, with a label identifying the relevant version. For the nodes shown above, the edges look as follows (M1 and M2 identify the two versions of <em>The Martian</em>):</p>
<pre>v84 -> v85 [label = "M1"];
v84 -> v86 [label = "M2"];
v85 -> v87 [label = "M1"];
v86 -> v87 [label = "M2"];
v87 -> v89 [label = "M2"];
v87 -> v88 [label = "M1"];
v88 -> v89 [label = "M1"];</pre>
<p>Finally, there is a special type of link between nodes which remain identical in the text but appear in a different position in the text, in the two or more versions of the text. This special type of links represents, you guessed it, the transpositions. They appear at the end of the file, for instance like so:</p>
<pre>v258 -> v266 [ color = "lightgray", style = "dotted" arrowhead = "none", arrowtail = "none" ];
v273 -> v275 [ color = "lightgrey", style = "dotted" arrowhead = "none", arrowtail = "none" ];</pre>
<p>Now, this is human-readable only in a very limited sense, and is probably not meant to be. The biggest problem for humans is, of course, to bring the information about the text represented by each node together with the information about the link between each node. This is no less true for the transpositions as for any other link, so we can quickly see that there are two transpositions here, but not what they are. The graphml format essentially works the same, except that it is XML and therefore a bit more verbose. That’s why there is the graphical version, so let’s have a look at that!</p>
<p><strong>Change the output parameters and look at your graphs!</strong></p>
<p>In order to view dot/graphviz files, you need an appropriate viewer. There are surprisingly few options for this, but the <em>XDot</em> viewer for Linux does the trick, if not very gracefully. Now another issue appears: It is surprisingly difficult to spot the transpositions in those graphs, which are very very stretched out vertically; the following is a very small section of such a graph (the passage in Figure 1 corresponds roughly to the dot-file sample above):</p>
<p><div id="attachment_992" style="width: 510px" class="wp-caption aligncenter"><a href="http://dragonfly.hypotheses.org/954/chapter2_example" rel="attachment wp-att-992"><img aria-describedby="caption-attachment-992" loading="lazy" class="wp-image-992 size-large" src="http://dragonfly.hypotheses.org/files/2016/08/chapter2_example-500x286.png" alt="chapter2_example" width="500" height="286" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_example-500x286.png 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_example-300x171.png 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_example-768x439.png 768w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_example.png 1389w" sizes="(max-width: 500px) 100vw, 500px" /></a><p id="caption-attachment-992" class="wp-caption-text">Figure 1: Small section of a variant graph based on CollateX dot/graphviz output.</p></div></p>
<p>This very neatly visualizes the textual flow of each version and the differences between them. It is read from top to bottom as follows: Just start reading at the top node and then follow the arrow marked with either one of the versions (M1 or M2) to read that version; or look at the alternative paths taken by each version to compare the two versions at any point in the text. And, for a slightly more interesting example, a small section from the beginning of chapter 2 of <em>The Martian</em>:</p>
<p><div id="attachment_960" style="width: 425px" class="wp-caption aligncenter"><a href="http://dragonfly.hypotheses.org/954/chapter2_start" rel="attachment wp-att-960"><img aria-describedby="caption-attachment-960" loading="lazy" class="wp-image-960 size-large" src="http://dragonfly.hypotheses.org/files/2016/08/chapter2_start-415x500.png" alt="chapter2_start" width="415" height="500" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_start-415x500.png 415w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_start-249x300.png 249w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_start-768x925.png 768w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_start.png 831w" sizes="(max-width: 415px) 100vw, 415px" /></a><p id="caption-attachment-960" class="wp-caption-text">Figure 2: The beginning of chapter 2 of The Martian, visualized as a variant graph using CollateX and the dot/graphviz output.</p></div></p>
<p>This is already a somewhat cleaned-up version, the line-breaks have been reintroduced so that the nodes are not too much streched out horizontally, something which dramatically increases readability.</p>
<p>For a book chapter of any reasonable length, the resulting graph is very very long and there is no way you can search for a transposition in it, as it is fundamentally just marked by a light-grey dotted line! In the output, I changed the properties of the transposition edge to be a blue, bold line to make spotting the transpositions a little bit easier. (And I’d recommend to the developers of CollateX to make something like this the default behaviour, even if transpositions could be considered secondary information; your users will love you for it!) So finally, here we have a transposition, even if it is a very simple one!</p>
<p><div id="attachment_961" style="width: 348px" class="wp-caption aligncenter"><a href="http://dragonfly.hypotheses.org/954/chapter4_transposition" rel="attachment wp-att-961"><img aria-describedby="caption-attachment-961" loading="lazy" class="wp-image-961 size-large" src="http://dragonfly.hypotheses.org/files/2016/08/chapter4_transposition-338x500.png" alt="chapter4_transposition" width="338" height="500" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter4_transposition-338x500.png 338w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter4_transposition-203x300.png 203w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter4_transposition.png 655w" sizes="(max-width: 338px) 100vw, 338px" /></a><p id="caption-attachment-961" class="wp-caption-text">Figure 3: A simple example of a transposition detected and visualized using CollateX, from chapter 2 of The Martian.</p></div></p>
<p>It is intuitively visible that the phrase “why not” appears early on in the sentence in the first version of <em>The Martian</em> (M1), which reads: “So why not bring them all in?” And the same little phrase also appears in the same context, but in a different position, in the second version: “So I brought them all in. Why not?” (M2). Interestingly, CollateX is flexible enough here to identify the transposition despite the fact that the two segments differ with regard to capitalization (and the text has not been transformed to lowercase in this analysis). Note also that CollateX identifies single words that have been transposed as well as longer sequences of text. I’m not sure how far back and ahead it checks, but the issues I encountered with performance when trying to collate an entire novel suggest it does look quite far.</p>
<p>This is just a very simple illustration of the principle, but even with just two versions as in the case of <em>The Martian</em>, there can be considerable complexity in the way many apparently small, even unimportant, local changes interact. And the collation graph produced by CollateX brings out this complexity rather nicely without overwhelming the reader, even though I do find it a bit confusing that the two versions switch sides, from left to right or vice versa, occasionally.</p>
<p>Simply for illustration, one more example of a transposition, this time from chapter 2 of <em>The Martian</em>, where the transposition concerns a larger segment of text and is part of a more complicated rewriting of the sentences involved.</p>
<p><div id="attachment_962" style="width: 510px" class="wp-caption aligncenter"><a href="http://dragonfly.hypotheses.org/954/chapter2_transposition" rel="attachment wp-att-962"><img aria-describedby="caption-attachment-962" loading="lazy" class="wp-image-962 size-large" src="http://dragonfly.hypotheses.org/files/2016/08/chapter2_transposition-500x500.png" alt="chapter2_transposition" width="500" height="500" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-500x500.png 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-150x150.png 150w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-300x300.png 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-768x769.png 768w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-160x160.png 160w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition-320x320.png 320w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2016/08/chapter2_transposition.png 1130w" sizes="(max-width: 500px) 100vw, 500px" /></a><p id="caption-attachment-962" class="wp-caption-text">Figure 4: Variant graph with a transposition, from chapter 2 of The Martian.</p></div></p>
<p>It would be a pity to miss out on such phenomena when collating several variants of a text, wouldn’t it? Although, come to think of it: There appears to be a second transposition here which the CollateX algorithm does not catch: that of “5 square meters”, which in version 1 appears right after the “dirt gods” section, whereas in version 2, it appears before that section, in the form “five square meters” (note the spelled-out number). Well, that’s a tricky one, I guess, and is a welcome reminder that algorithmic approaches never have 100% recall and precision…</p>
<p>To conclude: I hope this little post has been able not only to show you why it may be interesting to look at transpositions, but also to explain how you may go about detecting them in your own data using CollateX and why this tool is an excellent choice for collation and visualization of textual variants containing transpositions. Don’t miss out on transpositions next time you do a collation! And, of course, don’t miss out on our full paper on <em>The Martian</em> once it comes out.</p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>