What the perfect repository for text analysis looks like (to me)
rss:link
https://dragonfly.hypotheses.org/388
rss:description
The longer I work with various collections of literary texts, available in various formats, and for use with various tools, the more I would like to have a nice repository which me and others could use to ingest, store, transform, update and extract text collections. So what would this repository look like? Basically, I’m describing a use case, and would...
content:encoded
<p>The longer I work with various collections of literary texts, available in various formats, and for use with various tools, the more I would like to have a nice repository which me and others could use to ingest, store, transform, update and extract text collections. So what would this repository look like? Basically, I’m describing a use case, and would be very interested to hear from you out there, fellow text analysis practicioners, whether you have aspects to add to this, and how you are currently dealing with the issues outlined here.</p>
<p><span id="more-388"></span></p>
<div id="attachment_391" style="width: 310px" class="wp-caption alignleft"><a href="http://dragonfly.hypotheses.org/files/2013/07/massivereading.jpg"><img aria-describedby="caption-attachment-391" loading="lazy" class="size-medium wp-image-391 " src="http://dragonfly.hypotheses.org/files/2013/07/massivereading-300x274.jpg" alt="Massive reading! (Photo credit: "Work with schools: after a book talk, showing boys gathered..." by the New York Public Library, Flickr commons, http://www.flickr.com/photos/lselibrary/3925726691/. )" width="300" height="274" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2013/07/massivereading-300x274.jpg 300w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2013/07/massivereading-500x457.jpg 500w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2013/07/massivereading-328x300.jpg 328w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2013/07/massivereading.jpg 760w" sizes="(max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-391" class="wp-caption-text">Massive Reading! (Photo credit: “Work with schools: after a book talk, showing boys gathered…” by the New York Public Library, Flickr commons, http://www.flickr.com/photos/lselibrary/3925726691/.)</p></div>
<p>Basically, I see the following steps in my “repository for text analysis use case”, roughly from beginning to end:</p>
<ol>
<ol>
<li>Ingest texts coming from various sources (Gutenberg, Wikisource, TextGrid, ABU, theatre-classique.fr, ebooksgratuits.com, you name it) and, hence, in various formats (txt, html, epub, doc, XML, TEI P4, TEI P5), into the repository.</li>
<li>Transform all of these text formats into a central “master” format, basically a relatively basic implementation of TEI, and make them valid against a schema defining that implementation.</li>
<li>Add typological metadata to the TEI header, including things like: genre, sub-genre, author gender, literary epoch, narrative form, etc.</li>
<li>Create various derivative files from the master files, especially plain text files split into pieces in various ways (by kb or numbers of words, or by structural segments like chapters, paragraphs, scenes or acts).</li>
<li>Create such derivative files containing only specific parts of the master files, like only text from the “body”, or only “speeches”, or everything but quotations.</li>
<li>Create collections of derivative files based on the typological metadata, such as a collections of all crime fiction novels from the 20th century, or all comedies in verse written by women (not that many yet).</li>
<li>Use the typological metadata to flexibly generate filenames for the derivative files, for example “author_title” or “genre_year-title”.</li>
<li>The master and/or derivative files can be further annotated linguistically (tokenization, lemmatization, POS-tagging) and otherwise (named entity recognition, speakers, etc.)</li>
<li>The master files can be corrected, updated and versioned, and new sets of derivatives be generated from the updated masters</li>
<li>The collections of derivative files can be published for documentation, reproducibility and reuse, including resue by external analysis services</li>
<li>All data is stored safely and securely, and access can be more or less restricted so that collaborative work on the collections becomes possible</li>
<li>And finally, of course, and because 12 is a nice number, use these collections of derivative text files for all kinds of computational text analysis</li>
</ol>
</ol>
<p>Don’t we all need this in some way or another, when doing computational text analysis? Currently, I am doing these things with a wild but somewhat functional although ultimately clunky combination of a wide array of tools and services (I’m probably forgetting some):</p>
<ol>
<li>Calibre to create txt or rtf from epub and other formats (#1 above)</li>
<li>Oxgarage to create TEI from various formats (#1 above)</li>
<li>A little bit of XSLT in jEdit or oXygen, to derive selective content from TEI files and write txt files (#4 and #5)</li>
<li>RegEx in jEdit or oXygen, to create or clean up TEI files (#1 and #3)</li>
<li>RelaxNG in jEdit oXygen, for validation (#3)</li>
<li>Dropbox for storing everything and accessing it myself (#11)</li>
<li>“cat”, “split” and “rename” on the command line to split and merge text files and to rename the filenames of specific text collections (#7)</li>
<li>TreeTagger via an R wrapper, for linguistic annotation (#8)</li>
<li>“stylo” package for R, WEKA, Gephi, TXM, jEdit, for all kinds of queries and analyses (#12)</li>
<li>do things manually, for good measure and difficult cases (all steps, even #12: I do read stuff in the good old way, too.)</li>
</ol>
<p>Notice that in the first list, #9 (update with versioning) #10 (publish for reuse) and #11 (storage with access control) are not really catered for with this <em>bricolage</em> setup. More importantly, however, the workflow involves too many tools which are not really connected with each other, so that the process of creating a given derivative of a text can neither be easily be repeated by myself nor, just as importantly, by others.</p>
<p>Currently, here in Würzburg, we are exploring how to provide for most if not all of these things with <a href="http://textgrid.de">TextGrid</a> and the <a href="http://de.dariah.eu">DARIAH</a> infrastructure. I’m personally curious to hear how others do this, but it would also be of great interest to us when further defining this use case, how others deal with this: So, how do you do it? What is needed but not in the list? What works well for you? What doesn’t?</p>
<p>Of course, I have not really said what my perfect repository would <em>look like</em>: I’m not sure, but I do know that when I’ll be using it, I will be impressed and gratified by its elegance, flexibility and reactiveness. It will work beautifully.</p>