This HTML5 document contains 15 embedded RDF statements represented using HTML+Microdata notation.

The embedded RDF content will be recognized by any processor of HTML5 Microdata.

Namespace Prefixes

PrefixIRI
n7http://demo.openlinksw.com/about/id/http/dragonfly.hypotheses.org/91/
wdrshttp://www.w3.org/2007/05/powder-s#
dchttp://purl.org/dc/elements/1.1/
n2https://dragonfly.hypotheses.org/
rsshttp://purl.org/rss/1.0/
n6http://demo.openlinksw.com/about/id/http/www.dragonfly.hypotheses.org/
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
contenthttp://purl.org/rss/1.0/modules/content/
xsdhhttp://www.w3.org/2001/XMLSchema#

Statements

Subject Item
n2:574
rdf:type
rss:item
dc:creator
Christof Schöch
wdrs:describedby
n6:91 n7:
dc:date
2014-03-23T18:37:56Z
dc:subject
Regular Expressions My research HTML TEI html2tei Python
rss:title
Enrichment by Elimination, or: How to turn HTML into simple TEI using Python
rss:link
https://dragonfly.hypotheses.org/574
rss:description
There are lots of full text repositories of literary works out there, be it the venerable Project Gutenberg (founded in 1971, when the internet was just a few dozen computers), a pioneer like Gallica (with increasing amounts of plain text in the 90-95% correct OCR range), or a crowdsourced efforts like Wikisource (with nifty quality indicators). Closer to my geographical...
content:encoded
<p><a href="http://dragonfly.hypotheses.org/files/2014/03/python.jpeg"><img loading="lazy" class="size-thumbnail wp-image-576 alignleft" src="http://dragonfly.hypotheses.org/files/2014/03/python-150x150.jpeg" alt="python" width="150" height="150" srcset="https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2014/03/python-150x150.jpeg 150w, https://f-origin.hypotheses.org/wp-content/blogs.dir/857/files/2014/03/python.jpeg 225w" sizes="(max-width: 150px) 100vw, 150px" /></a>There are lots of full text repositories of literary works out there, be it the venerable <a href="http://www.gutenberg.org/">Project Gutenberg</a> (founded in 1971, when the internet was just a few dozen computers), a pioneer like <a href="http://gallica.bnf.fr/">Gallica</a> (with increasing amounts of plain text in the 90-95% correct OCR range), or a crowdsourced efforts like <a href="http://wikisource.org/wiki/Main_Page">Wikisource</a> (with nifty quality indicators). Closer to my geographical location are initiatives like TextGrid&#8217;s <a href="http://textgridrep.de/">Digitale Bibliothek</a> and the <a href="http://www.deutschestextarchiv.de/">Deutsches Textarchiv</a> (both very professional and academic projects). And closer to my academic interests in French and Spanish literature are platforms like <a href="http://abu.cnam.fr/">ABU</a> (offering French literary texts since 1993), <a href="http://athena.unige.ch/athena/">Athena</a> (my personal eye-opener, around since 1994), the <a href="http://www.cervantesvirtual.com/">Biblioteca Virtual Miguel de Cervantes</a> (a labyrinth with many treasures) and, more recently, <a href="http://www.ebooksgratuits.com">Ebooks libres et gratuits</a>. A nice meta-catalogue is <a href="http://www.noslivres.net/">Nos livres</a> with more than 7,000 entries to date.<span id="more-574"></span></p> <p>That&#8217;s plenty of sources for literary texts. But that&#8217;s also plenty of file formats in which you find them (and varying quality of the texts, I might add). Some do offer texts in <a href="http://www.tei-c.org/index.xml">TEI</a> (like TextGrid and Deutsches Textarchiv) and Project Gutenberg has been offering xHTML for a while. But the Biblioteca Virtual as well as Athena are using mostly HTML (and other less useful formats), and the vast collection of Ebooks gratuits is in EPUB or HTML. Also, one of the more useful output formats you get when doing OCR with FineReader is HTML (with quite a bit of control over what goes into it). EPUB itself is just a container format, and the textual content itself is marked-up in xHTML, which you can get at using <a href="http://calibre-ebook.com/">Calibre</a>&#8216;s conversion feature (which has plenty of options).</p> <p>So the question is, how do we get from all this HTML to a format that plays nicely in quantitative text analysis? Wait a second, you say, can&#8217;t I just save the stuff as plain text and be ready to go? I think there is a point to be made for quick-and-easy solutions, and lots of tools for quantitative text analysis are very happy with plain text. But in the long run, I do believe that it makes a lot of sense to have a slightly more expressive master format to work from, even if you do generate plain text from it at some point. Enter TEI, of course, whose eminent usefulness for quantitative text analysis I have lauded before.</p> <p>Using TEI as my master format for storing my collections of text has a number of advantages: first of all, I can keep some metadata directly connected to my text; not just the author and title, but also things like the provenance of the digital text and the edition of the work it is based on. Second, I can keep texts like prefaces and notes associated with the text right there in my file, but can easily ignore them when I generate a derivative text file. And third, I can make the text&#8217;s structure explicit and use that to later selectively pull put parts of the text alone, without having to rely on complicated heuristics.</p> <p>So, how can we create TEI from HTML? One perfectly valid strategy, I believe, would be to turn it first into xHTML and then use <a href="http://www.tei-c.org/oxgarage/">Oxgarage</a>. However, that is two steps and you don&#8217;t have a lot of control over what ends up in your TEI file. Also, many EPUBs and HTML files found on the platforms mentioned before have all kinds of funny &#8220;span&#8221; Elements and &#8220;class&#8221; Attributes in them, which you don&#8217;t really need. What to do, then? Well, with a tiny bit of <a href="https://www.python.org/">Python</a> (thanks to Fotis for the jump-start) and a small amount of <a href="http://docs.python.org/3.3/library/re.html">Regular Expressions</a>, it becomes really easy to eliminate a lot of clutter from the HTML, add a little TEI header, and mark-up the chapters and headings. This is how your files become enriched by elimination and transformation.</p> <p>The Python script I&#8217;m proposing here is in a really early stage of development (pre-alpha, at most). The script basically only does three things: It opens an HTML file, it applies a series of search-and-replace actions on the file content, and it then saves the result as an XML file. HTML can come it all kinds of forms, and even with novels alone (I have not even tried anything else), the result is not always perfect. However, it is already a big step forward, for myself, from doing this ad-hoc for every file and every aspect of it (I tried that for some hours, too.) The nexts steps will be to make the script increasingly generic and to make it work on a whole set of files at a time. Maybe I will also have to branch it off into different version for different types of texts (like, for instance, plays from FineReader vs. novels from Calibre). You may <a href="https://github.com/christofs/tei-texts">check out the script on GitHub</a>, along with a modest start for a collection of TEI-encoded French literary texts. As a little bonus, that proof-of-concept collection of 14 novels is also available as a <a href="https://dl.dropboxusercontent.com/u/1786135/ROMANS.txm">TXM binary corpus (165MB)</a> already POS-tagged and ready to be loaded into <a href="http://textometrie.ens-lyon.fr/?lang=en">TXM</a>). Comments (here) and issues (there) are welcome, of course!</p> <p>[Edit, 28.4.2014: After writing <a title="Extracting selected text from XML files using lxml in Python" href="http://dragonfly.hypotheses.org/624">a companion script and blog post</a>, I have updated the code and <a href="https://github.com/clgs/toolbox">moved it to a new location</a>.]</p>