Not logged in : Login
(Sponging disallowed)

About: https://dragonfly.hypotheses.org/574     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : rss:item, within Data Space : linkeddata.uriburner.com:28898 associated with source document(s)

AttributesValues
type
Creator
  • Christof Schöch
described by
Date
  • 2014-03-23T18:37:56Z
Subject
  • HTML
  • Python
  • My research
  • Regular Expressions
  • TEI
  • html2tei
rss:title
  • Enrichment by Elimination, or: How to turn HTML into simple TEI using Python
rss:link
rss:description
  • There are lots of full text repositories of literary works out there, be it the venerable Project Gutenberg (founded in 1971, when the internet was just a few dozen computers), a pioneer like Gallica (with increasing amounts of plain text in the 90-95% correct OCR range), or a crowdsourced efforts like Wikisource (with nifty quality indicators). Closer to my geographical...
content:encoded
  • pythonThere are lots of full text repositories of literary works out there, be it the venerable Project Gutenberg (founded in 1971, when the internet was just a few dozen computers), a pioneer like Gallica (with increasing amounts of plain text in the 90-95% correct OCR range), or a crowdsourced efforts like Wikisource (with nifty quality indicators). Closer to my geographical location are initiatives like TextGrid’s Digitale Bibliothek and the Deutsches Textarchiv (both very professional and academic projects). And closer to my academic interests in French and Spanish literature are platforms like ABU (offering French literary texts since 1993), Athena (my personal eye-opener, around since 1994), the Biblioteca Virtual Miguel de Cervantes (a labyrinth with many treasures) and, more recently, Ebooks libres et gratuits. A nice meta-catalogue is Nos livres with more than 7,000 entries to date.

    That’s plenty of sources for literary texts. But that’s also plenty of file formats in which you find them (and varying quality of the texts, I might add). Some do offer texts in TEI (like TextGrid and Deutsches Textarchiv) and Project Gutenberg has been offering xHTML for a while. But the Biblioteca Virtual as well as Athena are using mostly HTML (and other less useful formats), and the vast collection of Ebooks gratuits is in EPUB or HTML. Also, one of the more useful output formats you get when doing OCR with FineReader is HTML (with quite a bit of control over what goes into it). EPUB itself is just a container format, and the textual content itself is marked-up in xHTML, which you can get at using Calibre‘s conversion feature (which has plenty of options).

    So the question is, how do we get from all this HTML to a format that plays nicely in quantitative text analysis? Wait a second, you say, can’t I just save the stuff as plain text and be ready to go? I think there is a point to be made for quick-and-easy solutions, and lots of tools for quantitative text analysis are very happy with plain text. But in the long run, I do believe that it makes a lot of sense to have a slightly more expressive master format to work from, even if you do generate plain text from it at some point. Enter TEI, of course, whose eminent usefulness for quantitative text analysis I have lauded before.

    Using TEI as my master format for storing my collections of text has a number of advantages: first of all, I can keep some metadata directly connected to my text; not just the author and title, but also things like the provenance of the digital text and the edition of the work it is based on. Second, I can keep texts like prefaces and notes associated with the text right there in my file, but can easily ignore them when I generate a derivative text file. And third, I can make the text’s structure explicit and use that to later selectively pull put parts of the text alone, without having to rely on complicated heuristics.

    So, how can we create TEI from HTML? One perfectly valid strategy, I believe, would be to turn it first into xHTML and then use Oxgarage. However, that is two steps and you don’t have a lot of control over what ends up in your TEI file. Also, many EPUBs and HTML files found on the platforms mentioned before have all kinds of funny “span” Elements and “class” Attributes in them, which you don’t really need. What to do, then? Well, with a tiny bit of Python (thanks to Fotis for the jump-start) and a small amount of Regular Expressions, it becomes really easy to eliminate a lot of clutter from the HTML, add a little TEI header, and mark-up the chapters and headings. This is how your files become enriched by elimination and transformation.

    The Python script I’m proposing here is in a really early stage of development (pre-alpha, at most). The script basically only does three things: It opens an HTML file, it applies a series of search-and-replace actions on the file content, and it then saves the result as an XML file. HTML can come it all kinds of forms, and even with novels alone (I have not even tried anything else), the result is not always perfect. However, it is already a big step forward, for myself, from doing this ad-hoc for every file and every aspect of it (I tried that for some hours, too.) The nexts steps will be to make the script increasingly generic and to make it work on a whole set of files at a time. Maybe I will also have to branch it off into different version for different types of texts (like, for instance, plays from FineReader vs. novels from Calibre). You may check out the script on GitHub, along with a modest start for a collection of TEI-encoded French literary texts. As a little bonus, that proof-of-concept collection of 14 novels is also available as a TXM binary corpus (165MB) already POS-tagged and ready to be loaded into TXM). Comments (here) and issues (there) are welcome, of course!

    [Edit, 28.4.2014: After writing a companion script and blog post, I have updated the code and moved it to a new location.]

is rdf:_10 of
Faceted Search & Find service v1.17_git149 as of Dec 03 2024


Alternative Linked Data Documents: iSPARQL | ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 08.03.3331 as of Aug 25 2024, on Linux (x86_64-ubuntu_noble-linux-glibc2.38-64), Single-Server Edition (378 GB total memory, 15 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software