Not logged in : Login
(Sponging disallowed)

About: https://dragonfly.hypotheses.org/915     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : rss:item, within Data Space : linkeddata.uriburner.com:28898 associated with source document(s)

AttributesValues
type
Creator
  • Christof Schöch
described by
Date
  • 2016-03-17T09:38:40Z
Subject
  • Python
  • names
  • fiction
  • Articles
  • OCR
  • narrative texts
  • spellcheck
  • text collection
  • text quality
rss:title
  • How good are our texts, really? Quality assurance for literary texts from various sources
rss:link
rss:description
  • by Ulrike Henny and Christof Schöch — this post originally appeared on the CLiGS blog. Some weeks ago, we made our “New Year’s release” of text collections available. We publish the texts in the “CLiGS” group’s GitHub repository called “textbox“ and archive each release on Zenodo where they get a DOI. The texts are encoded in TEI with relatively detailed...
content:encoded
  • by Ulrike Henny and Christof Schöch — this post originally appeared on the CLiGS blog.

    Some weeks ago, we made our “New Year’s release” of text collections available. We publish the texts in the “CLiGS” group’s GitHub repository called “textbox and archive each release on Zenodo where they get a DOI. The texts are encoded in TEI with relatively detailed metadata. The collections are subsets of the texts we are using in our various research projects in computational genre stylistics and contain narrative texts from France, Spain and Latin America. The texts have been gathered from various sources, most notably among them Ebooks libres et gratuits and Biblioteca Virtual Miguel de Cervantes

    What’s the problem, or: what is text quality?

    We are proud of these collections, of course, but what we don’t know (or didn’t know until recently), is how reliable our texts really are. We do record, in the TEI header’s source description, where each digital text comes from and, whenever we can, which printed edition was used in establishing the digital text. And we know that Ebooks libres et gratuits, for example, publish proofread texts. However, we don’t know for sure how reliable our texts are, what types of errors there tend to be and whether they are all equally reliable or whether texts from one source have typical errors that texts from other sources don’t have.  This is even more important when our sources provide images but no full text. If they are not already present in the source, it is very likely for structural and orthographic errors to occur as a result of an OCR-process on “our part of the pipeline”, as well. So this is about checking the quality of texts obtained elsewhere as well as checking the results of our own working process and text treatment. 

    The next question was what falls under this category of textual reliability or quality? Certainly, the spelling of individual words is what comes to mind first and which is paramount (especially for “bag-of-words” approaches). Here, an additional question is whether we want reliable texts in the sense of being faithful reproductions of the source edition, or good texts in the sense of an absence of spelling mistakes by modern standards. But other aspects are important as well, of course, for example punctuation (it affects sentence segmentation, which is important for many stylistic measures, and is also interesting in itself). More fundamentally, the fact that the text is in fact complete and does not leave out or mash up chapters, sections or paragraphs, needs to be considered. Finally, the structural integrity and validity of the textual markup is an aspect of text quality.

    So, how can we test how well our texts conform to these kinds of expectations? Remember that we are gathering hundreds of digital texts and that our project is not really about producing such texts, as a digital editing project would be, but about using them. Therefore, we need semi-automatic procedures of quality assurance and cannot rely on thorough proofreading of each text. It is simply not our priority.[09, http://computerphilologie.digital-humanities.de/jg09/geykenetal pdf .">1] For checking the completeness of the texts, we have not found any other way than manually checking the table of contents of a digital facsimile against our hierarchy of chapter divisions. For the validity of the markup, we have of course our TEI schema and can automatically validate the TEI files. But what about the spelling?

    Our approach

    To check the text quality on the orthography and character level, we decided to combine a spellchecker with lists of “exception words” (i.e. additional legitimate words) like named entities, foreign words and other special cases that might not be covered by the spellchecker. The goal was to get a list of remaining errors and information about how often each error occurs.  We would then correct “typical” and “easy” errors, that is, errors recurring frequently throughout the text collection which may easily be replaced automatically. The residual uncommon errors or those which are not easily replaceable would be left unchanged but documented together with the text collection in order to make the quality of the texts transparent.

    Fortunately, there are dictionaries for the languages relevant to us (French and Spanish) that are freely available and have been developed and used in other contexts, e.g. the dictionaries that are used by OpenOffice/LibreOffice for spellchecking. Equally fortunately, libraries exist to apply those dictionaries to a text collection with the help of programmed scripts. We chose the Python module “pyenchant” to implement the spellchecking.

    Our own Python script, which is part of the CLiGS toolbox and is available at https://github.com/cligs/toolbox/blob/master/spellchecking.py, works as follows. It presupposes, that:

    • the texts are ready as a collection of text files. Because they do not contain any metadata otherwise, the names of the files are assumed to serve as identifiers for the texts.
    • a dictionary of the text language has been installed. The pyenchant module can deal with different kinds of dictionaries, e. g. MySpell, Aspell and Hunspell. The documentation of the pyenchant module hints at how to install those on the various operating systems.
    • lists with exception words, e. g. named entities, have been prepared as text files (optionally)
    • a tokenizer exists for the language in question (optionally). By default, the pyenchant module only comes with an English tokenizer. In our case, it worked quite well for Spanish, as well, but not for French because of the many apostrophized forms of the articles. We therefore extended the pyenchant module with a basic French Tokenizer. It is explained in the pyenchant documentation how to create custom tokenizers.

    The CLiGS spellchecker takes the path to the text collection, the language code and the (optional) path to exception word lists as input. It produces a CSV file as output, containing an overview of errors in the whole collection (errors are listed for each text file and the entries are sorted by the number of times an error occurs throughout the collection, in descending order).

    The results: names, names, names, foreign words, …; also, genuine errors

    The results we obtained for the collection of French nineteenth-century novels were highly instructive. In our first iteration, just using the default spellchecker, there were around 8,000 different errors in 4.3 million words, with 270 of the errors occurring more than 100 times. Horrible, right?

    It turned out, however, that most of these “errors” where in fact names of people and places which are not part of the dictionary used by the spellchecker. Adding an existing list of named entities helped a lot, but many of the names used in our nineteenth-century novels were not included there. So we went through our error list and added named entities found there to the list of legitimate named entities. It was instructive that many of the named entities we added designate people and places from regions outside of France, especially Spain, Africa and North America. There are quite a few adventure novels in our set, after all.

    The next thing we noticed was that there are a lot of words in the list of errors which are words from other languages, mostly from Spanish, Italian and Latin. In addition, some colloquial and dialectal words appear, but decidedly less than could have been expected. So, we manually extracted the foreign words from our list of errors and established a list of legitimate foreign words as well. Things got decidedly better. 

    The other strategy that turned out to be instructive was to sort the errors by decreasing frequency in the collection. We realized that only a very small proportion of errors appear more than once in the collection. This is both good and bad news: Good news, because single errors should not cause too many problems with statistical measures of textual similarity, especially as it is easy to exclude such de facto “hapax legomena” from calculations. Good news, again, because fixing the few very frequent errors should help us get a much better error score for our texts with relatively little effort. Bad news, however, because fixing the remaining errors does look like a lot of work. In any case, many of those frequent errors turned out to be genuine spelling mistakes or historical spellings, mostly related to missing or erroneous accents (e.g. etait or piége) as well as to spelling simplifications (e.g. coeur instead of cœur), probably a legacy from attempts to provide ASCII-compatible texts. The last round of checking found 2,800 errors, only 24 of which appear more than 10 times and can easily be fixed. Much better.

    The results for the Spanish-American texts were quite similar to the errors obtained for the French texts. Most of the words which were not recognized by the spellchecker were named entities. The abbreviation “Vd.” was classified as an error, so a list of acceptable abbreviations was added in addition to a text collection specific named entity list. Interestingly, some region specific and colloquial words stand out as a frequent error in single texts, e. g. “milico” (militiaman) in the novel El Chacho by Eduardo Gutiérrez. Further common errors in this collection are words with diminutive suffixes which are particularly widespread in Mexican texts.

    Yet another important group of errors in this collection were historical spellings. Those are very source-specific errors. Texts from the Biblioteca Virtual Miguel de Cervantes, for example, have already been modernized. But in the case of other sources using first or early editions of the 19th century texts, the orthography has not always been updated. In addition to that, OCRed facsimile editions tend to have many historical spelling “errors”. Thus, this group of errors is very unevenly spread across the text collection, depending on where the texts come from and how they have been prepared.

    Foreign words (French, Italian, English, Latin, …) occur, as well, but in most cases just as single or a few instances. To give a number, in the 24 Spanish-American novels which are part of the first textbox release, 5,197 single instance errors have been detected by the spellchecker (out of a total of 1,266,000 words, so 0.41%).

    The consequences: corrections and transparency

    Our take-away from this exercise concerns three things: One is that we now know which very frequent, genuine mistakes there are and that it is worth fixing them, something we will do for the upcoming release of the textbox (you can preview that release in our “next” branch). The second is that the remaining errors (which in many cases aren’t really errors at all,  but legitimate though unusual terms), will have to stay in the texts, because it is too much work to identify, check and correct each of them. spell-checkedHowever, we are publishing the error analyses tables along with our text collections, so that anyone interested in this issue can see which texts contain which words of questionable spelling accuracy. And third, we realize that spell-checking historical literary texts is a special task, but also that our lists of legitimate words are probably highly collection-specific and will not translate easily to collections from another time and/or genre. However, that’s the best we can do with limited resources, and we think it is at least better than just trusting our sources and the OCR software.

    Notes

    1. Alexander Geyken et al. discuss this issue for the “Deutsches Textarchiv”, where distributed proofreading is the strategy of choice. See Geyken et al., “TEI und Textkorpora: Fehlerklassifikation und Qualitätskontrolle vor, während und nach der Texterfassung im Deutschen Textarchiv”, Forum Computerphilologie, 2009, http://computerphilologie.digital-humanities.de/jg09/geykenetal.pdf .
is rdf:_7 of
Faceted Search & Find service v1.17_git150 as of Jan 20 2025


Alternative Linked Data Documents: iSPARQL | ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 08.03.3332 as of Jan 29 2025, on Linux (x86_64-generic-linux-glibc25), Single-Server Edition (378 GB total memory, 30 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2025 OpenLink Software