(Please note that a revised version of this post has been published in the Journal for Digital Humanities in december 2013. – The original post is a slightly edited version of a talk I gave at the European Summer School “Culture & Technology at the University of Leipzig, Germany, on July 26 2013.)
This talk is about data in the humanities. Most of my colleagues in literary and cultural studies would not necessarily speak of their objects of study as „data”. If you ask them what it is they are studying, they would rather speak of books, paintings and movies; of drama and crime fiction, of still lives and action painting; of German expressionist movies and romantic comedy. They would mention Denis Diderot or Toni Morrison, Chardin or Jackson Pollock, Fritz Lang or Diane Keaton. Maybe they would talk about what they are studying as texts, images, and sounds. But rarely would they consider their objects of study to be „data”.
Sources of digital data for the humanities
However, in the humanities just as in other areas of research, we are increasingly dealing with „data”. With digitization efforts in the private and public sectors going on around the world, more and more data relevant to our fields of study exists, and, if the data has been licensed appropriately, it is available for research.[] The digital humanities aim to raise to the challenge and realize the potential of this data for humanistic inquiry. As Christine Borgman has shown in her book on Scholarship in the Digital Age, this is as much a theoretical, methodological and social issue as it is a technical issue.[]
Indeed, the existence of all this data raises a host of questions, some of which I would like to address today. For example:
- What is the relation between the data we have and our objects of study? – Does data replace books, paintings and movies? In what way can they be said to be representations of them?
- What difference does it make to analyze the digital representation or version of a novel or a painting instead of the printed book, the manuscript, or the original painting?
- What types of data are there in the humanities, and what difference does it make? – I will argue that one can distinguish two types of data, “big” data and “smart” data. What, then, does it mean to deal with big data, or smart data, in the humanities?
- What new ways of dealing with data do we need to adopt in the humanities? – How is big data and smart data being dealt with in the process of scholarly knowledge generation, that is when data is being created, enriched, analyzed and interpreted?
1. What is data (in the humanities)?
So, what do we mean by “data” in the context of research in the humanities? First of all, let’s remember how data is generally defined. This is the definition from Wikipedia:
Data are values of qualitative or quantitative variables, belonging to a set of items. Data in computing (or data processing) are represented in a structure, often tabular (represented by rows and columns), as a tree (a set of nodes with parent-children relationship) or as a graph structure (a set of interconnected nodes).[]
This definition is already quite useful in our context. However, it fails to introduce some important distinctions. For instance, there is structured and unstructured data as well as semi-structured data. Structured data is typically held in a database in which all key/value pairs have identifiers and clear relations and which follow an explicit data model. Plain text is a typical example of unstructured data, in which the boundaries of individual items, the relations between items, and the meaning of items, are mostly implicit. Data held in XML files is an example of semi-structured data, which can be more or less strictly constrained by the absence or presence of a more or less precise schema.
Another important distinction is between data and metadata. Here, the term „data” refers to the part of a file or dataset which contains the actual representation of an object of enquiry, while the term „metadata” refers to data about that data: metadata explicitly describes selected aspects of a dataset, such as the time of its creation, or the way it was collected, or what entity external to the dataset it is supposed to represent.
Independent of its type, any dataset relevant to research represents specific aspects of the object of scrutiny, be it in the natural sciences, the social sciences or the humanities. Data is not the object of study itself, but “stands in” for it in some way. Also, data is always a partial representation of the object of study. In some cases, however, it is our only window into the object of study. Still, this “disadvantage” of partial representation is small compared to the the fact that digital data can be transformed, analyzed, and acted upon computationally.
Data in the humanities is a bit special: one could in fact argue that text in a book or a manuscript, or the visual elements making up a painting, are data already. On the one hand, however, this is analog, non-discrete data, which cannot be analyzed or transformed computationally; and on the other hand, language and texts, just as paintings, are semiotic systems that have dimensions beyond the physically measurable, dimensions which depend on semantics and pragmatics, that is on meaning in context. For this latter reason particularly, speaking of “data” in the humanities is not unproblematic and has been challenged. Criticism has come from mainstream scholars who see “data” and quantitative methods of analyzing them with suspicion, because the apparent empiricism of data-driven research in the humanities seems at odds with principles of humanistic inquiry, such as context-dependent interpretation and the inevitable “situatedness” of the researchers and their aims.
Some practitioners of Digital Humanities, notably Joanna Drucker, have argued that the term “data” is actually inadequate. And indeed, the term’s etymology seems problematic in the context of the humanities: it comes from the Latin datum, which means “that which is given”. This means it carries with it the meaning of an observer-independent fact which cannot be challenged in itself. Joanna Drucker prefers to speak of “capta” instead of data, literally “that which has been captured or gathered”, underlining the idea that even the very act of capturing data in the first place is oriented by certain goals, done with specific instruments, and driven by a specific attention to a small part of what could have been captured given different goals and instruments. In other words, capturing data is not passively accepting what is given, but actively constructing what one is interested in.[]
My take on this is that rather than use a new term, we should redefine what we mean by data in the humanities in this sense. So, data in the humanities could be considered a digital, selective, machine-actionable construction of the object of humanistic inquiry. Whether we are historians using texts or other cultural artifacts as windows into another time or another culture, or whether we are literary scholars using knowledge of other times and cultures in order to determine the meaning of texts, digital data add another layer of mediation into the equation. In this way data, just as the tools with which we manipulate data, add complexity to the relation between researchers and their objects of study.
Basically, I would like to argue that there are two basic types of data in the humanities: big data and smart data. These two types of data can be described in two dimensions: the first dimension describes how structured, clean and explicit the data is; the second dimension describes how voluminous and how varied the data is. I suggest to view big data, in a first approximation, as relatively unstructured, messy and implicit, relatively large in volume, and varied in form. Conversely, I suggest to view smart data to be semi-structured or structured, clean and explicit, as well as relatively small in volume and of limited heterogeneity. Although you could say that these are really just differences of degree, there are more fundamental differences between them when it comes to looking at how each of them are created or captured, modeled, enriched, and analyzed.
2. Smart data (in the humanities)
When we move from books to digitized version of the text contained in the book, we are not necessarily dealing with big or smart data right away. It may very well be small and simple, not to say „messy” data in the beginning. This is probably the least useful type of data. So what do I mean by „smart data”?
First of all, I should mention that „smart data” is not an established or well-defined term. It is not very widespread and does not have a stable meaning. Smart data is data that is structured or semi-structured; it is explicit and enriched, because in addition to raw data it contains information in the form of annotations, enrichments and metadata. And smart data is „clean”, in the sense that imperfections of the process of capture or creation have been reduced as much as possible, within the limits of the specific aspect of the original object being represented. This also means that smart data is „small” in volume, because its creation involves human agency and demands time. The process of modeling the data is essential to small/smart data; its abstract structure can be defined with elaborate schemas or as predefined database structures.
A ptototypical example for smart data are scholarly digital editions produced using the guidelines of the Text Encoding Initiative. Technically, TEI documents are usually considered semi-structured; usually, they follow a data model expressed in a schema, but such a schemas allow for considerable flexibility. In addition to a very clean transcription of the text, such digital editions can make a lot of information explicit: first of all, TEI files contain not just the full text, but also metadata associated with the text (in the teiHeader section); also, the data is structured and explicit: there is markup making the structure of the text explicit: parts, chapters, headings, paragraphs, as well as page and line breaks, for example; finally, many more types of information can be made explicit: for example person names in a novel or play, place names in a letters or documents, and many more things; links to other parts of the documents and to external documents. Making all of these things explicit allows to visualize them in specific ways and to index, count and analyze them computationally.
But lets move on to another example of „smart data”. This data comes from a study of literary description in the eighteenth century novel which I conducted some years ago.[] The aim was to identify all descriptive passages in a collection of thirty-two novels published between 1760 and 1800 and to find out how, from a standpoint of literary stylistics, descriptive writing „functioned” at that time. For this, I used a bibliographic reference management system as my frontend to a database of descriptive passages I collected and tagged for dozens of features I considered relevant for the study. Using a database and adding information to the descriptive excerpts changed the way I interacted with the data, as opposed to manual annotation and note-taking. For example, an actual process of explicit iterative modeling via keywords is happening; hypotheses can be tested; and outliers cannot just be ignored. Also, it would have been really difficult to deal with the 2000 examples and their many tags without having them in a database. Because of the relatively large number of examples used for this study, recurring patterns of usage and trends over time and sub-genres could be detected.[]
Despite these significant transformations of the way we work with texts when they are available as „data”, I believe the move from print culture to smart digital data is actually a rather small step compared to the steps required by big data (we’ll get to that in a minute). In the digital medium, we can also read texts, look at images, make annotations, and write down ideas and syntheses.
Now, this is all very well and good: smart data as we find it in scholarly digital editions, in annotated linguistic corpora and in carefully curated image collections is immensely useful. However, there is an issue with smart data: it does not scale well. Although various aspects of creating smart data such as carefully encoded TEI documents can be at least partially automated, ultimately smart data depends on manual work by real people. Classifying descriptions their context according to formal, semantic and narratologic categories is not something computers can do just yet. This means that it is very time-consuming to create large volumes of smart data.
Of course, there are ways to deal with this, and I will get to that later in this talk. Machine Learning will no doubt be the key to these challenges. But what if we actually don’t really need smart data? What if having a lot of relatively unstructured, relatively messy data is just as useful and much easier to create? This kind of data is called „big data”, so let’s have a closer look at this alternative model of data.
3. Big data
Big data is the buzz-word of the decade. Everyone wants big data and big data technologies; big data experts are telling large corporations they won’t keep their competitive edge without big data. Areas as diverse as online marketing, stock exchange trading, health care, and political campaigns are driven by big data. The European Commission and the German Ministry of Education and Research hold „big data” conferences and fund big data research.
So, what does it mean for data to be „big”? In a recent best-selling book about the „Big Data Revolution”, the authors propose a non-technical, outcome-oriented definition of big data:
Big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value.[]
Wikipedia gives a more technical definition of big data, and points to the challenges rather than the opportunities of big data:
Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.[]
In fact, „big data” is usually defined, following a definition by Doug Laney, by the three „V”s: volume, velocity and variety.[] First, the idea that big data is defined by its (large) volume is seemingly the most obvious of the three „V”s. However, when does a large volume of data really become „big data”? What if the data you want to analyze simply exceeds the memory of your computer, forcing you to move processing to a grid computing system? What if it is too large to be efficiently accessed in a relational database, so that you need to use different way of representing your data?
Second, the idea of „velocity” of data really means two things: first, that data is constantly being generated by sensors (in the natural sciences) or as a by-product of people’s activities in a digital environment (in economics or the social sciences), creating a constant influx of new data. Second, this flow of data is being analyzed in real-time and has to be very quick and responsive. In turn, this allows to react immediately to the data.
Finally, the idea of „variety” of big data means that heterogeneous sources and formats of data are being used together, taking advantage of the links and overlap between such heterogeneous datasets to allow all kinds of inferences. The challenges here lie particularly in the fact that all these various datasets cannot be integrated into one unified dataset. What the idea of „variety” also implies is a variety of ways these datasets are structured, or a relative lack of structure in the datasets.
Actually, „big data” is a moving target, because along with the increasing volume, velocity and variety of data, new technologies are constantly becoming more readily available. For example, so-called „NoSQL”-databases are becoming more widespread and avoid some of the limitations that more traditional SQL-databases have when it comes to heterogeneous or unstructured data. A solution like „Hadoop”, that allows the distributed but closely coordinated processing of huge volumes of data on hundreds or thousands of machines in a grid, is cutting-edge and fancy now, but will be mainstream very soon.
Big data in the humanities is not the same as big data in the natural sciences or in economics. In most cases, velocity does not play a key role in big humanities data right now. Also, the large „volume” is less usefully defined in the humanities by a shift from databases to distributed computing. Variety of formats, complexity or lack of structure does come into play, however. In fact, the distinctive mark of big data in the humanities seems to be a methodological shift rather than a primarily technological one. And it is a huge methodological shift. Paradoxically, the shift from small smart data to big data is much more radical, I would argue, than the shift from print to smart digital data was.
Moving from smart data to big data implies a shift from „close reading” to „distant reading” (in the words of Franco Moretti) or to „macroanalysis” (to use Matthew Jockers’ term).[]
Instead of reading a few selected texts, we analyze an entire collection of relevant textual data.
The first consequence of this is that instead of operating on the level of literary forms and conventions, of semantics and context, we operate with quantitative measures of low-level features, on the basis of statistics and probabilities. The second consequence is that instead of so-called „representative” texts or paintings, we can now study the entire set of texts or images relevant to a specific research question. Trends in literature can be observed across the entire literary production of a given time and given genre. Questions of representativeness, of canonization, of literary quality play a much smaller or at least a different role in this context.
Nineteenth-Century Novels (License: CC-BY)
If this sounds to good to be true, that is because it is. Despite massive digitization efforts by private and public actors, we are still far from the complete record of human culture and creativity, even if we are looking just at text. For the British nineteenth century novel, for example, the calculations go as follows: an estimated 20.000 novels were published in the nineteenth century; of these, only around 6.000 are estimated to be still existant in the holdings of libraries and private collections worldwide. However, of these 6.000 novels only about half, that is 3500 novels, have been digitized in full text mode and are available for research; that corresponds to less than twenty percent of the total production.[] This would certainly be a good sample size, if it were a random sample; but of course it is not. So, the 3.500 novels seem like a small amount and not a number that actually resolves the sampling, representativeness, and canonization issues. Still, this is a huge improvement over the mainstream canon which probably does not include more than one hundred novels for the nineteenth century, and may be visualized as a tiny spot at the bottom of the slide.
This means that examples for „really big” data in the humanities are still relatively rare. Even Google Books is not quite there yet. Google has scanned more than 30 million books, as of April 2013, and continues to scan more. Compared to the estimated 130 million books ever published, this is a large part of the written human record. But of course, this is neither exhaustive nor are books the only medium of print publication there is, so newspapers, magazines and journals would need to be added to this. But what really counts, from my point of view, is less the volume than the methods used for analysis. And these can be successfully applied to smaller sets of data as well, and imply precisely the methodological paradigm shift I mentioned, from close to distant reading.
I would like to give just one example from my own work, dealing with French drama from the seventheenth and eighteenth century and involving a maximum of 520 individual plays. That’s not big data in the technical sense of the three „V”s, but it requires a radical shift from close reading methods to quantitative, statistical analysis.
The basic question I have been addressing for the last year or so is how traditional categories of literary history, such as literary genres, forms and eras, relate to classifications made on the basis of the actual linguistic material. What comes out of this type of analysis, which can be pushed further in a variety of ways, is that there are indeed correlations between linguistic features on the one hand, and large categories from literary history on the other hand; but also, that these are not simple and stable correlations, but highly complex and varying ones.[]
Conclusion: towards smarter big data or bigger smart data
For most of this talk, I have been opposing big data and smart data. Indeed, big data tends to involve large volumes of raw, plain, somewhat messy text, whereas smart data tends to involve smaller volumes of carefully encoded, very clean text. Big data needs to be analyzed with methods from statistics, such as cluster analysis or principal component analysis, whereas smart data can be analyzed with specific tools allowing to take advantage of structural, linguistic and contextual markup. Big data requires visualization to even start understanding its possible structure, whereas smart data makes its structures explicit. In big data applications, outliers and ambiguities are said to matter little because they get smoothed over by the sheer quantity of information that is good enough, whereas smart data makes exceptions and ambiguities explicit and effectively reduces possible ambiguities.
That said, I believe the most interesting challenge for the next years when it comes to dealing with data in the humanities will be to actually transgress this opposition of smart and big data. What we need is bigger smart data or smarter big data, and to create and use it, we need to make use of new methods. So, how can we enrich big data sufficiently to make more intelligent queries possible? How can we speed up the process of creating smart data so that we can produce larger volumes of it?
Basically, there are two possible ways to do this: the one is automatic annotation, the other is crowdsourcing. Automation refers to various heuristics of discovering implicit units, structures, patterns and relations, and of making them explicit in the data. Crowdsourcing, on the other hand, relies on breaking down a large task into such small units that each of these little tasks can be performed in a distributed way by a large number of volunteers. Various strategies have been developed for breaking up the tasks, for creating incentive structures to motivate volunteers (like „gamification” or „win-win”-constellations), and to reintegrate the added information into the project.
In fact, automation and crowdsourcing will have to work hand in hand. This is what happens with OCR: better and better opical character recognition systems are still no match to manual double-keying by experts, especially when it comes to print before 1800 or to handwriting. But state-of-the art OCR combined with algorithms to detect potential areas of error and cleverly crowdsourced and distributed error-correction mechanisms such as the ones implemented by „Captcha” go a long way to producing more reliable full text.
Similarly, automatic annotation even of basic linguistic features for well-researched languages is still too faulty to be trusted blindly, at least in a „clean smart data” perspective. We will have to find ways of detecting potentially faulty linguistic annotation, then finding and motivating users to check such annotations, and writing the corrections back into larger and larger collections of clean, structured and well-annotated text.
The story of smart and big data (Licence: CC-BY)
To summarize the story of data in the humanities which I been trying to tell, one could consider that this story has several steps: The first step leads from the study of creative works in the form of books, paintings and movies to their study based on digital representations of these works; this is what digitization at its most basic level as brought about; the first of two possible steps from there leads to smart data, that is to data that has been carefully curated, structured, annotated in a way to make explicit a lot of information that is implicit in the „raw” and messy digitized artefacts. This happens, prototypically, in scholarly digital editions of text or music scores. The second of the two possible steps from „raw digital data” leads to big data, simply by accumulating more and more data and letting the algorithms sort it all out, instead of cleaning it up by hand. The last step in this story is to reconcile, or rather to combine, the smart and the big data approaches.
In any case, the fact that we are all here at the Leipzig Summer School, some of you learning how to make data smart and use it, and some of you learning how to use big data in smart ways, shows that this kind of convergence is well under way.
Notes