content:encoded
| For the last year or so, one my major interests has been topic modeling and applying it to a variety of French literary texts. I have been doing this as part of an exploration of the computational analysis of literary genres in our Würzburg junior researchers group. One of the interesting things to come out of all of this has been my realization of what a different animal “topics” can be when you are using narrative fiction, not expository prose. Topics are not just abstract themes, then, but are related to personnel, to setting, to narrative motives, to character descriptions, to objects, and to arts and media. Another thing to come out of this is a “topic modeling workflow” (tmw) for Python, which accepts texts in XML/TEI (or plain text) formats, does some preprocessing, calls Mallet to do the topic modeling, does a bit of postprocessing and then creates a variety of visualisations (the code is available on GitHub, and a tutorial will follow here soon). Various iterations of this constantly evolving set of functions have served me well for several conference papers on French crime fiction from the nineteenth to the twentieth-century, for French Classical and Englightenment Drama, and for a comparison of French and Spanish drama from the Early Modern period.
One of the many things I think is challenging with this method are ways to get a good grasp of the overall structure of the model. Again, one of the things I have learned is that different strata of the model are interesting for different things: for example, the ten percent of the topics with the highest probability in the text collection are highly relevant topics in that collection, but they are likely to be relatively vague and won’t differentiate texts well. Inversely, the ten percent of the topics with the lowest probability in the collection tend to be highly specific, but will probably be found only in a very small number of texts by a few different authors. If you happen to be interested in one of these topics, this is a great way of finding relevant texts in a large collection. But if your focus is on genre distinctions, it is the topics with a medium probability in the collection that are most interesting, because genre, just as these topics, is precisely located at this level between all texts and single texts.
The questions of the internal structure of a topic model is of course also related to the famous issue of how many topics you should choose to model. If that number is too low, you will probably miss out on a lot of interesting topics. If it is too high, you will end up with a larger number of interesting topics, but also with a lot of very similar topics. Some of my earlier attempts to get to grips with this was to create heatmaps of a large number of topics plotted against authors, or to create lineplots plotting topic probabilities against time of publication. One of the more interesting things to come out of this was that when you plot a small number of relatively similar topics, they sometimes “take turns” as it were. A group of authors at a given time will realise some general theme (like love or reasoning) in the form of a topic, and a later group of authors will also speak about this general theme, but realise it in a slightly different topic. Sometimes these topics share a few top words but will contextualize them with a host of less important, but very different words. In any case, seemingly similar topics are often subtly different and also have very different distributions. Certainly they are too interesting to be discarded.
However, this approach basically assumed that you know already which of the topics are similar or somehow related, so that plotting them together would be useful and interesting. But how do you discover similar topics automatically? The idea to perform some sort of clustering based on the word probability scores for the words in each topic seemed like an interesting avenue for this. In a way, this is very similar to some of the well-established methods in stylometry: there, each text is treated as a word frequency vector; the text collection is represented as a high-dimensional vector space; distance measures are applied to the vectors in that space; and cluster analysis can be performed on the distance tables to see which texts are similar to each other, and may have been written by the same author. Now, if each topic is treated like a text, we can simply do the same with topics. And because on a technical level, this is all very basic and well-understood, there are packages and functions ready to be used for this, for example the scipy package for Python.
With the rest of the workflow already in Python, it was an obvious step to add topic clustering to the mix, which is what I finally did today. Frankly, once the slightly inelegant part of turning Mallet output into a nice clean dataframe is done, the rest is a breeze, everything is already available; at least if you are happy with a graph that is a bit rough around the edges.) Here is an example of just such a graph representing a topic model of 740 French novels, about one third of them crime fiction novels, the other two thirds various other kinds of novels. I ran tmw to obtain a topic model with 250 topics and then, based on the top-50 words and their probability in of the 250 topics, produced the dendrogram below (the cosine distance measure and the complete linkage clustering algorithm were used, more of which soon).
Topic Clustering Dendrogram
Now, 250 topics is quite a lot of data for a dendrogram, but there are some interesting things to find here. There are quite a number of subgroups which the algorithm distinguishes, of a size between two and five or six topics, with a few larger groups in between (each group is distinguished visually by its color; for a more detailed view, it is important to remember that the similarity of two topics is indicated by the point on the vertical axis where their lines are joined; the lower that point is, the more similar are the topcis). Looking more closely at these groups, and looking up the respective wordles for each of the topics in a group, one can actually see that this method works quite well. Two of the topics suggested by the dendrogram to be relatively similar without being among the top most similar topic pairs are, for example, topics 181 and 248 (they are represented by the two yellow lines just right of the “50” in the legend of the graph; click to enlarge; and sorry about the huge graph). If you look at their two wordles, you see why:
Wordle for topics 181 and 248
These two topics are interesting because they are thematically very close without simply sharing their top one or two top topic words. One is about comedians and clowns in a circus, the other about the various spaces and people in a theater house. But clearly, both are related. And they do share words, of course, such as “théâtre”, the top word in topic 248, a bit less prominent in topic 181; at a lower level, they share “scène” and “spectacle”. So yes, there is a way that this pairing is confirmed by a reader’s intuition. When choosing a group of topics which are joined at an even lower score, the connection is even more obvious (sometimes almost too obvious; or are the topics really so similar?). There seem to be few false positives, so the method seems to work.
Wonderful! However, this also raises a host of new questions, among them of course the old parameter issue: Which distance measure is appropriate for this type of data? Which clustering algorithm produces real insight instead of artifacts of the method? How many words should be taken into account? More specifically, which parameters allow the clustering to show topics that are really semantically similar, not just in their first one or two words, but more generally speaking? And how can we even assess this, when our perception of the topics “themselves” is always mediated by endless lists of words and numerical scores or by tables and graphs trying to summarize them? This issue of parameters is particularly important here because the data used here is inherently very sparse: if all went well when modeling, a great many topics do not share many of their top-10 or top-50 words.
A final question raised by this method is whether it can really tell us something about the quality of the topic model. What is it that we want in a topic model? Not too many topics which are too similar? Many different small groups of highly similar topics? Can we use this to compare models obtained using different parameters for modeling? Again, some of the things we learned in stylometry could again be useful here, for instance using the Rand Index to assess the cluster quality across topic models. And there are of course many other ways of assessing (and quantifying) semantic similarity which could be contrasted with the approach used here. Clearly, this is just the start.
For the last year or so, one my major interests has been topic modeling and applying it to a variety of French literary texts. I have been doing this as part of an exploration of the computational analysis of literary genres in our Würzburg junior researchers group. One of the interesting things to come out of all of this has been my realization of what a different animal “topics” can be when you are using narrative fiction, not expository prose. Topics are not just abstract themes, then, but are related to personnel, to setting, to narrative motives, to character descriptions, to objects, and to arts and media. Another thing to come out of this is a “topic modeling workflow” (tmw) for Python, which accepts texts in XML/TEI (or plain text) formats, does some preprocessing, calls Mallet to do the topic modeling, does a bit of postprocessing and then creates a variety of visualisations (the code is available on GitHub, and a tutorial will follow here soon). Various iterations of this constantly evolving set of functions have served me well for several conference papers on French crime fiction from the nineteenth to the twentieth-century, for French Classical and Englightenment Drama, and for a comparison of French and Spanish drama from the Early Modern period.
One of the many things I think is challenging with this method are ways to get a good grasp of the overall structure of the model. Again, one of the things I have learned is that different strata of the model are interesting for different things: for example, the ten percent of the topics with the highest probability in the text collection are highly relevant topics in that collection, but they are likely to be relatively vague and won’t differentiate texts well. Inversely, the ten percent of the topics with the lowest probability in the collection tend to be highly specific, but will probably be found only in a very small number of texts by a few different authors. If you happen to be interested in one of these topics, this is a great way of finding relevant texts in a large collection. But if your focus is on genre distinctions, it is the topics with a medium probability in the collection that are most interesting, because genre, just as these topics, is precisely located at this level between all texts and single texts.
The questions of the internal structure of a topic model is of course also related to the famous issue of how many topics you should choose to model. If that number is too low, you will probably miss out on a lot of interesting topics. If it is too high, you will end up with a larger number of interesting topics, but also with a lot of very similar topics. Some of my earlier attempts to get to grips with this was to create heatmaps of a large number of topics plotted against authors, or to create lineplots plotting topic probabilities against time of publication. One of the more interesting things to come out of this was that when you plot a small number of relatively similar topics, they sometimes “take turns” as it were. A group of authors at a given time will realise some general theme (like love or reasoning) in the form of a topic, and a later group of authors will also speak about this general theme, but realise it in a slightly different topic. Sometimes these topics share a few top words but will contextualize them with a host of less important, but very different words. In any case, seemingly similar topics are often subtly different and also have very different distributions. Certainly they are too interesting to be discarded.
However, this approach basically assumed that you know already which of the topics are similar or somehow related, so that plotting them together would be useful and interesting. But how do you discover similar topics automatically? The idea to perform some sort of clustering based on the word probability scores for the words in each topic seemed like an interesting avenue for this. In a way, this is very similar to some of the well-established methods in stylometry: there, each text is treated as a word frequency vector; the text collection is represented as a high-dimensional vector space; distance measures are applied to the vectors in that space; and cluster analysis can be performed on the distance tables to see which texts are similar to each other, and may have been written by the same author. Now, if each topic is treated like a text, we can simply do the same with topics. And because on a technical level, this is all very basic and well-understood, there are packages and functions ready to be used for this, for example the scipy package for Python.
With the rest of the workflow already in Python, it was an obvious step to add topic clustering to the mix, which is what I finally did today. Frankly, once the slightly inelegant part of turning Mallet output into a nice clean dataframe is done, the rest is a breeze, everything is already available; at least if you are happy with a graph that is a bit rough around the edges.) Here is an example of just such a graph representing a topic model of 740 French novels, about one third of them crime fiction novels, the other two thirds various other kinds of novels. I ran tmw to obtain a topic model with 250 topics and then, based on the top-50 words and their probability in of the 250 topics, produced the dendrogram below (the cosine distance measure and the complete linkage clustering algorithm were used, more of which soon).
Topic Clustering Dendrogram
Now, 250 topics is quite a lot of data for a dendrogram, but there are some interesting things to find here. There are quite a number of subgroups which the algorithm distinguishes, of a size between two and five or six topics, with a few larger groups in between (each group is distinguished visually by its color; for a more detailed view, it is important to remember that the similarity of two topics is indicated by the point on the vertical axis where their lines are joined; the lower that point is, the more similar are the topcis). Looking more closely at these groups, and looking up the respective wordles for each of the topics in a group, one can actually see that this method works quite well. Two of the topics suggested by the dendrogram to be relatively similar without being among the top most similar topic pairs are, for example, topics 181 and 248 (they are represented by the two yellow lines just right of the “50” in the legend of the graph; click to enlarge; and sorry about the huge graph). If you look at their two wordles, you see why:
Wordle for topics 181 and 248
These two topics are interesting because they are thematically very close without simply sharing their top one or two top topic words. One is about comedians and clowns in a circus, the other about the various spaces and people in a theater house. But clearly, both are related. And they do share words, of course, such as “théâtre”, the top word in topic 248, a bit less prominent in topic 181; at a lower level, they share “scène” and “spectacle”. So yes, there is a way that this pairing is confirmed by a reader’s intuition. When choosing a group of topics which are joined at an even lower score, the connection is even more obvious (sometimes almost too obvious; or are the topics really so similar?). There seem to be few false positives, so the method seems to work.
Wonderful! However, this also raises a host of new questions, among them of course the old parameter issue: Which distance measure is appropriate for this type of data? Which clustering algorithm produces real insight instead of artifacts of the method? How many words should be taken into account? More specifically, which parameters allow the clustering to show topics that are really semantically similar, not just in their first one or two words, but more generally speaking? And how can we even assess this, when our perception of the topics “themselves” is always mediated by endless lists of words and numerical scores or by tables and graphs trying to summarize them? This issue of parameters is particularly important here because the data used here is inherently very sparse: if all went well when modeling, a great many topics do not share many of their top-10 or top-50 words.
A final question raised by this method is whether it can really tell us something about the quality of the topic model. What is it that we want in a topic model? Not too many topics which are too similar? Many different small groups of highly similar topics? Can we use this to compare models obtained using different parameters for modeling? Again, some of the things we learned in stylometry could again be useful here, for instance using the Rand Index to assess the cluster quality across topic models. And there are of course many other ways of assessing (and quantifying) semantic similarity which could be contrasted with the approach used here. Clearly, this is just the start.
|