Friday, October 17, 2014

Textual Corpora and the Digital Islamic Humanities, Day 1

I'm in Providence, at Brown, for a digital humanities and text corpora workshop geared towards people working in Islamic Studies fields.

A bit of a conference report follows the jump.

Normally I would not be totally comfortable posting a conference report like this because it's basically reproducing others' intellectual work, presented orally and possibly provisionally, in written and disseminated form. However, because video of the event is going to be posted online along with all of the PowerPoint slides, these presentations were made with an awareness that they were going to be disseminated online and so a brief digest does not strike me as a problem. With that said, what I am writing here represents the work of others, which I will cite appropriately.

The workshop convener, Elias Muhanna, began by introducing what he called "digital tools with little no learning curve." These included text databases such as,, and the aggregate dictionary page al-mawrid reader (which is apparently totally and completely in violation of every copyright law on the books). Then there were sources for collections of unsearchable PDFs (,, and; and, which is pretty much the only one on this list that isn't violating copyright law in some way) and sources for searchable digital libraries of classical Arabic texts ( and; and al-jami' al-kabīr, which is a database that has the special feature of mostly not functioning and mostly not being installable). as well as various databases used by computational linguists are the best bets for modernists looking for things.

With respect to all of these, the question of how the texts are entered is a bit of a mystery. Some are rekeyed from editions, some are scanned as PDFs and some are OCR scanned; and even though OCR scanning can be up to 99% accurate, that still translates into a typo every hundred characters, which is not ideal. Regardless of the technology used to upload these texts to these databases, copyright law was raised, ongoing, as an issue surrounding the use of these tools, and the current state of play appears to be somewhere between the wild west and don't ask don't tell.

A few sample searches were run to demonstrate what they might be used for — occurrences of the phrase allahu a'lam to gauge epistemological humility (I'm not totally sure about the reliability of the one to gauge the other, but nevermind) and an Arabic proverb I did not know previously about mongoose farts (fasā baynahum al-ẓaribān) to illustrate a search to determine how a saying might be used, whether purely for grammatical or sociologically illustrative purposes (as this one apparently is) or whether it occurs within a discourse.


These text collections were a segue into Maxim Romanov's presentation on the difference between text collections and text corpora and the desiderata for the creation of the latter.

Text collections are what already exit. They are characterized by the following traits:

   reproduce books (technically DBs but don’t fuction as DBs)
   Book/Source is divided into meaningless unites of data, such as “pages”
   Limited, ideologically biased (shamela is open but BOK format is obscure)
   Not customizable  (users cannot add content)
   Limited search options
   Search results are impossible to handle (have to have your own system on top of the library system)
   No advanced way for analyzing results (no graphing, mapping)
   No ability to update metadata

Textual corpora are what we need to be creating. They are characterized by the following traits: 

   Adapted for research purposes (open organization format)
   Book.Source is divided into meaningful unites of data (such as “biographies” for a biographical collection, “events” for chronicles, “hadith reports” for hadith collections)
   Open and fully customizable
   Complex searches (with filtering options)
   Results can be saved for later processing (multiple versions, annotations, links)
   Visualizations of results
   Easy to update metadata


Elli Mylonas gave an introduction to the idea of textual markup, which was the piece that was the most general and most theoretical of the day. She raised a number of interesting issues.

One was the question of how archival data can be, and she made the case for XML files being not quite as good as acid-free paper in a box, but basically the digital equivalent. It's a standard language and it is text-based and therefore should be readable on future technologies, whatever they might be.

She then made the case that text markup is a form of textual interpretation; and when somebody asked a question that was predicated on his being okay with the status quo in which some people do programming and some people analyze texts, she replied that marking up a text for XML really forces you to think more carefully about both the structure and the content of the text; it's not an either-or proposition. This is not a case where science is trying to impose itself upon the humanities (ahem, quantum medievalism) but rather supplement it methodologically.

One important distinction is between markup and markdown. The latter is a more descriptive, plain-text rendering of, well, text, that allows it to be more easily exported into a variety of schema. (I think?) Markdown is less rule-bound, more abstract, and more idiosyncratic, which means that it is less labor intensive but potentially less-widely useful in the absence of a really robust tagging scheme.

She showed a few examples of a marked up text, including the Shelley-Godwin archive, which has Mary Shelley's notebooks marked up to show where her handwriting occurs and where her husband Percey's does, as a way of trying to put to rest the question of who really wrote Frankenstein; a Brown project on the paleographic inscriptions of the Levant that, she told us, provokes an argument between every new graduate student worker and the PI over how to classify the religious assignation of the inscriptions (see? interpretation!); and the Old Bailey Online, in which you can search court records by crime and punishment.

The one difficulty for Islamic Studies is that XML comes out of the European printing and typesetting tradition and is therefore not natively suited to Arabic and other right-to-left languages.

Maxim Romanov then gave a practical introduction to one element of text markup, namely regular expressions, a way of creating customized searches within digitized text.

These are two web sites with some basic instructions and options to practice:

One example of a regular expression is this. If I wanted to find all the possible transliterations of  قذافي (the surname of the deposed Libyan dictator) in a given text in a searchable text corpus, I would type:  [QGK]a(dh?)+a{1,2}f(i|y) as my search term. This would look for any word that began with Q, G, or K, then had an a, then had a d and possibly an h and a possible repetition of that combination, one or two As, and f, and then either an i or a y. There was much practicing and many exercises and that's really all I have to say about that. (Except that this cartoon  and this one suddenly make a lot more sense.)

Finally, a representative from Brill came to give a brief presentation, and mostly people gave him a hard time about cost and open access. I felt a bit sorry for the man, but not at all sorry for the company that put him in that position. I mean, really. What was Brill expecting from a DH workshop?

More tomorrow, beginning bright and early.


  1. Welcome to down the rabbit hole!

  2. Well, I'm still peering in, I think, but I'll be down it and playing chess with flamingoes soon enough.