Pages

Saturday, October 18, 2014

Textual Corpora and the Digital Islamic Humanities, Day 2


Following up on the Qaddafi-hunt by regular expression of day 1 of the workshop on digital Islamic humanities, here is Maxim Romanov, demonstrating a regular expression to search for terms that describe years within a text corpus that hasn't been subjected to Buckwalter transliteration but is rather in the original Arabic script.

More below the jump:
Three major topics that were covered on day 2.

Scripting. Maxim Romanov covered a basic overview of/introduction to scripting and the automation of repetitive tasks, such as downloading thousands of things from the web, converting text formats, and conducting complex searches (by regular expressions).

This is a screengrab from the sample timeline we built.
This wasn't my entry but pretty well captured the Zeitgeist of the workshop.
The preferred scripting language amongst this crowd of presenters was Python, in no small measure because it is named after Monty Python, but also because it is very straightforward. Maxim illustrated some of the possibilities with python by walking us through one of his research questions, which was about the chronological coverage of certain historical sources, in other words, how much attention do certain time periods get versus other time periods?. He demonstrated the methods he used for capturing date information from a really large amount of text by automating specific queries with script, and then processing the data so it could be output in an easily readable graph. Conference organizer Elias Muhanna emphasized that this was an example of how digital and computational methodologies are not replacements for analysis but rather demand quite a lot of good, old-fashioned philological hard-nosedness, but offer different tools for exploring and expressing it. This is a way of simply speeding up and scaling up what we are already doing.

We then had a brief presentation from one of the researchers from the Early Islamic Empire at Work project, who showed us how his team is creating search tools for their corpus, tools which will be made publicly available in December as the Jedli toolbox, which will include various types of color-codeable, checklist- and keyword-based searching. One of the major takeaways from this presentation was the idea that by being able to edit open-source code and program things, it's possible to build upon earlier existing work to make things do specifically what any given researcher wants them to.

This raised the question of citation, which, based on a lot of the comments made in response to the question (which I asked), made it seem like a total wild west. One of the participants with quite a lot of programming experience said that citing someone else's code would be like citing a recipe when you make dinner for friends, and other participants and presenters said that if you were using something really extraordinary from somebody else's project, you might mention that. However, Elli Mylonas disagreed, arguing that correct citation of existing work is one of the ways that the digital humanities can gain traction within the academy as legitimate scholarship that counts at moments like tenure review rather than languishing, in the same manner as the catalogues and indices that we all rely upon but don't view as having been built by proper "scholars." I would tend to think she's right.

 Timelines. Then Elli Mylonas introduced us to various timeline programs. Like yesterday, her presentation was really grounded in the theory and the wherefores and the big issues behind the DH. So she started out with the assertion that "timelines lie," that is, that any kind of timeline looks objective but is, in fact, encoding a historical argument made by the researcher who compiled and presented it. (I think this actually has an interesting parallel with narrative, footnotelessness or minimally-footnoted writing such as A Mediterranean Society (which has loads of footnotes but leaves a lot out, too), that in effect encodes a massive amount of historical argumentation within something that simply reads as text.)

Important things to look for in choosing a timeline program are: the ability to represent spans of time rather than just single points, the exportability of data, and the ability of the program to handle negative dates (again, encoding an argument about the notions of temporality and the potentiality of time). A free timeline-generating app is Timeline JS, which works with Google spreadsheets. That is the one that we tested out as a group. We also looked at Tiki-Toki, which is gorgeous but requires a paid subscription. (Definitely worth looking into whether one's institution has an institutional subscription.)

This is the spreadsheet behind the timeline shown above. There is also a column to the right that allows the inclusion of tagging information that separates items into specific meaningful bands or categories, as represented by the stripes at the bottom of the timeline screenshot.

Maxim Romanov suggested that this might be a useful tool for something like revisiting the chronology in Marshall Hodgson's Venture of Islam. 

Finally, we looked at Orbis, Sanford's geospatial model of the Roman world, which looks at travel through the Roman empire based upon time and cost. This is a feasible project because of the wealth of data and the relative uniformity of roads and resources and prices within the Roman empire and would have to be modified to deal with most of Islamciate history (Which brings to mind the question of the extent to which Genizah sources as a fairly coherent(ish) corpus can be used to extrapolate for the rest of the Islamic world rather than just the Jewish communities within it; if yes, that might be a feasible data set for this kind of processing. Really not my problem, though.) This was a perfect segue into the final topic of the day.

Geographic information systems. This piece was presented by Bruce Boucek, who is a social sciences librarian at Brown trained as a cartographer. He gave an overview of data sources and potential questions and problems, and then Maxim Romanov gave a final demonstration about how geographic imaging can be used to interrogate medieval geographic descriptions and maps.

By aligning the latitude and longitude information from a modern map to the cities marked on a medieval one (or simply by making L&L conform on a less contemporary modern map of unknown projection or questionable scake) and observing the distortion of the medieval map when it was made to conform to the modern one, we began to see what kind of view of the world the mapmaker, in this case Muqaddasi, held. What was he making closer and farther away than it really was? What kind of schematic does that yield?

Capturing latitude and longitude info on a 19th-c map.
Aligning the coordinates onto a contemporary satellite map.

An example with more extreme distortion, where cities on a medieval map are lined up against the geographical coordinates of the cities on the satellite map. Such a rendering demonstrates that the cartographer was more interested in other relationships between cities besides simple distances and raises questions about what his specific priorities were in mapmaking.

***

And that's that. Video and a link library should be up online at the workshop web site, and one of my colleagues storified all of the tweets from the conference. I'll probably write another post or two in the coming week reflecting on how I might begin to start using some of these tools and methods as I finish up the book and start work on a second project.

No comments:

Post a Comment