Wednesday, February 06, 2013

First Contact


It seems like I've had many versions of this conversation in the last few months, as new projects begin to ramp up:
Client: I want to do something cool to publish my work.

Developer: OK. Tell me what you'd like to do.

Client: Um. I need you to to tell me what's possible, so I can tell you what I want.

Developer: We can do pretty much anything. I need you to tell me what you want so I can figure out how to make it.
Almost every introductory meeting with a client/customer starts out this way. There's a kind of negotiation period where we figure out how to speak each other's language, often by drawing crude pictures. We look at things and decide how to describe them in a way we both understand. We wave our hands in the air and sometimes get annoyed that the other person is being so dense.

It's crucially important not to short-circuit this process though. You and your client likely have vastly different understandings of what can be done, how hard it is to do what needs to be done, and even whether it's worth doing. The initial negotiation sets the tone for the rest of the relationship. If you hurry through it, and let things progress while there are still major misunderstandings in the air, Bad Things will certainly happen. Like:
Client: This isn't what I wanted at all!

Developer: But I built exactly what you asked for!

Thursday, April 26, 2012

You did _what_? Form-based XML editing

My most recent project has been building an editing system for APIS (the Advanced Papyrological Information System—I know, I know) data as part of the Papyrological Editor (PE). APIS records are now TEI files. They started out using a text-based format that was MARC-inspired (fields, not structure), but I converted them to TEI a while back to bring them in line with the other papyri.info data. HGV records (also TEI) already had an editing system in the PE, so it seemed like the task of adding an APIS editor, which would be a simpler beast, wouldn't be that hard, just a matter of extending the existing one. In fact, it was an awful struggle.

Partly this is my fault. The PE is a Rails application, running on JRuby, and my Rails knowledge was pretty rusty. I loved it when I was working on it full time a few years ago, but coming back to it now, and working on an application someone else built, I found it obtuse and mentally exhausting. I had a hard time keeping in my head the different places business logic might be found, and figuring out where to graft on the pieces of the new editor was a continual fight against coder's block. The HGV editor relies on a YAML-based mapping file that documents where the editable information in an HGV document is to be found. This mapping is used to build a data structure that is used by the View to populate an editing form, and upon form submission, the process is reversed.

It's not at all unlike what XForms does, and in fact I was repeatedly saying to myself "Why didn't they just use XForms?" I got annoyed enough that I took some time to look at what it would take to just plug XForms into the application and use that for the APIS editor. The reluctant conclusion I came to was that there just aren't any XForms frameworks out there that I could do this with. And the XForms tutorials I was looking at didn't deal with data that looked like mine at all. TEI is flexible, but not all that regular, and every example I saw dealt with very regular XML data. Moreover, the only implementation I could find that wasn't a server-side framework (and I wasn't going to install another web service just to handle one form) is XSLTForms. The latter is impressive, but relies on in-browser XSLT transforms, which is fine if you have control of the whole page, but inconvenient for me, because I've already got a framework producing the shell of my pages. I just wanted something that would plug into what I already had. A bit sadder but wiser, I decided the team who built the HGV editor had done what they had to given what they had to work with.

Then, about a month ago, I got sick. Like bedridden sick. I actually took 3 days off work, which is probably the most sick leave I've taken since my children were born. While I was lying in bed waiting for the antibiotics to kick in, I got bored and thought I'd have a crack at rewriting the APIS editor. In Javascript. There was already a perfectly good pure-XML editing view, where you can open your document, make changes, and save it back (having it validated etc. on the server), so why not build something that could load the XML into a form in the browser, and then rebuild the XML with the form's contents and post it back? Doesn't sound that hard. And so that's what I did. I should say, this seemed like a potentially very silly thing to do, adding another, non-standard, editing method that duplicates the functionality of an existing standard to a system that already has a working method. I'd spent a lot of time on figuring out how to refactor the existing editor because that seemed like the Right Thing To Do. Being sick, bored, and off the clock gave me the scope to play around a bit.

Let me talk a bit more about the constraints of the project. Data is stored in XML files in a Git repository. This means that not only do I want my edits to be sending back good XML, I want it to look as much like the XML I got in the first place, with the same formatting, indentation, etc., so that Git can make sensible judgements about what has actually changed. I might want some data extracted from the XML to be processed a bit before it's put into the form, and vice versa. For example, I have dates with @when or @notBefore/@notAfter attributes that have values in the form YYYY-MM-DD, but I want my form to have separate Year, Month, and Day fields. Mixed (though only lightly mixed) content is possible in some elements. I need my form to be able to insert elements that may not be there at all in the source. I need it to deal with repeating data. I need it to deal with structured data (i.e. an XML fragment containing several pieces of data should be treated as a unit). And of course, it needs to be able to put data into the XML document in such a way that it will validate, so the order of inserted elements matters. Moreover, I need to build the form the way the framework expects it to be built, as a Rails View.

So the tool needs to know about an instance document:

  1. how to get data out of the XML, maybe running it through a conversion function on the way out
  2. how to map that data to one or a group of form elements, so there needs to be a naming convention
  3. how to deal with repeating elements or groups of elements
  4. how to structure the data coming out of the form, possibly running it through a function, and being able to deal with optional sub-structures
  5. how to put the form data into the correct place in the XML, so some knowledge of the content model is necessary
  6. how to add missing ancestor elements (maybe with attributes) for new form data
Basically, it needs to be able to deal with XML as it occurs in the wild: nasty, brutish, and (one hopes) short.

Form-based XML editing is one of those things that sounds fairly easy, but is in fact fraught with complications. It's easy to get data out of XML (XPath!), and it's easy to manipulate an XML DOM, removing and adding elements, attributes, and text nodes. But it's actually quite hard to get everything right, and to make the XML's formatting stay consistent. In my next post, I'll talk about how I did it.

Tuesday, March 20, 2012

How to Apologize

The latest regrettable spasm of sexism in the programming world played out this afternoon, as a company called Sqoot's announcement of a hackathon caused said event to implode before it ever began by including the infuriating and insensitive line under "perks" [Update: just to be clear, in the context of the original page, it was clear that the presence of women serving beer was one of the perks for attendees]:
Women: Need another beer? Let one of our friendly (female) event staff get that for you.
Gag. Sqoot fairly quickly realized they had walked into a buzzsaw, as lots of people called them on it, and their sponsors started pulling their support. It's rather nice to see that kind of quick, public reaction. Cloudmine's blog post about it particularly impressed me. Squoot issued an apology fairly swiftly, which I quote below:
Sqoot is hosting an API Jam in Boston at the end of March. One of the perks we (not our sponsors) listed on the event page was:

“Women: Need another beer? Let one of our friendly (female) event staff get that for you.”

While we thought this was a fun, harmless comment poking fun at the fact that hack-a-thons are typically male-dominated, others were offended. That was not our intention and thus we changed it.

We’re really sorry,

Avand & Mo
This didn't do much for a lot of people, but it got me thinking about apologies in tech in general, since they are actually crucial moments in the interaction between you and your customers/audience. When I worked at Lulu, Bob Young used to say that whenever you screw up, it's actually a tremendous opportunity to win a customer's loyalty by making it right. This applies both to small screwups (a customer's order never made it) and large ones (you did something that made lots of people mad). It strikes me that in this day and age, when the "non-apology" has become so frequent, people may actually not realize when it isn't appropriate to use conditional or evasive language in apologies. It's one thing if you're worried about being sued and can't admit culpability, or if you're someone like Rush Limbaugh, who's presumably concerned about appearing to back down in front of his audience. But if you're actually intent on repairing the damage done to your relationship with your customer or your audience, you need to be able to apologize properly.

So what are the elements of a good apology?

  1. I hear you.
  2. I am truly sorry.
  3. (semi-optional, depending on what happened) This is what went wrong.
  4. I am doing x to make sure this doesn't happen again and y to make it right with you.
  5. Thank you. I appreciate the feedback.
#1 is crucial. The person or group you're addressing has to know that you've heard their complaint and understand it. Apologies that lack this element sound cold and disconnected. And this is the main problem with Sqoot's "others were offended."  They aren't speaking to the people they offended. This is just guaranteed to further piss people off.

#2 should be unconditional. Not "I'm sorry if you were offended." Indeed, if you find yourself pushing the focus onto the people whom you pissed off at all, you may be sliding into non-apology territory. This isn't about them—they're mad because you made them mad. Note that a good apology is not defensive, and does not attempt to shift the blame, even if that blame belongs to an employee whom you've just fired.  If you did that, it's part of #4, the "how I'm fixing it" part, not the "I'm sorry" part. Don't try to save face in a genuine apology. Indicating that you meant no harm is fine, but if you're apologizing, it means you caused harm regardless of your intent.

#3 is a bit more tricky. People want to know how this could have happened, but it doesn't do to dwell on it too much, and this is another mistake Sqoot makes. They probably shouldn't quote the line that made everyone mad (it will make the readers mad all over again). It would have been enough to say they put something stupid and sexist into an event page which they now regret. On the other hand, you do have to acknowledge what happened and not look like you're trying to dodge it. So don't go into excruciating detail about what went wrong with a customer's order, for example. "I'm afraid you found a bug in our shopping cart" is probably enough detail. Sqoot's apology does this really badly: they explain exactly what they did, how it happened (we thought it was funny, because we're aware that these things are mostly male), and then contrast the "others" (who lack their sense of humor) who were offended. Explaining how you messed up does not mean defending yourself, and defending yourself in an apology must be handled delicately, or you look like an ass.

#4 Fix it, if you can. "We're refunding your order immediately and giving you a coupon", "I shall be entering rehab tomorrow morning", "We're donating $$ to x charity".

#5 Reconnect. When you screw up, people are paying very close attention to you, and it's an opportunity to show that you're a stand-up company/organization/person. You stand to win greater loyalty and affection by handling the problem effectively. The people who are complaining (assuming they are correct, of course) are helping you by showing you where you're wrong, or at least showing you a different perspective. Squoot "signing" their apology is actually good, in this case, because it indicates the founders (I presume that's who they are) are taking responsibility. It's too bad they flubbed the middle bit.

Sunday, March 04, 2012

A spot of mansplaining

This is bit of rambling, responding to Bethany Nowviskie's terrific "Don't circle the wagons", itself a response to Miriam Posner's "Some things to think about before you exhort everyone to code".

I'm a middle-aged, white, male programmer, so that's where I'm coming from. I can't help any of that, but doubtless it colors my perspective.

First, I have to say that the idea of coding being associated with prestige (as it seems now to be in DH) is rather foreign to my experience, but the rise of the brogrammer is probably a sign that in general it's not such a marginal activity anymore. These guys would probably have gone and gotten MBAs instead in years past.

DH is slightly uncomfortable territory for programmers, as I've written in the past, at least it is for people like me who mostly program rather than academics who write code to accomplish their ends. I speak in generalities, and there are good local exceptions, but we don't get adequate (or often any) professional development funding, we don't get research time, we don't get credit for academic stuff we may do (like writing papers, presenting at conferences, etc.), we don't get to lead projects, and our jobs are very often contingent on funding. All this in exchange for about a 30% pay cut over what we could be earning in industry. There are compensations of course: incredibly interesting work, really great people to work with, and often more flexible working conditions. That's worth putting up with a lot. I have a wife and young kids, and I'm rather fond of them, so being able to work from home and having decent working hours and vacation time is a major bonus for me.

None of that in any way accounts for the gender imbalance that Miriam is highlighting, though it does perhaps work as a general disincentive (in academia) to learn to code too well. I'd also say that there is nothing that can make you feel abysmally stupid quite like the discipline of programming. Errors as small as a single character (added or omitted) can make a program fail in any number of weird ways. I am frequently faced with problems that I don't know ahead of time I'll be able to solve. Ostensibly hard problems may be dead simple to fix, while simple tasks may end up taking days. It keeps you humble. But I would say that if you're the lone woman sitting in a class/lecture/lab, and you're feeling lost, you're not the only one, and it has nothing at all to do with your gender.

As Bethany cautions, please, please don't circle the wagons. It's my contention that most of the offensive things about programmer culture are not intentional nor deeply ingrained but are actually artifacts of the gender/race imbalance.

[As an aside, I was interested in Miriam's remarks about codeacademy. I started working through it with my daughter a couple of weeks ago, and she was finding it incredibly frustrating. It does not fit her learning style at all. She, like me, needs to know why she has to type this thing. She finds being told to just type var myName="Grace" without any context stupid. In the end we gave up and started learning how to build web pages, and I'll reintroduce Javascript in that context.]

Programmer culture is exclusionary though. Undergraduate CS programs have "weed out" courses. I've actually taken a couple of these, and the first one did weed me out—it managed to be hard and extremely boring at the same time. I only came back to it years later. This gets at an aspect of programmer culture though, a sort of "are you good enough?" ethic. It's not without foundation—a lot of people who self-identify as programmers can't program—but it also means that when you start to work with a new group, there's often a kind of ritual pissing contest where you have to prove your worth before you're treated as an equal. This kind of thing is irritating enough on it's own, and it's easy to imagine it taking on sexist or racist overtones.

Programming also tends to squeeze out older folks. Actual age discrimination does happen, but it's also because staying current means almost totally rebooting your skillset every few years. The Pragmatic Programmer book recommends learning a new language every year, and this is crucial advice (my own record is more like one every 18 months or so, but that's been enough so far). If you let yourself get comfortable and coast, or go into management, your skills are going to be close to worthless in about 5 years. And, while your experience definitely gives you an edge, you're not guaranteed to have the best solution to any given problem. The 22-year-old who read something on Hacker News last night might have found an answer that totally blows your received wisdom out of the water.

[As another aside, the speed at which skills go stale means that any organization that doesn't invest in professional development for their programmers is being seriously stupid. Or they expect their devs to leave and be replaced every few years.]

The upshot is that the population of programmers not only skews male, it skews young. Put a bunch of young men together, particularly in small groups that are under a lot of pressure (in a startup, for example), and you get the sorts of tribal behaviors that make women and anyone over 35 uncomfortable. There's not just misogyny, there's hostility towards people who might want to have a life outside of work (e.g. people who have spouses and kids and like spending time with them). And this is both a cause of sexual/racial/age imbalance and a result. It's a self-reinforcing cycle.

But there isn't really one monolithic "coder culture", there are lots of them. Every company and institution has its own culture. Teams have cultures. Basically, any grouping of human beings is going to develop its own set of values and ways of doing things. It's what people do naturally. The leaders of these groups have a lot to do with how those cultures develop, but they aren't necessarily in any position to remedy imbalances.

Once you're in a position to hire people, you realize that hiring good developers is hard. In any pool of applicants, you're going to have a bunch of totally unsuitable people, a few maybes, and if you're lucky, one or two gems (or people you think might become a gem with a little polishing). Are any of these going to be women? Not unless you're really lucky, because the weeding out has already done its work. So once you're in a position to decide who gets hired, it's too late to redress any imbalance. The imbalance is not because leaders don't want to hire women, it's already there.

The only answer I can see is to get a lot more women into programming. If the CS curricula can't do it, maybe new modes like DH can. From what I've seen the gender balance in DH, while still not great, is a lot less ruinous than in programming in general, and a CS degree is far from the only road into a programming career (I don't have one). I think the cycle can be broken. I don't think there's a lot of deeply ingrained misogyny or racism in coding culture. Rather, it's a boys club because it happens to be mostly boys. If there were more women it would adjust. And I don't think that (mostly) it would put up much resistance to an influx of women. So circling the wagons is the exact opposite of what needs to be done.

Monday, November 14, 2011

TEI in other formats; part the second: Theory

In my first post on this subject, I poked a bit at how one might represent TEI in HTML without discarding the text model from the TEI document. Now I want to talk a bit more about that model, and the theory behind it. I may at the end say irritable things about Theory as well, if you can stand it until the end. I humbly beg the reader's pardon.

Let's look at the same document I talked about last time: http://papyri.info/ddbdp/p.ryl;2;74/source (see also http://papyri.info/ddbdp/p.ryl;2;74/). We can visualize the document structure using Graphviz and a spot of XSLT (click for high-res):

tree structure of a TEI document

It's a fairly flat tree. As an XML document, it has to be a tree, of course, and TEI leverages this built-in "tree-ness" to express concepts like "this text is part of a paragraph" (i.e. it has a tei:p element as its ancestor). In line 1, for example, we find

<supplied reason="lost">Μάρ</supplied>κος
meaning the first three letters of the name Markos have been lost due to damage suffered by the papyrus the text was written on, and the editor of the text has supplied them. The fact that the letters "Μάρ" are contained by the supplied element, or, more properly that the text node containing those letters is a child of the supplied element, means that those letters have been supplied. In other words, the parent-child relationship is given additional semantics by TEI. We already have some problems here: the child of supplied is itself part of a word, "Markos", and that word is broken up by the supplied element. Only the fact that no white space intervenes between the end of the supplied element and the following text lets us know that this is a word. It's even worse if you look at the tree version, which is, incidentally, how the document will be interpreted by a computer after it has been parsed:


There's no obvious connection here between the first and second halves of the name. And in fact, if we hadn't taken steps to prevent it, any program the processed the document might reformat it so that "Mar" and "kos" were no longer connected. We could solve this problem by adding more elements. As the joke goes, "XML is like violence. If it isn't working, you're not using it enough." We could explicitly mark all the words, using a "w" element, thus: 
<w><supplied reason="lost">Μάρ</supplied>κος</w>
or

which would solve any potential problems with words getting split up, because we could always fix the problem—we would know what all the words are. We could even attach useful metadata, like the lemma (the dictionary headword) of the word in question. We don't do this for a couple of reasons. First, because we don't need to. We can work around the splitting-up of words by markup. Second, because it complicates the document and makes it harder for human editors to deal with, and third, because it introduces new chances for overlap. Overlap is the Enemy, as far as XML is concerned. The more containers you have, the greater the chances one container will need to start outside another, but finish inside (or vice versa). Consider that there's no reason at all a region of supplied text shouldn't start in the middle of one word and end in the middle of another. Look at lines 5-6 for example:
                           ... οὐ̣[χ ἱκανὸν εἶ-]
[ναι εἰ]ς
A supplied section begins in the middle of the third word from the end of line five, and continues for the rest of the line. The last word is itself broken and continues on the following line, the beginning of which is also supplied, that section ending in the middle of the second word on line six. This is a mess that would only be compounded if we wanted to mark off words.

This may all seem like a numbing level of detail, but it is on these details that theories of text are tested. The text model here cares about editorial observations on and interventions in the text, and those are what it attempts to capture. It cares much less about the structure of the text itself—note that the text is contained in a tei:ab, an element designed for delineating a block of text without saying anything about its nature as a block (unlike tei:p, for example). Visible features like columns, or text continued on multiple papyri, or multiple texts on the same papyrus would be marked by tei:divs. This is in keeping with papyrology's focus on the materiality of the text. What the editor sees, and what they make of it is more important than the construction of a coherent narrative from the text—something that is often impossible in any case. Making that set of tasks as easy as possible is therefore the focus of the text model we use.

What I'm trying to get at here is that there is Theory at work here (a host of theories in fact), having to do with a way to model texts, and that that set of theories are mapped onto data structures (TEI, XML, the tree) using a set of conventions, and taking advantage of some of the strengths of the data structures available. Those data structures have weaknesses too, and where we hit those, we have to make choices about how to serve our theories best with the tools we have. There is no end of work to be done at this level, of joining theory to practice, and a great deal of that work involves hacking, experimenting with code and data. It is from this realization, I think, that the "more hack, less yack" ethic of THATCamp emerged. And it is at this level, this intersection, this interface, that scholar-programmer types (like me) spend a lot of our time. And we do get a bit impatient with people who don't, or can't, or won't engage at the same level, especially if they then produce critiques of what we're doing.

As it happens, I do think that DH tends to be under-theorized, but by that I don't mean it needs more Foucault. Because it is largely project-driven, and the people who are able to reason about the lower-level modeling and interface questions are mostly paid only to get code into production, important decisions and theories are left implicit in code and in the shapes of the data, and aren't brought out into the light of day and yacked about as they should be.

Thursday, November 10, 2011

TEI in other formats; part the first: HTML

There has been a fair amount of discussion of late about TEI either having a standard HTML5 representation or even moving entirely to an HTML5 format. I want to do a little thinking "out loud" about how that might work.


Let's start with a fairly standard EpiDoc document (EpiDoc being a set of guidelines for using TEI to mark up ancient documents). http://papyri.info/ddbdp/p.ryl;2;74/source (see http://papyri.info/ddbdp/p.ryl;2;74/ for an HTML version with more info) is a fairly typical example of EpiDoc used to mark up a papyrus document. The document structure is fairly flat, but with a number of editorial interventions, all marked up. Line 12, below, shows supplied, unclear, and gap tags


<lb n="12"/><gap reason="lost" quantity="16" unit="character"/> <supplied reason="lost"> Π</supplied><unclear>ε</unclear>ρὶ Θή<supplied reason="lost">βας καὶ Ἑ</supplied><unclear>ρ</unclear>μωνθ<supplied reason="lost">ίτ </supplied><gap reason="lost" extent="unknown" unit="character"/>

So how might we take this line and translate it to HTML? First, we have an <lb> tag, which at first glance would seem to map quite readily onto the HTML <br> tag, but if we look at the TEI Guidelines page for lb (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-lb.html), we see a large number of possible attributes that don't necessarily convert well. In practice, all I usually see on a line break tag in TEI is an @n and maybe an @xml:id attribute. HTML doesn't really have a general-purpose attribute like @n, but @class or @title might serve. On <lb>, @n is often used to provide line numbers, so @title seems logical.


Now <gap reason="lost" quantity="16" unit="character"/> is a bit more of a puzzler. First, HTML's semantics don't extend at all to the recording of attributes of a text being transcribed, so nothing like the gap element exists. We'll have to use a general-purpose inline element (span seems obvious) and figure out how to represent the attribute values. TEI has no lack of attributes, and these don't naturally map to HTML at all in most cases. If we're going to keep TEI's attributes, we'll have to represent them as child elements.  We'll want to identify both the original TEI element and wrap its attributes and maybe its content too, so let's assume we'll use the @class attribute with a couple of fake "namespaces", "teie-" for TEI element names, "teia-" for attribute names, and "teig-" to identify attributes and wrap element contents (the latter might be overkill, but seems sensible as a way to control whitespace). We can assume a stylesheet with a span.teig-attribute selector that sets display:none.


<span class="tei-gap">
  <span class="teig-attribute teia-reason">lost</span>
  <span class="teig-attribute teia-quantity">16</span>
  <span class="teig-attribute teia-unit">character</span>
</span>


Like HTML, TEI has three structural models for elements: block, inline, and milestone. Block elements assume a "block" of text, that is, they create a visually distinct chunk of text. Divs, paragraphs, tables, and similar elements are block level. Inline elements contain content, but don't create a separate block. Examples are span in HTML, or hi in TEI. Milestones are empty elements like lb or br. TEI has several of these, and HTML, which has "generic" elements of the block and inline varieties (div and span) lacks a generic empty element. Hence the need to represent tei:gap as a span.


tei:supplied is clearly an inline element, and we can do something similar to the example above, using span:


<span class="tei-supplied">
  <span class="teig-attribute teia-reason">lost</span>
  <span class="teig-content">Π</span>
</span>


and likewise with unclear:


<span class="tei-unclear">
  <span class="teig-content">ε</span>
</span>


Now, doing this using generic HTML elements and styling/hiding them with CSS could be considered bad behavior. It's certainly frowned upon in the CSS 2.1 spec (see the note at http://www.w3.org/TR/CSS2/selector.html#class-html). I don't honestly see another way to do it though, because, although RDFa has been suggested as a vehicle for porting TEI to HTML, there is no ontology for TEI, so no good way to say "HTML element p is the same as TEI element p, here". Even granting the possibility of saying that, it doesn't help with the attribute problem. And we're still left with the problem of presentation: what will my HTML look like in a browser? It must be said that my messing about above won't produce anything like the desired effect, which for line 12 is something like:
[- ca.16 - Π]ε̣ρὶ Θή[βας καὶ Ἑ]ρ̣μωνθ[ίτ -ca.?- ] 
I could certainly make it so, probably with a combination of CSS and JavaScript, but what have I gained by doing so? I'll have traded one paradigm, XML + XSLT, for another, HTML + CSS + JavaScript. I'll have lost the ability to validate my markup, though I'll still be able to transform it to other formats. I should be able to round-trip it to TEI and back, so perhaps I could solve the validation problem that way. But is anything about this better than TEI XML? I don't think so…

I suspect I'm missing the point here, and that what the proponents of TEI in HTML are really after is a radically curtailed (or re-thought) version of TEI that does map more comfortably to HTML. The somewhat Baroque complexity of TEI leads the casual observer to wish for something simpler immediately, and can provoke occasional dismay even in experienced users. I certainly sympathize with the wish for a simpler architecture, but text modeling is a complex problem, and simple solutions to complex problems are hard to engineer.

Tuesday, June 28, 2011

Humanities Data Curation

Last Thursday, I attended the excellent Humanities Data Curation Summit, organized by Allen Renear, Trevor Muñoz, Katherine L. Walter, and Julia Flanders. I'm still processing the day, which included a breakout session with Allen, Elli Mylonas, and Michael Sperberg-McQueen, who are some of my favorite people in DH.

What I started thinking about today was that we'd skipped definitions at the beginning—there was a joke that Allen, as a philosopher, could have spent all day on that task. But in doing so, we elided the question of what is data in the humanities, and what is different about it from science or social science data.

Humanities data are not usually static, collected data like instrument readings or survey results. They are things like marked up texts, uncorrected OCR, images in need of annotation, etc. Humanities datasets can almost always be improved upon. "Curation" for them is not simply preservation, access, and forward migration. It means enabling interested communities to work with the data and make it better. Community interaction needs to be factored into the data's curation lifecycle.

I feel a blog post coming on about how the Integrating Digital Papyrology / papyri.info project does this...