Wasting my time...

One of the most irksome aspects of working in computational biology is how frustrating it can be to analyze other people's data (OPD) [1]. By OPD, I don't mean quickie files generated for personal use; rather, I'm talking about datasets ostensibly provided so that other folks can build upon, or at the very least, replicate published work. I'm talking about anything from supplementary material included with papers and/or software, to big taxpayer funded public databases.

Here's a typical scenario: I need to combine two or more pieces of data, such as a list of human disease associated variants identified in a study with some database of previously published variant associations. Conveniently, both datasets use the same format for identifying variants, which means that this should boil down to finding the union between a particular column in each of the tables. This shouldn't take more than five minutes, right?

Unfortunately, I quickly notice that some proportion of variants aren't being found in the database, even though the referenced origin of said variants are in there. 15 minutes of searching reveals that many of these are just typos, the others I'll have to check in more detail. I decide that I'd better write a script that cross-references the references [2] against the variants to catch any further mistakes, but this ends up spitting out a lot of garbage. Some time later, I realize that one of the tables doesn't stick to a consistent referencing style [3], so I can either go through the column and fix the entries manually, or try to write a script that handles all possibilities. A few hours later, I've finally got the association working, minus a dozen or so oddball cases that I'll have to go through one-by-one, only to find out that much of the numeric data I wanted to extract in the first place is coded as 'free text'. Now I'll need to write more code to extract the values I want. However, it's now 7 pm, and this will have to wait until tomorrow.

I've encountered this sort of problem many, many times when working with scientific data. Why are we so tolerant of poorly formated, error-riden, undocumented datasets? Or, perhaps a more appropriate question is why don't scientists have more respect for each other's time? Is it more reasonable for the dataset generator to spend a little bit of time checking the reliability of their data programmatically or for each person who downloads the data to waste hours (or days) working through typos and errors?

I get it: after spending months writing and rewriting a manuscript, rarely do you feel like spending a lot of time polishing off the supplementary materials. Mistakes happen simply because you're in a rush to get a draft out the door. On the more cynical side, I have also been told that spending time making it easier for people to use my data isn't worth my time. Neither of these considerations explains errors found in public databases, however.

I don't have a solution to the problem, but I'm pretty sure that the root cause is one of incentives: that is to say, there are few professional incentives for making it easier for your colleagues (competitors) to replicate and/or build upon your work. Perhaps we need a culture shift towards teaching better values to students or, more realistically, we need journals to actually required that data follow minimal standards, perhaps including requiring that mistakes in supplementary tables be fixed when pointed out by downstream users. 


[1] Who's down with OPD? Very few folks, I'm afraid.

[2] Cross-referencing has always struck me as the lamest, overused, 'nerd word' on TV. I cross-reference all the time, but I think this is the first time I've actually referred to it as such.

[3] e.g., [First author's first name] YYYY. I wish I was making this up.