The sustainability of bioinformatics niche solutions

fakepalFor about a week or so our Drosophila transcription factor database FlyTF was broken: The CSS needed for rendering the site lives on a separate machine which suffered from some obscure Apache problem. Not a big deal, one might think. The problem is IT support. For Julie from the InterMine team and our systems administrator Paul, it took approximately twenty minutes to identify the issue and come up with a fix. But: It’s not their job.

FlyTF started out as a simple tabulated text file, loaded into MySQL, and being served to the world through a rather simplistic Perl-based web site. I was still a postdoc, and after months of curating the scientific literature, writing my first CGI script was a welcome change and great training opportunity. And people didn’t care that it was simple (and in a way still is): The two FlyTF papers in 2006 and 2010 are both cited more than 60 times according to Google Scholar, and with more than 6,500 unique users and 12,000 visits, I don’t think we’re doing too bad. But, although I feel my work is highly appreciated by the community, those numbers are nothing in comparison to the user base of, say, Ensembl (to stay within the European theatre). FlyTF is a niche product. It was born and kept alive by a mix of personal pride and geekery, likely not worth the while in terms of an investment from the funding bodies. Even though FlyTF’s body is slowly dying, its soul will live on! I’m glad that we started collaborating with FlyBase back in 2010 and that most of our annotations are now part of the official FB Gene Ontology assignments. I’m now serving as an advisor to FlyBase, and although I’m probably one of their greatest critics in the way they serve the data (don’t get me started on QueryBuilder), I’m very grateful that gene grouping-specific annotations (such as: all TFs) is something that they can dedicate curation time to. (That probably also addresses some people’s question whether it is likely for me to release another version of FlyTF…)

General tools for the mass market like Ensembl (representative for many of the outgrowths coming from the EBI) or the tools and databases from the NCBI are essential to progress in modern biological science and there is no way around them. There won’t be a single bench scientist these days working in molecular biology who hasn’t used one or the other of them – and that must be a great many, looking at PubMed’s million new articles last year. Building these tools is important and, at least for now, it seems the funding bodies recognise the need for sustained investments into bioinformatics infrastructure. For example, the European ELIXIR programme “unites Europe’s leading life science organisations in managing and safeguarding the massive amounts of data being generated every day by publicly funded research”. That being said, it is funny that many authors of niche databases were contacted and invited to describe their needs when ELIXIR was in its early planning stage (and the number of emails I received!), but the announcement of their first call was far less advertised. My intuition is that their money is likely for the big guys and the mass market, but I could be wrong.

That being said: I don’t mind building tools for a niche market. There’s a lot of quality research going on that relies on careful data curation, databases and bioinformatics niche solutions. A few examples from my own field (regulatory genome informatics in the fly) are the databases FlyReg, RedFly and ORegAnno. Probably inadequately summarised in one sentence, they collect information from the literature about transcription factor binding sites and other gene regulatory elements, and provide them in a standardised format primarily to the bioinformatics community to benchmark their computational methods. This doesn’t sound like a big deal, but these data are extremely useful and roughly 90 citations for FlyReg (since 2004/05), more than 100 for RedFly (three publications since 2006) and about 60 for ORegAnno (since 2007/08) speak for themselves. So whenever you use a tool to work out if a piece of genome is regulating a gene (and you’re working on Drosophila), you’re indirectly benefitting from these databases. There is a multiplier effect, and just because even the most important of those resources has ‘just’ had 100 citations, it doesn’t mean they’re not crucial. But this is where the sustainability issue comes in. While Casey Bergman‘s DNase I footprint database was likely the first database of its kind, its strength lay in the curation, not the presentation. And, I hate to say this, because he probably hears it all too often, he’s had vision: Casey didn’t want to spend his precious research time fiddling with Apache problems like I did myself the other day, but he passed his data on to RedFly and ORegAnno. Having a wider scope than just fruit flies, I’ve never quite followed what happened to ORegAnno, but their last news update happened in 2007, so it may be that they passed away and on their data to Wyeth Wasserman‘s Pazar. The latter is based in Canada, and I think they’ve recognised an important trend with GenomeCanada, which is where I assume some of the money for Pazar comes from: with a proper IT team supporting the work. That brings me back to RedFly. It’s the strongest resource in respect to Drosophila, as it captures and allows filtering of data that are vastly irrelevant to non-fly researchers. It has expert Drosophila data curators (the very PI is one of them), and good IT support. Through a collaboration on biomedical ontologies and their use in semantic web applications and computational biology, David Osumi-Sutherland from FlyBase and Mark Halfon at Buffalo got me involved in a grant application to a large US funding body. It was a good proposal. Probably not stellar, but good. Bottom line from the otherwise alright-ish reviews: Who’s really going to need that? — Hhhmm. Did I mention the multiplier effect?

So why do I bother writing about this? Because I’m hopeless. And: I want to propose a new funding model for niche databases. Just recently I received a polite rejection letter from the BBSRC for a failed application for one of their TRDF2 grants. I was without doubt one of the strongest applications I had ever written. With my collaborator Gos Micklem it brought together -at no additional cost- state-of-the-art utilisation of an existing soft- and hardware infrastructure (a highly customised instance of their InterMine system) and the curation of all data relevant to the community of Drosophila tracheal system researchers (this is where we asked for a curator and designer of the UI). (I was tempted to paste in the entire proposal here because it really kicks ass, but I may want to send it out for a different call, so bear with me). The application was supported with strong letters from most of the European tracheal system researchers. It’s noteworthy that they’ve just started to utilise truly genome-wide approaches and genomics, so their anxiety about the data load and how to make sense of it is significant. All really good. But, so I heard through the grapevine, the BBSRC is not likely to support a resource for a community that (a) is only 20 or so groups strong and (b) with the exception of 2-3 groups is based outside the UK. So, alright, we’re a niche. Point taken. And I’m not going to start again with the multiplier effect, because I can do the maths, and the effect is even stronger if we start off with a larger community. So what to do? I believe our resource (which we had dubbed ‘TrachealDB’) would still be a great addition to our research weaponry and I’m going to show a few possible use cases at one of our next meetings for tracheal system researchers. So, maybe, if all of my peers could chip in £5k, there may still be a way for ‘TrachealDB’ to kick off. I don’t really believe in it, I’m not the sort of guru people would easily follow. But maybe we have to accept that not all good things in life are free, and that may include small niche databases if we find them useful. I wouldn’t be surprised nor offended to find a ‘Donate by PayPal’ button next time I go to RedFly.

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>