A few weeks ago I was invited to attend a workshop along the lines of “high performance computing and bioinformatics” by people from The Genome Analysis Centre (yes, that’s TGAC). I agreed to join, thinking I was going to learn how to run jobs on their top-notch resources, in contrast to the mere 480 cores we have in the Cambridge Systems Biology Centre.
I was wrong.
Day 1. Today I met with the heads of the compute resources of the Crick Institute, the TAGC, the Pittsburgh Center for Supercomputing and their Irish counterpart (for which we jokingly said he’s overseeing a core per member of their population, roughly 4 million). Their core interest was: How can we bring the goodness of HPC to the field of bioinformatics. Now that’s interesting: There seems to be a disconnect.
Although many universities (including the University of Cambridge) now have access to immense compute power, it’s not really being picked up by the computational biology and bioinformatics communities. Why is that?
Today we identified three core issues.
1) Funding. (A) Some funding bodies make it very difficult to spend money on “services”. In their backwards facing attitude, they would rather see scientists spend money on hardware, software and a systems administrator than on access to university internal and external resources. (B) This seems to be changing, but often these issues have real practical impact: While I can (after a tender process!) spend thousands of pounds on actual computers, the paper work associated with this only allows for the classical way of obtaining a purchase order number, order the product, receive a bill, and ultimately, the university will pay for it. The signup to any commercial cloud resource requires a credit card number. These processes are not compatible! (C) As cloud computing is relatively new (at least for academics), while we can guesstimate the cost of a computer, the software, and the systems administrator, we still need to learn to translate our needs into hours of CPU time at a guaranteed level of service. Here is where academic HPC and cloud providers can help us.
(2) Training. This seemed to be a major discussion point, assuming different communities for whom HPC could be relevant. In the end, we agreed that basic data plumbing is a major issue. Users with no concept of distributed file systems, with an idea which jobs can possibly be subdivided and which ones need to be run as monolithic whole, those are hard to reach. However, once people have reached a basic understanding of the Unix command line, pipelines and (ideally) some programming, we can help them. The primary entry points seems documentation. People independently on Twitter and in the room noted that HPC documentation is the worst in IT. This is something we are going to tackle tomorrow
(3) Technical issues. This was interesting. While bioinformaticians feared their need for storage might pose an issue, someone with a HPC background noted that it “just takes another 30% of funds” to store all of the previously generated data. Every year. It hasn’t quite emerged which technical problems need to be faced otherwise, but we’re going to address those tomorrow.
On the positive side, Phil Blood from Pittsburgh was contacted by Data Carpentry/Software Carpentry to think about the delivery of a HPC course. With many of these problems sorted out (e.g. who should be the target audience to make the biggest impact, what prerequisites should there be for course attendees [basic Unix skills]), we can next look at what skills exactly should be taught in a bioinformatics for HPC course.
Day 2. We revisited a few discussion points of the first day. We noted again that ‘general computational thinking’ is a major prerequisite for bioinformatics and HPC, but we don’t have the resources nor the reach to introduce this as topic for undergraduate education.
Training from our side can only be provided on the PhD or postdoctoral level. With an increasing number of wet-bench biologists learning how to use the command line, we need to however manage the transition of researchers who are primarily used to GUI-based tools to become competent HPC users, i.e. support them with basic Unix skills and an conceptual understanding of HPC environments.
We discussed the technical infrastructure that could be used to teach HPC, including sandbox systems (T-infrastructure), web-embedded command lines (e.g. as available for Docker), virtual environments (VMs, emulating containers), as well as courses and online communities.
What should a bioinformatician learn to leverage a HPC environments? The following slides contain some of our discussion content: