High-performance computing for bioinformatics

 A few weeks ago I was invited to attend a workshop along the lines of “high performance computing and bioinformatics” by people from The Genome Analysis Centre (yes, that’s TGAC). I agreed to join, thinking I was going to learn how to run jobs on their top-notch resources, in contrast to the mere 480 cores we have in the Cambridge Systems Biology Centre.

I was wrong.

Day 1. Today I met with the heads of the compute resources of the Crick Institute, the TAGC, the Pittsburgh Center for Supercomputing and their Irish counterpart (for which we jokingly said he’s overseeing a core per member of their population, roughly 4 million). Their core interest was: How can we bring the goodness of HPC to the field of bioinformatics. Now that’s interesting: There seems to be a disconnect.

Although many universities (including the University of Cambridge) now have access to immense compute power, it’s not really being picked up by the computational biology and bioinformatics communities. Why is that?

Today we identified three core issues.

1) Funding. (A) Some funding bodies make it very difficult to spend money on “services”. In their backwards facing attitude, they would rather see scientists spend money on hardware, software and a systems administrator than on access to university internal and external resources. (B) This seems to be changing, but often these issues have real practical impact: While I can (after a tender process!) spend thousands of pounds on actual computers, the paper work associated with this only allows for the classical way of obtaining a purchase order number, order the product, receive a bill, and ultimately, the university will pay for it. The signup to any commercial cloud resource requires a credit card number. These processes are not compatible! (C) As cloud computing is relatively new (at least for academics), while we can guesstimate the cost of a computer, the software, and the systems administrator, we still need to learn to translate our needs into hours of CPU time at a guaranteed level of service. Here is where academic HPC and cloud providers can help us.

(2) Training. This seemed to be a major discussion point, assuming different communities for whom HPC could be relevant. In the end, we agreed that basic data plumbing is a major issue. Users with no concept of distributed file systems, with an idea which jobs can possibly be subdivided and which ones need to be run as monolithic whole, those are hard to reach. However, once people have reached a basic understanding of the Unix command line, pipelines and (ideally) some programming, we can help them. The primary entry points seems documentation. People independently on Twitter and in the room noted that HPC documentation is the worst in IT. This is something we are going to tackle tomorrow

(3) Technical issues. This was interesting. While bioinformaticians feared their need for storage might pose an issue, someone with a HPC background noted that it “just takes another 30% of funds” to store all of the previously generated data. Every year. It hasn’t quite emerged which technical problems need to be faced otherwise, but we’re going to address those tomorrow.

On the positive side, Phil Blood from Pittsburgh was contacted by Data Carpentry/Software Carpentry to think about the delivery of a HPC course. With many of these problems sorted out (e.g. who should be the target audience to make the biggest impact, what prerequisites should there be for course attendees [basic Unix skills]), we can next look at what skills exactly should be taught in a bioinformatics for HPC course.

Day 2. We revisited a few discussion points of the first day. We noted again that ‘general computational thinking’ is a major prerequisite for bioinformatics and HPC, but we don’t have the resources nor the reach to introduce this as topic for undergraduate education.

Training from our side can only be provided on the PhD or postdoctoral level. With an increasing number of wet-bench biologists learning how to use the command line, we need to however manage the transition of researchers who are primarily used to GUI-based tools to become competent HPC users, i.e. support them with basic Unix skills and an conceptual understanding of HPC environments.

We discussed the technical infrastructure that could be used to teach HPC, including sandbox systems (T-infrastructure), web-embedded command lines (e.g. as available for Docker), virtual environments (VMs, emulating containers), as well as courses and online communities.

What should a bioinformatician learn to leverage a HPC environments? The following slides contain some of our discussion content:

   

      

New article on arXiv: Scaling of PWM scores

eveI’m happy to announce that our second year PhD student Xiaoyan has deposited her first article on arXiv prior to publication.

arXiv:1503.04992  Reliable scaling of Position Weight Matrices for binding strength comparisons between transcription factors

What started as a nuisance that many bioinformaticians face every day turned into an exciting question for Xiaoyan: How can we compare scores from PWMs? While it is clear that a PWM score is somehow related to the affinity of a transcription factor to DNA, through a scaling factor called lambda, it’s not easy to determine the actual value of that parameter. Experimental methods to do so are complex and expensive, and theoretical methods to approximate lambda are usually based on a multitude of noise genomic data.

In this paper, we introduce two simple methods to derive lambda. Both are based on different assumptions, but produce very similar parameter ranges. While the first method focusses on the top hits that are deemed to be ‘reliable’ (on the basis of genome statistics), the second method takes the calculated residence time of the factor on the DNA into account. That latter method is particularly useful in cases where two alternative PWMs exist for the same protein, and lambda serves to scale their scores so that they’re comparable.

New publication

homotypicOur paper “Homotypic clusters of transcription factor binding sites: A model system for understanding the physical mechanics of gene expression.” has been out for more than half a year, but I had not remembered putting up a brief post about it – probably because it was more of a review rather than the result of hard original graft. However, friends in the field have already picked up on our thoughts and the work has been cited by the Gordan and Hahnenhalli labs. So here it goes:

Homotypic clusters of transcription factor binding sites are common in regulatory contexts where a dosage-dependent transcriptional response is desired. The number, orientation and spacing of binding sites determines how the element will respond. While this is a known, experimentally often observed phenomenon, this paper provides the physical reasoning why homotypic clusters show their distinctive behaviour.

If you’re interested, go check it out at the Computational and Structural Biotechnology Journal.

Software Sustainability Fellowship

SSI

I’m very proud to announce that I’ve been elected one of 19 Fellows of the Software Sustainability Institute (SSI). The SSI is a virtual institute funded by the EPSRC with the mission of supporting scientific software. Fellows from all over the UK, both from academia and industry, advocate best practices in software development, open source, open data and education.

I’m aiming to bring my experience from working with data integration methods (e.g. ontologies) and machine learning that we have successfully used in the context of biological research to the people who work on developing the Internet of Things. I recently attended the Thingmonk conference and already presented a talk “What the IoT should learn from the life sciences” with my SSI hat on.

I’m following big footsteps of two former Cambridge SSI Fellows, Laurent Gatto and Stephen Eglen.

New publication

NAR2014_2_teaserOur manuscript “Estimating binding properties of transcription factors from genome-wide binding profiles” has now been accepted in Nucleic Acids Research. For more scientific background, see the blog post about the pre-published in arXiv version.

It was a long, painful uphill battle to get this piece published. But it was also one of the (unfortunately rare!) examples how a manuscript can really be improved through various iterations of peer review. We like to thank the anonymous reviewers for their continuous support throughout the entire process, for spotting a few conceptual problems and mis-calculations (that eventually even uncovered a problematic piece of code). They were very critical, but always fair!

Dealing with the loss of a group member

Eddie

Our friend and colleague Edward Kinrade suddenly died last month. He had been instrumental for the growth of the Adryan group over six years, taking care of all lab management duties and the training of students at the wet bench.

I wrote a small series of blog posts how we dealt with the situation: 1/ How to communicate bad news to the group. 2/ Disaster management in the lab. 3/ When cloud services bite you in the arse. Maybe someone will find them useful one day.

New publication

cslteaserAfter about four years in the making, we’ve finally published our paper “A combination of computational and experimental approaches identifies DNA sequence constraints associated with target site binding specificity of the transcription factor CSL” (the title, I know!!!). Please find the PDF here.

This has been an exciting journey. About four years ago I stared at a sequence motif that I got out of motif inference from a CSL/Su(H) ChIP dataset I analysed in collaboration with Sarah Bray’s group. There were a few good reasons to believe it was legit, but Sarah wasn’t convinced at all — not in the light of a crystal structure that Rhett Kovall had published in 2004. The paper discussed why the canonical motif was such a good fit for CSL, but it didn’t offer any entry point for why other sequences might be bound as well. A couple of hours with my favourite crystallographer Ilka Müller convinced me that CSL might prefer it’s canonical motif CGTGGGAA, but she also suggested to me that the space of possible sequences that are “tolerated” by the protein might be significantly larger. I presented these thoughts at a conference that was also attended by Bobby Glen, and we soon decided to team up to tackle the question more systematically using the tools of computational chemistry. Bobby and I proposed this to Apple for an ARTS Award, who presented us with some $30k of hardware, and the rest is history…

In the meantime, Sarah Bray took over the experimental validation of our computational results and because in vivo behaviour often doesn’t tell you much about what the protein really does, the very Rhett Kovall joined forces and provided additional biophysical measurements about CSL/Su(H) binding to DNA.

A truly international and interdisciplinary team. I’m proud of having been a member of it. And well done, Eddie and Rob from my CSBC team.

Another big donation from ARM

photo

The Game-of-Life is an example of cellular automata, a conceptual framework that is inspired by the interaction of living cells in a confined space over time. Computer science students learn about Conway’s classic game because it teaches them about Turing completeness, biology students may learn about it as an introduction into discrete stochastic simulations.

We’ve previously used a Game-of-Life implementation on KL25Z microcontrollers (another donation from the ARM University Programme) to provide a more haptic way for students to explore the properties of the game world. Each controller represents a cell whose survival is dependent on the neighbouring devices. However, for technical reasons the implementation relies on a serial ring bus to enable communication between the boards, which is very distinct from biological systems where all cells are in immediate contact with their neighbours.

We discussed this with the ARM University Programme, who suggested the use of wireless radio technology to communicate between our “cells”. Bluetooth Low Energy (BLE) is such an approach, and the Nordic nRF51822 mbed Kit is a low-power Cortex M0-controller with BLE on board. After a few emails with ARM, we’re now in a position to explore the properties of concurrent dynamic systems and compare them to the behaviour of living cells: A great learning experience for technical people and life scientists alike!

Last bits of Boris’ PhD published!

VHLnm23Better late than never, one might think. Once upon a time when I spent time in Tien Hsu’s lab, back in 2001 at the Medical University of South Carolina, I conducted a 2-hybrid experiment with the fly von Hippel-Lindau (VHL) tumor suppressor. Amongst a few other candidates, we found the nm23 tumour suppressor as interaction partner for VHL. We published a very nice paper on nm23 function in tracheal development, but never quite revealed where we had found the inspiration to chose exactly that gene to work on. Now in this semi-review article we’re finally coming clean and show about 13 years later that VHL and nm23 do stuff together. Unfortunately, pay per view in Naunyn Schmiedeberg’s Archives of Pharmacology: Interaction between Nm23 and the tumor suppressor VHL.

New article on arXiv: TF copy number estimation

TF_copy_number_calcWe are very happy to announce the submission of another article to arXiv prior to publication in a peer-reviewed open access journal:

arXiv:1404.5544: Estimating transcription factor abundance and specificity from genome-wide binding profiles

This is a rather curious story, as we had actually been working on something else! If you’re interested in determining the number of bound TF molecules or how to translate PWM scores into actual binding profiles, please have a look, send us comments and suggestions, tweet it, write about it on your blog (but let us know so we can engage in a fruitful scientific discussion!).

In brief, being challenged by the sometimes lengthy compute times of our GRiP framework, Radu Zabet derived an analytical solution for the prediction of genomic occupancy on the basis of TF binding site preference (in the form of a PWM), DNA accessibility (from genome-wide foot printing data), TF copy number and a scaling factor that translates the PWM score into a specificity. However, when benchmarking the method to actual ChIP-seq data, we realised that the same analytical model also allows the inference of TF copy number and specificity if the other two genome-wide measurements (occupancy and DNA accessibility) are given.

This is now an alternative entry point for studies into TF copy numbers, something that our group are also interested in addressing experimentally using quantitative mass-spec.