This talk will describe the bioinformatics use cases, challenges and experiences of two leading research institutions: the Francis Crick Institute and Cambridge University.
Adam Huffman will describe how the Francis Crick Institute creates HPC clusters on OpenStack for genomics and scales them to 5,000 cores. He will report on the experience of setting up virtual clusters with batch schedulers on OpenStack to provide an HPC environment for life sciences users. These users build complex pipelines comprised of many tools, operating on multi-terabyte datasets, historically on centrally-provided bare-metal clusters. Adam will describe how the problems of reliably constructing such clusters were overcome with OpenStack, and the challenges in achieving high performance on OpenStack with clusters of this size.
Paul Calleja and Wojciech Turek will describe how Cambridge University is building an HPC bioinformatics platform upon OpenStack infrastructure. The performance of this software stack depends on an IO subsystem optimised for data access patterns characteristic to HPC and current bioinformatics workloads in genomics. Current high-throughput technologies such as Next-Generation Sequencing (NGS) produce unprecedented scales of data in genomics and clinical projects, with many projects producing petabytes of data. Most existing bioinformatics solutions have problems scaling and dealing efficiently with current data volumes, making it hard to store, analyze, share and visualize the data. Cambridge’s approach focuses on solutions that deliver low-latency and high-throughput access to storage.
Attendees of this session will learn how to create a functioning virtual HPC cluster and to scale that cluster up to 5,000 cores, maintaining good performance. Attendees will also learn about:
- strategies to avoid cattle turning into pets
- complications with restricted access datasets
- rescuing instances affected by unreliable underlying storage
- apparent differences in reliability between filesystems used in compute node instances
- the need to engage actively with upstream software projects in order to address bugs and missing functionality
- cultural issues for users
- user expectations
- provisioning of complex genomics software pipelines