Scalable machine learning of complex models on extreme data will be an important industrial application of exascale computers. In this project, we take the example of predicting compound bioactivity for the pharmaceutical industry, an important sector for Europe for employment, income, and solving the problems of an ageing society. Small scale approaches to machine learning have already been trialed and show great promise to reduce empirical testing costs by acting as a virtual screen to filter out tests unlikely to work. However, it is not yet possible to use all available data to make the best possible models, as algorithms (and their implementations) capable of learning the best models do not scale to such sizes and heterogeneity of input data. There are also further challenges including imbalanced data, confidence estimation, data standards, model quality and feature diversity.
The ExCAPE project aims to solve these problems by producing state of the art scalable algorithms and implementations thereof suitable for running on future Exascale machines. These approaches will scale programs for complex pharmaceutical workloads to input data sets at industry scale. The programs will be targeted at exascale platforms by using a mix of HPC programming techniques, advanced platform simulation for tuning and and suitable accelerators.
The Pharmaceutical Industry
- One of the EU’s top performing technology sectors, provides 700,000 jobs producing goods with a retail value of 235,016 M€ (2013, EFPIA).
- The average cost of developing a drug is €930 million ($1.2 billion) and it takes 9 to 13 years for a compound to reach the patient
- Failure rates of compounds in clinical trials: approx 54% due to a lack of efficacy, 9% due to toxicity
- Chemogenomics is an extension of the already established single target QSAR, which aims to make a holistic model of compound bioactivity across a range of targets, thus enabling multitarget use cases such as phenotypic deconvolution, hit series triaging etc.
Machine Learning Challenges
- Diversity of feature space: fragment- or pharmacophore-based descriptor, graph, 3D properties of a compound or even biological fingerprint.
- Chemogenomics model quality: the state of the art in building models for high dimensional multi-label and multi-class data needs improving. Also, insight on feature significance and confidence estimation for the prediciton would make models more usable.
- There is a lack of common data standards to encode empircal information.
- Biological data is complex and noisy
- Large and imbalanced data sets: input sets can number 10s of millions of training points, with a heavy skew towards inactive
- Scalability: Making software that will scale to the huge parallelism on future machines is known to be a difficult task.
- Efficiency: To keep the energy usage acceptable, future machines are likely to use specialised accelerators, which require specific programming techniques.
- Development burden: Scalability and efficiency are hard to achieve, which in turn pushes up the difficulty of writing software, especially for algorithm experts who may not be so familiar with HPC. This burden needs to be reduced.