Data scientist's toolbox for computational biology

Services Involved:

Computational scientists at a large pharma company in Boston look for patterns in chemical and biological data related to the projects they support. They go beyond finding answers to well-defined questions and set out to identify similarities between toxicity profiles, metabolic cascade interference, and even author names from literature publications, statistically linked to sets of related chemical compounds.

Big data in the life sciences can be surprisingly close at hand.

Pattern identification in very large datasets requires high-performance computational algorithms, implemented to be run on the appropriate compute resources. Saber Informatics was brought in to create a data scientist's toolbox for computational biology - a set of computational tools that would run in the company's high-performance computing environment. As it was described to us, the computational group already can perform calculations in a "manual" mode one by one and needed to automate the process so that complex workflows could be set up to run reliably in a massively parallel fashion.

A new and key requirement was for the toolbox to also handle "big data": many millions of datapoints in each dataset.

Collaboratively with the Client's computational sciences group we identified the statistical algorithms that were to be implemented as interconnecting modules with a standard in-out parameter interface. We decided to use R as the implementation technology. In our initial testing the existing algorithmic implementations that we had identified would reliably run out of memory within a few minutes of calculations. To overcome this limitation, Saber Informatics designed and developed novel, high-performance functional modules that were capable of handling datasets that had previously been overwhelmingly large.

Collaborative design, high-quality implementation.

We held regular review meetings with the Client throughout the project. Detailed user documentation was delivered as a formal document with every development iteration. The toolbox was extensively stress-tested and benchmarked in the Client's high-performance compute environment using realistic datasets.

Our Client now has the computational tools to take their research to the next level. In their own words, "surprisingly the calculations don't crash" and now they can easily set up multi-step HPC workflows to tackle computational problems that were hopeless before. Combining efficient memory management with code reliability allowed us to successfully deliver a software product that our Client will use to conduct cutting-edge scientific research in the life sciences.

Data scientist's toolbox for computational biology

Big data in the life sciences can be surprisingly close at hand.

Collaborative design, high-quality implementation.

Advisory Services

Data Capture

Data Cleanup

Project Support

Implementation of Machine Learning for Life Sciences Clients

Scientific Algorithm Rewrite for Production Use

Dealing with Poorly Formatted Clinical Research Data

About Us

Recent News

Are You Up for It? On Large Language Models.

Embracing random walks in machine learning