What they didn't tell you about machine learning

Nowadays ML and AI are so much talked about in life sciences R&D, it even makes sense to include them as keywords in a project proposal to help oneself get funded more easily. Trade magazines publish success stories. Conference talks promise new insights into hard-to-crack problems. Vendors offer specialized hardware that can do image analysis at scale. Be honest, did you think of maybe applying some of those new approaches to your work?

We at Saber Informatics suggest that you should! However, based on our recent work with a client organization that encourages innovation in a very much level-headed way, we learned that some key points must be considered when embarking on ML projects. If you check all of these, you will be well on your way to a successful real-world implementation. It turns out some seemingly mundane checks and balances should be put in place even before you start looking for which predictive model to use.

Data

Let's start with the data that will be used to train models. It probably shouldn't come as a surprise that your team will spend over half of their effort on procuring such data. It is not enough to just download it from somewhere. It will have to be looked at: its structure, distribution, anomalies. Even before that, its trustworthiness. What about those value qualifiers (e.g. x>5.0 or <5.0): are those correct? We saw some unexpected issues with the qualifiers missing for some of our data (they were replaced with question marks years ago in a software error somewhere during registration).

Be prepared to talk to your SME colleagues who uploaded the data to the corporate repository. It is crucial to get some of their time. They will suggest to add or remove data because they understand it better than anyone else. You will very likely hear sensible explanations but still will need to fill the gaps. Procuring data for ML is a learning experience in itself: you will be on the receiving end of any inconsistencies or rushed registrations that happened years ago. Don't get discouraged - it is a noble job because once you clean the dataset, it will be in high demand across the organization long after your team completes their project. High-quality data is worth a lot.

Versioning

After the cleanup work bears fruit, you will start building first models. It would be rare for a first version to be good enough for production, so most likely you will iterate. Create versions. Organizing model versions (data, columns, model type, parameters) is a challenge that truly needs to be seen to be appreciated. Do not underestimate the care required to manage model versions over the long term, especially once your models go into production. Once your colleagues start to use and trust your model predictions, they will want to know which version they used three months ago and exactly why the predictions differ slightly from what they got today.

Delivery Infrastructure

And the last but not least, challenge will be infrastructure. It's a pleasant chore to scale up because you feel appreciated and in demand (otherwise why scale?). Nevertheless, designing a robust infrastructure to deliver predictions to end users across your organization should be given due consideration. You will have to make good friends with IT.

Team

If you still are reading, it means you probably will not be easily discouraged in your ML/AI projects and must have already given the idea some thought. You are ready to start planning for steady project progression (as opposed to just delivering an enthusiastic talk at a conference). Based on our experience with ML projects, consider bringing people with diverse technical skills to work on your team: be sure to get a commitment of time from SMEs who understand and own the data. Reach out to the IT (or research infrastructure) department early - it will make the eventual rollout easier (and a lot more likely to happen).

We Can Help

Feel free to call or email us at Saber Informatics to start a conversation: we bring to the table extensive practical experience in the field as an active practitioner. We will help you avoid the mistakes we already made before. We honor strict client confidentiality and bring an industry insider perspective. We will listen and help.

What they didn't tell you about machine learning

Data

Versioning

Delivery Infrastructure

Team

We Can Help

Advisory Services

Data Capture

Data Cleanup

Project Support

Implementation of Machine Learning for Life Sciences Clients

Scientific Algorithm Rewrite for Production Use

Dealing with Poorly Formatted Clinical Research Data

About Us

Recent News

Are You Up for It? On Large Language Models.

Embracing random walks in machine learning