What they didn't tell you about machine learning

modeling

Nowadays ML and AI are so much talked about in life sciences R&D, it even makes sense to include them as keywords in a project proposal to help oneself get funded more easily. Trade magazines publish success stories. Conference talks promise new insights into hard-to-crack problems. Vendors offer specialized hardware that can do image analysis at scale. Be honest, did you think of maybe applying some of those new approaches to your work?

We at Saber Informatics suggest that you should! However, based on our recent work with a client organization that encourages innovation in a very much level-headed way, we learned that some key points must be considered when embarking on ML projects. If you check all of these, you will be well on your way to a successful real-world implementation. It turns out some seemingly mundane checks and balances should be put in place even before you start looking for which predictive model to use.

Data

Let's start with the data that will be used to train models. It probably shouldn't come as a surprise that your team will spend over half of their effort on procuring such data. It is not enough to just download it from somewhere. It will have to be looked at: its structure, distribution, anomalies. Even before that, its trustworthiness. What about those value qualifiers (e.g. x>5.0 or <5.0): are those correct? We saw some unexpected issues with the qualifiers missing for some of our data (they were replaced with question marks years ago in a software error somewhere during registration).

Be prepared to talk to your SME colleagues who uploaded the data to the corporate repository. It is crucial to get some of their time. They will suggest to add or remove data because they understand it better than anyone else. You will very likely hear sensible explanations but still will need to fill the gaps. Procuring data for ML is a learning experience in itself: you will be on the receiving end of any inconsistencies or rushed registrations that happened years ago. Don't get discouraged - it is a noble job because once you clean the dataset, it will be in high demand across the organization long after your team completes their project. High-quality data is worth a lot.

Versioning

After the cleanup work bears fruit, you will start building first models. It would be rare for a first version to be good enough for production, so most likely you will iterate. Create versions. Organizing model versions (data, columns, model type, parameters) is a challenge that truly needs to be seen to be appreciated. Do not underestimate the care required to manage model versions over the long term, especially once your models go into production. Once your colleagues start to use and trust your model predictions, they will want to know which version they used three months ago and exactly why the predictions differ slightly from what they got today.

Delivery Infrastructure

And the last but not least, challenge will be infrastructure. It's a pleasant chore to scale up because you feel appreciated and in demand (otherwise why scale?). Nevertheless, designing a robust infrastructure to deliver predictions to end users across your organization should be given due consideration. You will have to make good friends with IT.

Team

If you still are reading, it means you probably will not be easily discouraged in your ML/AI projects and must have already given the idea some thought. You are ready to start planning for steady project progression (as opposed to just delivering an enthusiastic talk at a conference). Based on our experience with ML projects, consider bringing people with diverse technical skills to work on your team: be sure to get a commitment of time from SMEs who understand and own the data. Reach out to the IT (or research infrastructure) department early - it will make the eventual rollout easier (and a lot more likely to happen).

We Can Help

Feel free to call or email us at Saber Informatics to start a conversation: we bring to the table extensive practical experience in the field as an active practitioner. We will help you avoid the mistakes we already made before. We honor strict client confidentiality and bring an industry insider perspective. We will listen and help.

About Us

Saber Informatics is a US data science consultancy founded in 2012.

Our focus is on pharmaceutical R&D, specifically data preparation for ML/AI initiatives.

  info@saberinformatics.com

Recent News

blair witch proj
published 2 months 3 weeks ago
mountain
published 2 years 7 months ago