Dealing with Poorly Formatted Clinical Research Data

samples
Services Involved: 

In this project we successfully set up and maintained an automated capture process for poorly formatted clinical research files incoming from offsite facilities of our client, a large US pharmaceutical company.

Incoming clinical research data requires QC

Much clinical research data collected by scientists in this Saber Informatics project was coming in for analysis and interpretation via loosely formatted files. These were generated as part of roughly a hundred ongoing clinical studies run by our Client. While it was one project for us it involved results of many active, ongoing studies, each with a team and a study lead. A significant part of this data was accumulative, with daily or weekly additions that were sent separately. With biomarker data spread across sizeable tables, manual analytics was out of the question. The information had to be plotted in Spotfire but because of the loose formatting, gaps and errors, it was nearly impossible to import these files in their raw state. It was obvious to everyone involved that some post-processing was needed.

Rigorous stepwise pipelines for data ingeston

While PIs were asked to use standardized reporting as soon as processing problems reached a certain level, the process of standardization would take time and even then not everything could be templated. Some sort of automated clean-up and formatting had to be set up.

We designed and put in place multiple automated stepwise data processing pipelines, one per study type. These were similar in general but differed in specific processing steps. During the design phase, particular attention was paid to ensuring that every cleanup and import step had a documented purpose. In the testing phase we worked side-by-side with Client domain experts in each study type to confirm that the pipelines were working as intended. We managed to have 98% of the ongoing study data auto-processed, while the remaining two percent were forwarded for manual processing using algorithmically generated alerts.

Traceability is a required component of data quality assurance

Intermediate output was generated at important junctions of the process so that our Client's scientists could confidently trace any issues or data gaps upstream through the processing steps. How would they request a correction in a study if they couldn't articulate what exactly was wrong or missing? Making the process actionable was key, since each study had rigid timelines and budgets. Errors had to be communicated quickly and clearly if one expected corrections in a timely manner. It would be much easier to correct something in an ongoing study rather than trying to reopen a completed one. Timing of any feedback within a study was critical.

Looking back

The project went on as more study types were being introduced. There were also changes in the informatics IT department, restructuring, and new overseas IT partners were introduced to support informatics. The data processing workflows kept humming along for several years. In this project we learned an important lesson: what makes data workflows survive changes? We discuss these in our blog article here.

We Can Help

Feel free to call or email us at Saber Informatics to start a conversation: we bring to the table extensive practical experience in the field as an active practitioner. We will help you avoid the mistakes we already made before. We honor strict client confidentiality and bring an industry insider perspective. We will listen and help.

About Us

Saber Informatics is a US data science consultancy founded in 2012.

Our focus is on pharmaceutical R&D, specifically data preparation for ML/AI initiatives.

  info@saberinformatics.com

Recent News

blair witch proj
published 2 months 3 weeks ago
mountain
published 2 years 7 months ago