This is Part 1 of a three part series. Click here to read Part 2 "Why a Recent FDA Brief on Clinical Cancer Research Is Really About Much More Than Cancer Research". Click here to read Part 3 "The 100 Largest Phase I Studies on ClinicalTrials.Gov".

Anyone who regularly uses the database (CT.g) becomes familiar with challenges of navigating the site, and analyzing its data. The interface is confusing, and the sheer volume of data is overwhelming. We've been exploring the data on CT.g a lot lately, and stopped to ask ourselves about its quality. But again, the volume of data to analyze makes this a challenge. To simplify the problem, we break the CT.g data into two buckets: trial registration data, and trial result data .  There are currently 211,988 studies in the database with registration data (as of April 4th, 2016), yet there are only 20,819 studies with complete sets of results. 

Turning to the registration side, it has been assumed that this information is all correct.  Yet when we dug into the data on we found over 10,000 immediate errors due to poorly entered information.  Even data that maps well has errors. For example, the country USA is entered numerous different ways:  USA, U.S.A., US, United States of America, plus a few typos thrown in for good measure. The team at reviews all studies being registered and makes every attempt to ensure the data is correct.  This can cause annoying delays to those registering the study, but for the rest of us, using the data in other ways, we should be grateful for the diligence of this team.

It was in December 2007 when the expansion requirements of FDAAA began and were implemented in CT.g.  Therefore it could be suggested that the information on CT.g prior to that date is less than exact and could probably be ignored for the use of evaluating trends, statistics and analytics on CT.g.  Certainly there are more errors in this earlier data.

The challenge in cleaning and reviewing the data is understanding when a data-point is an error or a true event.  An example is in the Phase I dataset below.  In the graph the red bars indicate anticipated and the black bars indicate actuals subject enrollement.  In 2006 there is a huge spike in the average number of subjects forecast for Phase 1 studies.  Since the actual numbers are very consistent, and the variance is in one year (possibly also 2012) we have some possible errors in the forecasted numbers.  Is this just a typo or an interesting underlying data set? dataset

We will be exploring CT.g data and its quality in further detail in upcoming blog posts. Stay tuned for insights into the growing size of Phase I data.