Public datasets like (CT.g) contain dirty data yet provide us with compelling insights. How do we ensure we are being thorough when analyzing public clinical trial data? I’ll walk you through an example analysis to explain how to identify errors, narrow datasets, and produce more thorough answers to research questions.

When we work on analysis projects, whether it's for clients or for blog content, we use our database of over 225,000 trials. Each trial has over 20 datapoints. The result is looking at over 4.5 million pieces of information to tell a story.

When working with data at such volume, especially data that comes from human input as is the case with public trial records, it’s not a matter of if there are errors but where are the errors. For that reason, we look at data with speculation, and filter and sort the data on many different dimensions until we're confident we know where the errors lie.

A quick point to inject here is that with large enough datasets you don’t need to identify every error. This can be an unnecessary use of time (though at BrackenData we do clean up errors when we add to our database, and have cleaned up over 10,000 errors to date). Eliminate the big ones, but with enough data to minimize the effect of a few errors and outliers you will draw the same conclusions.

Each analysis starts with a key question. A popular one we explored in the past is “Who enrolls more subjects: industry sponsors or academia?” To provide a thorough enough answer, we need to answer the question with more questions. Who’s asking the question and why? What data do I have that is most relevant? Which pieces of the data I looking at are irrelevant?

As an example of the analysis process, I will answer the question What Is the Largest Clinical Trial to Date in 2016?


The Short Answer

The largest submitted trial to date in 2016 is titled “Linkage of Medicaid Enrollment Information to Surveillance, Epidemiology and End Results Data”. The trial is sponsored by the National Cancer Institute (NCI) and is enrolling 3,217,145 subjects.

I found this by using TrialFinder, BrackenData’s search and visualization tool. Another option is to use advanced searching on (CT.g) to look at trials submitted since Jan 1, 2016 but there isn't a quick way to sort by enrollment size.

A deeper look shows that the enrollment numbers for this study are listed as “anticipated” and not actual enrollment numbers. This is an observational study and its phase is labeled as “N/A”.

At this point you and I are probably asking the same question: Over 3 million subjects!? Immediately, we are both speculative about the data.

To get a more thorough answer we need to ask ourselves more questions and refine our original research question.


Refining the Question

I can’t say with certainty that an anticipated enrollment size of over 3 million is incorrect, but this is likely an error. At the least, it’s a number that creates speculation. If I was performing a broader analysis on this dataset for a client this is an outlier I would want to remove. I’d also want to find others like it, and take those out.

Beyond its size, what’s the giveaway that it doesn’t belong in our analysis? That the enrollment number listed is labeled as “anticipated”. On CT.g investigators can submit “anticipated enrollment numbers” or “actual enrollment numbers” and sometimes forget to update these after the trial is complete. If we want to look at concrete data, let’s only include studies with actual enrollment numbers. 

Let’s also ask ourselves more questions. Who’s asking the question? What data do we have that’s relevant?

The question is for the sake our blog readers, who are individuals mainly concerned with industry sponsored trials, and not observational data. Because our audience is educated on the subject we can agree that looking at all phases isn’t indicative of much. A Phase 3 trial is almost always going to be bigger than its Phase 1 counterpart (although look at our previous post about Phase 1 studies exceeding 800 subjects).

Therefore, let’s refine our key research question to what is the largest industry sponsored interventional trial in 2016 to date, by phase?


The Better Answer

The largest Phase 1 trial with our criteria is sponsored by Baxalta US and titled “Assessment of Pharmacokinetics and Safety of M923 Administered Via Auto-injector or Prefilled Syringe, in Healthy Subjects”. Its status is complete, and the study involved 603 subjects. Read more about it here. And yes, they enrolled 603 subjects in 6 months! We checked with the study director, a friend of BrackenData.

The largest Phase 2 trial with our criteria is a Novartis sponsored trial originating in Germany titled “Determination of Accuracy in Measurement of Total Immunoglobulin E Using a Test Device in Atopic Subjects”. This one is also marked complete, and involved 193 subjects. Read more about it here.

The largest Phase 3 trial with our criteria originated in China and is titled “A Phase III Clinical Trial of An Inactivated Quadrivalent Influenza Vaccine in Healthy Subjects Aged 3 Years and Older” [link: ]. It was sponsored by Jiangsu Province Centers for Disease Control and Prevention and enrolled 3,664 subjects. And yes, another very fast recruiting study! Read more here.

There’s a trial in the USA in Phase 3 of similar size that’s worth noting which you can read about here.

Finally, the largest Phase 4 trial in 2016 to date is an interesting one titled “Incidence of Lactose Intolerance Among Self-Reported Lactose Intolerant People” which was sponsored by a2 Milk Company. This study involved 600 subjects. It’s marked completed, but the study results have yet to be posted. Read about it here.



Instead of answering our key research question with the first item that matched our criteria we:

  1. Speculated if the item was relevant
  2. Identified what about it is irrelevant
  3. Removed all studies from the dataset we deemed irrelevant (Anything marked Phase N/A or anticipated enrollment)
  4. Split the data into contextual buckets (phase)
  5. Provided an answer to our research question for each contextual bucket
  6. Included notable results that didn’t directly answer the question (such as we did with Phase 3)

As a result, our audience has more meaningful insight into the research we conducted and finds more credibility in the response.

Do you have a question or analysis you’d like us to explore in a blog post? We’d be happy to do it. Leave us a comment with your idea.

Signup for a Personal Demonstration of Our Clinical Trial Analytics

Name *