Researchers at Mass General Brigham found that artificial intelligence software like ChatGPT could help speed up the screening process to find patients eligible for clinical trials—though they cautioned that additional safeguards would be necessary.
Using a version of OpenAI’s GPT-4 program through Microsoft’s Azure cloud service, the researchers found that a tailored generative AI application was able to quickly comb through patient notes within electronic medical records and accurately identify those with heart failure who met the criteria for a study.
That trial, named COPILOT-HF, aims to determine if a virtual clinic approach can help remotely steer more patients onto guideline-recommended medication regimens for heart failure. The researchers designed a set of 13 prompts to help the program determine if a person could be enrolled based on their medical chart data.
When tested against a set of 1,894 patients, with an average of 120 written notes apiece, the AI program was between 97.9% and 100% accurate, compared to an expert clinician’s conclusion. By comparison, manual reviews by the trained-but-unlicensed human study staff delivered an accuracy rate between 91.7% and 100%, according to the study.
The researchers also pegged the price of the AI’s review at an average of 11 cents per patient—compared to orders of magnitude higher with traditional manual screening methods, where annual costs can reach into the tens of thousands and vary significantly by study type and phase.
“Screening of participants is one of the most time-consuming, labor-intensive, and error-prone tasks in a clinical trial,” said co-lead author Ozan Unlu, a clinical informatics fellow at Mass General Brigham. The study’s results were published earlier this week in the New England Journal of Medicine’s AI-focused publication.
However, the researchers noted that employing AI can bring risks—such as introducing or reinforcing ethnic or racial biases or missing nuances within doctors’ notes. They said its use should be closely monitored with human double-checking after being included in routine operations.
“We saw that large language models hold the potential to fundamentally improve clinical trial screening,” said co-senior author Samuel Aronson, executive director of IT and AI solutions for Mass General Brigham Personalized Medicine. “Now the difficult work begins to determine how to integrate this capability into real-world trial workflows in a manner that simultaneously delivers improved effectiveness, safety and equity.”