11 May 2022
The current method used to test the accuracy of AI symptom checkers is misleading, a new study published in the British Medical Journal has concluded.
Academics at Imperial College London’s Self-Care Academic Research Unit (SCARU), working in collaboration with the Royal College of GPs (RCGP), have discovered it’s almost impossible to benchmark online symptom checkers because doctors themselves can’t agree on what the right triage and diagnosis is for test cases. As a result of these findings, the researchers and team at Healthily suggest that a new industry standard for measuring the accuracy of symptom checkers should be explored.
The team worked to create more than 130 short stories (or “vignettes”) covering 18 medical areas.
Vignettes are medical stories outlining a patient’s symptoms and other relevant information and are traditionally used to test medical students. They have also been the principle way AI symptom checkers have tested their accuracy and assured their safety since their creation 7 years ago.
In the study, the research team initially gave the RCGP vignettes to experienced Imperial College clinicians who read them and gave their opinion on the appropriate “triage” (prioritisation) of the patient.. The options were (1) the patient should care for themselves, (2) see a GP or (3) go to A&E. The clinicians also had to decide what the 3 most likely conditions for the patient would be given their story.
The study found that while the Imperial doctors agreed most of the time on the self-care conditions, there was only “fair” or “moderate” agreement on whether the person in the other stories should go and see a GP or go to hospital. Overall, they agreed with the RCGP clinicians for more than three-quarters of cases and only disagreed in 1 in 4 cases (26%).
When it came to naming the most probable condition, Imperial’s “independent” panel of doctors agreed with the RCGP’s doctors 72% of the time.
Study lead Dr Austen El-Osta, who is the Director of the Self-Care Academic Research Unit (SCARU) at Imperial College London says : “This obviously speaks to the elephant in the room. There is no certainty in medicine until you have tested. Both doctors and AI are basing their advice on probability and risk, not diagnostic testing. This contrasts with the real world where diagnostic testing is often needed to drive evidence-based decision making such as by using medical imaging or blood tests for example.”
When the medical stories were fed into Healthily, the world’s leading AI symptom checker, it got the correct “triage” 62% of the time and the correct condition 61% of the time.
Imperial SCARU used a combination of medical academics and laypeople with no experience to pretend they were the patients with the symptoms described in the medical vignette.
This is the traditional way AI symptom checkers are evaluated but the researchers found this led to “significant variability”.
Dr El-Osta explained: “Artificial intelligence can ask far more questions than the vignette can anticipate. This means that inputters often put in responses that legitimately change the range of possible triage and condition options. So the AI may not come out with an answer that matches the vignette but it may be appropriate for what was inputted.”
Overall, the research team at Imperial concluded that Healthily was “generally working at a safe level of probable risk” – for 96.3% of the time. The study showed that Healthily only gave “very unsafe” triage recommendation 3.7% of the time (e.g. told someone they were able to self-care when they should go to hospital immediately).
Imperial SCARU concluded that online symptom checkers could only be truly verified if their performance was cross-checked against scenarios using real patients and interactions with GPs “as opposed to using artificial vignettes.”
Professor Maureen Baker CBE, the Chief Medical Officer for Healthily, said: “We welcome this study.
“By the nature of academic publishing this study has been a long time in gestation. The first findings were shown to Healthily in the summer of 2020 and we have already used the research to focus our quality and improvement efforts.
“We agree with the findings of the report that vignette testing has too many biases to form the basis of an accurate assessment of the appropriateness of the recommendation delivered by symptom checkers.
“The testing could be improved by using real people in real-life situations, for example, by asking patients to use a symptom checker before going to see a doctor (for example, in the waiting room), then comparing the top three suggestions from both the checker and the doctor.”
Dr Austen El-Osta, said: “This piece of research started as an accuracy report and became something more far-reaching. We need to rethink the standard of testing for AI symptom checkers in light of this study. Research in this space is really important because the routine use of safe and accurate online symptom checkers has the potential to ‘democratize self-care’ for all, and empower individuals to seek the right level of support when needed”
“The current use of vignettes isn’t serving the industry or the consumer. We are keen to continue this work to find an appropriate gold standard of testing that can take account of all the variabilities we uncovered in this study.”
Matteo Berlucchi, Chief Executive of Healthily said: “This is an important discussion because the future of digital healthcare needs to be based on trust. The public must understand how to use the information they are being provided and trust that it is appropriate.
“This report shows the industry and Healthily how we need to think harder and improve testing.
“We are working with Imperial SCARU and the World Health Organisation (WHO) to tackle this issue and we hope to continue our collaboration with Imperial to bring forward new approaches.”
Healthily is the first AI healthcare platform to put self-care at the heart of healthcare, with a mix of user-friendly health tools, an award-winning app and an AI Smart Symptom Checker, one of the most accurate and advanced symptoms checkers in the world coupled with medically-verified information created with guidance by the Healthily Clinical Advisory Board.
As the first self-care platform registered as a Class 1 Medical Device in the UK, Healthily helps anyone, anywhere decide when to see a doctor and how to manage wellbeing safely at home.
The Healthily AI platform can also be licensed to telemedicine companies, health insurers, national health services and big pharma to help them scale their services more cost effectively. All part of the Healthily mission to help one billion people find their health through informed self-care.
For more information visit www.livehealthily.com/business