When Machines Speak Medicine

The pace of AI adoption in healthcare is staggering. Tools once confined to experimental labs are now being embedded into clinical workflows, decision support systems, and even wearable devices. But beneath all the excitement lies a growing unease. And this is not without reason. A recent study published in Nature has exposed a fundamental vulnerability in the very systems we are beginning to trust with patient care: hallucination.

This isn’t a minor technical flaw. It’s a systemic issue. And if left unaddressed, it could undermine the safety, reliability, and ethics of AI in medicine.

The study examined how large language models (LLMs) respond to adversarial prompts i.e. inputs deliberately designed to include fabricated clinical details. Researchers created 300 physician-validated vignettes, each containing one false element: a fake lab result, a non-existent condition, or a misleading sign. The researchers tested six different AI systems using three different setups. First, they used the systems with their usual settings, just as most people would use them out of the box. Then, they tried using a special prompt designed to reduce errors and make the responses more cautious. Finally, they adjusted the settings to make the systems give more predictable and consistent answers, by turning off randomness completely. This is known as setting the temperature to zero, which forces the system to choose the most likely response every time, rather than exploring different possibilities.

The findings were sobering:

Hallucination rates ranged from 50% to 82% across models.
Even GPT-4o, the best-performing model, hallucinated in 53% of cases under default settings.
Using a carefully worded prompt helped reduce the number of mistakes, but not enough. Even with this safety measure, the AI still gave wrong medical information in about one out of every four cases.
Changing the temperature settings to make the AI more predictable didn’t help much. Even when the system was told to avoid randomness and stick to the most likely answers, it still made mistakes.
Shorter vignettes were slightly more prone to hallucination.

You see, this is not just a theoretical concern. In clinical practice, accuracy is non-negotiable. A fabricated lab result or diagnosis isn’t just a technical error. It’s a potential threat to patient safety.

These errors can lead to:

Misdiagnoses
Unnecessary investigations or treatments
Delayed or inappropriate care
Legal and ethical complications

And because LLMs often present information in a confident, well-structured format, these hallucinations can be difficult to detect, especially in high-pressure environments.

What’s especially worrying is that these mistakes weren’t random. They were caused by prompts that were deliberately written to trick the system into giving false answers. This shows that AI tools aren’t just prone to making innocent errors. They can also be misled on purpose by people who know how to manipulate them.

In a healthcare system that’s starting to rely more and more on AI, this kind of vulnerability could lead to serious problems, such as:

Corrupted data: Someone could feed bad information into the system to influence its decisions.
Fake charges: AI could be used to create false medical bills or claim payments for treatments that never happened.
Spreading lies: AI could be used to share false or harmful health information that misleads patients or professionals.

When AI systems make things up or give incorrect information, it’s not just a rare accident. It’s actually built into how they work. These systems are designed to guess what sounds right based on patterns in data, not to know what is true or false. They don’t understand context. They don’t know who you are, what they’ve just said, or the implications of their output. They’re not reasoning. They’re guessing, based on patterns in data.

These systems are designed to guess what sounds right based on patterns in data, not to know what is true or false. They don’t understand context. They don’t know who you are, what they’ve just said, or the implications of their output. They’re not reasoning. They’re guessing, based on patterns in data.

This is why hallucinations occur. The model’s not lying. It’s not confused. It’s simply doing what it was designed to do; generating plausible text. And sometimes, plausible isn’t the same as true. As I explained during the recent “From Code to Consequence” debate,

“Most of the time the AI’s right, but some of the time it’s wrong. And that’s not the model’s fault. That’s just the way it’s designed.”

This means that any system built on LLMs will always carry a risk of error. And in clinical settings, that risk is unacceptable.

During the panel, I emphasised a point that often gets lost in discussions about accuracy and performance metrics:

“You only have to be wrong once. It's not that it's right 9 million times. It's the fact that it's wrong once.”

This is the crux of the issue. AI systems may perform well most of the time, but in medicine, the cost of a single mistake can be irreversible. A hallucinated diagnosis, a fabricated lab result, or a misinterpreted symptom can lead to inappropriate treatment, delayed care, or worse.

During the same debate, a similar concept of medical AI sunglasses was raised; wearable devices that listen to patient conversations, analyse symptoms, and offer diagnostic suggestions. The idea is compelling. It promises real-time support, faster decisions, and augmented clinical insight.

But what happens when the system hallucinates? What if it misinterprets a symptom or fabricates a condition? The doctor, trusting the device, proceeds. The pharmacist fills the prescription. The patient suffers.

This example illustrates the broader risk. Whether embedded in a wearable or a desktop interface, any system built on LLMs carries the same vulnerability. And unless we understand that, we risk deploying tools that do more harm than good.

any system built on LLMs carries the same vulnerability. And unless we understand that, we risk deploying tools that do more harm than good.

One of the most urgent implications of the Nature study is the need to treat clinical AI systems, including LLMs, as regulated medical devices. This is not just a semantic distinction. It has real consequences for how these systems are designed, tested, deployed, and monitored.

In most jurisdictions, a medical device is defined as any tool or technology intended to diagnose, prevent, monitor, or treat disease. If an AI system is used to support clinical decision-making, it falls squarely within this definition. That means it must meet rigorous standards for safety, efficacy, and accountability.

This includes:

1. 1. Clinical validation: The system must be tested in real-world settings to ensure it performs reliably across diverse populations and scenarios.
  2. Risk classification: High-risk applications, such as diagnostic support, require stricter oversight, including human review and traceability.
  3. Regulatory approval: In the UK, this means going through the MHRA. In the EU, it means compliance with the AI Act and MDR (EU) 2017/745. In the US, it means FDA clearance.
  4. Post-market surveillance: Once deployed, the system must be monitored for adverse outcomes, performance drift, and unintended consequences.
  5. Transparency and explainability: Developers must be able to explain how the system works, what data it was trained on, and how it arrives at its conclusions.

Treating AI as a medical device also means recognising that it is not neutral. It reflects the biases, gaps, and limitations of its training data. We need to carefully examine how the system behaves; not just when everything is working normally, but also when things go wrong, when it’s under stress, or when it faces unusual situations that don’t happen often. These rare or extreme scenarios, sometimes called “edge cases,” can reveal hidden problems that wouldn’t show up in everyday use. We also need to think about what could happen if someone tries to trick or misuse the system on purpose.

We need to carefully examine how the system behaves; not just when everything is working normally, but also when things go wrong, when it’s under stress, or when it faces unusual situations that don’t happen often.

It is telling that some senior figures in the field are beginning to distance themselves from the idea that LLMs can be used reliably in clinical decision support. Dr. John Halamka, President of the Mayo Clinic Platform, has cautioned that while LLMs can summarise known facts, they often “don’t know what they don’t know” and may fabricate answers with confidence.

Similarly, researchers from Beijing Friendship Hospital have stressed that LLMs cannot replace senior clinicians in complex decision-making. They may assist with basic tasks, but the human touch, the ability to contextualise, to empathise, to override flawed logic, remains irreplaceable.

We need more than AI frameworks. We need literacy. AI literacy. Data literacy. The ability to understand what these systems are, how they work, and where they fail. Without that, organisations will continue to deploy tools they don’t understand, chasing efficiency while ignoring risk.

And the risks are growing. As AI becomes more embedded in clinical workflows, the potential for harm increases. Not because the technology is inherently bad, but because it is being used without sufficient understanding.

This is why I offer courses in research methods and critical appraisal techniques. These are not academic luxuries. They are essential human skills in an AI-dominated world. When systems hallucinate, when prompts are adversarial, when misinformation spreads faster than correction, critical thinking becomes a clinical safeguard.

I offer courses in research methods and critical appraisal techniques. These are not academic luxuries. They are essential human skills in an AI-dominated world.

My training is designed to help professionals spot errors, challenge assumptions, and evaluate evidence. Whether you're a clinician, analyst, or policy-maker, these tools are vital for navigating a landscape where machines increasingly shape decisions.

The Nature study is a wake-up call. It shows that even the most advanced systems are vulnerable to manipulation and error. It shows that hallucination is not a fringe issue; it is central. And it shows that in medicine, where the stakes are high, we must proceed with caution.

AI can never be a replacement for clinical judgment. It should only ever be used as a support tool. And its value depends on how well we understand when machines speak medicine.