Dr. Isaac Kohane, a computer scientist at Harvard and a physician, collaborated with two colleagues to put GPT-4 through its paces in order to assess how the newest artificial intelligence model from OpenAI performed in a medical scenario.
“I’m surprised to say: better than many physicians I’ve watched,” he adds in “The AI Revolution in Medicine,” a new book co-written by independent journalist Carey Goldberg and Microsoft vice president of research Peter Lee. (The authors claim that neither Microsoft nor OpenAI required editorial review of the book, despite the fact that Microsoft has spent billions of dollars creating OpenAI’s technologies.)
“Just like me,” he claimed, the chatbot can identify unusual illnesses.
According to Kohane’s book, GPT-4, which was issued to paying members in March 2023, answers US medical test license questions correctly more than 90% of the time. It performs substantially better on tests than prior ChatGPT AI models, GPT-3 and -3.5, as well as certain certified doctors.
GPT-4, on the other hand, is more than simply a strong test taker and fact seeker. It’s also an excellent translator. It can translate discharge information for a patient who speaks Portuguese and condense complex technical language into something 6th graders can understand.
GPT-4 can also give doctors helpful suggestions about bedside manner, offering tips on how to talk to patients about their conditions in compassionate, clear language, and it can read lengthy reports or studies and summarize them in the blink of an eye, as the authors demonstrate with vivid examples. The technology can even explain its reasoning through challenges in a way that appears to require some human-style intellect.
But, if you ask GPT-4 how it achieves all of this, it will most likely tell you that its intelligence is still “limited to patterns in data and does not entail actual understanding or intentionality.” That’s what GPT-4 told the book’s writers when they questioned if it could participate in causal reasoning. Despite these restrictions, as Kohane demonstrated in the book, GPT-4 may accurately replicate how doctors identify illnesses.
GPT-4’s ability to diagnose like a doctor
In the book, Kohane conducts a clinical thought experiment using GPT-4 based on a real-life situation involving a newborn infant he treated some years ago. Giving the bot a few key details about the baby from a physical exam, as well as some information from an ultrasound and hormone levels, the machine was able to correctly diagnose congenital adrenal hyperplasia, a 1 in 100,000 condition, “just as I would, with all my years of study and experience,” Kohane wrote.
The doctor was impressed as well as terrified.
“On the one hand, I was having a sophisticated medical conversation with a computational process,” he wrote. “On the other hand, the anxious realization that millions of families would soon have access to this impressive medical expertise, and I couldn’t figure out how we could guarantee or certify that GPT-4’s advice would be safe or effective was equally mind blowing.”
GPT-4 is not always correct – and it lacks an ethical compass.
GPT-4 isn’t always trustworthy, and the book is full of examples of its errors. These range from basic clerical errors, such as incorrectly reporting a BMI that the bot had successfully computed seconds before, to arithmetic blunders, such as incorrectly “solving” a Sudoku puzzle or failing to square a variable in an equation. The errors are frequently subtle, and the system has a propensity to insist on being correct even when questioned. It’s not difficult to see how a misplaced number or incorrectly calculated weight may lead to major mistakes in prescription or diagnosing.
GPT-4, like earlier GPTs, may “hallucinate” – a technical term for when AI makes up responses or disobeys commands.
When the writers of the book inquired about this issue, GPT-4 stated “I have no intention of deceiving or misleading anyone, but I occasionally make mistakes or make conclusions based on inadequate or erroneous data. I also lack the clinical judgment and ethical accountability of a human doctor or nurse.”
Starting a new session with GPT-4 and having it “read over” and “verify” its own work with a “fresh pair of eyes” is one potential cross-check suggested by the authors in the book. This method can occasionally work to expose errors, however GPT-4 is hesitant to confess when it is incorrect. Another approach for spotting errors is to tell the bot to show you its work so you can verify it human-style.
GPT-4 has the potential to free up valuable time and resources in the clinic, allowing physicians to be more present with patients “instead of their computer screens,” according to the scientists. Nonetheless, they claim, “We must push ourselves to picture a world in which robots become more and smarter, eventually exceeding human ability in practically every dimension. Then we must consider how we want that world to function.”