LLMs have no model of correctness, only typicality.
-
LLMs have no model of correctness, only typicality. So:
“How much does it matter if it’s wrong?”
It’s astonishing how frequently both providers and users of LLM-based services fail to ask this basic question — which I think has a fairly obvious answer in this case, one that the research bears out.
(Repliers, NB: Research that confirms the seemingly obvious is useful and important, and “I already knew that” is not information that anyone is interested in except you.)
1/ https://www.404media.co/chatbots-health-medical-advice-study/
-
LLMs have no model of correctness, only typicality. So:
“How much does it matter if it’s wrong?”
It’s astonishing how frequently both providers and users of LLM-based services fail to ask this basic question — which I think has a fairly obvious answer in this case, one that the research bears out.
(Repliers, NB: Research that confirms the seemingly obvious is useful and important, and “I already knew that” is not information that anyone is interested in except you.)
1/ https://www.404media.co/chatbots-health-medical-advice-study/
Despite the obviousness of the larger conclusion (“LLMs don’t give accurate medical advice”), this passage is…if not surprising, exactly, at least really really interesting.
2/
-
Despite the obviousness of the larger conclusion (“LLMs don’t give accurate medical advice”), this passage is…if not surprising, exactly, at least really really interesting.
2/
There’s a lesson here, perhaps, about the tangled relationship between what is •typical• and what is •correct•, and what it is that LLMs actually do:
When medical professionals ask medical questions in technical medical language, the answers they get are typically correct.
When non-professional ask medical questions in a perhaps medically ill-formed vernacular mode, the answers they get are typically wrong.
The LLM readily models both of these things. Despite having no notion of correctness in either case, correctness is more statistically typical in one than the other.
3/
-
LLMs have no model of correctness, only typicality. So:
“How much does it matter if it’s wrong?”
It’s astonishing how frequently both providers and users of LLM-based services fail to ask this basic question — which I think has a fairly obvious answer in this case, one that the research bears out.
(Repliers, NB: Research that confirms the seemingly obvious is useful and important, and “I already knew that” is not information that anyone is interested in except you.)
1/ https://www.404media.co/chatbots-health-medical-advice-study/
It has been so hard to explain that to family members who ask about LLMs "but it's right most of the time" is one of the most common responses when I talk about how there is no internal sense of reality or truth so they need to check every output to be sure
-
Despite the obviousness of the larger conclusion (“LLMs don’t give accurate medical advice”), this passage is…if not surprising, exactly, at least really really interesting.
2/
@inthehands Obvious to me. Having the same family doctor who knows you all for 20 years really is important and an immense privilege.
-
Despite the obviousness of the larger conclusion (“LLMs don’t give accurate medical advice”), this passage is…if not surprising, exactly, at least really really interesting.
2/
@inthehands This is why experienced developers can make use of LLMs, and why LLMs won't replace them.
-
There’s a lesson here, perhaps, about the tangled relationship between what is •typical• and what is •correct•, and what it is that LLMs actually do:
When medical professionals ask medical questions in technical medical language, the answers they get are typically correct.
When non-professional ask medical questions in a perhaps medically ill-formed vernacular mode, the answers they get are typically wrong.
The LLM readily models both of these things. Despite having no notion of correctness in either case, correctness is more statistically typical in one than the other.
3/
@inthehands This result makes sense - they generate *statistically likely* text based on a prompt, and the stolen words of basically the entire internet and several libraries worth of books.
If the prompt is such that the text it generates is statistically-likely to be correct - the language used closely aligns with a medical textbook, diagnostic manual, etc. - it's more likely to generate text based on sources like that.
If it sounds like a tweet, you're more likely to get a shitpost. -
@inthehands This result makes sense - they generate *statistically likely* text based on a prompt, and the stolen words of basically the entire internet and several libraries worth of books.
If the prompt is such that the text it generates is statistically-likely to be correct - the language used closely aligns with a medical textbook, diagnostic manual, etc. - it's more likely to generate text based on sources like that.
If it sounds like a tweet, you're more likely to get a shitpost.@inthehands It has no concept of what is correct, real, valuable, or meaningful - only what is statistically likely given a particular prompt.
Which is a problem - because if you ask it a question, you need to know the correct answer, or have the means to verify it.
Because it has no idea what the correct answer is.
If you don't know enough to be able to verify the result, then you can't trust it. -
There’s a lesson here, perhaps, about the tangled relationship between what is •typical• and what is •correct•, and what it is that LLMs actually do:
When medical professionals ask medical questions in technical medical language, the answers they get are typically correct.
When non-professional ask medical questions in a perhaps medically ill-formed vernacular mode, the answers they get are typically wrong.
The LLM readily models both of these things. Despite having no notion of correctness in either case, correctness is more statistically typical in one than the other.
3/
@inthehands I continue to be well-served by treating LLMs as fancy autocomplete and not anthropomorphizing them. I feel like the chat interface is where things went sideways, making it too easy to believe that they "think"
-
Despite the obviousness of the larger conclusion (“LLMs don’t give accurate medical advice”), this passage is…if not surprising, exactly, at least really really interesting.
2/
@inthehands An aside. When people used to ask Dawn wasn’t it hard to treat animals because “they can’t tell you what’s wrong,” she’d answer that they also can’t lie about it. She thought the latter probably outweighed the former.
-
@inthehands This is why experienced developers can make use of LLMs, and why LLMs won't replace them.
-
LLMs have no model of correctness, only typicality. So:
“How much does it matter if it’s wrong?”
It’s astonishing how frequently both providers and users of LLM-based services fail to ask this basic question — which I think has a fairly obvious answer in this case, one that the research bears out.
(Repliers, NB: Research that confirms the seemingly obvious is useful and important, and “I already knew that” is not information that anyone is interested in except you.)
1/ https://www.404media.co/chatbots-health-medical-advice-study/
@inthehands chatbots are terrible, period.
-
There’s a lesson here, perhaps, about the tangled relationship between what is •typical• and what is •correct•, and what it is that LLMs actually do:
When medical professionals ask medical questions in technical medical language, the answers they get are typically correct.
When non-professional ask medical questions in a perhaps medically ill-formed vernacular mode, the answers they get are typically wrong.
The LLM readily models both of these things. Despite having no notion of correctness in either case, correctness is more statistically typical in one than the other.
3/
@inthehands Worth noting, however, that when the training set captures a lot of outdated or irrelevant information, because the field has advanced rapidly since the model was trained, "typical" can start to diverge again. This can be mitigated if the practitioner knows to consult the latest information (either by reading it or by feeding it to the model as a part of the query) but of course they have to be aware of that. This is I suppose no worse than relying on the practitioner's knowledge.
-
@inthehands Worth noting, however, that when the training set captures a lot of outdated or irrelevant information, because the field has advanced rapidly since the model was trained, "typical" can start to diverge again. This can be mitigated if the practitioner knows to consult the latest information (either by reading it or by feeding it to the model as a part of the query) but of course they have to be aware of that. This is I suppose no worse than relying on the practitioner's knowledge.
@inthehands OTOH, as practitioners come to rely on stochastic information retrieval for more and more diagnoses, as it confirms what they already know, it may cause them to assign more weight to the information in the model than is justified, overruling their own second thoughts. ("Computer says...")
-
There’s a lesson here, perhaps, about the tangled relationship between what is •typical• and what is •correct•, and what it is that LLMs actually do:
When medical professionals ask medical questions in technical medical language, the answers they get are typically correct.
When non-professional ask medical questions in a perhaps medically ill-formed vernacular mode, the answers they get are typically wrong.
The LLM readily models both of these things. Despite having no notion of correctness in either case, correctness is more statistically typical in one than the other.
3/
@inthehands One of the factors in this mess is the heavily-boosted notion that LLM's contain facts or knowledge. Coincidentally, sort of, but not really. A safer mental model is to think of them as a fuzzy virtual machine of sorts, not unlike a vibe-y JVM but programmed in something dressed as plain language. Garbage-in-garbage-out. Often anything-in-garbage-out.
-
R ActivityRelay shared this topic