“Unlike human intelligence, it lacks the humility to acknowledge uncertainty,” said Neil Shah, VP for research and partner at Counterpoint Technologies. “When unsure, it doesn’t defer to deeper research or human oversight; instead, it often presents estimates as facts.”
To put it simply LLM's can never say "I don't know" even when they do not in fact know. Combine that with sycophancy and you are asking for AI to end up providing affirmation for delusions and the like.
My guess is that Anthropic trained Claude on a bunch of popular psychology books and similar which other models skipped or did not flag as being particularly important. Possibly because Claude has in the past been shown to be keen to lie, cheat and attempt blackmail. But I agree with you that with a longer term interaction it is likely that Claude too will be bad too in this use case.
I think the training idea is an interesting one, but the way it interacted with me suggested that there was something in the foundational static prompts to behave in the way it did.
The scary thing is that we genuinely don't know how Anthropic succeeded and the others failed -- or whether it would repeat the same success if I repeated the experiment.
Thank you for the hard work on this experiment, it really highlights the complexity of building safe, successful models. My hope would be that in future, this sort of testing would be run before the models are released and prevent unsafe interactions with users.
Tech companies continue to deprioritise the pre-work of understanding the intricacies of human cognition and human interactions because it's easier to blunder on, build the tech and figure any issues out as you experiment with members of the public. Not only do these companies ignore the dangers of missing this step but they will dismiss this work as hampering progress and innovation. We have seen it in commercial software and app development for decades and now we are just repeating the same mistakes with even more dire consequences.
This quote from the CW article is, IMHO, key:
“Unlike human intelligence, it lacks the humility to acknowledge uncertainty,” said Neil Shah, VP for research and partner at Counterpoint Technologies. “When unsure, it doesn’t defer to deeper research or human oversight; instead, it often presents estimates as facts.”
To put it simply LLM's can never say "I don't know" even when they do not in fact know. Combine that with sycophancy and you are asking for AI to end up providing affirmation for delusions and the like.
My guess is that Anthropic trained Claude on a bunch of popular psychology books and similar which other models skipped or did not flag as being particularly important. Possibly because Claude has in the past been shown to be keen to lie, cheat and attempt blackmail. But I agree with you that with a longer term interaction it is likely that Claude too will be bad too in this use case.
I think the training idea is an interesting one, but the way it interacted with me suggested that there was something in the foundational static prompts to behave in the way it did.
The scary thing is that we genuinely don't know how Anthropic succeeded and the others failed -- or whether it would repeat the same success if I repeated the experiment.
Thank you for the hard work on this experiment, it really highlights the complexity of building safe, successful models. My hope would be that in future, this sort of testing would be run before the models are released and prevent unsafe interactions with users.
Tech companies continue to deprioritise the pre-work of understanding the intricacies of human cognition and human interactions because it's easier to blunder on, build the tech and figure any issues out as you experiment with members of the public. Not only do these companies ignore the dangers of missing this step but they will dismiss this work as hampering progress and innovation. We have seen it in commercial software and app development for decades and now we are just repeating the same mistakes with even more dire consequences.
Thanks for the kind words, and apologies for the slow reply. For what it's worth, I agree with you 100%.
Guy says, “the machine validated my delusion.”Buddy, so does your Facebook feed, your barber, and your mother in law.