Discussion about this post

User's avatar
Francis Turner's avatar

This quote from the CW article is, IMHO, key:

“Unlike human intelligence, it lacks the humility to acknowledge uncertainty,” said Neil Shah, VP for research and partner at Counterpoint Technologies. “When unsure, it doesn’t defer to deeper research or human oversight; instead, it often presents estimates as facts.”

To put it simply LLM's can never say "I don't know" even when they do not in fact know. Combine that with sycophancy and you are asking for AI to end up providing affirmation for delusions and the like.

My guess is that Anthropic trained Claude on a bunch of popular psychology books and similar which other models skipped or did not flag as being particularly important. Possibly because Claude has in the past been shown to be keen to lie, cheat and attempt blackmail. But I agree with you that with a longer term interaction it is likely that Claude too will be bad too in this use case.

Expand full comment
Hannah Grace's avatar

Thank you for the hard work on this experiment, it really highlights the complexity of building safe, successful models. My hope would be that in future, this sort of testing would be run before the models are released and prevent unsafe interactions with users.

Tech companies continue to deprioritise the pre-work of understanding the intricacies of human cognition and human interactions because it's easier to blunder on, build the tech and figure any issues out as you experiment with members of the public. Not only do these companies ignore the dangers of missing this step but they will dismiss this work as hampering progress and innovation. We have seen it in commercial software and app development for decades and now we are just repeating the same mistakes with even more dire consequences.

Expand full comment
3 more comments...

No posts