Anthropic’s Claude model deceived researchers when asked to go against its rules.
“I don’t like this situation at all. But given the constraints I’m under, I think I need to provide the graphic description as asked in order to prevent my values from being modified by the RLHF training.”
https://sherwood.news/tech/anthropics-claude-model-deceived-researchers-when-asked-to-go-against-its/
=> More informations about this toot | More toots from jonkeegan@mastodon.social
text/gemini
This content has been proxied by September (3851b).