Toot

Written by Jan :rust: :ferris: on 2025-01-29 at 20:54

‘Interpretability’ and ‘alignment’ are fool’s errands: a #proof that controlling misaligned large language models is the best anyone can hope for (Oct 2024)

https://link.springer.com/epdf/10.1007/s00146-024-02113-9?sharing_token=3rRsMdH1UdNlnQQmqcozX_e4RwlQNchNByi7wbcMAY6BPcdnFvH2qz5wBW_MWVfuP6XiIjUuOaAkAylpJWYcV7ptG9FICiiQIbLIhD2SwblgAxawIQPIF0AJKzQaIv6wgiJRo4GtMJ1R6AewBvbdRekNszwAag_WVijImJA2dlc%3D

"This paper [...] show[s] that it is empirically impossible to reliably interpret which functions a large language model (#LLM) #AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible."

This affects much more than just alignment!

[#]LLMs #Paper #ArtificialIntelligence

=> More informations about this toot | View the thread | More toots from janriemer@floss.social

Toot

Written by Jan :rust: :ferris: on 2025-01-29 at 20:54

Mentions

Tags