Toot

Written by Jan :rust: :ferris: on 2025-01-29 at 20:54

‘Interpretability’ and ‘alignment’ are fool’s errands: a #proof that controlling misaligned large language models is the best anyone can hope for (Oct 2024)

https://link.springer.com/epdf/10.1007/s00146-024-02113-9?sharing_token=3rRsMdH1UdNlnQQmqcozX_e4RwlQNchNByi7wbcMAY6BPcdnFvH2qz5wBW_MWVfuP6XiIjUuOaAkAylpJWYcV7ptG9FICiiQIbLIhD2SwblgAxawIQPIF0AJKzQaIv6wgiJRo4GtMJ1R6AewBvbdRekNszwAag_WVijImJA2dlc%3D

"This paper [...] show[s] that it is empirically impossible to reliably interpret which functions a large language model (#LLM) #AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible."

This affects much more than just alignment!

[#]LLMs #Paper #ArtificialIntelligence

=> More informations about this toot | View the thread | More toots from janriemer@floss.social

Mentions

Tags

=> View proof tag | View llm tag | View ai tag | View llms tag | View paper tag | View artificialintelligence tag

Proxy Information
Original URL
gemini://mastogem.picasoft.net/toot/113913633184329998
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
243.813569 milliseconds
Gemini-to-HTML Time
0.485634 milliseconds

This content has been proxied by September (3851b).