Despite these problems I feel that chain-of-thought llms are getting close to being able to solve this kind of problem. The particular issues that come up here could be solved either by tool use during the chain of thought to actually look at the api, or by more training, specifically on the Agda standard library in this case. OpenAI have announced some impressive benchmarks for their new o3 model, so maybe that will be the first to succeed on this question.
=> More informations about this toot | View the thread | More toots from aws@mathstodon.xyz
text/gemini
This content has been proxied by September (3851b).