There's a standard question I ask to test llms: to produce an Agda proof that there are infinitely many prime numbers. None of them have succeeded. Here is the result for DeepSeek's R1 model. R1 is a "chain of thought" model, on par with OpenAI's o1. However, unlike OpenAI, DeekSeek makes the chain of thought publicly available, giving us some insight on the process it follows to get the answer.
=> More informations about this toot | More toots from aws@mathstodon.xyz
It starts off well, coming up with a valid plan, and starting to fill out the details. It's clear that this was not written by a person with a completely solid understanding of either basic number theory, Agda or constructive mathematics, but is a sufficiently close approximation that it seems promising.
=> View attached media | View attached media
=> More informations about this toot | More toots from aws@mathstodon.xyz
It struggles a lot trying to show that every number ≥2 has a prime factor. It observes it can do this by finding the smallest factor > 1, and hallucinates a function in the Agda standard library that does just that.
It uses language suggestive of someone who is looking up information, particularly "Yes, here's the relevant part." Although R1 has an internet search tool it is not being used during the chain of thought. The function definitions are just what it imagines might be in the Agda standard library, which are mostly accurate, but not in this case.
=> View attached media | View attached media
=> More informations about this toot | More toots from aws@mathstodon.xyz
The hallucinated function minimal-divisor
appears in the final answer, so feigning ignorance I ask it where it is defined. This time it is even more suggestive of someone looking up information with "Let me check the Agda standard library documentation" and "Wait, no. Let me double-check." Once again the web search function is not being used during the chain of thought.
The final answer contains a specific Agda version and a valid link to a module in the standard library source code, which does not have the claimed function.
=> View attached media | View attached media
=> More informations about this toot | More toots from aws@mathstodon.xyz
Finally, I make sure the web search tool is turned on and explicitly ask it to use it to give me the source code for this function. It uses the search tool first, and then hands the information over to the chain of thought mode, which correctly observes that the function is not in the search results.
However, based purely on its own earlier output it convinces itself that the function is in the library regardless.
=> View attached media | View attached media
=> More informations about this toot | More toots from aws@mathstodon.xyz
Despite these problems I feel that chain-of-thought llms are getting close to being able to solve this kind of problem. The particular issues that come up here could be solved either by tool use during the chain of thought to actually look at the api, or by more training, specifically on the Agda standard library in this case. OpenAI have announced some impressive benchmarks for their new o3 model, so maybe that will be the first to succeed on this question.
=> More informations about this toot | More toots from aws@mathstodon.xyz
@aws Yeah I think the methodology will help a lot. I've seen conflicting results (might be theory vs. practice) about how they are impacted by hallucinations. The longer chain of thought means it's more likely for one to sneak in, and often-times this can doom the chain when they get stuck on a line of thinking.
=> More informations about this toot | More toots from jasonc@types.pl
@aws This seems like a useful benchmark, but I only began learning to write proofs a couple of months ago when I got early access to my discrete maths course. Also, I don’t know how to code in Agdo, and I have at least 4 other languages that I need to get proficient at asap. It seems like a fun language to math around in, but someone recommended Haskell for project Euler, so idk which I would learn next after my required ones.
But I do love to benchmark models and uncommon benchmarks are better
=> More informations about this toot | More toots from sodakaidotcom@mathstodon.xyz This content has been proxied by September (3851b).Proxy Information
text/gemini