Ancestors

Written by Andrew W Swan on 2025-01-26 at 15:59

There's a standard question I ask to test llms: to produce an Agda proof that there are infinitely many prime numbers. None of them have succeeded. Here is the result for DeepSeek's R1 model. R1 is a "chain of thought" model, on par with OpenAI's o1. However, unlike OpenAI, DeekSeek makes the chain of thought publicly available, giving us some insight on the process it follows to get the answer.

=> More informations about this toot | More toots from aws@mathstodon.xyz

Written by Andrew W Swan on 2025-01-26 at 16:02

It starts off well, coming up with a valid plan, and starting to fill out the details. It's clear that this was not written by a person with a completely solid understanding of either basic number theory, Agda or constructive mathematics, but is a sufficiently close approximation that it seems promising.

=> View attached media | View attached media

=> More informations about this toot | More toots from aws@mathstodon.xyz

Written by Andrew W Swan on 2025-01-26 at 16:05

It struggles a lot trying to show that every number ≥2 has a prime factor. It observes it can do this by finding the smallest factor > 1, and hallucinates a function in the Agda standard library that does just that.

It uses language suggestive of someone who is looking up information, particularly "Yes, here's the relevant part." Although R1 has an internet search tool it is not being used during the chain of thought. The function definitions are just what it imagines might be in the Agda standard library, which are mostly accurate, but not in this case.

=> View attached media | View attached media

=> More informations about this toot | More toots from aws@mathstodon.xyz

Toot

Written by Andrew W Swan on 2025-01-26 at 16:08

The hallucinated function minimal-divisor appears in the final answer, so feigning ignorance I ask it where it is defined. This time it is even more suggestive of someone looking up information with "Let me check the Agda standard library documentation" and "Wait, no. Let me double-check." Once again the web search function is not being used during the chain of thought.

The final answer contains a specific Agda version and a valid link to a module in the standard library source code, which does not have the claimed function.

=> View attached media | View attached media

=> More informations about this toot | More toots from aws@mathstodon.xyz

Descendants

Written by Andrew W Swan on 2025-01-26 at 16:10

Finally, I make sure the web search tool is turned on and explicitly ask it to use it to give me the source code for this function. It uses the search tool first, and then hands the information over to the chain of thought mode, which correctly observes that the function is not in the search results.

However, based purely on its own earlier output it convinces itself that the function is in the library regardless.

=> View attached media | View attached media

=> More informations about this toot | More toots from aws@mathstodon.xyz

Written by Andrew W Swan on 2025-01-26 at 16:10

Despite these problems I feel that chain-of-thought llms are getting close to being able to solve this kind of problem. The particular issues that come up here could be solved either by tool use during the chain of thought to actually look at the api, or by more training, specifically on the Agda standard library in this case. OpenAI have announced some impressive benchmarks for their new o3 model, so maybe that will be the first to succeed on this question.

=> More informations about this toot | More toots from aws@mathstodon.xyz

Written by Jason Carr on 2025-01-26 at 16:20

@aws Yeah I think the methodology will help a lot. I've seen conflicting results (might be theory vs. practice) about how they are impacted by hallucinations. The longer chain of thought means it's more likely for one to sneak in, and often-times this can doom the chain when they get stuck on a line of thinking.

=> More informations about this toot | More toots from jasonc@types.pl

Written by sodakaidotcom on 2025-01-26 at 21:55

@aws This seems like a useful benchmark, but I only began learning to write proofs a couple of months ago when I got early access to my discrete maths course. Also, I don’t know how to code in Agdo, and I have at least 4 other languages that I need to get proficient at asap. It seems like a fun language to math around in, but someone recommended Haskell for project Euler, so idk which I would learn next after my required ones.

But I do love to benchmark models and uncommon benchmarks are better

=> More informations about this toot | More toots from sodakaidotcom@mathstodon.xyz

Proxy Information
Original URL
gemini://mastogem.picasoft.net/thread/113895520642375240
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
2249.556008 milliseconds
Gemini-to-HTML Time
2.397995 milliseconds

This content has been proxied by September (3851b).