Toots for aws@mathstodon.xyz account

Written by Andrew W Swan on 2025-01-29 at 09:35

I think that reinforcement learning specifically in go would improve its strength. Even with this I wouldn't expect it to get particular strong - this is a very inefficient way to make a go ai. It might be worthwhile anyway, because the llm's do have one big advantage over current go ai's: every move has a natural language explanation with even more detail in the chain of thought.

=> More informations about this toot | View the thread

Written by Andrew W Swan on 2025-01-29 at 09:35

It gave the impression of someone who had read a lot about go but never played a game, so didn't really understand what the words actually meant. It knew that corner enclosures are good but not how to use them effectively. Looking again at the lower right, it consistently believed that it was strong there because of the Q4-Q6 enclosure. When asked to sum up what it thought about the game at the end, it stated that the enclosure was part of a moyo and that it was a weakness of white's to play in that area, which is very far from correct in this case.

=> More informations about this toot | View the thread

Written by Andrew W Swan on 2025-01-29 at 09:34

It feels a lot like R1 is not able to visualise the board and e.g. seemed completely blind to the "hole" in its position at R6, thinking nothing of letting white push through there. Maybe this is something that a multimodal model would be better at.

=> More informations about this toot | View the thread

Written by Andrew W Swan on 2025-01-29 at 09:34

One more DeepSeek R1 thread (I promise this is the last from me for now). I got R1 to play a game of go against GnuGo, an old, and by today's standards very weak go ai. I played until I started hitting the rate limit on R1 - 40 moves. This was enough to see that R1 is not very good at go. https://online-go.com/game/71821573

=> More informations about this toot | View the thread

Written by Andrew W Swan on 2025-01-26 at 16:10

Despite these problems I feel that chain-of-thought llms are getting close to being able to solve this kind of problem. The particular issues that come up here could be solved either by tool use during the chain of thought to actually look at the api, or by more training, specifically on the Agda standard library in this case. OpenAI have announced some impressive benchmarks for their new o3 model, so maybe that will be the first to succeed on this question.

=> More informations about this toot | View the thread

Written by Andrew W Swan on 2025-01-26 at 16:10

Finally, I make sure the web search tool is turned on and explicitly ask it to use it to give me the source code for this function. It uses the search tool first, and then hands the information over to the chain of thought mode, which correctly observes that the function is not in the search results.

However, based purely on its own earlier output it convinces itself that the function is in the library regardless.

=> View attached media | View attached media

=> More informations about this toot | View the thread

Written by Andrew W Swan on 2025-01-26 at 16:08

The hallucinated function minimal-divisor appears in the final answer, so feigning ignorance I ask it where it is defined. This time it is even more suggestive of someone looking up information with "Let me check the Agda standard library documentation" and "Wait, no. Let me double-check." Once again the web search function is not being used during the chain of thought.

The final answer contains a specific Agda version and a valid link to a module in the standard library source code, which does not have the claimed function.

=> View attached media | View attached media

=> More informations about this toot | View the thread

Written by Andrew W Swan on 2025-01-26 at 16:05

It struggles a lot trying to show that every number ≥2 has a prime factor. It observes it can do this by finding the smallest factor > 1, and hallucinates a function in the Agda standard library that does just that.

It uses language suggestive of someone who is looking up information, particularly "Yes, here's the relevant part." Although R1 has an internet search tool it is not being used during the chain of thought. The function definitions are just what it imagines might be in the Agda standard library, which are mostly accurate, but not in this case.

=> View attached media | View attached media

=> More informations about this toot | View the thread

Written by Andrew W Swan on 2025-01-26 at 16:02

It starts off well, coming up with a valid plan, and starting to fill out the details. It's clear that this was not written by a person with a completely solid understanding of either basic number theory, Agda or constructive mathematics, but is a sufficiently close approximation that it seems promising.

=> View attached media | View attached media

=> More informations about this toot | View the thread

Written by Andrew W Swan on 2025-01-26 at 15:59

There's a standard question I ask to test llms: to produce an Agda proof that there are infinitely many prime numbers. None of them have succeeded. Here is the result for DeepSeek's R1 model. R1 is a "chain of thought" model, on par with OpenAI's o1. However, unlike OpenAI, DeekSeek makes the chain of thought publicly available, giving us some insight on the process it follows to get the answer.

=> More informations about this toot | View the thread

=> This profile with reblog | Go to aws@mathstodon.xyz account

Proxy Information

Original URL: gemini://mastogem.picasoft.net/profile/109357426749360822
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 373.392697 milliseconds
Gemini-to-HTML Time: 2.349869 milliseconds

This content has been proxied by September (3851b).