“We reproduced #DeepSeek R1-Zero in the CountDown game, and it just works
Through RL, the 3B base LM develops #SelfVerification and #search abilities all on its own
You can experience the Ahah moment yourself for < $30” — #JiayiPan
Beginning to see replication of Deepseek … “learns to allocate more thinking time to a problem by re-evaluating its approach” … this is described as the “Ah Ha Moment”.
[#]AI / #ReenforcementLearning https://github.com/Jiayi-Pan/TinyZero / https://x.com/jiayi_pirate/status/1882839370505621655?s=46 / https://youtube.com/watch?v=e659KrxxN5w
=> More informations about this toot | View the thread | More toots from peterrenshaw@ioc.exchange
=> View deepseek tag | View selfverification tag | View search tag | View jiayipan tag | View ai tag | View reenforcementlearning tag This content has been proxied by September (3851b).Proxy Information
text/gemini