Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was recently trying to replicate ClaudePlaysPokemon (which uses Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly getting confused and hallucinating significantly more than Claude, making it unviable (although some of that might be caused by my different setup). I wonder if this new model will do better. But I can't easily test it: for now, even paid users are apparently limited to 50 requests per day [1], which is not really enough when every step in the game is a request. Maybe I'll try it anyway, but really I need to wait for them to "introduce pricing in the coming weeks".

Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.

Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.

https://a.qoid.us/20250325/

[1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1



Update: I tried a different version of the prompt and it's doing really well! Well, so far it's gotten out of its house and into Professor Oak's lab, which is not so impressive compared to ClaudePlaysPokemon, but it's a lot more than Gemini 2.0 was able to do with the same prompt.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: