Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Have a look at the post - it explains how it works. There are two models: a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model. The former produces a latent vector, which is then interpreted by the latter to drive the motors.


> a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model.

huh. An interesting approach. I wonder if something like this can be used for other things as well, like "computer use" with the same concept of a "large" model handling the goals, and a "small" model handling clicking and stuff, at much higher rates, useful for games and things like that.


This is typical in real time applications. A supervisor tries to guess in which region the system is currently and then invokes the correct set of lower level algorithms.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: