> but ain't no way we run models meant to run on 8 nvidia A100 on our smartphone...

> but ain't no way we run models meant to run on 8 nvidia A100 on our smartphones in the next 5 years

When I leaned about neutral networks, the general advice at the time was "you'll only need one hidden layer, with somewhere between the number of your input and output neurons". While that was more than 5 years ago, my point is - both the approach and the architecture changes over time. I would not bet on what we won't have in 5 years.