When we do it as humans, that's basically how we do it. We may have an overall idea for a theme across the song, but usually you're zoomed into a few seconds of music and adding light effects to it.
I had not, thanks! Interestingly, using FFT for this has been around for a long time, but combining it with transformers could have interesting new results.
It's interesting to me that you can have something like this that is "hard to build" but "easy to verify" - humans are really good at telling if something is "off" about the visualization.