I wonder if you can break down the sequences into segments (parts) and then the AI doesn't have to know how to control LEDs directly, but can instead put sequences together in accordance with the music.
When we do it as humans, that's basically how we do it. We may have an overall idea for a theme across the song, but usually you're zoomed into a few seconds of music and adding light effects to it.
I had not, thanks! Interestingly, using FFT for this has been around for a long time, but combining it with transformers could have interesting new results.
It's interesting to me that you can have something like this that is "hard to build" but "easy to verify" - humans are really good at telling if something is "off" about the visualization.