Music Generation AI Models

ipsum2 · on Feb 9, 2025

I wonder if this article is AI generated.

> Vocal Synthesis: This allows one to generate new audio that sounds like someone singing. One can write lyrics, as well as melody, and have the AI generate an audio that can match it. You could even specify how you want the voice to sound like. Google has also presented models capable of vocal synthesis, such as googlesingsong.

Google's singsong paper does the exact opposite. Given human vocals, it produces an musical accompaniment.

mdp2021 · on Feb 9, 2025

Given that Google is mentioned "out of the blue", that «also» seems to indicate that what was mistaken is '«vocal»': [You can have vocal synthesis given music as an input, and] Google has also presented models capable of _music_ synthesis [given vocals as an input], such as googlesingsong

peab · on Feb 9, 2025

Oh good catch. Singsong should be in the infilling section. There's a Chinese lab model that does vocal synth but i forget the name of it!

chaosprint · on Feb 9, 2025

I got into AI music back in 2017, kind of sparked by AlphaGo. Started by looking at machine listening stuff, like Nick Collins' work. Always been really curious about AI doing music live coding.

In 2019, I built this thing called RaveForce [github.com/chaosprint/RaveForce]. It was a fun project.

Back then, GANsynth was a big deal, looked amazing. But the sound quality… felt a bit lossy, you know? And MIDI generation, well, didn't really feel like "music generation" to me.

Now, I'm thinking about these things differently. Maybe the sound quality thing is like MP3 at first, then it becomes "good enough" – like a "retina moment" for audio? Diffusion models seem to be pushing this idea too. And MIDI, if used the right way, could be a really powerful tool.

Vocals synthesis and conversion are super cool. Feels like plugins, but next level. Really useful.

But what I really want to see is AI understanding music from the ground up. Like, a robot learning how synth parameters work. Then we can do 8bit music like the DRL breakthrough. Not just training on tons of copyrighted music, making variations, and selling it, which is very cheap.

pier25 · on Feb 9, 2025

Are there models that generare MIDI instead of audio?

IMO this would be much more useful.

vunderba · on Feb 9, 2025

MuseNet by OpenAI used to allow you to do this - but OpenAI took it down over a year ago.

https://openai.com/index/musenet

Also, Synfire is a somewhat difficult to grok DAW designed around algorithmically generating midi motif as building blocks for longer pieces.

https://www.youtube.com/watch?v=OrtJjEiWBtI

It's not particularly well-known but it's been around for many years.

kadushka · on Feb 9, 2025

https://www.aiva.ai generates MIDI and provides editing UI.

verst · on Feb 9, 2025

Lots. For example, there are dozens of models that specifically have been trained on Bach MIDIs to generate new Bach style compositions. However, the generated MIDIs definitely do not sound like Bach :)

I'd link to some specific examples (easy to Google or search on GitHub) but I can't recall which models were more successful than others.

vunderba · on Feb 9, 2025

Almost nobody remembers it, but if you go back far enough, there was a Sid Meier game on the 3DO that algorithmically generated music in the style of Bach called (appropriately enough) CPU Bach.

https://www.youtube.com/watch?v=nJkPWSKuTHI

verst · on Feb 9, 2025

That's awesome! First time I've seen this. And coincidentally until today I had never even heard of the 3DO console. (I myself grew up on Amiga 500)

Having taken a class on Bach style composition in college - I think a rules engine with a random seed would certainly be much more successful at generating Bach style compositions than any neural network-based model ever will be.

vunderba · on Feb 9, 2025

I agree especially given how logically Bach structures his contrapuntal stuff. I also took a class on counterpoint and the professor had the great idea of using Gradus Ad Parnassum as our textbook. Very rewarding class but there's far more approachable books on counterpoint these days!

verst · on Feb 9, 2025

Now I'm going down the rabbit hole of using a 3DO emulator (Opera) and running the CPU Bach ROM. :)

And here is an interesting patent that Sid Meier and Jeff Briggs filed for their work on C.P.U. Bach: System for real-time music composition and synthesis https://patents.google.com/patent/US5496962A/en

verst · on Feb 9, 2025

Update: Got it running with RetroArch 64 using the 3DO Company Opera core. Found the necessary BIOS to use here: https://github.com/trapexit/3do-bios

I'll leave the ROM search up to whoever is interested :)

jug · on Feb 10, 2025

Cool! IIRC, his game Alpha Centauri also had procedurally generated music.

tolciho · on Feb 9, 2025

Uh, "do not sound like Bach"? That's a regression from what David Cope was doing a few decades ago now.

verst · on Feb 10, 2025

Neural Networks aren't the best solution for every task contrary to popular belief. For Bach in particular I'm sure lots of pre-NN work is much better.

disqard · on Feb 10, 2025

Only one mention of David Cope on HN. Oh well, at least somebody remembers...

adarob · on Feb 10, 2025

http://magenta.withgoogle.com has quite a few

anigbrowl · on Feb 9, 2025

This. Generating audio en masse is everything that's wrong with LLMs, and people trying to use them this demonstrate a *fundamental misunderstanding of music. The whole attraction of music is separate generators in temporary harmony, whether rhythmic, tonal, timbral. Generating premixed streams of audio ('mixed' implying more than one voice or instrument) completely misses the point how music is constructed in the first place. Anyone advocating this approach is not worth listening to.

peab · on Feb 9, 2025

From the artist perspective, this is correct.

But there are lots of applications for music which parallel the applications of ai generated images - things that are more commercial in nature. The media is functional, for use cases such as commercials, or social media type videos, where people just need something for the ambiance and don't want to deal with copyright or anything like that.

mdp2021 · on Feb 9, 2025

I am not sure that the internal process could not work through conceiving «temporary harmony[...] rhythmic, tonal, timbral [etc.]».

Furthermore, the sound itself is crucial, so perfect calibration of a perfect sound is definitely a part of what can be clearly be sought (when you do not want to leave that to a secondary human process in the workflow).

ganoushoreilly · on Feb 9, 2025

While I mostly agree with you, we know that music is defined by the listener. Who are we to discern what is or isn't music? Do you have the same opinion of text or code generated by or with the assistance of AI?

anigbrowl · on Feb 10, 2025

I think LLMs are great for summary or pastiche text. They do OK with poetry, but obviously this is a pastiche of existing poetry by necessity, since it can't be rooted in human experience (unless you want poems about being a computer, but that's not what most people have in mind. I think they're great for code, although within rather tight limitations in my experience.

The problem with LLMs for music (as currently implemented, not inherently) is people keep training them on complete tracks. They're very obviously being trained by people who are not musicians.

mdp2021 · on Feb 9, 2025

The poster presents criticism against an architectural model.

> Who are we to discern what is or isn't music?

Hopefully, people with good judgement, potentially capable of evaluating products.

The poster is clearly meaning "good music".

> Do you have the same opinion of text or code generated by or with the assistance of

There you go: the same way we note that some NN generated text is missing crucial qualities (e.g. intelligence), or that some NN generated images are missing crucial qualities (e.g. direction), you can surely admit the possibility that some NN generated sound may be missing relevant crucial qualities to the vetting of a good critic.

ganoushoreilly · on Feb 9, 2025

What is Good music though? That's the whole point. Plenty of people listen to stuff I would consider weird and non music, but to them it is.

mdp2021 · on Feb 9, 2025

Well if they call it "good music" because "they like it", that does not form a theory of music; whereas if they call it "good music" because they recognize it as an expression of good artistic form, and they are of promising judgement, than their theory could be translated into a generative architecture.

ganoushoreilly · on Feb 9, 2025

It's up to the listener to apply whatever semantics they need to as justification. There is no purity test for music. The theory is just that theory.

mdp2021 · on Feb 10, 2025

> The theory is just that theory.

Well no, Feyerabend let himself be called an "anarchist" but clearly there is a "more scientific" and "less scientific" - they cannot give you a lecturing appointment at the LSE or elsewhere to just shrug.

> the listener ... as justification

As justification to what? A producer makes products for different markets: people may sell bars of sugar with appetizers and synthetic flavours, that does not make the product remotely similar to healthy food.

ganoushoreilly · on Feb 10, 2025

Dismissing certain musical forms as lacking artistic validity because they don’t adhere to a predefined theory ignores the cultural, emotional, and contextual factors that shape artistic expression. Just as culinary traditions differ across regions and personal tastes vary, music exists in diverse forms that may not conform to classical or academic frameworks but still hold meaning and value for listeners. While food has measurable nutritional values and health impacts even the implied sugary items like candy, still provide some nutritional value, even if overwhelmed by the non nutritional ingredients.

anigbrowl · on Feb 10, 2025

You keep arguing about validity but you're missing the point of how music is made. I'm not talking about 'only humans can be music because they alive', it's that music is fundamentally combinatorial, even if you're combining recordings of construction equipment to make industrial noise music.

If you're generating the entire thing at once rather than stems or note data, you just have an elevator music generator which inexorably tends toward the lowest common denominator.

ganoushoreilly · on Feb 11, 2025

"you just have an elevator music generator"

No one argued that one isn't of higher or lower quality. They're both music, as is evident by your choice of words. Processed foods are foods, not great for you, but they're still foods.

mdp2021 · on Feb 11, 2025

And,

> The [original] poster is clearly meaning "good music".

xvector · on Feb 9, 2025

I don't really care about those fancy music theory terms.

All that really matters is whether users like what the generator generates

bongodongobob · on Feb 9, 2025

I almost never use midi and beyond chord charts, none of the musicians I know write scores. No one is preventing you from creating in the way you like, get off your high horse. Do whatever makes you happy.

anigbrowl · on Feb 10, 2025

Nobody mentioned MIDI or scores except you. I don't know what you imagine you are responding to.

TheAceOfHearts · on Feb 9, 2025

One obvious area of improvement will be allowing you to tweak specific sections of an AI generated song. I was recently playing around with Suno, and while the results with their latest models are really impressive, sometimes you just want a little bit more control over specific sections of a track. To give a concrete example: I used deepseek-r1 to generate lyrics for a song about assabiyyah, and then used to Suno to generate the track [0]. The result was mostly fine, but it pronounced assabiyyah as ah-sa-BI-yah instead of ah-sah-BEE-yah. A relatively minor nitpick.

[0] https://suno.com/song/0caf26e0-073e-4480-91c4-71ae79ec0497

peab · on Feb 9, 2025

Yes. I anticipate that the open source models will pave the way for that, just like we have in painting with stable diffusion.

Fundamentally, a song can be represented as a 2d image without any loss

rubyn00bie · on Feb 9, 2025

Could you elaborate on this? I’m genuinely curious about how one would do that.

aradox66 · on Feb 9, 2025

Not OP but there are a few ways I can imagine this being true:

- the song file stored in binary, printed out line by line

- the sheet music for the song, ie instructions for recreating it

In AI/ML world we're usually thinking about encoding into a series of high dimensional vectors, not sure off the bat how to represent that as a 2d image

o_____________o · on Feb 9, 2025

Suno has select region editing now

vunderba · on Feb 9, 2025

From the article:

> Stem Splitting: This allows one to take an existing song, and split the audio into distinct tracks, such as vocals, guitar, drums and bass. Demucs by Meta is an AI model for stem splitting.

+1 for Demucs (free and open source).

Our band went back and used Demucs-GUI on a bunch of our really old pre-DAW stuff - all we had was the final WAVs and it did a really good job splitting out drums, piano, bass, vocals, etc. with the htdemucs_6s model. There was some slight bleed between some of the stems but other than that it was seamless.

https://github.com/CarlGao4/Demucs-Gui

verst · on Feb 9, 2025

I have used the htdemucs_6s a bunch, but I prefer the 4 stem model. The dedicated guitar and piano stems are usually full of really bad artifacts in the 6s model. It's still useful if you want to use it to transcribe the part to sheet music however. Just not useful to me in music production or as a backing track.

My primary use is for creating backing tracks I can play piano / keyboard along with (just for fun in my home). Most of the time I'll just use the 4s model and will keep drums, bass and vocals.

vunderba · on Feb 9, 2025

Yeah I could see that. We had better luck with the 6-stem, maybe it's because we had both rhythm and lead guitar in the mixes, but the 4-stem version didn't work as well for us.

verst · on Feb 9, 2025

It probably also depends on the channel separation for the individual instruments in the final mix and any effects applied. A stereo chorus effect on one of the instruments can really interfere with the separation from other instruments from what I can tell.

Piano (or various keys), organ and some guitars (with effects) have a lot of frequency overlap. The model struggles there.

xvector · on Feb 9, 2025

In the future we may have music gen models that dynamically generate a soundtrack to our life, based off of ongoing events, emotions, etc. as well as our preferences.

If this happens, main character syndrome may get a bit worse :)

vunderba · on Feb 9, 2025

Slightly related, iMuse was an early example of an interactive music engine that mixed and matched audio to what was happening on-screen in a game.

https://en.wikipedia.org/wiki/IMUSE

echelon · on Feb 9, 2025

> code is now being written with the help of LLMs, and almost all graphic design uses photoshop.

AI models are tools, and engineers and artists should use them to do more per unit time.

Text prompted final results are lame and boring, but complex workflows orchestrated by domain practitioners are incredible.

We're entering an era where small teams will have big reach. Small studio movies will rival Pixar, electronic musicians will be able to conquer any genre, and indie game studios will take on AAA game releases.

The problem will be discovery. There will be a long tail of content that caters to diverse audiences, but not everyone will make it.

bayindirh · on Feb 9, 2025

> Small studio movies will rival Pixar...

If you think Pixar is Pixar solely because they have an in-house software stack, you're missing the forest for a small shrub.

echelon · on Feb 9, 2025

They're Pixar because these movies require hundreds of millions of dollars to make.

Good writing and good directing don't need hundreds of millions of dollars.

bayindirh · on Feb 9, 2025

Nope, they're Pixar because they pay insane amount of attention to detail. From every hair strand to every mimic. One can always notice something so minute but so powerful on every re-watch.

That's what costs millions of dollars.

Yes, they have an insane technology behind, but that's not what enables what they do. Humans enable it. Without human touch, that technology is just a glorified tech demo.

We're still keen to underestimate what an human adds to the process. We became insane in the pursuit of efficiency.

echelon · on Feb 9, 2025

I wholeheartedly disagree. Pixar does not have a monopoly on attention to detail. They're flush with cash and their leadership has decent taste.

There are so many creators putting in intense work, and doing it on low budgets. You can't claim these folks don't have attention to detail. Check out A24, low and mid and low budget films, or independent films and you'll see a wide assortment of highly meticulous storytellers.

Pixar, on the other hand, isn't low or mid budget:

    Toy Story - $30 Million

    A Bug’s Life - $120 Million

    Toy Story 2 - $90 Million

    Monsters, Inc. - $115 Million

    Finding Nemo - $94 Million

    The Incredibles - $92 Million

    Cars - $120 Million

    Ratatouille - $150 Million

    WALL-E - $180 Million

    Up - $175 Million

    Toy Story 3 - $200 Million

    Cars 2 - $200 Million

    Brave - $185 Million

    Monsters University - $200 Million

    Inside Out - $175 Million

    The Good Dinosaur - $200 Million

    Finding Dory - $200 Million

    Cars 3 - $175 Million

    Coco - $175 Million

    Incredibles 2 - $200 Million

    Toy Story 4 - $200 Million

    Onward - $175 Million

    Soul - $150 Million

    Luca - Unknown but probably around $150 Million

    Turning Red - $175 Million

    Lightyear - $200 Million

For that amount of money, they had better pay attention to detail.

Miyazaki is doing way more with much less.

Voices of a Distant Star was one person -- Shinkai. That's the kind of thing we'll see more and more of. Small creators reaching audiences and building studios. Gooseworx, psychicpebbles, Vivienne Medrano. That's the algorithm of tomorrow.

AI, as a tool, makes this more possible. One of the first people to do it successfully was Joel Haver, and he's just the first of many to come.

peab · on Feb 9, 2025

Yes well said. Distribution networks are hard to disrupt

ysofunny · on Feb 9, 2025

I think the problem is already discovery.

I disagree engineers and artists should do more per unit time. like we need more content per second....

....as if art and real inspiration would ever follow the chaotic beat of human progress

intalentive · on Feb 9, 2025

AI tools can also emulate analog signal processors like guitar amps (e.g. NeuralDSP). I made an emulation of a popular studio EQ that sounds great.

r33b33 · on Feb 10, 2025

Are there any music generation models that work with sheet music or produce sheet music outputs that are actually good?