I appreciate that you've analyzed GPT3's inability to rhyme rhyme well. And maybe some of that comes from BPE. But saying that "GPT3 _cannot_ rhyme" is a strong and unsupportable statement. What, really, would the difference be between memorizing a rhyming dictionary and being able to apply that vs "actually" learning to rhyme? Because GPT3 can certainly do the former, so why can't it do the latter?
Now if you ran an experiment comparing MLM (or any LM) on rhyming tasks with different encodings, then you could certainly make a statement like "BPE is worse at rhyming than other encodings" and it would be scientifically supportable. And that very well might be true. But your extreme conclusion is not supportable.
What would the difference be? That's very obvious - just think about any kind of comic or light verse!
A rhyme dictionary would still not replicate human rhyming capabilities. Think about neologisms or misspellings. A model can memorize every single entry in the rhyming dictionary (and let's say this somehow cashes out as apparent rhyming proficiency in being able to for every word recall an entry in the rhyme dictionary of valid other words), but it would not be able to write something like "Jabberwocky" inventing a bunch of new words or phrases or names which rhyme. (How would it know to rhyme "wabe" and "outgrabe" when they appear in no dictionaries - because they were just invented?) A model which has "actually" learned to rhyme would be able to take new words (not necessarily invented by it, but possibly invented by humans after it was trained, or invented on the spot for a prompt, or part of a new fictional work like worldbuilding) and rhyme them appropriately. A model which has memorized a rhyming dictionary would not.
I'm genuinely confused why you take such an extreme position on this issue. You seem like you understand some things about how neural networks operate. So I'd assume you understand their ability to interpolate between examples to new situations they've never literally seen before - what is commonly referred to as "generalization" in ML, which is really the key concept in the entire field of ML. But for some reason you've decided this simply can't apply to rhyming for the world's most advanced language model. Your choice buddy.
Now if you ran an experiment comparing MLM (or any LM) on rhyming tasks with different encodings, then you could certainly make a statement like "BPE is worse at rhyming than other encodings" and it would be scientifically supportable. And that very well might be true. But your extreme conclusion is not supportable.