That is a big factor in the accuracy of speech-to-text engines, but I don't think training data is a problem for text-to-speech (which is what the OP was complaining about).
I strongly disagree! Text to speech needs the same data as speech to text - a well annotated collection of raw, single speaker speech data from a variety of speakers and accompanying text labels.
It is very, very difficult to find a large, well curated dataset of speech with accompanying text labels. TIMIT is the gold standard of speech recognition despite its simplicity, and it costs ~$150(!) to get access at all. Switchboard is another famous one which is bigger, but only partially labeled and still very hard to get. CMU has an openly available, small dataset called AN4, but it is only 64MB of raw recordings (a lot for 1991 when it was recorded, but still small today).
You could use YouTube and their automated translations, but the text is shoddy at best and definitely too unreliable to feed to an algorithm without significant cleansing.
Honestly, U.S. Congressional speeches may be the best "open" option IMO - good transcriptions, a variety of speakers, and TONS of data. If you could sync up the transcriptions with recordings of C-SPAN it could work very well, though I have not seen a dataset that puts this together. This still only covers English!
Companies understand that data is valuable - that is why they take the rights to it in every EULA! Well curated internal datasets will be the KFC/Coke secret recipes of the "big data" age.
I can't upvote this enough. There are very few companies that are masters of this game. Because it is extremely difficult to build. Every stage is a painful process and there are so many moving parts. To begin with you need very high quality recordings. Successful companies have voice directors whose job is to record voice at most optimal settings (and to make sure speaker speaks is neutral dispassionate tone). And then there is the manual process of segmentation and phoneme alignment - That requires lot of human effort. Then you run the code! It will take quite sometime. Then you study the quality of synthesis. Fix issues manually (for example, if you are using concatenative synthesis, you may have to fix offending phoneme alignment based on feedback). And iterate on this. Unless you have a team, I wouldn't venture into this and rather use third party leaders.
I recently discovered vocalid.org thanks to a TED video[1]. They work to provide people having speech impairments with unique synthetic voices thanks to voice donors. According to their process, a voice donor need to read 2-3 hours of text to allow them to build a custom synthetic voice.
I don't know if they have any plan to open their data but if someone could find a way to increase the amount of voice donors while improving an open dataset, it would be interesting. Just a thought, they are most certainly many more issues / options in this field.
That is a big factor in the accuracy of speech-to-text engines, but I don't think training data is a problem for text-to-speech (which is what the OP was complaining about).