One thing on my mind lately is the availability of training data. Rumor has it that a significant amount of the content OpenAI trained with was publicly available, e.g. Reddit. Some people say that by using the training data off of Reddit that GPT-4 can impersonate most of the unique voices one would find on Reddit.
Google has gmail. It has our search history. It has Google Groups. It has Google scholar. Didn't they also digitize every book in the library of congress or something like that? The LLM that is built upon their insanely rich data is truly scary to contemplate.
Google has gmail. It has our search history. It has Google Groups. It has Google scholar. Didn't they also digitize every book in the library of congress or something like that? The LLM that is built upon their insanely rich data is truly scary to contemplate.