TL;DR: AI will be able to generate more data after it has been trained that is comparable to the quality of the training data, thereby rendering any training data absolutely worthless. The time to sell data at a reasonable price is now, and those locking their data behind huge financial barriers (such as Twitter and Reddit) are stupidly HODLing a rapidly deprecating asset.
you don’t need to use all the output for training if you can separate the good parts. “OpenAI” reportedly used paid for (and is now using free) RLHF for this, Anthropic is trying to develop RLAIF to achieve the same.
Look into WizardLM. The researchers that trained it basically gave ChatGPT a bunch of algorithm-defined prompts, scraped the chat logs, and used them to train another model. Here is a link to their paper describing the process in detail.
Ah, that makes sense. So new data is being added, just in a different form.