Article: Data Protectionism Is Self-Defeating

SquishyPillow@burggit.moe · 1 year ago

Article: Data Protectionism Is Self-Defeating

SquishyPillow@burggit.moe · edit-2 1 year ago

If AI-generated data is curated, I believe it can be used to train AI more. Curation itself can be covertly crowdsourced by deploying LLM bots on social media and selecting only generated messages that receive the most likes/upvotes/whatever to use for training.

I should also mention that synthetic data curation has already been proven to be successful to some degree. WizardLM is trained on the evol-instruct dataset, which is a synthetic dataset generated by ChatGPT. You can read more about how the dataset and model were created here. And if you want to evaluate WizardLM itself, the model is available in GGML format in various sizes here.