Article: Data Protectionism Is Self-Defeating

SquishyPillow@burggit.moe · 1 year ago

Article: Data Protectionism Is Self-Defeating

rinkan 輪姦@burggit.moe · 1 year ago

The author’s larger point about the non-viability of data cartels may be correct, but the claim that AI can be trained on their own output seems wrong. If an AI is giving incorrect output, adding that output to the training data will just reinforce the error, not correct it.

RA2lover@burggit.moe · 1 year ago

you don’t need to use all the output for training if you can separate the good parts. “OpenAI” reportedly used paid for (and is now using free) RLHF for this, Anthropic is trying to develop RLAIF to achieve the same.

SquishyPillow@burggit.moe · 1 year ago

Look into WizardLM. The researchers that trained it basically gave ChatGPT a bunch of algorithm-defined prompts, scraped the chat logs, and used them to train another model. Here is a link to their paper describing the process in detail.

rinkan 輪姦@burggit.moe · 1 year ago

Ah, that makes sense. So new data is being added, just in a different form.

SquishyPillow@burggit.moe · edit-2 1 year ago

If AI-generated data is curated, I believe it can be used to train AI more. Curation itself can be covertly crowdsourced by deploying LLM bots on social media and selecting only generated messages that receive the most likes/upvotes/whatever to use for training.

I should also mention that synthetic data curation has already been proven to be successful to some degree. WizardLM is trained on the evol-instruct dataset, which is a synthetic dataset generated by ChatGPT. You can read more about how the dataset and model were created here. And if you want to evaluate WizardLM itself, the model is available in GGML format in various sizes here.

SmolSlime@burggit.moe · 1 year ago

Will there be a point where any additional data trained won’t improve AI any further? 🤔

SquishyPillow@burggit.moe · 1 year ago

There is a point where more data yields diminishing returns, and might even backfire. It is likely that ChatGPT has already reached this point, and will not improve without changes to the model architecture.

Also, additional data may bias the usefulness of a generative model towards specific usecases. Fine-tuning a LLM on nothing but python code will make it better at generating python code, but won’t improve its ability to do ERP or other story-driven tasks, for example.

soulnull@burggit.moe · edit-2 1 year ago

Can confirm. It seems counterintuitive, but more data needs more resources, more indexing, more room for errors.

In my experimentation with RVC, I’ve experimented with all sorts of sizes, and I’ve found my 2 hour datasets take forever and produce subpar results. 5-15 minutes worth of speech data is the sweet spot. No amount of training seems to fix it, it’s counterproductive to overtrain it, but the model just can’t figure out what to do with all of that data it seems.

Granted, different models can have different advantages and will certainly have different results, but how many times have you been researching something and found so many conflicting pieces of information? If it’s 1 out of 10 pieces of data, that’s easy enough, but now a larger dataset is 10 out of 100 pieces of conflicting information… It’s still 10%, but unfortunately, it’s now 10 pieces of data that it needs to figure out how to interpret, even if the other 90 pieces agree with each other. Just like us, it can get to a point where it’s just too much information to deal with.

Definitely a point of diminishing returns.