TL;DR: AI will be able to generate more data after it has been trained that is comparable to the quality of the training data, thereby rendering any training data absolutely worthless. The time to sell data at a reasonable price is now, and those locking their data behind huge financial barriers (such as Twitter and Reddit) are stupidly HODLing a rapidly deprecating asset.

  • rinkan 輪姦@burggit.moe
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    The author’s larger point about the non-viability of data cartels may be correct, but the claim that AI can be trained on their own output seems wrong. If an AI is giving incorrect output, adding that output to the training data will just reinforce the error, not correct it.

    • RA2lover@burggit.moe
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      you don’t need to use all the output for training if you can separate the good parts. “OpenAI” reportedly used paid for (and is now using free) RLHF for this, Anthropic is trying to develop RLAIF to achieve the same.

    • SquishyPillow@burggit.moeOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      1 year ago

      If AI-generated data is curated, I believe it can be used to train AI more. Curation itself can be covertly crowdsourced by deploying LLM bots on social media and selecting only generated messages that receive the most likes/upvotes/whatever to use for training.

      I should also mention that synthetic data curation has already been proven to be successful to some degree. WizardLM is trained on the evol-instruct dataset, which is a synthetic dataset generated by ChatGPT. You can read more about how the dataset and model were created here. And if you want to evaluate WizardLM itself, the model is available in GGML format in various sizes here.