• FaceDeer@kbin.social
    link
    fedilink
    arrow-up
    5
    ·
    1 year ago

    That article doesn’t show what you think it shows. There was a lot of discussion of it when it first came out and the examples of overfitting they managed to dig up were extreme edge cases of edge cases that took them a huge amount of effort to find. So that people don’t have to follow a Reddit link, from the top comment:

    They identified images that were likely to be overtrained, then generated 175 million images to find cases where overtraining ended up duplicating an image.

    We find 94 images are extracted. […] [We] find that a further 13 (for a total of 109 images) are near-copies of training examples

    They’re purposefully trying to generate copies of training images using sophisticated techniques to do so, and even then fewer than one in a million of their generated images is a near copy.

    And that’s on an older version of Stable Diffusion trained on only 160 million images. They actually generated more images than were used to train the model.

    Overfitting is an error state. Nobody wants to overfit on any of the input data, and so the input data is sanitized as much as possible to remove duplicates to prevent it. They had to do this research on an early Stable Diffusion model that was already obsolete when they did the work because modern Stable Diffusion models have been refined enough to avoid that problem.