#bias in LAION -> bias in stable diffusion | Explore Tumblr posts and blogs

succliberation · 2 years ago

Text

Alright, now time for some devil's advocacy.

The database doesn't actually contain images, it contains metadata that points to where the images can be found. There isn't literally CSAM in the database, there's metadata that points to links that may or may not contain CSAM. Sites like the Internet Archive or Common Crawl also have similar issues, because they're crawling/archiving the entire internet.

The researchers conducted the study in a country where its completely illegal to view CSAM even for research. Basically, they were legally not allowed to verify if there was actual CSAM at the links that the database pointed to. They only referenced the links contained in the LAION database against a database of known CSAM links. They got a few thousand matches in a database that contains about 5 billion image links - less than 1 percent of all images.

I know StableDiffusion was not trained on every single image in the database, and I assume Midjourney wasn't either. It's possible that no CSAM was included in the training data that the image generators actually used. I think this is something that Stability AI (the company that owns Stable Diffusion) would have to confirm for themselves.

The reporters who broke this story are explicitly anti-open source AI. There is a very big likelihood that privately owned databases meant for training AI by companies like Google or Microsoft also contain CSAM, but those databases are closed-source and cannot be investigated. Their bias doesn't mean that we should reject their findings, but it does mean that their recommendations for the future should be taken with a grain of salt.

Alright, devil advocacy over. We still need better mass data-tagging tools. The LAION database should have been cross-referenced for the known CSAM database a lot sooner. Manually getting people to review images is almost impossible when it comes to datasets this large, so its absolutely crucial that we create tools that don't further exploit vulnerable children.

The LAION database went down while they remove the suspected CSAM links, and AFAIK it's not back up yet.

The biggest dataset used for AI image generators had CSAM in it

Link the original tweet with more info

The LAION dataset has had ethical concerns raised over its contents before, but the public now has proof that there was CSAM used in it.

The dataset was essentially created by scraping the internet and using a mass tagger to label what was in the images. Many of the images were already known to contain identifying or personal information, and several people have been able to use EU privacy laws to get images removed from the dataset.

However, LAION itself has known about the CSAM issue since 2021.

LAION was a pretty bad data set to use anyway, and I hope researchers drop it for something more useful that was created more ethically. I hope that this will lead to a more ethical databases being created, and companies getting punished for using unethical databases. I hope the people responsible for this are punished, and the victims get healing and closure.

#ngl after seeing all the EU privacy laws they broke i already wasnt a big fan of the LAION database #and its pretty badly tagged as well #so there's not a great practical reason to use it #other than 'its really big and open source'#so i just hope that there's other competition for really big and open source soon #hopefully corporations will get convinced into open source development

12 notes · View notes