#bias in LAION -> bias in stable diffusion
Explore tagged Tumblr posts
1o1percentmilk · 1 year ago
Text
BITCHHH IM IN A PHILOSOPHY CLASS WHY AM I KNEE DEEP IN THE DOCUMENTATION FOR LAION-5B*
*LAION-5b is the dataset for Stable Diffusion text-to-image generator and currently the world's largest open-access image-text dataset. grins at you
3 notes · View notes
succliberation · 2 years ago
Text
Alright, now time for some devil's advocacy.
The database doesn't actually contain images, it contains metadata that points to where the images can be found. There isn't literally CSAM in the database, there's metadata that points to links that may or may not contain CSAM. Sites like the Internet Archive or Common Crawl also have similar issues, because they're crawling/archiving the entire internet.
The researchers conducted the study in a country where its completely illegal to view CSAM even for research. Basically, they were legally not allowed to verify if there was actual CSAM at the links that the database pointed to. They only referenced the links contained in the LAION database against a database of known CSAM links. They got a few thousand matches in a database that contains about 5 billion image links - less than 1 percent of all images.
I know StableDiffusion was not trained on every single image in the database, and I assume Midjourney wasn't either. It's possible that no CSAM was included in the training data that the image generators actually used. I think this is something that Stability AI (the company that owns Stable Diffusion) would have to confirm for themselves.
The reporters who broke this story are explicitly anti-open source AI. There is a very big likelihood that privately owned databases meant for training AI by companies like Google or Microsoft also contain CSAM, but those databases are closed-source and cannot be investigated. Their bias doesn't mean that we should reject their findings, but it does mean that their recommendations for the future should be taken with a grain of salt.
Alright, devil advocacy over. We still need better mass data-tagging tools. The LAION database should have been cross-referenced for the known CSAM database a lot sooner. Manually getting people to review images is almost impossible when it comes to datasets this large, so its absolutely crucial that we create tools that don't further exploit vulnerable children.
The LAION database went down while they remove the suspected CSAM links, and AFAIK it's not back up yet.
The biggest dataset used for AI image generators had CSAM in it
Tumblr media
Link the original tweet with more info
The LAION dataset has had ethical concerns raised over its contents before, but the public now has proof that there was CSAM used in it.
The dataset was essentially created by scraping the internet and using a mass tagger to label what was in the images. Many of the images were already known to contain identifying or personal information, and several people have been able to use EU privacy laws to get images removed from the dataset.
However, LAION itself has known about the CSAM issue since 2021.
Tumblr media
LAION was a pretty bad data set to use anyway, and I hope researchers drop it for something more useful that was created more ethically. I hope that this will lead to a more ethical databases being created, and companies getting punished for using unethical databases. I hope the people responsible for this are punished, and the victims get healing and closure.
12 notes · View notes