#LLM dataset | Explore Tumblr posts and blogs

phlebaswrites · 2 months ago

Note

Will filing a DCMA takedown mean that the jackass behind the theft will see my legal name and contact info?

I'm not a lawyer so I can't say for sure, but I think it's likely.

For starters, the takedown notice will go to the company so they'll definitely see your details.

nyuuzyou (the person claiming ownership of the dataset into which they've processed all our unlocked works on AO3) has already clearly indicated that they believe they're in the right, and they're willing to fight against the takedown notices - they filed a counter notice to say as much right after OTW filed the first takedown notice with huggingface (the website to which nyuuzyou uploaded the dataset).

They also tried to upload the dataset to two other websites (it's thankfully now been removed).

Given that, it's possible (though I can't judge how likely) that these takedown notices might end up in a court of law somewhere, and in such a case nyuuzyou will definitely have access to them - and all of our IRL names.

This is one of the hazards of DMCA takedown notices, leaving fanwork creators to choose between protecting our creations or connecting our IRL and fannish identities at the risk of doxing. It is also why I've been careful not to say that we must all file takedown notices, in fact I think that anyone who is in a vulnerable situation most emphatically MUST NOT.

Let me be clear.

DO NOT DO THIS IF IT WILL HURT YOU.

Instead, leave that up to fans like myself who have less to lose and are willing to take that risk.

Right now, what we are doing is engaging in both a legal fight but also something of a public awareness campaign.

The huggingface site that is currently hosting this dataset is actually one facet of Hugging Face, Inc. a well known French-American company based in New York City that works in the machine learning space. I can't imagine that they want to be known as bad faith actors who host databases full of stolen material. They are a private company right now, but if their founders ever want to go public (and make a lot of money selling their shares) they would prefer not to be the subject of bad press. I make a note that they might already be preparing for an IPO since their stocks seem to be available for purchase on the NASDAQ private market and they raised $235 million in their series D funding round. This is a company that is potentially valued at $4.5 billion - they have bigger fish to fry than a bunch of members of the public conducting the legal equivalent of a DDoS on them.

Because that's effectively what we're doing - we are snowing them under with takedown notices that have to be individually replied to and dealt with. We are trying to convince huggingface that deleting the dataset nyuuzyou uploaded is the easier and less problematic option than legally defending nyuuzyou's right to post it.

The other thing that we're doing is making a public anti-AI stand.

We are telling the LLM / Gen AI community that AO3 is not the soft target it might look like - they might be able to crawl the site against site rules and community standards but if they post their datasets publicly for street cred (and that's exactly what nyuuzyou is doing) then we will act to protect ourselves.

The status of fanwork as a legally valid creative pursuit - to be protected and cherished like any other - is a long campaign, and one that the OTW was founded on. When @astolat first proposed AO3, it was the next step in a fight that had been ongoing for years.

I'd been a fan for over a decade before AO3 was founded and I personally don't intend to see it fall to this new wave of assaults.

Though it is interesting to be on this end of a takedown notice for once in my life! 🤣

#asks answered #Anonymous #Phlebas blogs her life #fanfic writers on writing #fanfic writer problems #ao3 #fanfiction #Generative AI #LLM development #LLM dataset #not mine #and I hate it #Anti AI fightback

11 notes · View notes

phlebaswrites · 2 months ago

Text

They got all of my fics too.

I've filed a DCMA takedown notice but the person who put this dataset together filed a counter notice and said - 13 days ago - that they expect the dataset to be restored in 10-14 business days.

I'd like for that not to happen, so if anyone wants my help in putting together a DCMA takedown notice about their own fics being in this dataset please let me know

AO3 has been scraped, once again.

As of the time of this post, AO3 has been scraped by yet another shady individual looking to make a quick buck off the backs of hardworking hobby writers. This Reddit post here has all the details and the most current information. In short, if your fic URL ends in a number between 1 and 63,200,000 (inclusive), AND is not archive locked, your fic has been scraped and added to this database.

I have been trying to hold off on archive locking my fics for as long as possible, and I've managed to get by unscathed up to now. Unfortunately, my luck has run out and I am archive locking all of my current and future stories. I'm sorry to my lovelies who read and comment without an account; I love you all. But I have to do what is best for me and my work. Thank you for your understanding.

#Phlebas blogs her life #fanfic writers on writing #fanfic writer problems #ao3 #fanfiction #Generative AI #llm development #LLM dataset #not mine #and I hate it #despairing laughter

36K notes · View notes

aromantic-ghost-menace · 2 months ago

Text

So... I think I've just finished writing a 7.7 or 7.8K words long chapter in 2-3 days.

What the fuck possessed me? Did I finally manage to build up a working writing routine for me?

Did suffering the fanfic writer curse in advance really payout and work?

Anyways, am currently editing my current stp truth lies AU chapter "The drowned Cage" there. Will archive locked post it then. Maybe put up a publically encrypted/enciphered version of the fic once I got Maddening Shackles fully written down and posted.

But breakfast first! Food.

#aromantic ghost menace #slay the princess #stp #stp au #truth lies au #slay the princess au #Fanfic #Writing #Fanfiction #My fics #fic update #I'm just... who the fuck possessed me and what did I trade in?!#Also yeah will publically cipher encrypt it later on so it poisons scraped datasets for LLMs #Cause good luck trying to train essentially a statistical learning machine to recognize and decode seeming gibberish by itself

6 notes · View notes

stirdrawsandreblaws · 1 month ago

Text

just lost the post with the lookup tool but. lol. apparently my fanfic writing got scraped for some gen ai model training

they really have no standards

#stirring up trouble #just accepting all my work is ending up in a dataset whether written or drawn #and endeavoring to put more cursed bullshit out there to pollute the llms

1 note · View note

aromantic-ghost-menace · 2 months ago

Text

I neither use AI nor do I want my work to be used in a dataset for AI training, hence why I started to upload encoded public version fics with a few different ciphers, as inconsistently yet still humanly doable as possible.

I strongly encourage everyone reading this to also do the same, not because it will ensure that scraping won't happen to their fic again or that it won't be encoded, but because the more enciphered fics and works there are out there, the more work it means or those scrapers to get their datasets assembled to then in turn sell and earn money off from.

Don't let them steal your work and earn off of it!

a writing competition i was going to participate in again this year has announced that they now allow AI generated content to be submitted

their reasoning being that "we couldn't ban it even if we wanted to, every writer already uses it anyway"

"Every writer"?

come on

#aromantic ghost menace #poison those scraped dataset LLMs

63K notes · View notes

criticaldigitalmedia · 3 months ago

Text

#.teaching #ai #dataset #generative ai #llms

0 notes

roadmaperp-software · 1 year ago

Text

Hugging face

🤗Hugging Face has revolutionized the world of AI, offering a powerful platform that hosts a robust open-source ecosystem for natural language processing and machine learning models. Having approximately 450000 open-source models and 90k data sets, Hugging Face is riding high and is the go-to solution for integrating specialized AI & ML tasks into your application. A perfect place for AI/ML professionals to leverage its resources and host your AI models to train and take them to the next level!!

#pytorch #tensorflow #amdocs #huggingface #ai #llm #ml #transformers #machinelearning #artificialintelligence #datasets #opensource #erpsoftware #softwareservices

1 note · View note

volixia669 · 1 year ago

Text

I'm gonna have to pop some bubbles.

While a cool idea, AI translating ASL will work less well than google translate, and, well, tbh, I had a whole spiel but it was based on translating English into ASL. Which. At least would be for the benefit of deaf people, though long story short: doesn't work. Too complex and these tools NEVER take deaf culture into consideration when translating.

As for ASL into English...So its for people who speak ASL but can't "speak" so that hearing people can understand? So its for the benefit of abled people?

Also like. Some deaf people CAN speak. Some deaf people have personal reasons for not wanting to. Either way, this sort of AI will mess up. Hell, the "thank you" shown? COMPLETELY different from another form of thank you. Amd there's multiple forms of "I love you" depending on dialect.

Because yeah, there's dialects, and these machine translators do awful with any type of sentence structure, which btw is NOT the same in ASL as it is in spoken English.

You know what's been done since we've had written word, so that those who can't/won't speak can communicate? Used. Writing.

Something that is way faster than stumbling through a bazillion misunderstandings because an AI doesn't know deaf black people in the US have their own dialect or because the person signing learned from a Canadian group.

If abled people don't LIKE having to wait for someone to write out what they're saying then they can go learn ASL from a deaf person instead of relying ona tech that just makes things harder for disabled people.

A computer science student named Priyanjali Gupta, studying in her third year at Vellore Institute of Technology, has developed an AI-based model that can translate sign language into English.

#lix rambles #also if its using generative AI datasets for translation then its all bullshit anyways #also do not get me started on how LLMs are too environmentally desteuctive to go into regular use

70K notes · View notes

vlruso · 2 years ago

Text

Meet SynthIA (Synthetic Intelligent Agent) 7B-v1.3: A Mistral-7B-v0.1 Model Trained on Orca Style Datasets

🚀 Exciting news! Introducing SynthIA (Synthetic Intelligent Agent) 7B-v1.3: A Mistral-7B-v0.1 Model Trained on Orca Style Datasets. 🐬🤖 SynthIA-7B-v1.3 is a robust and flexible large language model with 7 billion parameters, ideal for text creation, translation, generating content, and answering questions. It's perfect for researchers, educators, and businesses. 💪 Discover how SynthIA-7B-v1.3 can take your language-related tasks to a whole new level! Check out the blog post below for more information. 👇 [Read the blog post here](https://ift.tt/W5VNZE1) #linguisticintelligence #textgeneration #languagemodels #AI #SynthIA7Bv1-3 #OrcaStyleDatasets List of Useful Links: AI Scrum Bot - ask about AI scrum and agile Our Telegram @itinai Twitter - @itinaicom

#itinai.com #AI #News #Meet SynthIA (Synthetic Intelligent Agent) 7B-v1.3: A Mistral-7B-v0.1 Model Trained on Orca Style Datasets #AI News #AI tools #Dhanshree Shripad Shenwai #Innovation #ITinAI.com #LLM #MarkTechPost #t.me/itinai Meet SynthIA (Synthetic Intelligent Agent) 7B-v1.3: A Mistral-7B-v0.1 Model Trained on Orca Style Datasets

0 notes

river-taxbird · 2 years ago

Text

There is no such thing as AI.

How to help the non technical and less online people in your life navigate the latest techbro grift.

I've seen other people say stuff to this effect but it's worth reiterating. Today in class, my professor was talking about a news article where a celebrity's likeness was used in an ai image without their permission. Then she mentioned a guest lecture about how AI is going to help finance professionals. Then I pointed out, those two things aren't really related.

The term AI is being used to obfuscate details about multiple semi-related technologies.

Traditionally in sci-fi, AI means artificial general intelligence like Data from star trek, or the terminator. This, I shouldn't need to say, doesn't exist. Techbros use the term AI to trick investors into funding their projects. It's largely a grift.

What is the term AI being used to obfuscate?

If you want to help the less online and less tech literate people in your life navigate the hype around AI, the best way to do it is to encourage them to change their language around AI topics.

By calling these technologies what they really are, and encouraging the people around us to know the real names, we can help lift the veil, kill the hype, and keep people safe from scams. Here are some starting points, which I am just pulling from Wikipedia. I'd highly encourage you to do your own research.

Machine learning (ML): is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines "discover" their "own" algorithms, without needing to be explicitly told what to do by any human-developed algorithms. (This is the basis of most technologically people call AI)

Language model: (LM or LLM) is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on. (This would be your ChatGPT.)

Generative adversarial network (GAN): is a class of machine learning framework and a prominent framework for approaching generative AI. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss. (This is the source of some AI images and deepfakes.)

Diffusion Models: Models that generate the probability distribution of a given dataset. In image generation, a neural network is trained to denoise images with added gaussian noise by learning to remove the noise. After the training is complete, it can then be used for image generation by starting with a random noise image and denoise that. (This is the more common technology behind AI images, including Dall-E and Stable Diffusion. I added this one to the post after as it was brought to my attention it is now more common than GANs.)

I know these terms are more technical, but they are also more accurate, and they can easily be explained in a way non-technical people can understand. The grifters are using language to give this technology its power, so we can use language to take it's power away and let people see it for what it really is.

#ai #scams #language #semiotics #machine learning

12K notes · View notes

txttletale · 1 year ago

Note

are there any critiques of AI art or maybe AI in general that you would agree with?

AI art makes it a lot easier to make bad art on a mass production scale which absolutely floods art platforms (sucks). LLMs make it a lot easier to make content slop on a mass production scale which absolutely floods search results (sucks and with much worse consequences). both will be integrated into production pipelines in ways that put people out of jobs or justify lower pay for existing jobs. most AI-produced stuff is bad. the loudest and most emphatic boosters of this shit are soulless venture capital guys with an obvious and profound disdain for the concept of art or creative expression. the current wave of hype around it means that machine learning is being incorporated into workflows and places where it provides no benefit and in fact makes services and production meaningfully worse. it is genuinely terrifying to see people looking to chatGPT for personal and professional advice. the process of training AIs and labelling datasets involves profound exploitation of workers in the global south. the ability of AI tech to automate biases while erasing accountability is chilling. seems unwise to put a lot of our technological eggs in a completely opaque black box basket (mixing my metaphors ab it with that one). bing ai wont let me generate 'tesla CEO meat mistake' because it hates fun

#ask #ai art

6K notes · View notes

lesser-vissir · 9 months ago

Text

I hear what you're saying but an LLM built exclusively off of a single company's copyrighted material, even one as big as Disney, isn't feasible and the result would be a terrible AI.

the AI technology that will be used to replace workers is not stable diffusion. and it's not chat gpt, and it's not midjourney! it will be in-house models made by the likes of adobe and disney that is trained entirely on their OWN, COPYRIGHTED MATERIAL and no amount of getting mad at communists online for generating anime girls is going to do anything to stop that. GLOBAL COMMUNISM NOW

#LLMs only work because they use insanely big source material #like #you are right about your conclusions regarding AI #but Disney will in fact be using something built off of the current iterations of LLMs #or at least will train off the same datasets

415 notes · View notes

centrally-unplanned · 6 days ago

Text

What on earth is Marginal Revolutions doing linking this tweet about LLM mental health safety:

No one has any numbers on this, there is no dataset of "self harm incidents caused by TV" because that isn't a coherent concept. He just fucking says it too downstream:

Literally this lmao

Anyway MR I am worried for you buddy please, talk to your friends, you can come back into the light! It doesn't have to be this way!

#Tyler I believe in you!!#MR hateblog

50 notes · View notes

snarp · 6 months ago

Text

Seemingly-counterintuitive thing about "AI" output: these things are much better at identifying and recreating emotion/tone than they are at facts/reasoning. The LLMs typically know whether a story is "funny" or "depressing," whether a picture of a dog is "cute" or "scary," when it's appropriate to use a puke emoji, and so on. They do not know that the Russian sleep experiment and the Dashcon baby are fake.

Makes sense! People are more likely to lie or be wrong about reality than about their own feelings about it. And the datasets are, ultimately, just stuff people say.

#ai #ml #llm

73 notes · View notes

neurospring · 5 months ago

Text

History and Basics of Language Models: How Transformers Changed AI Forever - and Led to Neuro-sama

I have seen a lot of misunderstandings and myths about Neuro-sama's language model. I have decided to write a short post, going into the history of and current state of large language models and providing some explanation about how they work, and how Neuro-sama works! To begin, let's start with some history.

Before the beginning

Before the language models we are used to today, models like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) were used for natural language processing, but they had a lot of limitations. Both of these architectures process words sequentially, meaning they read text one word at a time in order. This made them struggle with long sentences, they could almost forget the beginning by the time they reach the end.

Another major limitation was computational efficiency. Since RNNs and LSTMs process text one step at a time, they can't take full advantage of modern parallel computing harware like GPUs. All these fundamental limitations mean that these models could never be nearly as smart as today's models.

The beginning of modern language models

In 2017, a paper titled "Attention is All You Need" introduced the transformer architecture. It was received positively for its innovation, but no one truly knew just how important it is going to be. This paper is what made modern language models possible.

The transformer's key innovation was the attention mechanism, which allows the model to focus on the most relevant parts of a text. Instead of processing words sequentially, transformers process all words at once, capturing relationships between words no matter how far apart they are in the text. This change made models faster, and better at understanding context.

The full potential of transformers became clearer over the next few years as researchers scaled them up.

The Scale of Modern Language Models

A major factor in an LLM's performance is the number of parameters - which are like the model's "neurons" that store learned information. The more parameters, the more powerful the model can be. The first GPT (generative pre-trained transformer) model, GPT-1, was released in 2018 and had 117 million parameters. It was small and not very capable - but a good proof of concept. GPT-2 (2019) had 1.5 billion parameters - which was a huge leap in quality, but it was still really dumb compared to the models we are used to today. GPT-3 (2020) had 175 billion parameters, and it was really the first model that felt actually kinda smart. This model required 4.6 million dollars for training, in compute expenses alone.

Recently, models have become more efficient: smaller models can achieve similar performance to bigger models from the past. This efficiency means that smarter and smarter models can run on consumer hardware. However, training costs still remain high.

How Are Language Models Trained?

Pre-training: The model is trained on a massive dataset to predict the next token. A token is a piece of text a language model can process, it can be a word, word fragment, or character. Even training relatively small models with a few billion parameters requires trillions of tokens, and a lot of computational resources which cost millions of dollars.

Post-training, including fine-tuning: After pre-training, the model can be customized for specific tasks, like answering questions, writing code, casual conversation, etc. Certain post-training methods can help improve the model's alignment with certain values or update its knowledge of specific domains. This requires far less data and computational power compared to pre-training.

The Cost of Training Large Language Models

Pre-training models over a certain size requires vast amounts of computational power and high-quality data. While advancements in efficiency have made it possible to get better performance with smaller models, models can still require millions of dollars to train, even if they have far fewer parameters than GPT-3.

The Rise of Open-Source Language Models

Many language models are closed-source, you can't download or run them locally. For example ChatGPT models from OpenAI and Claude models from Anthropic are all closed-source.

However, some companies release a number of their models as open-source, allowing anyone to download, run, and modify them.

While the larger models can not be run on consumer hardware, smaller open-source models can be used on high-end consumer PCs.

An advantage of smaller models is that they have lower latency, meaning they can generate responses much faster. They are not as powerful as the largest closed-source models, but their accessibility and speed make them highly useful for some applications.

So What is Neuro-sama?

Basically no details are shared about the model by Vedal, and I will only share what can be confidently concluded and only information that wouldn't reveal any sort of "trade secret". What can be known is that Neuro-sama would not exist without open-source large language models. Vedal can't train a model from scratch, but what Vedal can do - and can be confidently assumed he did do - is post-training an open-source model. Post-training a model on additional data can change the way the model acts and can add some new knowledge - however, the core intelligence of Neuro-sama comes from the base model she was built on. Since huge models can't be run on consumer hardware and would be prohibitively expensive to run through API, we can also say that Neuro-sama is a smaller model - which has the disadvantage of being less powerful, having more limitations, but has the advantage of low latency. Latency and cost are always going to pose some pretty strict limitations, but because LLMs just keep getting more efficient and better hardware is becoming more available, Neuro can be expected to become smarter and smarter in the future. To end, I have to at least mention that Neuro-sama is more than just her language model, though we have only talked about the language model in this post. She can be looked at as a system of different parts. Her TTS, her VTuber avatar, her vision model, her long-term memory, even her Minecraft AI, and so on, all come together to make Neuro-sama.

Wrapping up - Thanks for Reading!

This post was meant to provide a brief introduction to language models, covering some history and explaining how Neuro-sama can work. Of course, this post is just scratching the surface, but hopefully it gave you a clearer understanding about how language models function and their history!

#neuro sama #neurosama #vedal987 #llm #artificial intelligence #explained

33 notes · View notes

kenyatta · 3 months ago

Text

Thom Wolf, the co founder of Hugging Face wrote this the other day about how the current AI is just “a country of yes men on servers”

History is filled with geniuses struggling during their studies. Edison was called "addled" by his teacher. Barbara McClintock got criticized for "weird thinking" before winning a Nobel Prize. Einstein failed his first attempt at the ETH Zurich entrance exam. And the list goes on.

The main mistake people usually make is thinking Newton or Einstein were just scaled-up good students, that a genius comes to life when you linearly extrapolate a top-10% student.

This perspective misses the most crucial aspect of science: the skill to ask the right questions and to challenge even what one has learned. A real science breakthrough is Copernicus proposing, against all the knowledge of his days -in ML terms we would say “despite all his training dataset”-, that the earth may orbit the sun rather than the other way around.

To create an Einstein in a data center, we don't just need a system that knows all the answers, but rather one that can ask questions nobody else has thought of or dared to ask. One that writes 'What if everyone is wrong about this?' when all textbooks, experts, and common knowledge suggest otherwise.

“In my opinion this is one of the reasons LLMs, while they already have all of humanity's knowledge in memory, haven't generated any new knowledge by connecting previously unrelated facts. They're mostly doing "manifold filling" at the moment - filling in the interpolation gaps between what humans already know, somehow treating knowledge as an intangible fabric of reality.”

#ai #agi #thom wolf #huggingface

30 notes · View notes