#Twitter data scraper
Explore tagged Tumblr posts
iwebdatascrape · 2 years ago
Text
Twitter Data Scraping Services -Twitter Data Collection Services
Scrape data like profile handle, followers count, etc., using our Twitter data scraping services. Our Twitter data collection services are functional across the USA, UK, etc.
know more:
0 notes
webscreenscraping00 · 2 years ago
Text
Tumblr media
Twitter is amongst the most tracked social networking sites these days. This is amongst the best resources to extract twitter followers as well as find the followers in a huge amount through using Twitter web scraping. We can attach to a lot of people using Facebook, LinkedIn, and Twitter. However, we require to get them in the list or circle. It’s not an easy job and therefore, we provide Twitter Data Scraping services. Web Screen Scraping provides the Best Twitter Data Scraping Services to scrape or extract data from Twitter. Get all your requirements fulfilled with our Twitter Data Scraping services.
0 notes
callmearcturus · 1 year ago
Text
I said this elsewhere but
not to be That Guy but I don't really see the point of moving platforms anymore.
There is no where we can hide on the internet from the silicon valley bros. There just isn't. Patreon is VC-funded and could announce tomorrow that oh of course they've been partnered with Midjourney for months already. Twitter actively scraps everything for AI learning. And even if you trusted the other big players like FB/IG to tell the truth about shit, people are going to use these platforms for datasets anyway. They'll just do it quietly and hope no one notices.
And places like cohost or whatever-- honestly, if it makes you feel safer/better, go for it, but I don't think cohost has the sway or capital to build the type of legal team you need to fight against scrapers. Hell, you wanna retreat into private discords? Discord wants in on AI too.
Everyone big is already dealing in AI, and everyone small doesn't even have a seat at the table. In my opinion, we are all collectively holding out for Brussels or any of the many court cases to do something about this shit, because it's no longer a thing we can just hide from.
I'm going to keep my writing on the AO3 because they are the odd case of having an actual legal team in place for this shit. For artists, I have nothing but sympathy. I suggest glazing and nightshading literally everything you post.
But beyond that, I'm unsure what we can do. This is a matter for legislation. Silicon Valley doesn't care if we all go to cohost, and even less scrupulous data-crawlers will just grab our shit from there too.
So I'll be here.
3K notes · View notes
actosoluions · 2 years ago
Text
How to Scrape Tweets Data by Location Using Python and snscrape?
Tumblr media
In this blog, we will take a comprehensive look into scraping Python wrapper and its functionality and specifically focus on using it to search for tweets based on location. We will also delve into why the wrapper may not always perform as expected. Let's dive in
snscrape is a remarkable Python library that enables users to scrape tweets from Twitter without the need for personal API keys. With its lightning-fast performance, it can retrieve thousands of tweets within seconds. Moreover, snscrape offers powerful search capabilities, allowing for highly customizable queries. While the documentation for scraping tweets by location is currently limited, this blog aims to comprehensively introduce this topic. Let's delve into the details:
Introduction to Snscrape: Snscrape is a feature-rich Python library that simplifies scraping tweets from Twitter. Unlike traditional methods that require API keys, snscrape bypasses this requirement, making it accessible to users without prior authorization. Its speed and efficiency make it an ideal choice for various applications, from research and analysis to data collection.
The Power of Location-Based Tweet Scraping: Location-based tweet scraping allows users to filter tweets based on geographical coordinates or place names. This functionality is handy for conducting location-specific analyses, monitoring regional trends, or extracting data relevant to specific areas. By leveraging Snscrape's capabilities, users can gain valuable insights from tweets originating in their desired locations.
Exploring Snscrape's Location-Based Search Tools: Snscrape provides several powerful tools for conducting location-based tweet searches. Users can effectively narrow their search results to tweets from a particular location by utilizing specific parameters and syntax. This includes defining the search query, specifying the geographical coordinates or place names, setting search limits, and configuring the desired output format. Understanding and correctly using these tools is crucial for successful location-based tweet scraping.
Overcoming Documentation Gaps: While snscrape is a powerful library, its documentation on scraping tweets by location is currently limited. This article will provide a comprehensive introduction to the topic to bridge this gap, covering the necessary syntax, parameters, and strategies for effective location-based searches. Following the step-by-step guidelines, users can overcome the lack of documentation and successfully utilize snscrape for their location-specific scraping needs.
Best Practices and Tips: Alongside exploring Snscrape's location-based scraping capabilities, this article will also offer best practices and tips for maximizing the efficiency and reliability of your scraping tasks. This includes handling rate limits, implementing error-handling mechanisms, ensuring data consistency, and staying updated with any changes or updates in Snscrape's functionality.
Introduction of snscrape Using Python
In this blog, we’ll use tahe development version of snscrape that can be installed withpip install git+https://github.com/JustAnotherArchivist/snscrape.git
Note: this needs Python 3.8 or latest
Some familiarity of the Pandas module is needed.
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
We encourage you to explore and experiment with the various features of snscrape to better understand its capabilities. Additionally, you can refer to the mentioned article for more in-depth information on the subject. Later in this blog, we will delve deeper into the user field and its significance in tweet scraping. By gaining a deeper understanding of these concepts, you can harness the full potential of snscrape for your scraping tasks.
Advanced Search Features
Tumblr media
In this code snippet, we define the search query as "pizza near:Los Angeles within:10km", which specifies that we want to search for tweets containing the word "pizza" near Los Angeles within a radius of 10 km. The TwitterSearchScraper object is created with the search query, and then we iterate over the retrieved tweets and print their content.
Feel free to adjust the search query and radius per your specific requirements.
For comparing results, we can utilize an inner merging on two DataFrames:common_rows = df_coord.merge(df_city, how='inner')
That returns 50 , for example, they both have the same rows.
What precisely is this place or location?
When determining the location of tweets on Twitter, there are two primary sources: the geo-tag associated with a specific tweet and the user's location mentioned in their profile. However, it's important to note that only a small percentage of tweets (approximately 1-2%) are geo-tagged, making it an unreliable metric for location-based searches. On the other hand, many users include a location in their profile, but it's worth noting that these locations can be arbitrary and inaccurate. Some users provide helpful information like "London, England," while others might use humorous or irrelevant descriptions like "My Parents' Basement."
Despite the limited availability and potential inaccuracies of geo-tagged tweets and user profile locations, Twitter employs algorithms as part of its advanced search functionality to interpret a user's location based on their profile. This means that when you look for tweets through coordinates or city names, the search results will include tweets geotagged from the location and tweets posted by users who have that location (or a location nearby) mentioned in their profile.
Tumblr media
To illustrate the usage of location-based searching on Twitter, let's consider an example. Suppose we perform a search for tweets near "London." Here are two examples of tweets that were found using different methods:
The first tweet is geo-tagged, which means it contains specific geographic coordinates indicating its location. In this case, the tweet was found because of its geo-tag, regardless of whether the user has a location mentioned in their profile or not.
The following tweet isn’t geo-tagged, which means that it doesn't have explicit geographic coordinates associated with it. However, it was still included in the search results because a user has given a location in the profile that matches or is closely associated with London.
When performing a location-based search on Twitter, you can come across tweets that are either geo-tagged or have users with matching or relevant locations mentioned in their profiles. This allows for a more comprehensive search, capturing tweets from specific geographic locations and users who have declared their association with those locations.
Get Location From Scraped Tweets
If you're using snscrape to scrape tweets and want to extract the user's location from the scraped data, you can do so by following these steps. In the example below, we scrape 50 tweets within a 10km radius of Los Angeles, store the data in a DataFrame, and then create a new column to capture the user's location.
Tumblr media Tumblr media
If It Doesn’t Work According to Your Expectations
The use of the near: and geocode: tags in Twitter's advanced search can sometimes yield inconsistent results, especially when searching for specific towns, villages, or countries. For instance, while searching for tweets nearby Lewisham, the results may show tweets from a completely different location, such as Hobart, Australia, which is over 17,000 km away.
To ensure more accurate results when scraping tweets by locations using snscrape, it is recommended to use the geocode tag having longitude & latitude coordinates, along with a specified radius, to narrow down the search area. This approach will provide more reliable and precise results based on the available data and features.
Conclusion
In conclusion, the snscrape Python module is a valuable tool for conducting specific and powerful searches on Twitter. Twitter has made significant efforts to convert user input locations into real places, enabling easy searching by name or coordinates. By leveraging its capabilities, users can extract relevant information from tweets based on various criteria.
For research, analysis, or other purposes, snscrape empowers users to extract valuable insights from Twitter data. Tweets serve as a valuable source of information. When combined with the capabilities of snscrape, even individuals with limited experience in Data Science or subject knowledge can undertake exciting projects.
Happy scrapping!
For more details, you can contact Actowiz Solutions anytime! Call us for all your mobile app scraping and web scraping services requirements.
sources :https://www.actowizsolutions.com/how-to-scrape-tweets-data-by-location-using-python-and-snscrape.php
0 notes
bitbybitwrites · 1 month ago
Text
ShadowDragon sells a tool called SocialNet that streamlines the process of pulling public data from various sites, apps, and services. Marketing material available online says SocialNet can “follow the breadcrumbs of your target’s digital life and find hidden correlations in your research.” In one promotional video, ShadowDragon says users can enter “an email, an alias, a name, a phone number, a variety of different things, and immediately have information on your target. We can see interests, we can see who friends are, pictures, videos.”
The leaked list of targeted sites include ones from major tech companies, communication tools, sites focused around certain hobbies and interests, payment services, social networks, and more. The 30 companies the Mozilla Foundation is asking to block ShadowDragon scrapers are ​​Amazon, Apple, BabyCentre, BlueSky, Discord, Duolingo, Etsy, Meta’s Facebook and Instagram, FlightAware, Github, Glassdoor, GoFundMe, Google, LinkedIn, Nextdoor, OnlyFans, Pinterest, Reddit, Snapchat, Strava, Substack, TikTok, Tinder, TripAdvisor, Twitch, Twitter, WhatsApp, Xbox, Yelp, and YouTube.
433 notes · View notes
jamingbenn · 4 months ago
Text
year in review - hockey rpf on ao3
Tumblr media
hello!! the annual ao3 year in review had some friends and i thinking - wouldn't it be cool if we had a hockey rpf specific version of that. so i went ahead and collated the data below!!
i start with a broad overview, then dive deeper into the 3 most popular ships this year (with one bonus!)
if any images appear blurry, click on them to expand and they should become clear!
₊˚⊹♡ . ݁₊ ⊹ . ݁˖ . ݁𐙚 ‧₊˚ ⋅. ݁
before we jump in, some key things to highlight: - CREDIT TO: the webscraping part of my code heavily utilized the ao3 wrapped google colab code, as lovingly created by @kyucultures on twitter, as the main skeleton. i tweaked a couple of things but having it as a reference saved me a LOT of time and effort as a first time web scraper!!! thank you stranger <3 - please do NOT, under ANY circumstances, share any part of this collation on any other website. please do not screenshot or repost to twitter, tiktok, or any other public social platform. thank u!!! T_T - but do feel free to send requests to my inbox! if you want more info on a specific ship, tag, or you have a cool idea or wanna see a correlation between two variables, reach out and i should be able to take a look. if you want to take a deeper dive into a specific trope not mentioned here/chapter count/word counts/fic tags/ship tags/ratings/etc, shoot me an ask!
˚  .   ˚ .      . ✦     ˚     . ★⋆. ࿐࿔
with that all said and done... let's dive into hockey_rpf_2024_wrapped_insanity.ipynb
BIG PICTURE OVERVIEW
i scraped a total of 4266 fanfics that dated themselves as published or finished in the year 2024. of these 4000 odd fanfics, the most popular ships were:
Tumblr media
Note: "Minor or Background Relationship(s)" clocked in at #9 with 91 fics, but I removed it as it was always a secondary tag and added no information to the chart. I did not discern between primary ship and secondary ship(s) either!
breaking down the 5 most popular ships over the course of the year, we see:
Tumblr media
super interesting to see that HUGE jump for mattdrai in june/july for the stanley cup final. the general lull in the offseason is cool to see as well.
as for the most popular tags in all 2024 hockey rpf fic...
Tumblr media
weee like our fluff. and our established relationships. and a little H/C never hurt no one.
i got curious here about which AUs were the most popular, so i filtered down for that. note that i only regex'd for tags that specifically start with "Alternate Universe - ", so A/B/O and some other stuff won't appear here!
Tumblr media
idk it was cool to me.
also, here's a quick breakdown of the ratings % for works this year:
Tumblr media
and as for the word counts, i pulled up a box plot of the top 20 most popular ships to see how the fic length distribution differed amongst ships:
Tumblr media
mattdrai-ers you have some DEDICATION omg. respect
now for the ship by ship break down!!
₊ . ݁ ݁ . ⊹ ࣪ ˖͙͘͡★ ⊹ .
#1 MATTDRAI
most popular ship this year. peaked in june/july with the scf. so what do u people like to write about?
Tumblr media
fun fun fun. i love that the scf is tagged there like yes actually she is also a main character
₊ . ݁ ݁ . ⊹ ࣪ ˖͙͘͡★ ⊹ .
#2 SIDGENO
(my babies) top tags for this ship are:
Tumblr media
folks, we are a/b/o fiends and we cannot lie. thank you to all the selfless authors for feeding us good a/b/o fic this year. i hope to join your ranks soon.
(also: MPREG. omega sidney crosby. alpha geno. listen, the people have spoken, and like, i am listening.)
₊ . ݁ ݁ . ⊹ ࣪ ˖͙͘͡★ ⊹ .
#3 NICOJACK
top tags!!
Tumblr media
it seems nice and cozy over there... room for one more?
₊ . ݁ ݁ . ⊹ ࣪ ˖͙͘͡★ ⊹ .
BONUS: JDTZ.
i wasnt gonna plot this but @marcandreyuri asked me if i could take a look and the results are so compelling i must include it. are yall ok. do u need a hug
Tumblr media
top tags being h/c, angst, angst, TRADES, pining, open endings... T_T katie said its a "torture vortex" and i must concurr
₊ . ݁ ݁ . ⊹ ࣪ ˖͙͘͡★ ⊹ .
BONUS BONUS: ALPHA/BETA/OMEGA
as an a/b/o enthusiast myself i got curious as to what the most popular ships were within that tag. if you want me to take a look about this for any other tag lmk, but for a/b/o, as expected, SID GENO ON TOP BABY!:
Tumblr media
thats all for now!!! if you have anything else you are interested in seeing the data for, send me an ask and i'll see if i can get it to ya!
448 notes · View notes
bitterkarella · 1 year ago
Text
Midnight Pals: Hackin'
King: i can't believe elon's grok is pretending i'm friends with him King: i need to stop that AI before everyone believes it! King: i've got to hire a hacker King: franz, you've got to help me Franz Kafka: what? me? Barker: steve, no
Kafka: i'm not a hacker King: oh i thought franz was a hacker Barker: what gave you THAT impression? King: you know, with the cat ear headphones and the striped thigh socks Barker: no steve that's something ENTIRELY different Kafka: n-no it isn't, on second thought yes I'm totally a hacker
Kafka: it means i'm a hacker, nothing else Barker: sure franz Kafka: it does! it totally means i'm a hacker! Barker: franz, go play with your blahaj plush, the adults are talking here
Barker: you know who you need? you need william gibson Barker: the best hacker money can buy King: william gibson? how do i contact him? Barker: you don't Barker: he'll contact you
King: can you really hack grok, william? William Gibson: [wearing black duster and fingerless black gloves] my hacker name is shadow gigabyte King: oh sorry Gibson: can i hack grok? listen kid i was cyberbyting the megabyte mainframe when you were just rebooting your motherboard mouse data bandwidth modem email King: wow!
Gibson: my CPU is a neural net processer, a learning computer King: wow he really sounds like he knows what he's talking about! King: that definitely sounds like hacker talk to me Gibson: CD Rom Gibson: internet Joe Hill: dad can i talk to you for a second King: not now joe daddy's hiring a hacker
Gibson: [wildly slapping keyboard] i'll re-index the mega bit blaster cyber codex Gibson: [wildly slapping keyboard] now we'll cybersecurity the lock box data center King: hey what happens if you push that button? Gibson: what the-- no!! [klaxons sound] King: what's that mean? Gibson: shit Gibson: we've got company
Gibson: sentient cyber virus electronic guard cyberbots Gibson: real high tech Gibson: state of the art in bio-tech wetware neural-data scrapers Gibson: [putting on sunglasses with red laser scope] and they ain't friendly
King: what are we going to do?! Gibson: kid, you keep your hands to yourself unless you wanna become roadkill on the information super highway!!! Gibson: hold on to your CPU (central processing unit)!!!
Gibson: [wildly slapping keyboard] gotta reconfigure the darkweb logistics for ethernet wavetech Gibson: [wildly slapping keyboard] upload the memory downloader for dumpware backup Gibson: [wildly slapping keyboard] uncodify the cyberpatch modifer aaaaand Gibson: i'm in
King: wow, you hacked twitter?? how did you do it? Gibson: the greatest hackers never reveal their secrets [earlier] Gibson: [wearing fake mustache] hey elon its me catturd Gibson: could you give me your password? Elon Musk: sure it's "picklerick420"!
537 notes · View notes
nsomniacsdream · 1 year ago
Text
Tumblr, in my estimation, cannot be a place that is profitable, because the aims of the userbase can be described somewhere between fuck around and fuck off. No one comes here for anything other than shitposting. Companies don't try to find your tumblr, to my knowledge, so it's the last safe place on the internet to just say stupid shit and learn from it, instead of becoming unemployable.
Tumblr would be a really good buy for like, Archive.org. Someone who doesn't have to worry about 'profit', they just have to keep the lights on. Do moderation by roundtable, when someone submits a support request, like "I'm being harassed", the proof they provide is sent to ten random active bloggers(unrelated to any involved parties), and their decision is actionable. Provide the tools for self determination, instead of a black box that doesn't seem to be working for anyone. It's cheaper, and fairer to the community in general. I dunno, it's probably not perfect, but it's better than 'just doing nothing until the person who keep complaining gives us a reason to ban them, that makes the problem go away'. Honestly, Matt whatever should just donate tumblr to them, call it a charitable donation, claim it on his taxes. It's a sinking ship every minute you're trying to extract value from it. It's real current value is almost certainly hovering around "the change in people's couches of like 20 households".
Tumblr has a place on the internet, an important place, but it'll never be a need that is profitable. And Tumblr's history and reputation kind of prevent it from ever being changed into something different that would be as profitable as whoever currently owns it would want. I suppose you could burn it all to the ground, wipe the servers and start a twitter clone. But it'll just be one more on a field that's so oversaturated it's not worth trying. I'm not sure why people keep buying tumblr, it's a fantastic creative community, but it's products can't be sold, and the userbase is poor and has little to no interest in paying for 'upgrades'. So you could sell everything to AI scrapers, or data miners, but you'll lose the entire userbase and no one's gonna come in to fill the gaps left. It's a quick and messy death.
137 notes · View notes
tapiocats · 11 months ago
Text
To everyone, but especially artists : Instagram has fully leaned into the AI craze and has now scraped every existing account to train their new Meta AI.
In response, artists are migrating en masse to this new social network called Cara.app. It’s a mix of Instagram/Bluesky/Twitter, and is focused on artists with a complete ban on AI generated content. They also try as much as possible to block scrapers to use the site’s data, and have a partnership with glaze that allows you to protect your art posts.
You can also choose what you see in your feed based on percentages of following/recommended content, and so far it has been incredible for discovering new artists !
Like many, I am trying it out right now. I really enjoy the community, and hope this great project keeps on living !
Join me there, at https://cara.app/tapiocats 💕
I'm not leaving Tumblr btw !
Just trying out an Instagram alternative 😄
43 notes · View notes
snickerdoodlles · 2 years ago
Text
pulling out a section from this post (a very basic breakdown of generative AI) for easier reading;
AO3 and Generative AI
There are unfortunately some massive misunderstandings in regards to AO3 being included in LLM training datasets. This post was semi-prompted by the ‘Knot in my name’ AO3 tag (for those of you who haven’t heard of it, it’s supposed to be a fandom anti-AI event where AO3 writers help “further pollute” AI with Omegaverse), so let’s take a moment to address AO3 in conjunction with AI. We’ll start with the biggest misconception:
1. AO3 wasn’t used to train generative AI.
Or at least not anymore than any other internet website. AO3 was not deliberately scraped to be used as LLM training data.
The AO3 moderators found traces of the Common Crawl web worm in their servers. The Common Crawl is an open data repository of raw web page data, metadata extracts and text extracts collected from 10+ years of web crawling. Its collective data is measured in petabytes. (As a note, it also only features samples of the available pages on a given domain in its datasets, because its data is freely released under fair use and this is part of how they navigate copyright.) LLM developers use it and similar web crawls like Google’s C4 to bulk up the overall amount of pre-training data.
AO3 is big to an individual user, but it’s actually a small website when it comes to the amount of data used to pre-train LLMs. It’s also just a bad candidate for training data. As a comparison example, Wikipedia is often used as high quality training data because it’s a knowledge corpus and its moderators put a lot of work into maintaining a consistent quality across its web pages. AO3 is just a repository for all fanfic -- it doesn’t have any of that quality maintenance nor any knowledge density. Just in terms of practicality, even if people could get around the copyright issues, the sheer amount of work that would go into curating and labeling AO3’s data (or even a part of it) to make it useful for the fine-tuning stages most likely outstrips any potential usage.
Speaking of copyright, AO3 is a terrible candidate for training data just based on that. Even if people (incorrectly) think fanfic doesn’t hold copyright, there are plenty of books and texts that are public domain that can be found in online libraries that make for much better training data (or rather, there is a higher consistency in quality for them that would make them more appealing than fic for people specifically targeting written story data). And for any scrapers who don’t care about legalities or copyright, they’re going to target published works instead. Meta is in fact currently getting sued for including published books from a shadow library in its training data (note, this case is not in regards to any copyrighted material that might’ve been caught in the Common Crawl data, its regarding a book repository of published books that was scraped specifically to bring in some higher quality data for the first training stage). In a similar case, there’s an anonymous group suing Microsoft, GitHub, and OpenAI for training their LLMs on open source code.
Getting back to my point, AO3 is just not desirable training data. It’s not big enough to be worth scraping for pre-training data, it’s not curated enough to be considered for high quality data, and its data comes with copyright issues to boot. If LLM creators are saying there was no active pursuit in using AO3 to train generative AI, then there was (99% likelihood) no active pursuit in using AO3 to train generative AI.
AO3 has some preventative measures against being included in future Common Crawl datasets, which may or may not work, but there’s no way to remove any previously scraped data from that data corpus. And as a note for anyone locking their AO3 fics: that might potentially help against future AO3 scrapes, but it is rather moot if you post the same fic in full to other platforms like ffn, twitter, tumblr, etc. that have zero preventative measures against data scraping.
2. A/B/O is not polluting generative AI
…I’m going to be real, I have no idea what people expected to prove by asking AI to write Omegaverse fic. At the very least, people know A/B/O fics are not exclusive to AO3, right? The genre isn’t even exclusive to fandom -- it started in fandom, sure, but it expanded to general erotica years ago. It’s all over social media. It has multiple Wikipedia pages.
More to the point though, omegaverse would only be “polluting” AI if LLMs were spewing omegaverse concepts unprompted or like…associated knots with dicks more than rope or something. But people asking AI to write omegaverse and AI then writing omegaverse for them is just AI giving people exactly what they asked for. And…I hate to point this out, but LLMs writing for a niche the LLM trainers didn’t deliberately train the LLMs on is generally considered to be a good thing to the people who develop LLMs. The capability to fill niches developers didn’t even know existed increases LLMs’ marketability. If I were a betting man, what fandom probably saw as a GOTCHA moment, AI people probably saw as a good sign of LLMs’ future potential.
3. Individuals cannot affect LLM training datasets.
So back to the fandom event, with the stated goal of sabotaging AI scrapers via omegaverse fic.
…It’s not going to do anything.
Let’s add some numbers to this to help put things into perspective:
LLaMA’s 65 billion parameter model was trained on 1.4 trillion tokens. Of that 1.4 trillion tokens, about 67% of the training data was from the Common Crawl (roughly ~3 terabytes of data).
3 terabytes is 3,000,000,000 kilobytes.
That’s 3 billion kilobytes.
According to a news article I saw, there has been ~450k words total published for this campaign (*this was while it was going on, that number has probably changed, but you’re about to see why that still doesn’t matter). So, roughly speaking, ~450k of text is ~1012 KB (I’m going off the document size of a plain text doc for a fic whose word count is ~440k).
So 1,012 out of 3,000,000,000.
Aka 0.000034%.
And that 0.000034% of 3 billion kilobytes is only 2/3s of the data for the first stage of training.
And not to beat a dead horse, but 0.000034% is still grossly overestimating the potential impact of posting A/B/O fic. Remember, only parts of AO3 would get scraped for Common Crawl datasets. Which are also huge! The October 2022 Common Crawl dataset is 380 tebibytes. The April 2021 dataset is 320 tebibytes. The 3 terabytes of Common Crawl data used to train LLaMA was randomly selected data that totaled to less than 1% of one full dataset. Not to mention, LLaMA’s training dataset is currently on the (much) larger size as compared to most LLM training datasets.
I also feel the need to point out again that AO3 is trying to prevent any Common Crawl scraping in the future, which would include protection for these new stories (several of which are also locked!).
Omegaverse just isn’t going to do anything to AI. Individual fics are going to do even less. Even if all of AO3 suddenly became omegaverse, it’s just not prominent enough to influence anything in regards to LLMs. You cannot affect training datasets in any meaningful way doing this. And while this might seem really disappointing, this is actually a good thing.
Remember that anything an individual can do to LLMs, the person you hate most can do the same. If it were possible for fandom to corrupt AI with omegaverse, fascists, bigots, and just straight up internet trolls could pollute it with hate speech and worse. AI already carries a lot of biases even while developers are actively trying to flatten that out, it’s good that organized groups can’t corrupt that deliberately.
101 notes · View notes
itsevanffsbutspam · 7 months ago
Text
hey! please remove your adblocker. please pay for premium. please pay extra to remove ads. please agree to these new and updated terms of service. sorry, it won't work until you agree. please disable your tracker prevention. please allow cookies. please allow cookies. did you mean allow third party vendors? it's against our terms of service to use adblocker. we saw you googled someone, do you want to follow them on twitter? please agree to our new terms of service. if you want to opt out of our ai, please navigate through four layer deep sub-menus to find the toggle. our new terms of service work retroactively. sorry, you didn't opt out before we activated our scraper. you agreed to our terms of service. want to delete your data? please buy our data protection and deletion service. please remove your adblocker. please disable your tracker prevention. please give us your payment data. subscribe now, only 300 a year with ads! please pay extra to remove ads. please pay us to continue using this free tool. it's just inflation. please don't use a vpn, it's against our terms of service. if you must use a vpn, please pay for our vpn and antivirus services. please enter your payment data to make use of this free trial. please remove your adblocker. please request your data before deactivating your account. processing your request may take up to 120 business years. it is currently not possible to remove your payment data from our service. it is against our terms of service to do that.
AGREE maybe later
9 notes · View notes
mariacallous · 1 year ago
Text
While the finer points of running a social media business can be debated, one basic truth is that they all run on attention. Tech leaders are incentivized to grow their user bases so there are more people looking at more ads for more time. It’s just good business.
As the owner of Twitter, Elon Musk presumably shared that goal. But he claimed he hadn’t bought Twitter to make money. This freed him up to focus on other passions: stopping rival tech companies from scraping Twit­ter’s data without permission—even if it meant losing eyeballs on ads.
Data-scraping was a known problem at Twitter. “Scraping was the open secret of Twitter data access. We knew about it. It was fine,” Yoel Roth wrote on the Twitter ­alternative Bluesky. AI firms in particular were no­torious for gobbling up huge swaths of text to train large language models. Now that those firms were worth a lot of money, the situation was far from fine, in Musk’s opinion.
In November 2022, OpenAI debuted ChatGPT, a chatbot that could generate convincingly human text. By January 2023, the app had over 100 million users, making it the fastest ­growing consumer app of all time. Three months later, OpenAI secured another round of funding that closed at an astounding valuation of $29 billion, more than Twitter was worth, by Musk’s estimation.
OpenAI was a sore subject for Musk, who’d been one of the original founders and a major donor before stepping down in 2018 over disagree­ments with the other founders. After ChatGPT launched, Musk made no secret of the fact that he disagreed with the guardrails that OpenAI put on the chatbot to stop it from relaying dangerous or insensitive infor­mation. “The danger of training AI to be woke—in other words, lie—is deadly,” Musk said on December 16, 2022. He was toying with starting a competitor.
Near the end of June 2023, Musk launched a two-part offensive to stop data scrapers, first directing Twitter employees to temporarily block “logged out view.” The change would mean that only people with Twitter accounts could view tweets.
“Logged out view” had a complicated history at Twitter. It was rumored to have played a part in the Arab Spring, allowing dissidents to view tweets without having to create a Twitter account and risk compromising their anonymity. But it was also an easy access point for people who wanted to scrape Twitter data.
Once Twitter made the change, Google was temporarily blocked from crawling Twitter and serving up relevant tweets in search results—a move that could negatively impact Twitter’s traffic. “We’re aware that our ability to crawl Twitter.com has been limited, affecting our ability to display tweets and pages from the site in search results,” Google spokesperson Lara Levin told The Verge. “Websites have control over whether crawlers can access their content.” As engineers discussed possible workarounds on Slack, one wrote: “Surely this was expected when that decision was made?”
Then engineers detected an “explosion of logged in requests,” according to internal Slack messages, indicating that data scrapers had simply logged in to Twitter to continue scraping. Musk ordered the change to be reversed.
On July 1, 2023, Musk launched part two of the offensive. Suddenly, if a user scrolled for just a few minutes, an error message popped up. “Sorry, you are rate limited,” the message read. “Please wait a few moments then try again.”
Rate limiting is a strategy that tech companies use to constrain net­work traffic by putting a cap on the number of times a user can perform a specific action within a given time frame (a mouthful, I know). It’s often used to stop bad actors from trying to hack into people’s accounts. If a user tries an incorrect password too many times, they see an error mes­sage and are told to come back later. The cost of doing this to someone who has forgotten their password is low (most people stay logged in), while the benefit to users is very high (it prevents many people’s accounts from getting compromised).
Except, that wasn’t what Musk had done. The rate limit that he ordered Twitter to roll out on July 1 was an API limit, meaning Twitter had capped the number of times users could refresh Twitter to look for new tweets and see ads. Rather than constrain users from performing a specific ac­tion, Twitter had limited all user actions. “I realize these are draconian rules,” a Twitter engineer wrote on Slack. “They are temporary. We will reevaluate the situation tomorrow.”
At first, Blue subscribers could see 6,000 posts a day, while nonsubscribers could see 600 (enough for just a few minutes of scroll­ing), and new nonsubscriber accounts could see just 300. As people started hitting the limits, #TwitterDown started trending on, well, Twitter. “This sucks dude you gotta 10X each of these numbers,” wrote user @tszzl.
The impact quickly became obvious. Companies that used Twitter di­rect messages as a customer service tool were unable to communicate with clients. Major creators were blocked from promoting tweets, putting Musk’s wish to stop data scrapers at odds with his initiative to make Twit­ter more creator­ friendly. And Twitter’s own trust and safety team was suddenly stopped from seeing violative tweets.
Engineers posted frantic updates in Slack. “FYI some large creators com­plaining because rate limit affecting paid subscription posts,” one said.
Christopher Stanley, the head of information security, wrote with dis­may that rate limits could apply to people refreshing the app to get news about a mass shooting or a major weather event. “The idea here is to stop scrapers, not prevent people from obtaining safety information,” he wrote. Twitter soon raised the limits to 10,000 (for Blue subscribers), 1,000 (for nonsubscribers), and 500 (for new nonsubscrib­ers). Now, 13 percent of all unverified users were hitting the rate limit.
Users were outraged. If Musk wanted to stop scrapers, surely there were better ways than just cutting off access to the service for everyone on Twitter.
“Musk has destroyed Twitter’s value & worth,” wrote attorney Mark S. Zaid. “Hubris + no pushback—customer empathy—data = a great way to light billions on fire,” wrote former Twitter product manager Esther Crawford, her loyalties finally reversed.
Musk retweeted a joke from a parody account: “The reason I set a ‘View Limit’ is because we are all Twitter addicts and need to go outside.”
Aside from Musk, the one person who seemed genuinely excited about the changes was Evan Jones, a product manager on Twitter Blue. For months, he’d been sending executives updates regarding the anemic sign­up rates. Now, Blue subscriptions were skyrocketing. In May, Twitter had 535,000 Blue subscribers. At $8 per month, this was about $4.2 million a month in subscription revenue. By early July, there were 829,391 subscribers—a jump of about $2.4 million in revenue, not accounting for App Store fees.
“Blue signups still cookin,” he wrote on Slack above a screenshot of the sign­up dashboard.
Jones’s team capitalized on the moment, rolling out a prompt to upsell users who’d hit the rate limit and encouraging them to subscribe to Twit­ter Blue. In July, this prompt drove 1.7 percent of the Blue subscriptions from accounts that were more than 30 days old and 17 percent of the Blue subscriptions from accounts that were less than 30 days old.
Twitter CEO Linda Yaccarino was notably absent from the conversation until July 4, when she shared a Twitter blog post addressing the rate limiting fiasco, perhaps deliberately burying the news on a national holiday.
“To ensure the authenticity of our user base we must take extreme measures to remove spam and bots from our platform,” it read. “That’s why we temporarily limited usage so we could detect and eliminate bots and other bad actors that are harming the platform. Any advance notice on these actions would have allowed bad actors to alter their behavior to evade detection.” The company also claimed the “effects on advertising have been minimal.”
If Yaccarino’s role was to cover for Musk’s antics, she was doing an ex­cellent job. Twitter rolled back the limits shortly after her announcement. On July 12, Musk debuted a generative AI company called xAI, which he promised would develop a language model that wouldn’t be politically correct. “I think our AI can give answers that people may find controver­sial even though they are actually true,” he said on Twitter Spaces.
Unlike the rival AI firms he was trying to block, Musk said xAI would likely train on Twitter’s data.
“The goal of xAI is to understand the true nature of the universe,” the company said grandly in its mission statement, echoing Musk’s first, di­sastrous town hall at Twitter. “We will share more information over the next couple of weeks and months.”
In November 2023, xAI launched a chatbot called Grok that lacked the guardrails of tools like ChatGPT. Musk hyped the release by posting a screenshot of the chatbot giving him a recipe for cocaine. The company didn’t appear close to understanding the nature of the universe, but per­ haps that’s coming.
Excerpt adapted from Extremely Hardcore: Inside Elon Musk’s Twitter by Zoë Schiffer. Published by arrangement with Portfolio Books, a division of Penguin Random House LLC. Copyright © 2024 by Zoë Schiffer.
20 notes · View notes
azspot · 2 years ago
Quote
In recent months, the signs and portents have been accumulating with increasing speed. Google is trying to kill the 10 blue links. Twitter is being abandoned to bots and blue ticks. There’s the junkification of Amazon and the enshittification of TikTok. Layoffs are gutting online media. A job posting looking for an “AI editor” expects “output of 200 to 250 articles per week.” ChatGPT is being used to generate whole spam sites. Etsy is flooded with “AI-generated junk.” Chatbots cite one another in a misinformation ouroboros. LinkedIn is using AI to stimulate tired users. Snapchat and Instagram hope bots will talk to you when your friends don’t. Redditors are staging blackouts. Stack Overflow mods are on strike. The Internet Archive is fighting off data scrapers, and “AI is tearing Wikipedia apart.” The old web is dying, and the new web struggles to be born.
AI is killing the old web, and the new web struggles to be born
67 notes · View notes
yadivagirl · 2 years ago
Text
I linked this as a gift so hopefully people can read it. Below is the part about AO3.
At Archive of Our Own, a fan fiction database with more than 11 million stories, writers have increasingly pressured the site to ban data-scraping and A.I.-generated stories.
In May, when some Twitter accounts shared examples of ChatGPT mimicking the style of popular fan fiction posted on Archive of Our Own, dozens of writers rose up in arms. They blocked their stories and wrote subversive content to mislead the A.I. scrapers. They also pushed Archive of Our Own’s leaders to stop allowing A.I.-generated content.
Betsy Rosenblatt, who provides legal advice to Archive of Our Own and is a professor at University of Tulsa College of Law, said the site had a policy of “maximum inclusivity” and did not want to be in the position of discerning which stories were written with A.I.
11 notes · View notes
mghostship · 1 year ago
Text
Haunted.
Fuck you with the data mining. Left twitter for this wreck of a haven just to be met with more shit. May this greedy fucking site burn for selling our data and ass kissing fucking scrapers. Rot you bitches!!!
2 notes · View notes
slechterick · 2 years ago
Text
tumblr never fixing their search functionality while twitter and reddit are struggling to combat data scrapers looking to train language models has to be the most "won by doing nothing" yet in the new social media wars
5 notes · View notes