#is web scraping legal
Explore tagged Tumblr posts
catchexperts · 3 months ago
Text
Web Scraping 101: Everything You Need to Know in 2025
Tumblr media
🕸️ What Is Web Scraping? An Introduction
Web scraping—also referred to as web data extraction—is the process of collecting structured information from websites using automated scripts or tools. Initially driven by simple scripts, it has now evolved into a core component of modern data strategies for competitive research, price monitoring, SEO, market intelligence, and more.
If you’re wondering “What is the introduction of web scraping?” — it’s this: the ability to turn unstructured web content into organized datasets businesses can use to make smarter, faster decisions.
💡 What Is Web Scraping Used For?
Businesses and developers alike use web scraping to:
Monitor competitors’ pricing and SEO rankings
Extract leads from directories or online marketplaces
Track product listings, reviews, and inventory
Aggregate news, blogs, and social content for trend analysis
Fuel AI models with large datasets from the open web
Whether it’s web scraping using Python, browser-based tools, or cloud APIs, the use cases are growing fast across marketing, research, and automation.
🔍 Examples of Web Scraping in Action
What is an example of web scraping?
A real estate firm scrapes listing data (price, location, features) from property websites to build a market dashboard.
An eCommerce brand scrapes competitor prices daily to adjust its own pricing in real time.
A SaaS company uses BeautifulSoup in Python to extract product reviews and social proof for sentiment analysis.
For many, web scraping is the first step in automating decision-making and building data pipelines for BI platforms.
⚖️ Is Web Scraping Legal?
Yes—if done ethically and responsibly. While scraping public data is legal in many jurisdictions, scraping private, gated, or copyrighted content can lead to violations.
To stay compliant:
Respect robots.txt rules
Avoid scraping personal or sensitive data
Prefer API access where possible
Follow website terms of service
If you’re wondering “Is web scraping legal?”—the answer lies in how you scrape and what you scrape.
🧠 Web Scraping with Python: Tools & Libraries
What is web scraping in Python? Python is the most popular language for scraping because of its ease of use and strong ecosystem.
Popular Python libraries for web scraping include:
BeautifulSoup – simple and effective for HTML parsing
Requests – handles HTTP requests
Selenium – ideal for dynamic JavaScript-heavy pages
Scrapy – robust framework for large-scale scraping projects
Puppeteer (via Node.js) – for advanced browser emulation
These tools are often used in tutorials like “Web scraping using Python BeautifulSoup” or “Python web scraping library for beginners.”
⚙️ DIY vs. Managed Web Scraping
You can choose between:
DIY scraping: Full control, requires dev resources
Managed scraping: Outsourced to experts, ideal for scale or non-technical teams
Use managed scraping services for large-scale needs, or build Python-based scrapers for targeted projects using frameworks and libraries mentioned above.
🚧 Challenges in Web Scraping (and How to Overcome Them)
Modern websites often include:
JavaScript rendering
CAPTCHA protection
Rate limiting and dynamic loading
To solve this:
Use rotating proxies
Implement headless browsers like Selenium
Leverage AI-powered scraping for content variation and structure detection
Deploy scrapers on cloud platforms using containers (e.g., Docker + AWS)
🔐 Ethical and Legal Best Practices
Scraping must balance business innovation with user privacy and legal integrity. Ethical scraping includes:
Minimal server load
Clear attribution
Honoring opt-out mechanisms
This ensures long-term scalability and compliance for enterprise-grade web scraping systems.
🔮 The Future of Web Scraping
As demand for real-time analytics and AI training data grows, scraping is becoming:
Smarter (AI-enhanced)
Faster (real-time extraction)
Scalable (cloud-native deployments)
From developers using BeautifulSoup or Scrapy, to businesses leveraging API-fed dashboards, web scraping is central to turning online information into strategic insights.
📘 Summary: Web Scraping 101 in 2025
Web scraping in 2025 is the automated collection of website data, widely used for SEO monitoring, price tracking, lead generation, and competitive research. It relies on powerful tools like BeautifulSoup, Selenium, and Scrapy, especially within Python environments. While scraping publicly available data is generally legal, it's crucial to follow website terms of service and ethical guidelines to avoid compliance issues. Despite challenges like dynamic content and anti-scraping defenses, the use of AI and cloud-based infrastructure is making web scraping smarter, faster, and more scalable than ever—transforming it into a cornerstone of modern data strategies.
🔗 Want to Build or Scale Your AI-Powered Scraping Strategy?
Whether you're exploring AI-driven tools, training models on web data, or integrating smart automation into your data workflows—AI is transforming how web scraping works at scale.
👉 Find AI Agencies specialized in intelligent web scraping on Catch Experts,
📲 Stay connected for the latest in AI, data automation, and scraping innovation:
💼 LinkedIn
🐦 Twitter
📸 Instagram
👍 Facebook
▶️ YouTube
0 notes
themorningnewsinformer · 18 days ago
Text
Cloudflare AI Bot Blocker: A Game-Changer for Web Publishers
Introduction The digital publishing world is fighting back against unauthorized AI data scraping. With the launch of the Cloudflare AI bot blocker, over a million websites—including media giants like Sky News and Buzzfeed—can now block AI bots from collecting content without consent. This transformative tool gives creators the control they’ve long demanded over their digital work. Why Is…
5 notes · View notes
actowizsolutions0 · 4 months ago
Text
Learn best practices for automated data scraping to avoid blocks and extract valuable insights efficiently. Optimize your web scraping strategies today!
0 notes
lasnoticiasdevesko-blog · 9 months ago
Link
🔍 Aprende sobre técnicas como Selenium, Puppeteer, APIs, web scraping legal y ético, ¡todo explicado de manera sencilla y práctica! 💻 #WebScraping #APIs #SEO #Automatización #Tecnología #DesarrolloWeb
0 notes
mostlysignssomeportents · 1 year ago
Text
Copyright takedowns are a cautionary tale that few are heeding
Tumblr media
On July 14, I'm giving the closing keynote for the fifteenth HACKERS ON PLANET EARTH, in QUEENS, NY. Happy Bastille Day! On July 20, I'm appearing in CHICAGO at Exile in Bookville.
Tumblr media
We're living through one of those moments when millions of people become suddenly and overwhelmingly interested in fair use, one of the subtlest and worst-understood aspects of copyright law. It's not a subject you can master by skimming a Wikipedia article!
I've been talking about fair use with laypeople for more than 20 years. I've met so many people who possess the unshakable, serene confidence of the truly wrong, like the people who think fair use means you can take x words from a book, or y seconds from a song and it will always be fair, while anything more will never be.
Or the people who think that if you violate any of the four factors, your use can't be fair – or the people who think that if you fail all of the four factors, you must be infringing (people, the Supreme Court is calling and they want to tell you about the Betamax!).
You might think that you can never quote a song lyric in a book without infringing copyright, or that you must clear every musical sample. You might be rock solid certain that scraping the web to train an AI is infringing. If you hold those beliefs, you do not understand the "fact intensive" nature of fair use.
But you can learn! It's actually a really cool and interesting and gnarly subject, and it's a favorite of copyright scholars, who have really fascinating disagreements and discussions about the subject. These discussions often key off of the controversies of the moment, but inevitably they implicate earlier fights about everything from the piano roll to 2 Live Crew to antiracist retellings of Gone With the Wind.
One of the most interesting discussions of fair use you can ask for took place in 2019, when the NYU Engelberg Center on Innovation Law & Policy held a symposium called "Proving IP." One of the panels featured dueling musicologists debating the merits of the Blurred Lines case. That case marked a turning point in music copyright, with the Marvin Gaye estate successfully suing Robin Thicke and Pharrell Williams for copying the "vibe" of Gaye's "Got to Give it Up."
Naturally, this discussion featured clips from both songs as the experts – joined by some of America's top copyright scholars – delved into the legal reasoning and future consequences of the case. It would be literally impossible to discuss this case without those clips.
And that's where the problems start: as soon as the symposium was uploaded to Youtube, it was flagged and removed by Content ID, Google's $100,000,000 copyright enforcement system. This initial takedown was fully automated, which is how Content ID works: rightsholders upload audio to claim it, and then Content ID removes other videos where that audio appears (rightsholders can also specify that videos with matching clips be demonetized, or that the ad revenue from those videos be diverted to the rightsholders).
But Content ID has a safety valve: an uploader whose video has been incorrectly flagged can challenge the takedown. The case is then punted to the rightsholder, who has to manually renew or drop their claim. In the case of this symposium, the rightsholder was Universal Music Group, the largest record company in the world. UMG's personnel reviewed the video and did not drop the claim.
99.99% of the time, that's where the story would end, for many reasons. First of all, most people don't understand fair use well enough to contest the judgment of a cosmically vast, unimaginably rich monopolist who wants to censor their video. Just as importantly, though, is that Content ID is a Byzantine system that is nearly as complex as fair use, but it's an entirely private affair, created and adjudicated by another galactic-scale monopolist (Google).
Google's copyright enforcement system is a cod-legal regime with all the downsides of the law, and a few wrinkles of its own (for example, it's a system without lawyers – just corporate experts doing battle with laypeople). And a single mis-step can result in your video being deleted or your account being permanently deleted, along with every video you've ever posted. For people who make their living on audiovisual content, losing your Youtube account is an extinction-level event:
https://www.eff.org/wp/unfiltered-how-youtubes-content-id-discourages-fair-use-and-dictates-what-we-see-online
So for the average Youtuber, Content ID is a kind of Kafka-as-a-Service system that is always avoided and never investigated. But the Engelbert Center isn't your average Youtuber: they boast some of the country's top copyright experts, specializing in exactly the questions Youtube's Content ID is supposed to be adjudicating.
So naturally, they challenged the takedown – only to have UMG double down. This is par for the course with UMG: they are infamous for refusing to consider fair use in takedown requests. Their stance is so unreasonable that a court actually found them guilty of violating the DMCA's provision against fraudulent takedowns:
https://www.eff.org/cases/lenz-v-universal
But the DMCA's takedown system is part of the real law, while Content ID is a fake law, created and overseen by a tech monopolist, not a court. So the fate of the Blurred Lines discussion turned on the Engelberg Center's ability to navigate both the law and the n-dimensional topology of Content ID's takedown flowchart.
It took more than a year, but eventually, Engelberg prevailed.
Until they didn't.
If Content ID was a person, it would be baby, specifically, a baby under 18 months old – that is, before the development of "object permanence." Until our 18th month (or so), we lack the ability to reason about things we can't see – this the period when small babies find peek-a-boo amazing. Object permanence is the ability to understand things that aren't in your immediate field of vision.
Content ID has no object permanence. Despite the fact that the Engelberg Blurred Lines panel was the most involved fair use question the system was ever called upon to parse, it managed to repeatedly forget that it had decided that the panel could stay up. Over and over since that initial determination, Content ID has taken down the video of the panel, forcing Engelberg to go through the whole process again.
But that's just for starters, because Youtube isn't the only place where a copyright enforcement bot is making billions of unsupervised, unaccountable decisions about what audiovisual material you're allowed to access.
Spotify is yet another monopolist, with a justifiable reputation for being extremely hostile to artists' interests, thanks in large part to the role that UMG and the other major record labels played in designing its business rules:
https://pluralistic.net/2022/09/12/streaming-doesnt-pay/#stunt-publishing
Spotify has spent hundreds of millions of dollars trying to capture the podcasting market, in the hopes of converting one of the last truly open digital publishing systems into a product under its control:
https://pluralistic.net/2023/01/27/enshittification-resistance/#ummauerter-garten-nein
Thankfully, that campaign has failed – but millions of people have (unwisely) ditched their open podcatchers in favor of Spotify's pre-enshittified app, so everyone with a podcast now must target Spotify for distribution if they hope to reach those captive users.
Guess who has a podcast? The Engelberg Center.
Naturally, Engelberg's podcast includes the audio of that Blurred Lines panel, and that audio includes samples from both "Blurred Lines" and "Got To Give It Up."
So – naturally – UMG keeps taking down the podcast.
Spotify has its own answer to Content ID, and incredibly, it's even worse and harder to navigate than Google's pretend legal system. As Engelberg describes in its latest post, UMG and Spotify have colluded to ensure that this now-classic discussion of fair use will never be able to take advantage of fair use itself:
https://www.nyuengelberg.org/news/how-explaining-copyright-broke-the-spotify-copyright-system/
Remember, this is the best case scenario for arguing about fair use with a monopolist like UMG, Google, or Spotify. As Engelberg puts it:
The Engelberg Center had an extraordinarily high level of interest in pursuing this issue, and legal confidence in our position that would have cost an average podcaster tens of thousands of dollars to develop. That cannot be what is required to challenge the removal of a podcast episode.
Automated takedown systems are the tech industry's answer to the "notice-and-takedown" system that was invented to broker a peace between copyright law and the internet, starting with the US's 1998 Digital Millennium Copyright Act. The DMCA implements (and exceeds) a pair of 1996 UN treaties, the WIPO Copyright Treaty and the Performances and Phonograms Treaty, and most countries in the world have some version of notice-and-takedown.
Big corporate rightsholders claim that notice-and-takedown is a gift to the tech sector, one that allows tech companies to get away with copyright infringement. They want a "strict liability" regime, where any platform that allows a user to post something infringing is liable for that infringement, to the tune of $150,000 in statutory damages.
Of course, there's no way for a platform to know a priori whether something a user posts infringes on someone's copyright. There is no registry of everything that is copyrighted, and of course, fair use means that there are lots of ways to legally reproduce someone's work without their permission (or even when they object). Even if every person who ever has trained or ever will train as a copyright lawyer worked 24/7 for just one online platform to evaluate every tweet, video, audio clip and image for copyright infringement, they wouldn't be able to touch even 1% of what gets posted to that platform.
The "compromise" that the entertainment industry wants is automated takedown – a system like Content ID, where rightsholders register their copyrights and platforms block anything that matches the registry. This "filternet" proposal became law in the EU in 2019 with Article 17 of the Digital Single Market Directive:
https://www.eff.org/deeplinks/2018/09/today-europe-lost-internet-now-we-fight-back
This was the most controversial directive in EU history, and – as experts warned at the time – there is no way to implement it without violating the GDPR, Europe's privacy law, so now it's stuck in limbo:
https://www.eff.org/deeplinks/2022/05/eus-copyright-directive-still-about-filters-eus-top-court-limits-its-use
As critics pointed out during the EU debate, there are so many problems with filternets. For one thing, these copyright filters are very expensive: remember that Google has spent $100m on Content ID alone, and that only does a fraction of what filternet advocates demand. Building the filternet would cost so much that only the biggest tech monopolists could afford it, which is to say, filternets are a legal requirement to keep the tech monopolists in business and prevent smaller, better platforms from ever coming into existence.
Filternets are also incapable of telling the difference between similar files. This is especially problematic for classical musicians, who routinely find their work blocked or demonetized by Sony Music, which claims performances of all the most important classical music compositions:
https://pluralistic.net/2021/05/08/copyfraud/#beethoven-just-wrote-music
Content ID can't tell the difference between your performance of "The Goldberg Variations" and Glenn Gould's. For classical musicians, the best case scenario is to have their online wages stolen by Sony, who fraudulently claim copyright to their recordings. The worst case scenario is that their video is blocked, their channel deleted, and their names blacklisted from ever opening another account on one of the monopoly platforms.
But when it comes to free expression, the role that notice-and-takedown and filternets play in the creative industries is really a sideshow. In creating a system of no-evidence-required takedowns, with no real consequences for fraudulent takedowns, these systems are huge gift to the world's worst criminals. For example, "reputation management" companies help convicted rapists, murderers, and even war criminals purge the internet of true accounts of their crimes by claiming copyright over them:
https://pluralistic.net/2021/04/23/reputation-laundry/#dark-ops
Remember how during the covid lockdowns, scumbags marketed junk devices by claiming that they'd protect you from the virus? Their products remained online, while the detailed scientific articles warning people about the fraud were speedily removed through false copyright claims:
https://pluralistic.net/2021/10/18/labor-shortage-discourse-time/#copyfraud
Copyfraud – making false copyright claims – is an extremely safe crime to commit, and it's not just quack covid remedy peddlers and war criminals who avail themselves of it. Tech giants like Adobe do not hesitate to abuse the takedown system, even when that means exposing millions of people to spyware:
https://pluralistic.net/2021/10/13/theres-an-app-for-that/#gnash
Dirty cops play loud, copyrighted music during confrontations with the public, in the hopes that this will trigger copyright filters on services like Youtube and Instagram and block videos of their misbehavior:
https://pluralistic.net/2021/02/10/duke-sucks/#bhpd
But even if you solved all these problems with filternets and takedown, this system would still choke on fair use and other copyright exceptions. These are "fact intensive" questions that the world's top experts struggle with (as anyone who watches the Blurred Lines panel can see). There's no way we can get software to accurately determine when a use is or isn't fair.
That's a question that the entertainment industry itself is increasingly conflicted about. The Blurred Lines judgment opened the floodgates to a new kind of copyright troll – grifters who sued the record labels and their biggest stars for taking the "vibe" of songs that no one ever heard of. Musicians like Ed Sheeran have been sued for millions of dollars over these alleged infringements. These suits caused the record industry to (ahem) change its tune on fair use, insisting that fair use should be broadly interpreted to protect people who made things that were similar to existing works. The labels understood that if "vibe rights" became accepted law, they'd end up in the kind of hell that the rest of us enter when we try to post things online – where anything they produce can trigger takedowns, long legal battles, and millions in liability:
https://pluralistic.net/2022/04/08/oh-why/#two-notes-and-running
But the music industry remains deeply conflicted over fair use. Take the curious case of Katy Perry's song "Dark Horse," which attracted a multimillion-dollar suit from an obscure Christian rapper who claimed that a brief phrase in "Dark Horse" was impermissibly similar to his song "A Joyful Noise."
Perry and her publisher, Warner Chappell, lost the suit and were ordered to pay $2.8m. While they subsequently won an appeal, this definitely put the cold grue up Warner Chappell's back. They could see a long future of similar suits launched by treasure hunters hoping for a quick settlement.
But here's where it gets unbelievably weird and darkly funny. A Youtuber named Adam Neely made a wildly successful viral video about the suit, taking Perry's side and defending her song. As part of that video, Neely included a few seconds' worth of "A Joyful Noise," the song that Perry was accused of copying.
In court, Warner Chappell had argued that "A Joyful Noise" was not similar to Perry's "Dark Horse." But when Warner had Google remove Neely's video, they claimed that the sample from "Joyful Noise" was actually taken from "Dark Horse." Incredibly, they maintained this position through multiple appeals through the Content ID system:
https://pluralistic.net/2020/03/05/warner-chappell-copyfraud/#warnerchappell
In other words, they maintained that the song that they'd told the court was totally dissimilar to their own was so indistinguishable from their own song that they couldn't tell the difference!
Now, this question of vibes, similarity and fair use has only gotten more intense since the takedown of Neely's video. Just this week, the RIAA sued several AI companies, claiming that the songs the AI shits out are infringingly similar to tracks in their catalog:
https://www.rollingstone.com/music/music-news/record-labels-sue-music-generators-suno-and-udio-1235042056/
Even before "Blurred Lines," this was a difficult fair use question to answer, with lots of chewy nuances. Just ask George Harrison:
https://en.wikipedia.org/wiki/My_Sweet_Lord
But as the Engelberg panel's cohort of dueling musicologists and renowned copyright experts proved, this question only gets harder as time goes by. If you listen to that panel (if you can listen to that panel), you'll be hard pressed to come away with any certainty about the questions in this latest lawsuit.
The notice-and-takedown system is what's known as an "intermediary liability" rule. Platforms are "intermediaries" in that they connect end users with each other and with businesses. Ebay and Etsy and Amazon connect buyers and sellers; Facebook and Google and Tiktok connect performers, advertisers and publishers with audiences and so on.
For copyright, notice-and-takedown gives platforms a "safe harbor." A platform doesn't have to remove material after an allegation of infringement, but if they don't, they're jointly liable for any future judgment. In other words, Youtube isn't required to take down the Engelberg Blurred Lines panel, but if UMG sues Engelberg and wins a judgment, Google will also have to pay out.
During the adoption of the 1996 WIPO treaties and the 1998 US DMCA, this safe harbor rule was characterized as a balance between the rights of the public to publish online and the interest of rightsholders whose material might be infringed upon. The idea was that things that were likely to be infringing would be immediately removed once the platform received a notification, but that platforms would ignore spurious or obviously fraudulent takedowns.
That's not how it worked out. Whether it's Sony Music claiming to own your performance of "Fur Elise" or a war criminal claiming authorship over a newspaper story about his crimes, platforms nuke first and ask questions never. Why not? If they ignore a takedown and get it wrong, they suffer dire consequences ($150,000 per claim). But if they take action on a dodgy claim, there are no consequences. Of course they're just going to delete anything they're asked to delete.
This is how platforms always handle liability, and that's a lesson that we really should have internalized by now. After all, the DMCA is the second-most famous intermediary liability system for the internet – the most (in)famous is Section 230 of the Communications Decency Act.
This is a 27-word law that says that platforms are not liable for civil damages arising from their users' speech. Now, this is a US law, and in the US, there aren't many civil damages from speech to begin with. The First Amendment makes it very hard to get a libel judgment, and even when these judgments are secured, damages are typically limited to "actual damages" – generally a low sum. Most of the worst online speech is actually not illegal: hate speech, misinformation and disinformation are all covered by the First Amendment.
Notwithstanding the First Amendment, there are categories of speech that US law criminalizes: actual threats of violence, criminal harassment, and committing certain kinds of legal, medical, election or financial fraud. These are all exempted from Section 230, which only provides immunity for civil suits, not criminal acts.
What Section 230 really protects platforms from is being named to unwinnable nuisance suits by unscrupulous parties who are betting that the platforms would rather remove legal speech that they object to than go to court. A generation of copyfraudsters have proved that this is a very safe bet:
https://www.techdirt.com/2020/06/23/hello-youve-been-referred-here-because-youre-wrong-about-section-230-communications-decency-act/
In other words, if you made a #MeToo accusation, or if you were a gig worker using an online forum to organize a union, or if you were blowing the whistle on your employer's toxic waste leaks, or if you were any other under-resourced person being bullied by a wealthy, powerful person or organization, that organization could shut you up by threatening to sue the platform that hosted your speech. The platform would immediately cave. But those same rich and powerful people would have access to the lawyers and back-channels that would prevent you from doing the same to them – that's why Sony can get your Brahms recital taken down, but you can't turn around and do the same to them.
This is true of every intermediary liability system, and it's been true since the earliest days of the internet, and it keeps getting proven to be true. Six years ago, Trump signed SESTA/FOSTA, a law that allowed platforms to be held civilly liable by survivors of sex trafficking. At the time, advocates claimed that this would only affect "sexual slavery" and would not impact consensual sex-work.
But from the start, and ever since, SESTA/FOSTA has primarily targeted consensual sex-work, to the immediate, lasting, and profound detriment of sex workers:
https://hackinghustling.org/what-is-sesta-fosta/
SESTA/FOSTA killed the "bad date" forums where sex workers circulated the details of violent and unstable clients, killed the online booking sites that allowed sex workers to screen their clients, and killed the payment processors that let sex workers avoid holding unsafe amounts of cash:
https://www.eff.org/deeplinks/2022/09/fight-overturn-fosta-unconstitutional-internet-censorship-law-continues
SESTA/FOSTA made voluntary sex work more dangerous – and also made life harder for law enforcement efforts to target sex trafficking:
https://hackinghustling.org/erased-the-impact-of-fosta-sesta-2020/
Despite half a decade of SESTA/FOSTA, despite 15 years of filternets, despite a quarter century of notice-and-takedown, people continue to insist that getting rid of safe harbors will punish Big Tech and make life better for everyday internet users.
As of now, it seems likely that Section 230 will be dead by then end of 2025, even if there is nothing in place to replace it:
https://energycommerce.house.gov/posts/bipartisan-energy-and-commerce-leaders-announce-legislative-hearing-on-sunsetting-section-230
This isn't the win that some people think it is. By making platforms responsible for screening the content their users post, we create a system that only the largest tech monopolies can survive, and only then by removing or blocking anything that threatens or displeases the wealthy and powerful.
Filternets are not precision-guided takedown machines; they're indiscriminate cluster-bombs that destroy anything in the vicinity of illegal speech – including (and especially) the best-informed, most informative discussions of how these systems go wrong, and how that blocks the complaints of the powerless, the marginalized, and the abused.
Tumblr media
Support me this summer on the Clarion Write-A-Thon and help raise money for the Clarion Science Fiction and Fantasy Writers' Workshop!
Tumblr media
If you'd like an essay-formatted version of this post to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:
https://pluralistic.net/2024/06/27/nuke-first/#ask-questions-never
Tumblr media
Image: EFF https://www.eff.org/files/banner_library/yt-fu-1b.png
CC BY 3.0 https://creativecommons.org/licenses/by/3.0/deed.en
677 notes · View notes
death-at-20k-volts · 2 months ago
Text
On the subject of AI...
Okay so, I have been seeing more and more stuff related to AI-generated art recently so I’m gonna make my stance clear:
I am strongly against generative AI. I do not condone its usage personally, professionally, or in any other context.
More serious take under the cut, I am passionate about this subject:
So, first thing’s first, I’ll get my qualifications out of the way: BSc (Hons) Computer Science with a specialty in Artificial Intelligence systems and Data Security and Governance. I wrote my thesis, and did multiple R&D-style papers, on the subject. On the lower end I also have (I think the equivalent is an associate’s?) qualifications in art and IT systems. I’m not normally the type to pull the ‘well actually 🤓☝️’ card but, I'm laying some groundwork here to establish that I am heavily involved in the fields this subject relates to, both academically and professionally.
So what is 'AI' in this context?
Nowadays when someone says ‘AI’, they’re most likely talking about Generative Artificial Intelligence – it’s a subtype of AI system that is used, primarily, to produce images, text, videos, and other media formats (thus, generative). 
By this point, we’ve all heard of the likes of ChatGPT, Midjourney, etc – you get the idea. These are generative AI systems used to create the above mentioned content types. 
Now, you might be inclined to think things such as:
‘Well, isn’t that a good thing? Creating stuff just got a whole lot easier!’ 
‘I struggle to draw [for xyz reason], so this is a really useful tool’
‘I’m not an artist, so it’s nice to be able to have something that makes things how I want them to look’
No, it’s not a good thing, and I’ll tell you exactly why.
-------------------------------------------------
What makes genAI so bad?
There’s a few reasons that slate AI as condemnable, and I’ll do my best to cover them here as concisely as I reasonably can. Some of these issues are, admittedly, hypothetical in nature – the fact of the matter is, this is a technology that has come to rise faster than people and legislature (law) can even keep up with. 
Stealing Is Bad, M’kay?
Now you’re probably thinking, hold on, where does theft come into this? So, allow me to explain.
Generative AI systems are able to output the things that they do because first and foremost, they’re ‘trained’: fed lots and lots of data, so that when it’s queried with specific parameters, the result is media generated to specification. Most people understand this bit – I mean, a lot of us have screwed around with ChatGPT once or twice. I won't lie and say I haven't, because I have. Mainly for research purposes, but still. (The above is a massive simplification of the matter, because I ain't here to teach you at a university level)
Now, give some thought to where exactly that training data comes from. 
Typically, this data is sourced from the web; droves of information are systematically scraped from just about every publicly available domain available on the internet, whether that be photographs someone took, art, music, writing…the list goes on. Now, I’ll underline the core of this issue nice and clearly so you get the point I’m making:
It’s not your work.
Nor does it belong to the people responsible for these systems; untold numbers of people have had their content - potentially personal content, copyrighted content - taken and used for data training. Think about it – one person having their stuff stolen and reused is bad, right? Now imagine you’ve got a whole bunch of someones who are having their stuff taken, likely without them even knowing about it, and well – that’s, obviously, very bad. Which sets up a great segue into the next point:
Potential Legislation Issues
For the sake of readability, I’ll try not to dive too deep into legalese here. In short – because of the inherent nature of genAI (that is, the taking-and-using of potentially private and licensed material), there may come a time where this poses a very real legal issue in terms of usage rights.
At the time of writing, legislation hasn’t caught up – there aren't any ratified laws that state how, and where, big AI systems such as ChatGPT can and cannot source training data. Many arguments could be made that the scope and nature of these systems practically divorces generated content from its source material, however many do not agree with this sentiment; in fact, there have been some instances of people seeking legal action due to perceived copyright infringement and material reuse without fair compensation.
It might not be in violation of laws on paper right now, but it certainly violates the spirit of these laws – laws that are designed to protect the works of creatives the world over. 
AI Is Trash, And It’s Getting Trashier
Woah woah woah, I thought this was a factual document, not an opinion piece!
Fair. I’d be a liar if I said it wasn’t partly rooted in opinion, but here’s the fact: genAI is, objectively, getting worse. I could get really technical with the why portion, but I’m not rewriting my thesis here, so I’ll put it as simply as possible:
AI gets trained on Internet Stuff. AI is dubiously correct at best because of how it aggregates data (that is, from everywhere, even the factually-incorrect places)
People use AI to make stuff. They take this stuff at face value, and they don’t sanity check it against actual trusted sources of information (or a dictionary. Or an anatomy textbook)
People put that stuff back on the internet, be it in the form of images, written statements, "artwork", etc
Loop back to step 1
In the field of Artificial Intelligence this is sometimes called a runaway feedback loop: it’s the mother of all feedback loops that results in aggregated information getting more and more horrifically incorrect, inaccurate, and poorly put-together over time. Everything from facts to grammar, to that poor anime character’s sixth and seventh fingers – nothing gets spared, because there comes a point where these systems are being trained on their own outputs.
I somewhat affectionately refer to this as ‘informational inbreeding’; it is becoming the pug of the digital landscape, buggled eyes and all.
Now I will note, runaway feedback loops are typically referencing algorithmic bias - but if I'm being honest, it's an apt descriptor for what's happening here too.
This trend will, inevitably, continue to get worse over time; the prevalence of AI generated media is so commonplace now that it’s unavoidable – that these systems are going to be eating their own tails until they break. 
-------------------------------------------------
But I can’t draw/write! What am I supposed to do?
The age-old struggle – myself and many others sympathize, we really do. Maybe you struggle to come up with ideas, or to put your thoughts to paper cohesively, or drawing and writing is just something you’ve never really taken the time to develop before, but you’re really eager to make a start for yourself.
Maybe, like many of us including myself, you have disabilities that limit your mobility, dexterity, cognition, etc. Not your fault, obviously – it can make stuff difficult! It really can! And it can be really demoralizing to feel as though you're limited or being held back by something you can't help.
Here’s the thing, though:
It’s not an excuse, and it won’t make you a good artist.
The very artists you may or may not look up to got as good as they did by practicing. We all started somewhere, and being honest, that somewhere is something we’d cringe at if we had to look back at it for more than five minutes. I know I do. But in the context of a genAI-dominated internet nowadays, it's still something wonderfully human.
There are also many, many artists across history and time with all manner of disabilities, from chronic pain to paralysis, who still create. No two disabilities are the same, a fact I am well aware of, but there is ample proof that sheer human tenacity is a powerful tool in and of itself.
Or, put more bluntly and somewhat callously: you are not a unique case. You are not in some special category that justifies this particular brand of laziness, and your difficulties and struggles aren't license to take things that aren't yours.
The only way you’re going to create successfully? Is by actually creating things yourself. ‘Asking ChatGPT’ to spit out a writing piece for you is not writing, and you are not a writer for doing so. Using Midjourney or whatever to generate you a picture does not make you an artist. You are only doing yourself a disservice by relying on these tools.
I'll probably add more to this in time, thoughts are hard and I'm tired.
26 notes · View notes
snickerdoodlles · 2 years ago
Text
pulling out a section from this post (a very basic breakdown of generative AI) for easier reading;
AO3 and Generative AI
There are unfortunately some massive misunderstandings in regards to AO3 being included in LLM training datasets. This post was semi-prompted by the ‘Knot in my name’ AO3 tag (for those of you who haven’t heard of it, it’s supposed to be a fandom anti-AI event where AO3 writers help “further pollute” AI with Omegaverse), so let’s take a moment to address AO3 in conjunction with AI. We’ll start with the biggest misconception:
1. AO3 wasn’t used to train generative AI.
Or at least not anymore than any other internet website. AO3 was not deliberately scraped to be used as LLM training data.
The AO3 moderators found traces of the Common Crawl web worm in their servers. The Common Crawl is an open data repository of raw web page data, metadata extracts and text extracts collected from 10+ years of web crawling. Its collective data is measured in petabytes. (As a note, it also only features samples of the available pages on a given domain in its datasets, because its data is freely released under fair use and this is part of how they navigate copyright.) LLM developers use it and similar web crawls like Google’s C4 to bulk up the overall amount of pre-training data.
AO3 is big to an individual user, but it’s actually a small website when it comes to the amount of data used to pre-train LLMs. It’s also just a bad candidate for training data. As a comparison example, Wikipedia is often used as high quality training data because it’s a knowledge corpus and its moderators put a lot of work into maintaining a consistent quality across its web pages. AO3 is just a repository for all fanfic -- it doesn’t have any of that quality maintenance nor any knowledge density. Just in terms of practicality, even if people could get around the copyright issues, the sheer amount of work that would go into curating and labeling AO3’s data (or even a part of it) to make it useful for the fine-tuning stages most likely outstrips any potential usage.
Speaking of copyright, AO3 is a terrible candidate for training data just based on that. Even if people (incorrectly) think fanfic doesn’t hold copyright, there are plenty of books and texts that are public domain that can be found in online libraries that make for much better training data (or rather, there is a higher consistency in quality for them that would make them more appealing than fic for people specifically targeting written story data). And for any scrapers who don’t care about legalities or copyright, they’re going to target published works instead. Meta is in fact currently getting sued for including published books from a shadow library in its training data (note, this case is not in regards to any copyrighted material that might’ve been caught in the Common Crawl data, its regarding a book repository of published books that was scraped specifically to bring in some higher quality data for the first training stage). In a similar case, there’s an anonymous group suing Microsoft, GitHub, and OpenAI for training their LLMs on open source code.
Getting back to my point, AO3 is just not desirable training data. It’s not big enough to be worth scraping for pre-training data, it’s not curated enough to be considered for high quality data, and its data comes with copyright issues to boot. If LLM creators are saying there was no active pursuit in using AO3 to train generative AI, then there was (99% likelihood) no active pursuit in using AO3 to train generative AI.
AO3 has some preventative measures against being included in future Common Crawl datasets, which may or may not work, but there’s no way to remove any previously scraped data from that data corpus. And as a note for anyone locking their AO3 fics: that might potentially help against future AO3 scrapes, but it is rather moot if you post the same fic in full to other platforms like ffn, twitter, tumblr, etc. that have zero preventative measures against data scraping.
2. A/B/O is not polluting generative AI
…I’m going to be real, I have no idea what people expected to prove by asking AI to write Omegaverse fic. At the very least, people know A/B/O fics are not exclusive to AO3, right? The genre isn’t even exclusive to fandom -- it started in fandom, sure, but it expanded to general erotica years ago. It’s all over social media. It has multiple Wikipedia pages.
More to the point though, omegaverse would only be “polluting” AI if LLMs were spewing omegaverse concepts unprompted or like…associated knots with dicks more than rope or something. But people asking AI to write omegaverse and AI then writing omegaverse for them is just AI giving people exactly what they asked for. And…I hate to point this out, but LLMs writing for a niche the LLM trainers didn’t deliberately train the LLMs on is generally considered to be a good thing to the people who develop LLMs. The capability to fill niches developers didn’t even know existed increases LLMs’ marketability. If I were a betting man, what fandom probably saw as a GOTCHA moment, AI people probably saw as a good sign of LLMs’ future potential.
3. Individuals cannot affect LLM training datasets.
So back to the fandom event, with the stated goal of sabotaging AI scrapers via omegaverse fic.
…It’s not going to do anything.
Let’s add some numbers to this to help put things into perspective:
LLaMA’s 65 billion parameter model was trained on 1.4 trillion tokens. Of that 1.4 trillion tokens, about 67% of the training data was from the Common Crawl (roughly ~3 terabytes of data).
3 terabytes is 3,000,000,000 kilobytes.
That’s 3 billion kilobytes.
According to a news article I saw, there has been ~450k words total published for this campaign (*this was while it was going on, that number has probably changed, but you’re about to see why that still doesn’t matter). So, roughly speaking, ~450k of text is ~1012 KB (I’m going off the document size of a plain text doc for a fic whose word count is ~440k).
So 1,012 out of 3,000,000,000.
Aka 0.000034%.
And that 0.000034% of 3 billion kilobytes is only 2/3s of the data for the first stage of training.
And not to beat a dead horse, but 0.000034% is still grossly overestimating the potential impact of posting A/B/O fic. Remember, only parts of AO3 would get scraped for Common Crawl datasets. Which are also huge! The October 2022 Common Crawl dataset is 380 tebibytes. The April 2021 dataset is 320 tebibytes. The 3 terabytes of Common Crawl data used to train LLaMA was randomly selected data that totaled to less than 1% of one full dataset. Not to mention, LLaMA’s training dataset is currently on the (much) larger size as compared to most LLM training datasets.
I also feel the need to point out again that AO3 is trying to prevent any Common Crawl scraping in the future, which would include protection for these new stories (several of which are also locked!).
Omegaverse just isn’t going to do anything to AI. Individual fics are going to do even less. Even if all of AO3 suddenly became omegaverse, it’s just not prominent enough to influence anything in regards to LLMs. You cannot affect training datasets in any meaningful way doing this. And while this might seem really disappointing, this is actually a good thing.
Remember that anything an individual can do to LLMs, the person you hate most can do the same. If it were possible for fandom to corrupt AI with omegaverse, fascists, bigots, and just straight up internet trolls could pollute it with hate speech and worse. AI already carries a lot of biases even while developers are actively trying to flatten that out, it’s good that organized groups can’t corrupt that deliberately.
101 notes · View notes
beardedmrbean · 30 days ago
Text
LONDON (AP) — Music streaming service Deezer said Friday that it will start flagging albums with AI-generated songs, part of its fight against streaming fraudsters.
Deezer, based in Paris, is grappling with a surge in music on its platform created using artificial intelligence tools it says are being wielded to earn royalties fraudulently.
The app will display an on-screen label warning about “AI-generated content" and notify listeners that some tracks on an album were created with song generators.
Deezer is a small player in music streaming, which is dominated by Spotify, Amazon and Apple, but the company said AI-generated music is an “industry-wide issue.” It's committed to “safeguarding the rights of artists and songwriters at a time where copyright law is being put into question in favor of training AI models," CEO Alexis Lanternier said in a press release.
Deezer's move underscores the disruption caused by generative AI systems, which are trained on the contents of the internet including text, images and audio available online. AI companies are facing a slew of lawsuits challenging their practice of scraping the web for such training data without paying for it.
According to an AI song detection tool that Deezer rolled out this year, 18% of songs uploaded to its platform each day, or about 20,000 tracks, are now completely AI generated. Just three months earlier, that number was 10%, Lanternier said in a recent interview.
AI has many benefits but it also "creates a lot of questions" for the music industry, Lanternier told The Associated Press. Using AI to make music is fine as long as there's an artist behind it but the problem arises when anyone, or even a bot, can use it to make music, he said.
Music fraudsters “create tons of songs. They upload, they try to get on playlists or recommendations, and as a result they gather royalties,” he said.
Musicians can't upload music directly to Deezer or rival platforms like Spotify or Apple Music. Music labels or digital distribution platforms can do it for artists they have contracts with, while anyone else can use a “self service” distribution company.
Fully AI-generated music still accounts for only about 0.5% of total streams on Deezer. But the company said it's “evident" that fraud is “the primary purpose" for these songs because it suspects that as many as seven in 10 listens of an AI song are done by streaming "farms" or bots, instead of humans.
Any AI songs used for “stream manipulation” will be cut off from royalty payments, Deezer said.
AI has been a hot topic in the music industry, with debates swirling around its creative possibilities as well as concerns about its legality.
Two of the most popular AI song generators, Suno and Udio, are being sued by record companies for copyright infringement, and face allegations they exploited recorded works of artists from Chuck Berry to Mariah Carey.
Gema, a German royalty-collection group, is suing Suno in a similar case filed in Munich, accusing the service of generating songs that are “confusingly similar” to original versions by artists it represents, including “Forever Young” by Alphaville, “Daddy Cool” by Boney M and Lou Bega's “Mambo No. 5.”
Major record labels are reportedly negotiating with Suno and Udio for compensation, according to news reports earlier this month.
To detect songs for tagging, Lanternier says Deezer uses the same generators used to create songs to analyze their output.
“We identify patterns because the song creates such a complex signal. There is lots of information in the song,” Lanternier said.
The AI music generators seem to be unable to produce songs without subtle but recognizable patterns, which change constantly.
“So you have to update your tool every day," Lanternier said. "So we keep generating songs to learn, to teach our algorithm. So we’re fighting AI with AI.”
Fraudsters can earn big money through streaming. Lanternier pointed to a criminal case last year in the U.S., which authorities said was the first ever involving artificially inflated music streaming. Prosecutors charged a man with wire fraud conspiracy, accusing him of generating hundreds of thousands of AI songs and using bots to automatically stream them billions of times, earning at least $10 million.
6 notes · View notes
mariacallous · 11 months ago
Text
Less than three months after Apple quietly debuted a tool for publishers to opt out of its AI training, a number of prominent news outlets and social platforms have taken the company up on it.
WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training. The cold reception reflects a significant shift in both the perception and use of the robotic crawlers that have trawled the web for decades. Now that these bots play a key role in collecting AI training data, they’ve become a conflict zone over intellectual property and the future of the web.
This new tool, Applebot-Extended, is an extension to Apple’s web-crawling bot that specifically lets website owners tell Apple not to use their data for AI training. (Apple calls this “controlling data usage” in a blog post explaining how it works.) The original Applebot, announced in 2015, initially crawled the internet to power Apple’s search products like Siri and Spotlight. Recently, though, Applebot’s purpose has expanded: The data it collects can also be used to train the foundational models Apple created for its AI efforts.
Applebot-Extended is a way to respect publishers' rights, says Apple spokesperson Nadine Haija. It doesn’t actually stop the original Applebot from crawling the website—which would then impact how that website’s content appeared in Apple search products—but instead prevents that data from being used to train Apple's large language models and other generative AI projects. It is, in essence, a bot to customize how another bot works.
Publishers can block Applebot-Extended by updating a text file on their websites known as the Robots Exclusion Protocol, or robots.txt. This file has governed how bots go about scraping the web for decades—and like the bots themselves, it is now at the center of a larger fight over how AI gets trained. Many publishers have already updated their robots.txt files to block AI bots from OpenAI, Anthropic, and other major AI players.
Robots.txt allows website owners to block or permit bots on a case-by-case basis. While there’s no legal obligation for bots to adhere to what the text file says, compliance is a long-standing norm. (A norm that is sometimes ignored: Earlier this year, a WIRED investigation revealed that the AI startup Perplexity was ignoring robots.txt and surreptitiously scraping websites.)
Applebot-Extended is so new that relatively few websites block it yet. Ontario, Canada–based AI-detection startup Originality AI analyzed a sampling of 1,000 high-traffic websites last week and found that approximately 7 percent—predominantly news and media outlets—were blocking Applebot-Extended. This week, the AI agent watchdog service Dark Visitors ran its own analysis of another sampling of 1,000 high-traffic websites, finding that approximately 6 percent had the bot blocked. Taken together, these efforts suggest that the vast majority of website owners either don’t object to Apple’s AI training practices are simply unaware of the option to block Applebot-Extended.
In a separate analysis conducted this week, data journalist Ben Welsh found that just over a quarter of the news websites he surveyed (294 of 1,167 primarily English-language, US-based publications) are blocking Applebot-Extended. In comparison, Welsh found that 53 percent of the news websites in his sample block OpenAI’s bot. Google introduced its own AI-specific bot, Google-Extended, last September; it’s blocked by nearly 43 percent of those sites, a sign that Applebot-Extended may still be under the radar. As Welsh tells WIRED, though, the number has been “gradually moving” upward since he started looking.
Welsh has an ongoing project monitoring how news outlets approach major AI agents. “A bit of a divide has emerged among news publishers about whether or not they want to block these bots,” he says. “I don't have the answer to why every news organization made its decision. Obviously, we can read about many of them making licensing deals, where they're being paid in exchange for letting the bots in—maybe that's a factor.”
Last year, The New York Times reported that Apple was attempting to strike AI deals with publishers. Since then, competitors like OpenAI and Perplexity have announced partnerships with a variety of news outlets, social platforms, and other popular websites. “A lot of the largest publishers in the world are clearly taking a strategic approach,” says Originality AI founder Jon Gillham. “I think in some cases, there's a business strategy involved—like, withholding the data until a partnership agreement is in place.”
There is some evidence supporting Gillham’s theory. For example, Condé Nast websites used to block OpenAI’s web crawlers. After the company announced a partnership with OpenAI last week, it unblocked the company’s bots. (Condé Nast declined to comment on the record for this story.) Meanwhile, Buzzfeed spokesperson Juliana Clifton told WIRED that the company, which currently blocks Applebot-Extended, puts every AI web-crawling bot it can identify on its block list unless its owner has entered into a partnership—typically paid—with the company, which also owns the Huffington Post.
Because robots.txt needs to be edited manually, and there are so many new AI agents debuting, it can be difficult to keep an up-to-date block list. “People just don’t know what to block,” says Dark Visitors founder Gavin King. Dark Visitors offers a freemium service that automatically updates a client site’s robots.txt, and King says publishers make up a big portion of his clients because of copyright concerns.
Robots.txt might seem like the arcane territory of webmasters—but given its outsize importance to digital publishers in the AI age, it is now the domain of media executives. WIRED has learned that two CEOs from major media companies directly decide which bots to block.
Some outlets have explicitly noted that they block AI scraping tools because they do not currently have partnerships with their owners. “We’re blocking Applebot-Extended across all of Vox Media’s properties, as we have done with many other AI scraping tools when we don’t have a commercial agreement with the other party,” says Lauren Starke, Vox Media’s senior vice president of communications. “We believe in protecting the value of our published work.”
Others will only describe their reasoning in vague—but blunt!—terms. “The team determined, at this point in time, there was no value in allowing Applebot-Extended access to our content,” says Gannett chief communications officer Lark-Marie Antón.
Meanwhile, The New York Times, which is suing OpenAI over copyright infringement, is critical of the opt-out nature of Applebot-Extended and its ilk. “As the law and The Times' own terms of service make clear, scraping or using our content for commercial purposes is prohibited without our prior written permission,” says NYT director of external communications Charlie Stadtlander, noting that the Times will keep adding unauthorized bots to its block list as it finds them. “Importantly, copyright law still applies whether or not technical blocking measures are in place. Theft of copyrighted material is not something content owners need to opt out of.”
It’s unclear whether Apple is any closer to closing deals with publishers. If or when it does, though, the consequences of any data licensing or sharing arrangements may be visible in robots.txt files even before they are publicly announced.
“I find it fascinating that one of the most consequential technologies of our era is being developed, and the battle for its training data is playing out on this really obscure text file, in public for us all to see,” says Gillham.
11 notes · View notes
hkkingofshades · 1 year ago
Text
Tumblr's new policy, and updates going forward
Yeah, I bet we all saw this coming, huh.
So, given tungl dot hell(tm)'s new deal with midjourney, I think pretty much all artists on tumblr are, well, not having a great time. Like deviantart, tumblr has provided a way to opt out from having your blog content scraped, but like deviantart, it's a little unclear what has already been shared before the opt-out went into place, and how much they'll actually work to stop machine trawlers from trawling opted-out blogs.
I'll put the tl;dr up front:
King of Shades will not be leaving Tumblr, but due to the new policy, I won't be posting full pages here anymore.
There's no point in taking down all the pages I've already posted. Deleting them from my page won't delete subsequent reblogs, and there's a pretty high chance that tumblr has already scraped them. (haveibeentrained.com seems to think I haven't been yet, at least. I don't think I really have a big enough following for that to happen, although I don't want to jinx it...) But I certainly won't be posting the full-size pages here anymore.
Instead, I think I'll go the Trying Human route and post a little preview of the update (possibly heavily watermarked; my computer can't run glaze/nightshade, unfortunately), so you guys will still get notifications, but you'll have to visit the ComicFury main website in order to read it. I'm very sorry for the inconvenience (although I will say that I think it's a much better reading experience over there)!
Speaking of which:
I have never and will never ask for any kind of compensation (other than your wonderful feedback, which I've just been absolutely blown away by) for doing this. Even putting legality aside, that's not why I'm here! However, if you've enjoyed this comic, ever thought that you might be willing to tip me on ko-fi if I had one, or even just want to continue having an internet that isn't entirely a corporate wasteland, I ask that you consider donating to ComicFury instead.
ComicFury is a relic of the old, good internet: it's been around for at least 15 years, and it's all hosted and managed by one guy (Kyo). Aside from his team of volunteer moderators, everything on this website is done by one person with a passion for supporting artists. I've chatted with him a little, and he's a great dude! Most of his operating costs are paid for out of pocket, and the site is currently hurting a little bit because it doesn't run ads, it doesn't have subscriptions or paywalled content, it doesn't have any corporate interference or monetization of any kind outside of his Patreon. And—perhaps most relevantly for this post:
I will cut right to the chase, we have decided not to allow AI-art based webcomics on the site. [...] As for our reasoning, there are obviously ethical concerns regarding the source images of most commonly used AI image generators (namely them just being scraped off the internet without anyones permission). But even beyond that, another concern is that due to the extremely low effort involved, webcomics of this nature could just over time completely drown out in numbers art by passionate people who put a lot of time into it , which would be a real shame. So we asked ourselves what would be better for the community, and we agreed that banning it would probably be the better thing overall.
—Kyo has been quite firm that he will not allow AI art to be posted to or scraped from any ComicFury domain. While this isn't a protection against huge web trawls or people putting someone's art in individually—there's not a lot anyone can do about that yet, even with tools like glaze and nightshade—it's a little peace of mind that the art posted there won't abruptly be sold en masse to the highest bidder.
The Patreon starts at $2/month, and Kyo has said that he doesn't mind people pledging for a short time and then dipping if they can't afford an ongoing subscription. If and only if this is something you can afford, and you want to continue seeing independent webcomics including King of Shades, please consider donating!
The Patreon is here. There's not much in the way of reward tiers, especially if you're not a member, but I posit that the real reward is being able to read free webcomics done by real humans as labors of love, without being advertised to or sold as the product. And also maybe the friends we made along the way. Or something.
Once again, there is no pressure, and no shame if you're not willing or able to give money. But if you've ever thought you might be willing to tip me for what I do, consider passing it along to the guy who makes it possible instead.
Thank you for your time!
P.S. Page 64 is coming, I promise! Recent developments kind of kneecapped my motivation for making online art 🙃
39 notes · View notes
hardwiredweird · 2 years ago
Note
Genuine question: how does AI destroy the environment?
I'm totally with you on not using AI and that it harms artists but I've not heard the environment angle before.
I think a lot of people aren't aware of that because they don't think about how AI works.
While there are builds that can run on a local machine and even on phones, the way most people use generative machine learning systems like SD and ChatGPT is through a client (often a web client) where the actual processing takes place on remote servers. Those servers need power and cooling.
The processing power needed to generate millions and billions of images and text replies a day is immense, which means the power draw is horrendous. And all those servers need to be cooled. Often with water.
I think you see where there might be a severe environmental impact just from these factors alone. And for what?
Is "haha, funny image of my blorbo" really worth all that?
And then we're not even touching on the whole metric shitton of other ethical issues like the racism, sexism, ableism and mysoginy presented in the output or the medical data that's been scraped and put into the datasets or the fact that people are producing p*rn of real people and children or the potential legal ramifications of not being able to use photo evidence in legal procedings anymore or....
You know.
AI just all-round bad.
But there's still people justifying the use of it because "I'm just doing it for fun."
55 notes · View notes
simpatel · 3 months ago
Text
Enhance Decision-Making with OpenTable Reviews Data Scraping
How to Enhance Decision-Making With OpenTable Reviews Data Scraping Service?
Introduction
In the restaurant industry, customer feedback is a valuable resource for making informed decisions. Platforms like OpenTable provide extensive reviews from diners, offering insights into customer preferences, satisfaction levels, and areas for improvement. However, manually analyzing this data can be time-consuming and inefficient. This is where an OpenTable Restaurant Reviews Data Scraping Service becomes indispensable. By leveraging automated data collection tools, businesses can gain actionable insights to enhance decision-making, improve customer experience, and stay ahead in the competitive restaurant industry.
Understanding the Importance of OpenTable Reviews
OpenTable is one of the leading platforms for restaurant reservations, offering a rich repository of customer feedback through reviews. These reviews provide a glimpse into customer satisfaction, food quality, ambiance, and service. Utilizing an OpenTable Reviews Data Scraping Service allows businesses to:
Identify Trends: Discover patterns in customer preferences, popular dishes, or common complaints.
Monitor Competitors: Gain insights into what competitors are doing well and where they’re falling short.
Enhance Customer Experience: Use feedback to tailor services, menus, and ambiance to customer needs.
Drive Data-Driven Decisions: Base decisions on reliable data rather than assumptions or limited samples.
How to Scrape OpenTable Reviews Data Effectively
To extract valuable insights, businesses need a robust strategy for Scrape OpenTable Reviews Data. Here are the key steps:
1. Define Your Objectives
Before starting, identify your goals. Are you looking to analyze overall customer satisfaction, compare your restaurant with competitors, or track specific KPIs like service speed or menu variety? Defining objectives will streamline the scraping process.
2. Choose the Right Tools
Several OpenTable Reviews Data Web Scraping Tools are available to simplify the extraction process. Look for tools that:
Handle large datasets efficiently.
Provide APIs for seamless integration.
Offer customization options to target specific data points like ratings, comments, or timestamps.
3. Implement APIs for Seamless Access
Using an OpenTable Website Reviews Data Scraping API can make the process more efficient. APIs allow businesses to extract data programmatically, ensuring accuracy and saving time.
4. Ensure Compliance
When engaging in OpenTable Restaurant reviews data scraping, it’s crucial to adhere to ethical and legal guidelines. Always review the platform’s terms of service to avoid potential violations.
5. Clean and Organize Data
Raw data often requires cleaning to remove duplicates, incomplete entries, or irrelevant information. Organizing the data into structured formats like CSV or JSON ensures easy analysis.
Applications of OpenTable Reviews Data Scraping Service
1. Customer Sentiment Analysis
Analyzing customer sentiments from reviews helps businesses understand how diners perceive their restaurants. Tools for Extract OpenTable Reservation Reviews Data provide insights into recurring themes like food quality, ambiance, or service efficiency.
2. Competitive Benchmarking
By performing Web Scraping OpenTable Reviews Data for competitors, businesses can identify areas where they excel or lag. This benchmarking helps in setting realistic goals and refining strategies.
3. Menu Optimization
Using OpenTable Restaurant Menu Reviews Data Extraction, restaurants can identify which dishes resonate most with customers. Similarly, feedback on less popular items can guide menu adjustments.
4. Marketing Strategy Development
Insights from OpenTable App Reviews Data Collection can inform marketing campaigns. For instance, positive reviews highlighting unique dishes or exceptional service can be used as testimonials in advertisements.
5. Operational Improvements
Feedback on slow service, crowded seating, or unclean environments can be addressed promptly. The data extracted via Restaurant Reviews Data Scraping Service ensures that no critical issue goes unnoticed.
Benefits of Using OpenTable Reviews Data Scraping Service
1. Automation
Automated tools reduce the time and effort required to collect and analyze data. Businesses can focus on strategic actions rather than manual data gathering.
2. Scalability
An OpenTable Reviews Data Scraping Service can handle extensive datasets, enabling businesses to analyze reviews from multiple locations or competitors simultaneously.
3. Accuracy
Advanced scraping tools ensure high accuracy, extracting only relevant and error-free data. This reliability is crucial for making informed decisions.
4. Real-Time Insights
With tools like an OpenTable Website Reviews Data Scraping API, businesses can access real-time data, staying updated on customer feedback and market trends.
5. Cost-Effectiveness
Investing in a professional Restaurant Reviews Data Scraping Service is more economical than hiring a dedicated team for manual data collection and analysis.
Overcoming Challenges in OpenTable Reviews Data Scraping Service
While scraping OpenTable reviews offers significant benefits, it’s not without challenges. Here are common issues and how to address them:
1. CAPTCHA and Bot Detection
Many websites, including OpenTable, implement CAPTCHA and other bot detection mechanisms. Using advanced tools with CAPTCHA-solving capabilities ensures uninterrupted data extraction.
2. Dynamic Content
Dynamic websites often load reviews through JavaScript, making scraping more complex. Employing tools designed for JavaScript-heavy sites can overcome this challenge.
3. Data Volume
Handling large datasets can be resource-intensive. Opting for scalable solutions ensures efficiency in OpenTable Reviews Data Web Scraping Tools.
4. Legal Compliance
To avoid legal issues, ensure that your scraping activities comply with OpenTable’s terms of service and relevant data protection laws.
Future Trends in OpenTable Reviews Data Scraping Service
1. AI-Powered Analysis
Integrating AI with OpenTable Reviews Data Web Scraping Tools enables deeper insights through natural language processing and sentiment analysis.
2. Predictive Analytics
Using scraped data to predict customer behavior, seasonal trends, or emerging preferences will become a key focus.
3. Integration with CRM Systems
Seamless integration of scraped data with customer relationship management (CRM) systems will help businesses personalize customer experiences.
Conclusion
An OpenTable Reviews Data Scraping Service is an invaluable tool for restaurants aiming to make data-driven decisions. By leveraging insights from Scrape OpenTable Reviews Data, businesses can enhance customer experiences, refine their operations, and gain a competitive edge. With the right tools and strategies, the possibilities are endless.
For businesses seeking reliable solutions, Datazivot offers comprehensive services tailored to your needs. Contact us today to unlock the full potential of OpenTable Reviews Data Scraping Service and transform your decision-making process!
Source : https://www.datazivot.com/open-table-reviews-data-scraping-service.php
2 notes · View notes
lasnoticiasdevesko-blog · 9 months ago
Link
🔍 Aprende sobre técnicas como Selenium, Puppeteer, APIs, web scraping legal y ético, ¡todo explicado de manera sencilla y práctica! 💻 #WebScraping #APIs #SEO #Automatización #Tecnología #DesarrolloWeb
0 notes
mostlysignssomeportents · 1 year ago
Text
Humans are not perfectly vigilant
Tumblr media
I'm on tour with my new, nationally bestselling novel The Bezzle! Catch me in BOSTON with Randall "XKCD" Munroe (Apr 11), then PROVIDENCE (Apr 12), and beyond!
Tumblr media
Here's a fun AI story: a security researcher noticed that large companies' AI-authored source-code repeatedly referenced a nonexistent library (an AI "hallucination"), so he created a (defanged) malicious library with that name and uploaded it, and thousands of developers automatically downloaded and incorporated it as they compiled the code:
https://www.theregister.com/2024/03/28/ai_bots_hallucinate_software_packages/
These "hallucinations" are a stubbornly persistent feature of large language models, because these models only give the illusion of understanding; in reality, they are just sophisticated forms of autocomplete, drawing on huge databases to make shrewd (but reliably fallible) guesses about which word comes next:
https://dl.acm.org/doi/10.1145/3442188.3445922
Guessing the next word without understanding the meaning of the resulting sentence makes unsupervised LLMs unsuitable for high-stakes tasks. The whole AI bubble is based on convincing investors that one or more of the following is true:
There are low-stakes, high-value tasks that will recoup the massive costs of AI training and operation;
There are high-stakes, high-value tasks that can be made cheaper by adding an AI to a human operator;
Adding more training data to an AI will make it stop hallucinating, so that it can take over high-stakes, high-value tasks without a "human in the loop."
These are dubious propositions. There's a universe of low-stakes, low-value tasks – political disinformation, spam, fraud, academic cheating, nonconsensual porn, dialog for video-game NPCs – but none of them seem likely to generate enough revenue for AI companies to justify the billions spent on models, nor the trillions in valuation attributed to AI companies:
https://locusmag.com/2023/12/commentary-cory-doctorow-what-kind-of-bubble-is-ai/
The proposition that increasing training data will decrease hallucinations is hotly contested among AI practitioners. I confess that I don't know enough about AI to evaluate opposing sides' claims, but even if you stipulate that adding lots of human-generated training data will make the software a better guesser, there's a serious problem. All those low-value, low-stakes applications are flooding the internet with botshit. After all, the one thing AI is unarguably very good at is producing bullshit at scale. As the web becomes an anaerobic lagoon for botshit, the quantum of human-generated "content" in any internet core sample is dwindling to homeopathic levels:
https://pluralistic.net/2024/03/14/inhuman-centipede/#enshittibottification
This means that adding another order of magnitude more training data to AI won't just add massive computational expense – the data will be many orders of magnitude more expensive to acquire, even without factoring in the additional liability arising from new legal theories about scraping:
https://pluralistic.net/2023/09/17/how-to-think-about-scraping/
That leaves us with "humans in the loop" – the idea that an AI's business model is selling software to businesses that will pair it with human operators who will closely scrutinize the code's guesses. There's a version of this that sounds plausible – the one in which the human operator is in charge, and the AI acts as an eternally vigilant "sanity check" on the human's activities.
For example, my car has a system that notices when I activate my blinker while there's another car in my blind-spot. I'm pretty consistent about checking my blind spot, but I'm also a fallible human and there've been a couple times where the alert saved me from making a potentially dangerous maneuver. As disciplined as I am, I'm also sometimes forgetful about turning off lights, or waking up in time for work, or remembering someone's phone number (or birthday). I like having an automated system that does the robotically perfect trick of never forgetting something important.
There's a name for this in automation circles: a "centaur." I'm the human head, and I've fused with a powerful robot body that supports me, doing things that humans are innately bad at.
That's the good kind of automation, and we all benefit from it. But it only takes a small twist to turn this good automation into a nightmare. I'm speaking here of the reverse-centaur: automation in which the computer is in charge, bossing a human around so it can get its job done. Think of Amazon warehouse workers, who wear haptic bracelets and are continuously observed by AI cameras as autonomous shelves shuttle in front of them and demand that they pick and pack items at a pace that destroys their bodies and drives them mad:
https://pluralistic.net/2022/04/17/revenge-of-the-chickenized-reverse-centaurs/
Automation centaurs are great: they relieve humans of drudgework and let them focus on the creative and satisfying parts of their jobs. That's how AI-assisted coding is pitched: rather than looking up tricky syntax and other tedious programming tasks, an AI "co-pilot" is billed as freeing up its human "pilot" to focus on the creative puzzle-solving that makes coding so satisfying.
But an hallucinating AI is a terrible co-pilot. It's just good enough to get the job done much of the time, but it also sneakily inserts booby-traps that are statistically guaranteed to look as plausible as the good code (that's what a next-word-guessing program does: guesses the statistically most likely word).
This turns AI-"assisted" coders into reverse centaurs. The AI can churn out code at superhuman speed, and you, the human in the loop, must maintain perfect vigilance and attention as you review that code, spotting the cleverly disguised hooks for malicious code that the AI can't be prevented from inserting into its code. As "Lena" writes, "code review [is] difficult relative to writing new code":
https://twitter.com/qntm/status/1773779967521780169
Why is that? "Passively reading someone else's code just doesn't engage my brain in the same way. It's harder to do properly":
https://twitter.com/qntm/status/1773780355708764665
There's a name for this phenomenon: "automation blindness." Humans are just not equipped for eternal vigilance. We get good at spotting patterns that occur frequently – so good that we miss the anomalies. That's why TSA agents are so good at spotting harmless shampoo bottles on X-rays, even as they miss nearly every gun and bomb that a red team smuggles through their checkpoints:
https://pluralistic.net/2023/08/23/automation-blindness/#humans-in-the-loop
"Lena"'s thread points out that this is as true for AI-assisted driving as it is for AI-assisted coding: "self-driving cars replace the experience of driving with the experience of being a driving instructor":
https://twitter.com/qntm/status/1773841546753831283
In other words, they turn you into a reverse-centaur. Whereas my blind-spot double-checking robot allows me to make maneuvers at human speed and points out the things I've missed, a "supervised" self-driving car makes maneuvers at a computer's frantic pace, and demands that its human supervisor tirelessly and perfectly assesses each of those maneuvers. No wonder Cruise's murderous "self-driving" taxis replaced each low-waged driver with 1.5 high-waged technical robot supervisors:
https://pluralistic.net/2024/01/11/robots-stole-my-jerb/#computer-says-no
AI radiology programs are said to be able to spot cancerous masses that human radiologists miss. A centaur-based AI-assisted radiology program would keep the same number of radiologists in the field, but they would get less done: every time they assessed an X-ray, the AI would give them a second opinion. If the human and the AI disagreed, the human would go back and re-assess the X-ray. We'd get better radiology, at a higher price (the price of the AI software, plus the additional hours the radiologist would work).
But back to making the AI bubble pay off: for AI to pay off, the human in the loop has to reduce the costs of the business buying an AI. No one who invests in an AI company believes that their returns will come from business customers to agree to increase their costs. The AI can't do your job, but the AI salesman can convince your boss to fire you and replace you with an AI anyway – that pitch is the most successful form of AI disinformation in the world.
An AI that "hallucinates" bad advice to fliers can't replace human customer service reps, but airlines are firing reps and replacing them with chatbots:
https://www.bbc.com/travel/article/20240222-air-canada-chatbot-misinformation-what-travellers-should-know
An AI that "hallucinates" bad legal advice to New Yorkers can't replace city services, but Mayor Adams still tells New Yorkers to get their legal advice from his chatbots:
https://arstechnica.com/ai/2024/03/nycs-government-chatbot-is-lying-about-city-laws-and-regulations/
The only reason bosses want to buy robots is to fire humans and lower their costs. That's why "AI art" is such a pisser. There are plenty of harmless ways to automate art production with software – everything from a "healing brush" in Photoshop to deepfake tools that let a video-editor alter the eye-lines of all the extras in a scene to shift the focus. A graphic novelist who models a room in The Sims and then moves the camera around to get traceable geometry for different angles is a centaur – they are genuinely offloading some finicky drudgework onto a robot that is perfectly attentive and vigilant.
But the pitch from "AI art" companies is "fire your graphic artists and replace them with botshit." They're pitching a world where the robots get to do all the creative stuff (badly) and humans have to work at robotic pace, with robotic vigilance, in order to catch the mistakes that the robots make at superhuman speed.
Reverse centaurism is brutal. That's not news: Charlie Chaplin documented the problems of reverse centaurs nearly 100 years ago:
https://en.wikipedia.org/wiki/Modern_Times_(film)
As ever, the problem with a gadget isn't what it does: it's who it does it for and who it does it to. There are plenty of benefits from being a centaur – lots of ways that automation can help workers. But the only path to AI profitability lies in reverse centaurs, automation that turns the human in the loop into the crumple-zone for a robot:
https://estsjournal.org/index.php/ests/article/view/260
Tumblr media
If you'd like an essay-formatted version of this post to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:
https://pluralistic.net/2024/04/01/human-in-the-loop/#monkey-in-the-middle
Tumblr media
Image: Cryteria (modified) https://commons.wikimedia.org/wiki/File:HAL9000.svg
CC BY 3.0 https://creativecommons.org/licenses/by/3.0/deed.en
--
Jorge Royan (modified) https://commons.wikimedia.org/wiki/File:Munich_-_Two_boys_playing_in_a_park_-_7328.jpg
CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/deed.en
--
Noah Wulf (modified) https://commons.m.wikimedia.org/wiki/File:Thunderbirds_at_Attention_Next_to_Thunderbird_1_-_Aviation_Nation_2019.jpg
CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/deed.en
379 notes · View notes
thechanelmuse · 1 year ago
Text
Tumblr media Tumblr media Tumblr media Tumblr media
My Book Review
"If you're not paying for it, you're the product."
Your Face Belongs to Us is a terrifying yet interesting journey through the world of invasive surveillance, artificial intelligence, facial recognition, and biometric data collection by way of the birth and rise of a company called Clearview AI — a software used by law enforcement and government agencies in the US yet banned in various countries. A database of 75 million images per day.
The writing is easy flowing investigative journalism, but the information (as expected) is...chile 👀. Lawsuits and court cases to boot. This book reads somewhat like one of my favorite books of all-time, How Music Got Free by Stephen Witt (my review's here), in which it delves into the history from birth to present while learning the key players along the way.
Here's an excerpt that keeps you seated for this wild ride:
“I was in a hotel room in Switzerland, six months pregnant, when I got the email. It was the end of a long day and I was tired but the email gave me a jolt. My source had unearthed a legal memo marked “Privileged & Confidential” in which a lawyer for Clearview had said that the company had scraped billions of photos from the public web, including social media sites such as Facebook, Instagram, and LinkedIn, to create a revolutionary app. Give Clearview a photo of a random person on the street, and it would spit back all the places on the internet where it had spotted their face, potentially revealing not just their name but other personal details about their life. The company was selling this superpower to police departments around the country but trying to keep its existence a secret.”
7 notes · View notes
librarianrafia · 1 year ago
Text
"These "hallucinations" are a stubbornly persistent feature of large language models, because these models only give the illusion of understanding; in reality, they are just sophisticated forms of autocomplete, drawing on huge databases to make shrewd (but reliably fallible) guesses about which word comes next:
Guessing the next word without understanding the meaning of the resulting sentence makes unsupervised LLMs unsuitable for high-stakes tasks. The whole AI bubble is based on convincing investors that one or more of the following is true:
I. There are low-stakes, high-value tasks that will recoup the massive costs of AI training and operation;
II. There are high-stakes, high-value tasks that can be made cheaper by adding an AI to a human operator;
III. Adding more training data to an AI will make it stop hallucinating, so that it can take over high-stakes, high-value tasks without a "human in the loop."
These are dubious propositions. There's a universe of low-stakes, low-value tasks – political disinformation, spam, fraud, academic cheating, nonconsensual porn, dialog for video-game NPCs – but none of them seem likely to generate enough revenue for AI companies to justify the billions spent on models, nor the trillions in valuation attributed to AI companies:
https://locusmag.com/2023/12/commentary-cory-doctorow-what-kind-of-bubble-is-ai/
The proposition that increasing training data will decrease hallucinations is hotly contested among AI practitioners. I confess that I don't know enough about AI to evaluate opposing sides' claims, but even if you stipulate that adding lots of human-generated training data will make the software a better guesser, there's a serious problem. All those low-value, low-stakes applications are flooding the internet with botshit. After all, the one thing AI is unarguably very good at is producing bullshit at scale. As the web becomes an anaerobic lagoon for botshit, the quantum of human-generated "content" in any internet core sample is dwindling to homeopathic levels:
This means that adding another order of magnitude more training data to AI won't just add massive computational expense – the data will be many orders of magnitude more expensive to acquire, even without factoring in the additional liability arising from new legal theories about scraping:
5 notes · View notes