#Facebook Scraping Tool | Explore Tumblr posts and blogs

mostlysignssomeportents · 1 year ago

Text

AI “art” and uncanniness

TOMORROW (May 14), I'm on a livecast about AI AND ENSHITTIFICATION with TIM O'REILLY; on TOMORROW (May 15), I'm in NORTH HOLLYWOOD for a screening of STEPHANIE KELTON'S FINDING THE MONEY; FRIDAY (May 17), I'm at the INTERNET ARCHIVE in SAN FRANCISCO to keynote the 10th anniversary of the AUTHORS ALLIANCE.

When it comes to AI art (or "art"), it's hard to find a nuanced position that respects creative workers' labor rights, free expression, copyright law's vital exceptions and limitations, and aesthetics.

I am, on balance, opposed to AI art, but there are some important caveats to that position. For starters, I think it's unequivocally wrong – as a matter of law – to say that scraping works and training a model with them infringes copyright. This isn't a moral position (I'll get to that in a second), but rather a technical one.

Break down the steps of training a model and it quickly becomes apparent why it's technically wrong to call this a copyright infringement. First, the act of making transient copies of works – even billions of works – is unequivocally fair use. Unless you think search engines and the Internet Archive shouldn't exist, then you should support scraping at scale:

https://pluralistic.net/2023/09/17/how-to-think-about-scraping/

And unless you think that Facebook should be allowed to use the law to block projects like Ad Observer, which gathers samples of paid political disinformation, then you should support scraping at scale, even when the site being scraped objects (at least sometimes):

https://pluralistic.net/2021/08/06/get-you-coming-and-going/#potemkin-research-program

After making transient copies of lots of works, the next step in AI training is to subject them to mathematical analysis. Again, this isn't a copyright violation.

Making quantitative observations about works is a longstanding, respected and important tool for criticism, analysis, archiving and new acts of creation. Measuring the steady contraction of the vocabulary in successive Agatha Christie novels turns out to offer a fascinating window into her dementia:

https://www.theguardian.com/books/2009/apr/03/agatha-christie-alzheimers-research

Programmatic analysis of scraped online speech is also critical to the burgeoning formal analyses of the language spoken by minorities, producing a vibrant account of the rigorous grammar of dialects that have long been dismissed as "slang":

https://www.researchgate.net/publication/373950278_Lexicogrammatical_Analysis_on_African-American_Vernacular_English_Spoken_by_African-Amecian_You-Tubers

Since 1988, UCL Survey of English Language has maintained its "International Corpus of English," and scholars have plumbed its depth to draw important conclusions about the wide variety of Englishes spoken around the world, especially in postcolonial English-speaking countries:

https://www.ucl.ac.uk/english-usage/projects/ice.htm

The final step in training a model is publishing the conclusions of the quantitative analysis of the temporarily copied documents as software code. Code itself is a form of expressive speech – and that expressivity is key to the fight for privacy, because the fact that code is speech limits how governments can censor software:

https://www.eff.org/deeplinks/2015/04/remembering-case-established-code-speech/

Are models infringing? Well, they certainly can be. In some cases, it's clear that models "memorized" some of the data in their training set, making the fair use, transient copy into an infringing, permanent one. That's generally considered to be the result of a programming error, and it could certainly be prevented (say, by comparing the model to the training data and removing any memorizations that appear).

Not every seeming act of memorization is a memorization, though. While specific models vary widely, the amount of data from each training item retained by the model is very small. For example, Midjourney retains about one byte of information from each image in its training data. If we're talking about a typical low-resolution web image of say, 300kb, that would be one three-hundred-thousandth (0.0000033%) of the original image.

Typically in copyright discussions, when one work contains 0.0000033% of another work, we don't even raise the question of fair use. Rather, we dismiss the use as de minimis (short for de minimis non curat lex or "The law does not concern itself with trifles"):

https://en.wikipedia.org/wiki/De_minimis

Busting someone who takes 0.0000033% of your work for copyright infringement is like swearing out a trespassing complaint against someone because the edge of their shoe touched one blade of grass on your lawn.

But some works or elements of work appear many times online. For example, the Getty Images watermark appears on millions of similar images of people standing on red carpets and runways, so a model that takes even in infinitesimal sample of each one of those works might still end up being able to produce a whole, recognizable Getty Images watermark.

The same is true for wire-service articles or other widely syndicated texts: there might be dozens or even hundreds of copies of these works in training data, resulting in the memorization of long passages from them.

This might be infringing (we're getting into some gnarly, unprecedented territory here), but again, even if it is, it wouldn't be a big hardship for model makers to post-process their models by comparing them to the training set, deleting any inadvertent memorizations. Even if the resulting model had zero memorizations, this would do nothing to alleviate the (legitimate) concerns of creative workers about the creation and use of these models.

So here's the first nuance in the AI art debate: as a technical matter, training a model isn't a copyright infringement. Creative workers who hope that they can use copyright law to prevent AI from changing the creative labor market are likely to be very disappointed in court:

https://www.hollywoodreporter.com/business/business-news/sarah-silverman-lawsuit-ai-meta-1235669403/

But copyright law isn't a fixed, eternal entity. We write new copyright laws all the time. If current copyright law doesn't prevent the creation of models, what about a future copyright law?

Well, sure, that's a possibility. The first thing to consider is the possible collateral damage of such a law. The legal space for scraping enables a wide range of scholarly, archival, organizational and critical purposes. We'd have to be very careful not to inadvertently ban, say, the scraping of a politician's campaign website, lest we enable liars to run for office and renege on their promises, while they insist that they never made those promises in the first place. We wouldn't want to abolish search engines, or stop creators from scraping their own work off sites that are going away or changing their terms of service.

Now, onto quantitative analysis: counting words and measuring pixels are not activities that you should need permission to perform, with or without a computer, even if the person whose words or pixels you're counting doesn't want you to. You should be able to look as hard as you want at the pixels in Kate Middleton's family photos, or track the rise and fall of the Oxford comma, and you shouldn't need anyone's permission to do so.

Finally, there's publishing the model. There are plenty of published mathematical analyses of large corpuses that are useful and unobjectionable. I love me a good Google n-gram:

https://books.google.com/ngrams/graph?content=fantods%2C+heebie-jeebies&year_start=1800&year_end=2019&corpus=en-2019&smoothing=3

And large language models fill all kinds of important niches, like the Human Rights Data Analysis Group's LLM-based work helping the Innocence Project New Orleans' extract data from wrongful conviction case files:

https://hrdag.org/tech-notes/large-language-models-IPNO.html

So that's nuance number two: if we decide to make a new copyright law, we'll need to be very sure that we don't accidentally crush these beneficial activities that don't undermine artistic labor markets.

This brings me to the most important point: passing a new copyright law that requires permission to train an AI won't help creative workers get paid or protect our jobs.

Getty Images pays photographers the least it can get away with. Publishers contracts have transformed by inches into miles-long, ghastly rights grabs that take everything from writers, but still shifts legal risks onto them:

https://pluralistic.net/2022/06/19/reasonable-agreement/

Publishers like the New York Times bitterly oppose their writers' unions:

https://actionnetwork.org/letters/new-york-times-stop-union-busting

These large corporations already control the copyrights to gigantic amounts of training data, and they have means, motive and opportunity to license these works for training a model in order to pay us less, and they are engaged in this activity right now:

https://www.nytimes.com/2023/12/22/technology/apple-ai-news-publishers.html

Big games studios are already acting as though there was a copyright in training data, and requiring their voice actors to begin every recording session with words to the effect of, "I hereby grant permission to train an AI with my voice" and if you don't like it, you can hit the bricks:

https://www.vice.com/en/article/5d37za/voice-actors-sign-away-rights-to-artificial-intelligence

If you're a creative worker hoping to pay your bills, it doesn't matter whether your wages are eroded by a model produced without paying your employer for the right to do so, or whether your employer got to double dip by selling your work to an AI company to train a model, and then used that model to fire you or erode your wages:

https://pluralistic.net/2023/02/09/ai-monkeys-paw/#bullied-schoolkids

Individual creative workers rarely have any bargaining leverage over the corporations that license our copyrights. That's why copyright's 40-year expansion (in duration, scope, statutory damages) has resulted in larger, more profitable entertainment companies, and lower payments – in real terms and as a share of the income generated by their work – for creative workers.

As Rebecca Giblin and I write in our book Chokepoint Capitalism, giving creative workers more rights to bargain with against giant corporations that control access to our audiences is like giving your bullied schoolkid extra lunch money – it's just a roundabout way of transferring that money to the bullies:

https://pluralistic.net/2022/08/21/what-is-chokepoint-capitalism/

There's an historical precedent for this struggle – the fight over music sampling. 40 years ago, it wasn't clear whether sampling required a copyright license, and early hip-hop artists took samples without permission, the way a horn player might drop a couple bars of a well-known song into a solo.

Many artists were rightfully furious over this. The "heritage acts" (the music industry's euphemism for "Black people") who were most sampled had been given very bad deals and had seen very little of the fortunes generated by their creative labor. Many of them were desperately poor, despite having made millions for their labels. When other musicians started making money off that work, they got mad.

In the decades that followed, the system for sampling changed, partly through court cases and partly through the commercial terms set by the Big Three labels: Sony, Warner and Universal, who control 70% of all music recordings. Today, you generally can't sample without signing up to one of the Big Three (they are reluctant to deal with indies), and that means taking their standard deal, which is very bad, and also signs away your right to control your samples.

So a musician who wants to sample has to sign the bad terms offered by a Big Three label, and then hand $500 out of their advance to one of those Big Three labels for the sample license. That $500 typically doesn't go to another artist – it goes to the label, who share it around their executives and investors. This is a system that makes every artist poorer.

But it gets worse. Putting a price on samples changes the kind of music that can be economically viable. If you wanted to clear all the samples on an album like Public Enemy's "It Takes a Nation of Millions To Hold Us Back," or the Beastie Boys' "Paul's Boutique," you'd have to sell every CD for $150, just to break even:

https://memex.craphound.com/2011/07/08/creative-license-how-the-hell-did-sampling-get-so-screwed-up-and-what-the-hell-do-we-do-about-it/

Sampling licenses don't just make every artist financially worse off, they also prevent the creation of music of the sort that millions of people enjoy. But it gets even worse. Some older, sample-heavy music can't be cleared. Most of De La Soul's catalog wasn't available for 15 years, and even though some of their seminal music came back in March 2022, the band's frontman Trugoy the Dove didn't live to see it – he died in February 2022:

https://www.vulture.com/2023/02/de-la-soul-trugoy-the-dove-dead-at-54.html

This is the third nuance: even if we can craft a model-banning copyright system that doesn't catch a lot of dolphins in its tuna net, it could still make artists poorer off.

Back when sampling started, it wasn't clear whether it would ever be considered artistically important. Early sampling was crude and experimental. Musicians who trained for years to master an instrument were dismissive of the idea that clicking a mouse was "making music." Today, most of us don't question the idea that sampling can produce meaningful art – even musicians who believe in licensing samples.

Having lived through that era, I'm prepared to believe that maybe I'll look back on AI "art" and say, "damn, I can't believe I never thought that could be real art."

But I wouldn't give odds on it.

I don't like AI art. I find it anodyne, boring. As Henry Farrell writes, it's uncanny, and not in a good way:

https://www.programmablemutter.com/p/large-language-models-are-uncanny

Farrell likens the work produced by AIs to the movement of a Ouija board's planchette, something that "seems to have a life of its own, even though its motion is a collective side-effect of the motions of the people whose fingers lightly rest on top of it." This is "spooky-action-at-a-close-up," transforming "collective inputs … into apparently quite specific outputs that are not the intended creation of any conscious mind."

Look, art is irrational in the sense that it speaks to us at some non-rational, or sub-rational level. Caring about the tribulations of imaginary people or being fascinated by pictures of things that don't exist (or that aren't even recognizable) doesn't make any sense. There's a way in which all art is like an optical illusion for our cognition, an imaginary thing that captures us the way a real thing might.

But art is amazing. Making art and experiencing art makes us feel big, numinous, irreducible emotions. Making art keeps me sane. Experiencing art is a precondition for all the joy in my life. Having spent most of my life as a working artist, I've come to the conclusion that the reason for this is that art transmits an approximation of some big, numinous irreducible emotion from an artist's mind to our own. That's it: that's why art is amazing.

AI doesn't have a mind. It doesn't have an intention. The aesthetic choices made by AI aren't choices, they're averages. As Farrell writes, "LLM art sometimes seems to communicate a message, as art does, but it is unclear where that message comes from, or what it means. If it has any meaning at all, it is a meaning that does not stem from organizing intention" (emphasis mine).

Farrell cites Mark Fisher's The Weird and the Eerie, which defines "weird" in easy to understand terms ("that which does not belong") but really grapples with "eerie."

For Fisher, eeriness is "when there is something present where there should be nothing, or is there is nothing present when there should be something." AI art produces the seeming of intention without intending anything. It appears to be an agent, but it has no agency. It's eerie.

Fisher talks about capitalism as eerie. Capital is "conjured out of nothing" but "exerts more influence than any allegedly substantial entity." The "invisible hand" shapes our lives more than any person. The invisible hand is fucking eerie. Capitalism is a system in which insubstantial non-things – corporations – appear to act with intention, often at odds with the intentions of the human beings carrying out those actions.

So will AI art ever be art? I don't know. There's a long tradition of using random or irrational or impersonal inputs as the starting point for human acts of artistic creativity. Think of divination:

https://pluralistic.net/2022/07/31/divination/

Or Brian Eno's Oblique Strategies:

http://stoney.sb.org/eno/oblique.html

I love making my little collages for this blog, though I wouldn't call them important art. Nevertheless, piecing together bits of other peoples' work can make fantastic, important work of historical note:

https://www.johnheartfield.com/John-Heartfield-Exhibition/john-heartfield-art/famous-anti-fascist-art/heartfield-posters-aiz

Even though painstakingly cutting out tiny elements from others' images can be a meditative and educational experience, I don't think that using tiny scissors or the lasso tool is what defines the "art" in collage. If you can automate some of this process, it could still be art.

Here's what I do know. Creating an individual bargainable copyright over training will not improve the material conditions of artists' lives – all it will do is change the relative shares of the value we create, shifting some of that value from tech companies that hate us and want us to starve to entertainment companies that hate us and want us to starve.

As an artist, I'm foursquare against anything that stands in the way of making art. As an artistic worker, I'm entirely committed to things that help workers get a fair share of the money their work creates, feed their families and pay their rent.

I think today's AI art is bad, and I think tomorrow's AI art will probably be bad, but even if you disagree (with either proposition), I hope you'll agree that we should be focused on making sure art is legal to make and that artists get paid for it.

Just because copyright won't fix the creative labor market, it doesn't follow that nothing will. If we're worried about labor issues, we can look to labor law to improve our conditions. That's what the Hollywood writers did, in their groundbreaking 2023 strike:

https://pluralistic.net/2023/10/01/how-the-writers-guild-sunk-ais-ship/

Now, the writers had an advantage: they are able to engage in "sectoral bargaining," where a union bargains with all the major employers at once. That's illegal in nearly every other kind of labor market. But if we're willing to entertain the possibility of getting a new copyright law passed (that won't make artists better off), why not the possibility of passing a new labor law (that will)? Sure, our bosses won't lobby alongside of us for more labor protection, the way they would for more copyright (think for a moment about what that says about who benefits from copyright versus labor law expansion).

But all workers benefit from expanded labor protection. Rather than going to Congress alongside our bosses from the studios and labels and publishers to demand more copyright, we could go to Congress alongside every kind of worker, from fast-food cashiers to publishing assistants to truck drivers to demand the right to sectoral bargaining. That's a hell of a coalition.

And if we do want to tinker with copyright to change the way training works, let's look at collective licensing, which can't be bargained away, rather than individual rights that can be confiscated at the entrance to our publisher, label or studio's offices. These collective licenses have been a huge success in protecting creative workers:

https://pluralistic.net/2023/02/26/united-we-stand/

Then there's copyright's wildest wild card: The US Copyright Office has repeatedly stated that works made by AIs aren't eligible for copyright, which is the exclusive purview of works of human authorship. This has been affirmed by courts:

https://pluralistic.net/2023/08/20/everything-made-by-an-ai-is-in-the-public-domain/

Neither AI companies nor entertainment companies will pay creative workers if they don't have to. But for any company contemplating selling an AI-generated work, the fact that it is born in the public domain presents a substantial hurdle, because anyone else is free to take that work and sell it or give it away.

Whether or not AI "art" will ever be good art isn't what our bosses are thinking about when they pay for AI licenses: rather, they are calculating that they have so much market power that they can sell whatever slop the AI makes, and pay less for the AI license than they would make for a human artist's work. As is the case in every industry, AI can't do an artist's job, but an AI salesman can convince an artist's boss to fire the creative worker and replace them with AI:

https://pluralistic.net/2024/01/29/pay-no-attention/#to-the-little-man-behind-the-curtain

They don't care if it's slop – they just care about their bottom line. A studio executive who cancels a widely anticipated film prior to its release to get a tax-credit isn't thinking about artistic integrity. They care about one thing: money. The fact that AI works can be freely copied, sold or given away may not mean much to a creative worker who actually makes their own art, but I assure you, it's the only thing that matters to our bosses.

If you'd like an essay-formatted version of this post to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

https://pluralistic.net/2024/05/13/spooky-action-at-a-close-up/#invisible-hand

#pluralistic #ai art #eerie #ai #weird #henry farrell #copyright #copyfight #creative labor markets #what is art #ideomotor response #mark fisher #invisible hand #uncanniness #prompting

272 notes · View notes

mariacallous · 4 months ago

Text

In late January, a warning spread through the London-based Facebook group Are We Dating the Same Guy?—but this post wasn’t about a bad date or a cheating ex. A connected network of male-dominated Telegram groups had surfaced, sharing and circulating nonconsensual intimate images of women. Their justification? Retaliation.

On January 23, users in the AWDTSG Facebook group began warning about hidden Telegram groups. Screenshots and TikTok videos surfaced, revealing public Telegram channels where users were sharing nonconsensual intimate images. Further investigation by WIRED identified additional channels linked to the network. By scraping thousands of messages from these groups, it became possible to analyze their content and the patterns of abuse.

AWDTSG, a sprawling web of over 150 regional forums across Facebook alone, with roughly 3 million members worldwide, was designed by Paolo Sanchez in 2022 in New York as a space for women to share warnings about predatory men. But its rapid growth made it a target. Critics argue that the format allows unverified accusations to spiral. Some men have responded with at least three defamation lawsuits filed in recent years against members, administrators, and even Meta, Facebook’s parent company. Others took a different route: organized digital harassment.

Primarily using Telegram group data made available through Telemetr.io, a Telegram analytics tool, WIRED analyzed more than 3,500 messages from a Telegram group linked to a larger misogynistic revenge network. Over 24 hours, WIRED observed users systematically tracking, doxing, and degrading women from AWDTSG, circulating nonconsensual images, phone numbers, usernames, and location data.

From January 26 to 27, the chats became a breeding ground for misogynistic, racist, sexual digital abuse of women, with women of color bearing the brunt of the targeted harassment and abuse. Thousands of users encouraged each other to share nonconsensual intimate images, often referred to as “revenge porn,” and requested and circulated women’s phone numbers, usernames, locations, and other personal identifiers.

As women from AWDTSG began infiltrating the Telegram group, at least one user grew suspicious: “These lot just tryna get back at us for exposing them.”

When women on Facebook tried to alert others of the risk of doxing and leaks of their intimate content, AWDTSG moderators removed their posts. (The group’s moderators did not respond to multiple requests for comment.) Meanwhile, men who had previously coordinated through their own Facebook groups like “Are We Dating the Same Girl” shifted their operations in late January to Telegram's more permissive environment. Their message was clear: If they can do it, so can we.

"In the eyes of some of these men, this is a necessary act of defense against a kind of hostile feminism that they believe is out to ruin their lives," says Carl Miller, cofounder of the Center for the Analysis of Social Media and host of the podcast Kill List.

The dozen Telegram groups that WIRED has identified are part of a broader digital ecosystem often referred to as the manosphere, an online network of forums, influencers, and communities that perpetuate misogynistic ideologies.

“Highly isolated online spaces start reinforcing their own worldviews, pulling further and further from the mainstream, and in doing so, legitimizing things that would be unthinkable offline,” Miller says. “Eventually, what was once unthinkable becomes the norm.”

This cycle of reinforcement plays out across multiple platforms. Facebook forums act as the first point of contact, TikTok amplifies the rhetoric in publicly available videos, and Telegram is used to enable illicit activity. The result? A self-sustaining network of harassment that thrives on digital anonymity.

TikTok amplified discussions around the Telegram groups. WIRED reviewed 12 videos in which creators, of all genders, discussed, joked about, or berated the Telegram groups. In the comments section of these videos, users shared invitation links to public and private groups and some public channels on Telegram, making them accessible to a wider audience. While TikTok was not the primary platform for harassment, discussions about the Telegram groups spread there, and in some cases users explicitly acknowledged their illegality.

TikTok tells WIRED that its Community Guidelines prohibit image-based sexual abuse, sexual harassment, and nonconsensual sexual acts, and that violations result in removals and possible account bans. They also stated that TikTok removes links directing people to content that violates its policies and that it continues to invest in Trust and Safety operations.

Intentionally or not, the algorithms powering social media platforms like Facebook can amplify misogynistic content. Hate-driven engagement fuels growth, pulling new users into these communities through viral trends, suggested content, and comment-section recruitment.

As people caught notice on Facebook and TikTok and started reporting the Telegram groups, they didn’t disappear—they simply rebranded. Reactionary groups quickly emerged, signaling that members knew they were being watched but had no intention of stopping. Inside, messages revealed a clear awareness of the risks: Users knew they were breaking the law. They just didn’t care, according to chat logs reviewed by WIRED. To absolve themselves, one user wrote, “I do not condone im [simply] here to regulate rules,” while another shared a link to a statement that said: “I am here for only entertainment purposes only and I don’t support any illegal activities.”

Meta did not respond to a request for comment.

Messages from the Telegram group WIRED analyzed show that some chats became hyper-localized, dividing London into four regions to make harassment even more targeted. Members casually sought access to other city-based groups: “Who’s got brum link?” and “Manny link tho?”—British slang referring to Birmingham and Manchester. They weren’t just looking for gossip. “Any info from west?” one user asked, while another requested, “What’s her @?”— hunting for a woman’s social media handle, a first step to tracking her online activity.

The chat logs further reveal how women were discussed as commodities. “She a freak, I’ll give her that,” one user wrote. Another added, “Beautiful. Hide her from me.” Others encouraged sharing explicit material: “Sharing is caring, don’t be greedy.”

Members also bragged about sexual exploits, using coded language to reference encounters in specific locations, and spread degrading, racial abuse, predominantly targeting Black women.

Once a woman was mentioned, her privacy was permanently compromised. Users frequently shared social media handles, which led other members to contact her—soliciting intimate images or sending disparaging texts.

Anonymity can be a protective tool for women navigating online harassment. But it can also be embraced by bad actors who use the same structures to evade accountability.

"It’s ironic," Miller says. "The very privacy structures that women use to protect themselves are being turned against them."

The rise of unmoderated spaces like the abusive Telegram groups makes it nearly impossible to trace perpetrators, exposing a systemic failure in law enforcement and regulation. Without clear jurisdiction or oversight, platforms are able to sidestep accountability.

Sophie Mortimer, manager of the UK-based Revenge Porn Helpline, warned that Telegram has become one of the biggest threats to online safety. She says that the UK charity’s reports to Telegram of nonconsensual intimate image abuse are ignored. “We would consider them to be noncompliant to our requests,” she says. Telegram, however, says it received only “about 10 piece of content” from the Revenge Porn Helpline, “all of which were removed.” Mortimer did not yet respond to WIRED’s questions about the veracity of Telegram’s claims.

Despite recent updates to the UK’s Online Safety Act, legal enforcement of online abuse remains weak. An October 2024 report from the UK-based charity The Cyber Helpline shows that cybercrime victims face significant barriers in reporting abuse, and justice for online crimes is seven times less likely than for offline crimes.

"There’s still this long-standing idea that cybercrime doesn’t have real consequences," says Charlotte Hooper, head of operations of The Cyber Helpline, which helps support victims of cybercrime. "But if you look at victim studies, cybercrime is just as—if not more—psychologically damaging than physical crime."

A Telegram spokesperson tells WIRED that its moderators use “custom AI and machine learning tools” to remove content that violates the platform's rules, “including nonconsensual pornography and doxing.”

“As a result of Telegram's proactive moderation and response to reports, moderators remove millions of pieces of harmful content each day,” the spokesperson says.

Hooper says that survivors of digital harassment often change jobs, move cities, or even retreat from public life due to the trauma of being targeted online. The systemic failure to recognize these cases as serious crimes allows perpetrators to continue operating with impunity.

Yet, as these networks grow more interwoven, social media companies have failed to adequately address gaps in moderation.

Telegram, despite its estimated 950 million monthly active users worldwide, claims it’s too small to qualify as a “Very Large Online Platform” under the European Union’s Digital Service Act, allowing it to sidestep certain regulatory scrutiny. “Telegram takes its responsibilities under the DSA seriously and is in constant communication with the European Commission,” a company spokesperson said.

In the UK, several civil society groups have expressed concern about the use of large private Telegram groups, which allow up to 200,000 members. These groups exploit a loophole by operating under the guise of “private” communication to circumvent legal requirements for removing illegal content, including nonconsensual intimate images.

Without stronger regulation, online abuse will continue to evolve, adapting to new platforms and evading scrutiny.

The digital spaces meant to safeguard privacy are now incubating its most invasive violations. These networks aren’t just growing—they’re adapting, spreading across platforms, and learning how to evade accountability.

57 notes · View notes

amalgamasreal · 4 months ago

Text

Updated Personal Infosec Post

Been awhile since I've had one of these posts part deus: but I figure with all that's going on in the world it's time to make another one and get some stuff out there for people. A lot of the information I'm going to go over you can find here:

https://www.privacyguides.org/en/tools/

So if you'd like to just click the link and ignore the rest of the post that's fine, I strongly recommend checking out the Privacy Guides. Browsers: There's a number to go with but for this post going forward I'm going to recommend Firefox. I know that the Privacy Guides lists Brave and Safari as possible options but Brave is Chrome based now and Safari has ties to Apple. Mullvad is also an option but that's for your more experienced users so I'll leave that up to them to work out. Browser Extensions:

uBlock Origin: content blocker that blocks ads, trackers, and fingerprinting scripts. Notable for being the only ad blocker that still works on Youtube.

Privacy Badger: Content blocker that specifically blocks trackers and fingerprinting scripts. This one will catch things that uBlock doesn't catch but does not work for ads.

Facebook Container: "but I don't have facebook" you might say. Doesn't matter, Meta/Facebook still has trackers out there in EVERYTHING and this containerizes them off away from everything else.

Bitwarden: Password vaulting software, don't trust the password saving features of your browsers, this has multiple layers of security to prevent your passwords from being stolen.

ClearURLs: Allows you to copy and paste URL's without any trackers attached to them.

VPN: Note: VPN software doesn't make you anonymous, no matter what your favorite youtuber tells you, but it does make it harder for your data to be tracked and it makes it less open for whatever public network you're presently connected to.

Mozilla VPN: If you get the annual subscription it's ~$60/year and it comes with an extension that you can install into Firefox.

Mullvad VPN: Is a fast and inexpensive VPN with a serious focus on transparency and security. They have been in operation since 2009. Mullvad is based in Sweden and offers a 30-day money-back guarantee for payment methods that allow it.

Email Provider: Note: By now you've probably realized that Gmail, Outlook, and basically all of the major "free" e-mail service providers are scraping your e-mail data to use for ad data. There are more secure services that can get you away from that but if you'd like the same storage levels you have on Gmail/Ol utlook.com you'll need to pay.

Tuta: Secure, end-to-end encrypted, been around a very long time, and offers a free option up to 1gb.

Mailbox.org: Is an email service with a focus on being secure, ad-free, and privately powered by 100% eco-friendly energy. They have been in operation since 2014. Mailbox.org is based in Berlin, Germany. Accounts start with up to 2GB storage, which can be upgraded as needed.

Email Client:

Thunderbird: a free, open-source, cross-platform email, newsgroup, news feed, and chat (XMPP, IRC, Matrix) client developed by the Thunderbird community, and previously by the Mozilla Foundation.

FairMail (Android Only): minimal, open-source email app which uses open standards (IMAP, SMTP, OpenPGP), has several out of the box privacy features, and minimizes data and battery usage.

Cloud Storage:

Tresorit: Encrypted cloud storage owned by the national postal service of Switzerland. Received MULTIPLE awards for their security stats.

Peergos: decentralized and open-source, allows for you to set up your own cloud storage, but will require a certain level of expertise.

Microsoft Office Replacements:

LibreOffice: free and open-source, updates regularly, and has the majority of the same functions as base level Microsoft Office.

OnlyOffice: cloud-based, free

FreeOffice: Personal licenses are free, probably the closest to a fully office suite replacement.

Chat Clients: Note: As you've heard SMS and even WhatsApp and some other popular chat clients are basically open season right now. These are a couple of options to replace those. Note2: Signal has had some reports of security flaws, the service it was built on was originally built for the US Government, and it is based within the CONUS thus is susceptible to US subpoenas. Take that as you will.

Signal: Provides IM and calling securely and encrypted, has multiple layers of data hardening to prevent intrusion and exfil of data.

Molly (Android OS only): Alternative client to Signal. Routes communications through the TOR Network.

Briar: Encrypted IM client that connects to other clients through the TOR Network, can also chat via wifi or bluetooth.

SimpleX: Truly anonymous account creation, fully encrypted end to end, available for Android and iOS.

Now for the last bit, I know that the majority of people are on Windows or macOS, but if you can get on Linux I would strongly recommend it. pop_OS, Ubuntu, and Mint are super easy distros to use and install. They all have very easy to follow instructions on how to install them on your PC and if you'd like to just test them out all you need is a thumb drive to boot off of to run in demo mode. For more secure distributions for the more advanced users the options are: Whonix, Tails (Live USB only), and Qubes OS.

On a personal note I use Arch Linux, but I WOULD NOT recommend this be anyone's first distro as it requires at least a base level understanding of Linux and liberal use of the Arch Linux Wiki. If you game through Steam their Proton emulator in compatibility mode works wonders, I'm presently playing a major studio game that released in 2024 with no Linux support on it and once I got my drivers installed it's looked great. There are some learning curves to get around, but the benefit of the Linux community is that there's always people out there willing to help. I hope some of this information helps you and look out for yourself, it's starting to look scarier than normal out there.

#infosec #personal information #personal infosec #info sec #firefox #mullvad #vpn #vpn service #linux #linux tails #pop_os #ubuntu #linux mint #long post #whonix #qubes os #arch linux

81 notes · View notes

d2071art · 7 months ago

Text

NO AI

TL;DR: almost all social platforms are stealing your art and use it to train generative AI (or sell your content to AI developers); please beware and do something. Or don’t, if you’re okay with this.

Which platforms are NOT safe to use for sharing you art:

Facebook, Instagram and all Meta products and platforms (although if you live in the EU, you can forbid Meta to use your content for AI training)

Reddit (sold out all its content to OpenAI)

Twitter

Bluesky (it has no protection from AI scraping and you can’t opt out from 3rd party data / content collection yet)

DeviantArt, Flikr and literally every stock image platform (some didn’t bother to protect their content from scraping, some sold it out to AI developers)

Here’s WHAT YOU CAN DO:

1. Just say no:

Block all 3rd party data collection: you can do this here on Tumblr (here’s how); all other platforms are merely taking suggestions, tbh

Use Cara (they can’t stop illegal scraping yet, but they are currently working with Glaze to built in ‘AI poisoning’, so… fingers crossed)

2. Use art style masking tools:

Glaze: you can a) download the app and run it locally or b) use Glaze’s free web service, all you need to do is register. This one is a fav of mine, ‘cause, unlike all the other tools, it doesn’t require any coding skills (also it is 100% non-commercial and was developed by a bunch of enthusiasts at the University of Chicago)

Anti-DreamBooth: free code; it was originally developed to protect personal photos from being used for forging deepfakes, but it works for art to

Mist: free code for Windows; if you use MacOS or don’t have powerful enough GPU, you can run Mist on Google’s Colab Notebook

(art style masking tools change some pixels in digital images so that AI models can’t process them properly; the changes are almost invisible, so it doesn’t affect your audiences perception)

3. Use ‘AI poisoning’ tools

Nightshade: free code for Windows 10/11 and MacOS; you’ll need GPU/CPU and a bunch of machine learning libraries to use it though.

4. Stay safe and fuck all this corporate shit.

#no AI #no ai art

75 notes · View notes

mlleclaudine · 10 months ago

Text

Dreamlike Impasto Paintings Evoke Artist’s Childhood Memories of Rural Life

by Emma Taggart - My Modern Met, August 14, 2024

Artist Anastasia Trusova uses bright acrylics to create colorful, nature-inspired impasto paintings that look like something from a dream. From rolling hills and meadows to winding rivers and serene lakes, each psychedelic scene offers a modern twist on classic impressionist paintings, reminiscent of the works of Claude Monet and Vincent van Gogh.

Trusova grew up in a small town in Russia, where she remembers the simplicity of rural life. “We didn’t have much, like everyone else back then,” she recalls, “but we were surrounded by abundant nature—forests, lakes, and swamps.”

Her passion for art began in childhood, leading her to study design at university. After graduating, Trusova spent eight years in China working as a shoe designer. Eventually, she moved to Belgium to join her husband. Now a mother of three, Trusova is fully dedicated to her painting practice, having developed a unique style she calls “textured graphic impressionism.”

By applying layers of thickly applied acrylic paint to her canvas, Trusova is able to capture nature’s abundant textures. She says, “I want to show the variability of nature, the beauty of the moment, as I see it.” Flowers and leaves are brought to life with thick daubs of pigment, while swirling clouds are formed by skillfully swiping paint across the canvas with a textured scraping tool.

Trusova seeks to capture and preserve her childhood memories of rural life in her art, hoping that future generations will also come to appreciate the beauty of nature through her work.

“Watching young people leave for big cities in search of opportunities, leaving behind quiet streets and abandoned homes, is a sad reality of our time,” she writes on Instagram. “When I return to these familiar places, now as a parent with my children, I feel a strange mix of joy and sadness. Joy from the memories of my childhood spent here, and sadness from the realization that my children will likely never experience this.”

The artist continues, “This painting is an attempt to preserve those memories and pass them on to the next generation. May they remember their roots, even if their lives are far from these places.”

Anastasia Trusova: Website | Facebook | Instagram

#art #impasto painting #Anastasia Trusova #My Modern Met #August 2024 #long post

100 notes · View notes

justforbooks · 3 days ago

Text

Internet users advised to change passwords after 16bn logins exposed

Hacked credentials could give cybercriminals access to Facebook, Meta and Google accounts among others

Internet users have been told to change their passwords and upgrade their digital security after researchers claimed to have revealed the scale of sensitive information – 16bn login records – potentially available to cybercriminals.

Researchers at Cybernews, an online tech publication, said they had found 30 datasets stuffed with credentials harvested from malicious software known as “infostealers” and leaks.

The researchers said the datasets were exposed “only briefly” but amounted to 16bn login records, with an unspecified number of overlapping records – meaning it is difficult to say definitively how many accounts or people have been exposed.

Cybernews said the credentials could open access to services including Facebook, Apple and Google – although there had been no “centralised data breach” at those companies.

Bob Diachenko, the Ukrainian cybersecurity specialist behind the research, said the datasets had become temporarily available after being poorly stored on remote servers – before being removed again. Diachenko said he was able to download the files and would aim to contact individuals and companies that had been exposed.

“It will take some time of course because it is an enormous amount of data,” he said.

Diachenko said the information he had seen in infostealer logs included login URLs to Apple, Facebook and Google login pages. Apple and Facebook’s parent, Meta, have been contacted for comment.

A Google spokesperson said the data reported by Cybernews did not stem from a Google data breach – and recommended people use tools like Google’s password manager to protect their accounts.

Internet users are also able to check if their email has been compromised in a data breach by using the website haveibeenpwned.com. Cybernews said the information seen in the datasets followed a “clear structure: URL, followed by login details and a password”.

Diachenko said the data appeared to be “85% infostealers” and about 15% from historical data breaches such as a leak suffered by LinkedIn.

Experts said the research underlined the need to update passwords regularly and adopt tough security measures such as multifactor authentication – or combining a password with another form of verification such as a code texted from a phone. Other recommended measures include passkeys, a password-free method championed by Google and Facebook’s owner, Meta.

“While you’d be right to be startled at the huge volume of data exposed in this leak it’s important to note that there is no new threat here: this data will have already likely have been in circulation,” said Peter Mackenzie, the director of incident response and readiness at the cybersecurity firm Sophos.

Mackenzie said the research underlined the scale of data that can be accessed by online criminals.

“What we are understanding is the depth of information available to cybercriminals.” He added: “It is an important reminder to everyone to take proactive steps to update passwords, use a password manager and employ multifactor authentication to avoid credential issues in the future.”

Toby Lewis, the global head of threat analysis at the cybersecurity firm Darktrace, said the data flagged in the research is hard to verify but infostealers – the malware reportedly behind the data theft – are “very much real and in use by bad actors”.

He said: “They don’t access a user’s account but instead scrape information from their browser cookies and metadata. If you’re following good practice of using password managers, turning on two-factor authentication and checking suspicious logins, this isn’t something you should be greatly worried about.”

Cybernews said none of the datasets have been reported previously barring one revealed in May with 184m records. It described the datasets as a “blueprint for mass exploitation” including “account takeover, identity theft, and highly targeted phishing”.

The researchers added: “The only silver lining here is that all of the datasets were exposed only briefly: long enough for researchers to uncover them, but not long enough to find who was controlling vast amounts of data.”

Alan Woodward, a professor of cybersecurity at Surrey University, said the news was a reminder to carry out “password spring cleaning”. He added: “The fact that everything seems to be breached eventually is why there is such a big push for zero trust security measures.”

Daily inspiration. Discover more photos at Just for Books…?

#just for books #Cybercrime #Hacking #Internet

5 notes · View notes

uboaappears · 1 year ago

Text

youtube

New video on the prodigal channel! This is an edited stream from my Twitch, in which I went on a whole adventure full of side quests in order to opt out of Meta (Facebook, Instagram, Threads) using all of my posts to train its generative AI model.

I go over the opting out process as well as talking more generally about the ethical problems with the current use of generative AI, especially when it comes to this kind of data scraping. I also discuss some tools that can be used to protect your art, such as Glaze, Nightshade and Sanative.AI.

I personally think it's fair game to use ChatGPT for this because this is, frankly, a pain in my behind. Also, pitting AIs against each other is funny. But what do you think?

#meta ai #artists against ai #stop ai #stop ai art #art youtube #art youtube channel #shroomy videos #shroomy's artivism #Youtube

8 notes · View notes

abivanceconnect · 4 months ago

Text

Social Media and Privacy Concerns!!! What You Need to Know???

In a world that is becoming more digital by the day, social media has also become part of our day-to-day lives. From the beginning of sharing personal updates to networking with professionals, social media sites like Facebook, Instagram, and Twitter have changed the way we communicate. However, concerns over privacy have also grown, where users are wondering what happens to their personal information. If you use social media often, it is important to be aware of these privacy risks. In this article, we will outline the main issues and the steps you need to take to protect your online data privacy. (Related: Top 10 Pros and Cons of Social media)

1. How Social Media Platforms Scrape Your Data The majority of social media platforms scrape plenty of user information, including your: ✅ Name, email address, and phone number ✅ Location and web browsing history ✅ Likes, comments, and search history-derived interests. Although this enhances the user experience as well as advertising, it has serious privacy issues. (Read more about social media pros and cons here) 2. Risks of Excessive Sharing Personal Information Many users unknowingly expose themselves to security risks through excessive sharing of personal information. Posting details of your daily routine, location, or personal life can lead to: ⚠️ Identity theft ⚠️Stalking and harassment ⚠️ Cyber fraud

This is why you need to alter your privacy settings and be careful about what you post on the internet. (Read this article to understand how social media affects users.) 3. The Role of Third-Party Apps in Data Breaches Did you register for a site with Google or Facebook? Handy, maybe, but in doing so, you're granting apps access to look at your data, normally more than is necessary. Some high profile privacy scandals, the Cambridge Analytica one being an example, have shown how social media information can be leveraged for in politics and advertising. To minimize danger: 👍Regularly check app permissions 👍Don't sign up multiple accounts where you don't need to 👍Strong passwords and two-factor authentication To get an in-depth overview of social media's impact on security, read this detailed guide. 4. How Social Media Algorithms Follow You You may not realize this, but social media algorithms are tracking you everywhere. From the likes you share to the amount of time you watch a video, sites monitor it all through AI-driven algorithms that learn from behavior and build personalized feeds. Though it can drive user engagement, it also: ⚠️ Forms filter bubbles that limit different perspectives ⚠️ Increases data exposure in case of hacks ⚠️ Increases ethical concerns around online surveillance Understanding the advantages and disadvantages of social media will help you make an informed decision. (Find out more about it here) 5. Maintaining Your Privacy: Real-Life Tips

To protect your personal data on social media: ✅ Update privacy settings to limit sharing of data ✅ Be cautious when accepting friend requests from unknown people ✅ Think before you post—consider anything shared online can be seen by others ✅ Use encrypted messaging apps for sensitive conversations These small habits can take you a long way in protecting your online existence. (For more detailed information, read this article) Final Thoughts Social media is a powerful tool that connects people, companies, and communities. There are privacy concerns, though, and you need to be clever about how your data is being utilized. Being careful about what you share, adjusting privacy settings, and using security best practices can enable you to enjoy the benefits of social media while being safe online. Interested in learning more about how social media influences us? Check out our detailed article on the advantages and disadvantages of social media and the measures to be taken to stay safe on social media.

#social media #online privacy #privacymatters #data privacy #digital privacy #hacking #identity theft #data breach #socialmediaprosandcons #social media safety #cyber security #social security

2 notes · View notes

pomegrnteseed · 1 year ago

Text

artificial intelligence is not whimsical magic, it's theft

AI is to art and creativity what the Dementor's Kiss is to wix: extraction of the soul

Artificial intelligence technologies work like this:

Developer creates an algorithm that's really good at searching for patterns and following commands

Developer creates a training dataset for the technology to begin identifying patterns - this dataset is HUGE, so big that every individual datapoint (word/phrase/image etc) cannot be checked for error or problem

Developer releases AI platform

User asks the platform for a result, giving some specific parameters, often by inputting example data (e.g. images)

The algorithms run, searching through the databank for strong matches in pattern recognition, piecing together what it has learned so far to create a seemingly novel response

The result is presented to the user as "new" "generated" content, but it's just an amalgamation of existing works and words that is persuasively "human-like" (because the result has been harvested from humans' hard word!)

The training dataset that the developers feed the tool oftentimes amount to theft.

Developers are increasingly being found to scrape the internet, or even licensed art or published books - despite copyright licensing! - to train the machine.

AI does not make something out of nothing (a bit like whichever magical Law it is, Gamp's maybe? idk charms were never my main focus in HP lore). AI pulls from the resources it has been given - the STOLEN WORDS AND IMAGES - and mashes them together in ways that meet the request given by the user. It looks whimsical, but it's actually incredibly problematic.

Unregulated as they are now, AI technologies are stealing the creative ideas, the hearts and souls of art in all forms, and reducing it to pattern recognition.

On top of that, the training datasets that the technologies are given initially are often incredibly biased, leading to them replicating racist, misogynistic, and otherwise oppressive stereotypes in their results. We've already seen the "pale male" bias uncovered in the research by Dr Timnit Gebru and her colleagues. Dr Gebru has also been vocal about the ethical implications of AI in terms of the ecological costs of these softwares. This brilliant article by MIT Technology Review breaks down Dr Gebru's paper that saw her fired from Google, the main arguments of which are:

the ecological and financial costs are unsustainable

the training datasets are too large and so cannot be properly regulated for biases

research opportunity costs (AI looks impressive, but it doesn't actually understand language, so it can be misleading/misdirecting for researchers)

AI models can be convincing, but this can lead to overreliance/too much trust in their accuracy and validity

So, artificial intelligence technologies are embroiled in numerous ethical issues that are far from resolved, even beyond the very real, very important, very concerning issues of plaigarism.

In fandom terms, this comes to be even more problematic when chat bots are created to talk with characters, like the recently discussed High Reeve Draco Malfoy chatbot that has some Facebook Groups in a flurry.

Transformative fiction is tricky in terms of what is ethical/fair transformation of transformative works. I will argue, though that those hemming and hawings are moot since Sen removed Manacled from ao3 because she is creating an original fiction story for publication after securing a book deal (which is awesome and I'm very excited to support them in that!).

Moreover, the ethical problems redouble when we take into consideration that feeding Manacled to an artificial intelligence chatbot technology means that reproductions and repackagings of Sen's work is out of their hands entirely. That data cannot be recovered, it will never be erased from the machine. And so when others use the machine, the possible word combinations, particular phrasings, etc will all be input for analysis, reforming and reproduction for other users.

I don't think people understand the gravity of the situation around data control (or, more specifically, the lack of control we have of the data we input into these technologies). Those words are no longer our own the second we type them into the text box on "generative" AI platforms. We cannot get those ideas or words back to call our own. We cannot guarantee that someone else won't use the platform to write something and then use it elsewhere, claiming it's their own when it is in fact ours.

There are serious implications and fundamental (somewhat philosophical, but also very real and extremely urgent) questions about ownership of art in this digital age, the heart of creativity, and what constitutes original work with these technologies being used to assist idea creation or even entire image/text generation.

TLDR - stop using artificial technologies to engage with fandom. use the endless creative palaces of your minds and take up roleplaying with your pals to explore real-time interactions (roleplay in fandom is a legit thing, there are plenty of fandoms that do RP; this is your chance to do the same for the niche dhr fandoms you're invested in).

Signed, a very tired digital technologies scholar who would like you all to engage critically with digital data privacy, protection, and ethics, please.

3 notes · View notes

crownsoft · 3 months ago

Text

How to Automatically Scrape Facebook Groups

Facebook is a very popular and well-known communication and social networking platform, with users spanning many countries and regions around the world. People are constantly sharing their lives on it. Precisely because of this, many marketers choose this platform for their marketing, customer acquisition, and promotional activities.

On such a comprehensive platform, what kinds of marketing methods are more efficient and more likely to improve conversion rates? This brings us to a specific type of marketing operation called group marketing. Groups are not only highly populated and centralized places but also the main hubs for people with similar traits. As long as you choose the right groups for your marketing, your efforts will be more efficient and result in higher conversion rates.

When it comes to group marketing, we can conduct post marketing, a faster method where you can make posts visible to all group members. Additionally, there’s bulk messaging marketing, which involves sending direct marketing messages to individual users within a group.

However, before executing these marketing operations, there’s a more pressing issue to resolve: How can we acquire enough groups for marketing?

Manually adding groups is very time-consuming and labor-intensive, and there’s a high chance of ending up in irrelevant groups. To solve this issue, there is currently only one effective solution: using tools like the Crownsoft Facebook Group Auto Scraping Marketing Tools, which can easily handle this task for you.

This tool has a highly powerful group scraping feature. In this software, you only need to input relevant group keywords (you can choose multiple keywords), and it will quickly complete the group collection process for you. This makes group marketing very convenient and efficient.

Moreover, this software allows you to directly set certain criteria for the groups, such as the number of members, group verification methods, and types of scraping conditions. These options help you find the groups that are most suitable for your marketing needs, enabling smoother and more effective group marketing operations.

Crownsoft Facebook Group Auto Scraping Marketing Tools supports logging into multiple Facebook accounts simultaneously. It allows you to collect group addresses based on keywords, send bulk messages to groups, batch-add recommended friends, send messages to recommenders, collect individual Facebook users, send bulk private messages, and post bulk comments on pages. Additionally, it features a customer service management function for interacting with followers, providing quick reply templates, and automatically translating chat records.

#Crownsoft Facebook Group Auto Scraping Marketing Tools #Facebook Group Auto Scraping Marketing Tools #facebook #facebook marketing

0 notes

thrina-thrina-on-the-wall · 2 years ago

Text

update on the facebook story situation. Last night I shared a picture of my attempt to make coconut milk without the right scraping tool. Two people reacted.

One: my new friend from New Zealand, the one I'm starting a long distance relationship with

Two: the student who im pretty sure is interested in me. Fortunately it's up to me whether I'm in a love triangle or not and I say no to that

#Mr new Zealand has suggested we do some Bible study and I'm taking that as a good sign.#I'm not in love I don't know if I even want to be but the relationship is good #As for Benny (the student) he's a good person but I really hope he gets the hint that Im not interested.#He's been very proactive in making sure I get his assignments and I'm pretty sure it's an excuse to talk to me #Trouble is I like talking to people

4 notes · View notes

mariacallous · 10 months ago

Text

Less than three months after Apple quietly debuted a tool for publishers to opt out of its AI training, a number of prominent news outlets and social platforms have taken the company up on it.

WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training. The cold reception reflects a significant shift in both the perception and use of the robotic crawlers that have trawled the web for decades. Now that these bots play a key role in collecting AI training data, they’ve become a conflict zone over intellectual property and the future of the web.

This new tool, Applebot-Extended, is an extension to Apple’s web-crawling bot that specifically lets website owners tell Apple not to use their data for AI training. (Apple calls this “controlling data usage” in a blog post explaining how it works.) The original Applebot, announced in 2015, initially crawled the internet to power Apple’s search products like Siri and Spotlight. Recently, though, Applebot’s purpose has expanded: The data it collects can also be used to train the foundational models Apple created for its AI efforts.

Applebot-Extended is a way to respect publishers' rights, says Apple spokesperson Nadine Haija. It doesn’t actually stop the original Applebot from crawling the website—which would then impact how that website’s content appeared in Apple search products—but instead prevents that data from being used to train Apple's large language models and other generative AI projects. It is, in essence, a bot to customize how another bot works.

Publishers can block Applebot-Extended by updating a text file on their websites known as the Robots Exclusion Protocol, or robots.txt. This file has governed how bots go about scraping the web for decades—and like the bots themselves, it is now at the center of a larger fight over how AI gets trained. Many publishers have already updated their robots.txt files to block AI bots from OpenAI, Anthropic, and other major AI players.

Robots.txt allows website owners to block or permit bots on a case-by-case basis. While there’s no legal obligation for bots to adhere to what the text file says, compliance is a long-standing norm. (A norm that is sometimes ignored: Earlier this year, a WIRED investigation revealed that the AI startup Perplexity was ignoring robots.txt and surreptitiously scraping websites.)

Applebot-Extended is so new that relatively few websites block it yet. Ontario, Canada–based AI-detection startup Originality AI analyzed a sampling of 1,000 high-traffic websites last week and found that approximately 7 percent—predominantly news and media outlets—were blocking Applebot-Extended. This week, the AI agent watchdog service Dark Visitors ran its own analysis of another sampling of 1,000 high-traffic websites, finding that approximately 6 percent had the bot blocked. Taken together, these efforts suggest that the vast majority of website owners either don’t object to Apple’s AI training practices are simply unaware of the option to block Applebot-Extended.

In a separate analysis conducted this week, data journalist Ben Welsh found that just over a quarter of the news websites he surveyed (294 of 1,167 primarily English-language, US-based publications) are blocking Applebot-Extended. In comparison, Welsh found that 53 percent of the news websites in his sample block OpenAI’s bot. Google introduced its own AI-specific bot, Google-Extended, last September; it’s blocked by nearly 43 percent of those sites, a sign that Applebot-Extended may still be under the radar. As Welsh tells WIRED, though, the number has been “gradually moving” upward since he started looking.

Welsh has an ongoing project monitoring how news outlets approach major AI agents. “A bit of a divide has emerged among news publishers about whether or not they want to block these bots,” he says. “I don't have the answer to why every news organization made its decision. Obviously, we can read about many of them making licensing deals, where they're being paid in exchange for letting the bots in—maybe that's a factor.”

Last year, The New York Times reported that Apple was attempting to strike AI deals with publishers. Since then, competitors like OpenAI and Perplexity have announced partnerships with a variety of news outlets, social platforms, and other popular websites. “A lot of the largest publishers in the world are clearly taking a strategic approach,” says Originality AI founder Jon Gillham. “I think in some cases, there's a business strategy involved—like, withholding the data until a partnership agreement is in place.”

There is some evidence supporting Gillham’s theory. For example, Condé Nast websites used to block OpenAI’s web crawlers. After the company announced a partnership with OpenAI last week, it unblocked the company’s bots. (Condé Nast declined to comment on the record for this story.) Meanwhile, Buzzfeed spokesperson Juliana Clifton told WIRED that the company, which currently blocks Applebot-Extended, puts every AI web-crawling bot it can identify on its block list unless its owner has entered into a partnership—typically paid—with the company, which also owns the Huffington Post.

Because robots.txt needs to be edited manually, and there are so many new AI agents debuting, it can be difficult to keep an up-to-date block list. “People just don’t know what to block,” says Dark Visitors founder Gavin King. Dark Visitors offers a freemium service that automatically updates a client site’s robots.txt, and King says publishers make up a big portion of his clients because of copyright concerns.

Robots.txt might seem like the arcane territory of webmasters—but given its outsize importance to digital publishers in the AI age, it is now the domain of media executives. WIRED has learned that two CEOs from major media companies directly decide which bots to block.

Some outlets have explicitly noted that they block AI scraping tools because they do not currently have partnerships with their owners. “We’re blocking Applebot-Extended across all of Vox Media’s properties, as we have done with many other AI scraping tools when we don’t have a commercial agreement with the other party,” says Lauren Starke, Vox Media’s senior vice president of communications. “We believe in protecting the value of our published work.”

Others will only describe their reasoning in vague—but blunt!—terms. “The team determined, at this point in time, there was no value in allowing Applebot-Extended access to our content,” says Gannett chief communications officer Lark-Marie Antón.

Meanwhile, The New York Times, which is suing OpenAI over copyright infringement, is critical of the opt-out nature of Applebot-Extended and its ilk. “As the law and The Times' own terms of service make clear, scraping or using our content for commercial purposes is prohibited without our prior written permission,” says NYT director of external communications Charlie Stadtlander, noting that the Times will keep adding unauthorized bots to its block list as it finds them. “Importantly, copyright law still applies whether or not technical blocking measures are in place. Theft of copyrighted material is not something content owners need to opt out of.”

It’s unclear whether Apple is any closer to closing deals with publishers. If or when it does, though, the consequences of any data licensing or sharing arrangements may be visible in robots.txt files even before they are publicly announced.

“I find it fascinating that one of the most consequential technologies of our era is being developed, and the battle for its training data is playing out on this really obscure text file, in public for us all to see,” says Gillham.

11 notes · View notes

amalgamasreal · 7 months ago

Text

On Personal InfoSec

Been awhile since I've had one of these posts but I figure with all that's going on in the world it's time to make another one of these posts and get some stuff out there for people. A lot of the information I'm going to go over you can find here:

So if you'd like to just click the link and ignore the rest of the post that's fine, I strongly recommend checking out the Privacy Guides.

Browsers:

There's a number to go with but for this post going forward I'm going to recommend Firefox. I know that the Privacy Guides lists Brave and Safari as possible options but Brave is Chrome based now and Safari has ties to Apple. Mullvad is also an option but that's for your more experienced users so I'll leave that up to them to work out.

Browser Extensions:

uBlock Origin: content blocker that blocks ads, trackers, and fingerprinting scripts. Notable for being the only ad blocker that still works on Youtube.

Privacy Badger: Content blocker that specifically blocks trackers and fingerprinting scripts. This one will catch things that uBlock doesn't catch but does not work for ads.

Facebook Container: "but I don't have facebook" you might say. Doesn't matter, Meta/Facebook still has trackers out there in EVERYTHING and this containerizes them off away from everything else.

Bitwarden: Password vaulting software, don't trust the password saving features of your browsers, this has multiple layers of security to prevent your passwords from being stolen.

ClearURLs: Allows you to copy and paste URL's without any trackers attached to them.

VPN:

Note: VPN software doesn't make you anonymous, no matter what your favorite youtuber tells you, but it does make it harder for your data to be tracked and it makes it less open for whatever network you're presently connected to.

Mozilla VPN: If you get the annual subscription it's ~$60/year and it comes with an extension that you can install into Firefox.

Proton VPN: Has easily the most amount of countries serviced, can take cash payments, and does offer port forwarding.

Email Provider:

Note: By now you've probably realized that Gmail, Outlook, and basically all of the major "free" e-mail service providers are scraping your e-mail data to use for ad data. There are more secure services that can get you away from that but if you'd like the same storage levels you have on Gmail/Outlook.com you'll need to pay.

Proton Mail: Secure, end-to-end encrypted, and fairly easy to setup and use. Offers a free option up to 1gb

Tuta: Secure, end-to-end encrypted, been around a very long time, and offers a free option up to 1gb.

Email Client:

Thunderbird if you're on Windows or Linux

Apple Mail if you're on macOS

Cloud Storage:

Proton Drive: Encrypted cloud storage from the same people as Proton Mail.

Tresorit: Encrypted cloud storage owned by the national postal service of Switzerland. Received MULTIPLE awards for their security stats.

Peergos: decentralized and open-source, allows for you to set up your own cloud storage, but will require a certain level of expertise.

Microsoft Office Replacements:

LibreOffice: free and open-source, updates regularly, and has the majority of the same functions as base level Microsoft Office.

OnlyOffice: cloud-based, free, and open source.

Chat Clients:

Note: As you've heard SMS and even WhatsApp and some other popular chat clients are basically open season right now. These are a couple of options to replace those.

Signal: Provides IM and calling securely and encrypted, has multiple layers of data hardening to prevent intrusion and exfil of data.

Molly (Android OS only): Alternative client to Signal. Routes communications through the TOR Network.

Briar: Encrypted IM client that connects to other clients through the TOR Network, can also chat via wifi or bluetooth.

If you game through Steam their Proton emulator in compatibility mode works wonders, I'm presently playing a major studio game that released in 2024 with no Linux support on it and once I got my drivers installed it's looked great. There are some learning curves to get around, but the benefit of the Linux community is that there's always people out there willing to help.

I hope some of this information helps you and look out for yourself, it's starting to look scarier than normal out there.

#information security #infosec #computer security #computer infosec #personal infosec #browsers #internet browser #email #instant messaging #cloud storage #linux #pop os #linux mint #ubuntu #firefox #firefox extensions #long post

67 notes · View notes