#robots.txt file use for?
Explore tagged Tumblr posts
digitalmarketerr · 8 months ago
Text
0 notes
ralfmaximus · 1 year ago
Text
Tumblr media
To understand what's going on here, know these things:
OpenAI is the company that makes ChatGPT
A spider is a kind of bot that autonomously crawls the web and sucks up web pages
robots.txt is a standard text file that most web sites use to inform spiders whether or not they have permission to crawl the site; basically a No Trespassing sign for robots
OpenAI's spider is ignoring robots.txt (very rude!)
the web.sp.am site is a research honeypot created to trap ill-behaved spiders, consisting of billions of nonsense garbage pages that look like real content to a dumb robot
OpenAI is training its newest ChatGPT model using this incredibly lame content, having consumed over 3 million pages and counting...
It's absurd and horrifying at the same time.
16K notes · View notes
changes · 2 years ago
Text
Friday, July 28th, 2023
🌟 New
We’ve updated the text for the blog setting that said it would “hide your blog from search results”. Unfortunately, we’ve never been able to guarantee hiding content from search crawlers, unless they play nice with the standard prevention measures of robots.txt and noindex. With this in mind, we’ve changed the text of that setting to be more accurate, insofar as we discourage them, but cannot prevent search indexing. If you want to completely isolate your blog from the outside internet and require only logged in folks to see your blog, then that’s the separate “Hide [blog] from people without an account” setting, which does prevent search engines from indexing your blog.
When creating a poll on the web, you can now have 12 poll options instead of 10. Wow.
For folks using the Android app, if you get a push notification that a blog you’re subscribed to has a new post, that push will take you to the post itself, instead of the blog view.
For those of you seeing the new desktop website layout, we’ve eased up the spacing between columns a bit to hopefully make things feel less cramped. Thanks to everyone who sent in feedback about this! We’re still triaging more feedback as the experiment continues.
🛠 Fixed
While experimenting with new dashboard tab configuration options, we accidentally broke dashboard tabs that had been enabled via Tumblr Labs, like the Blog Subs tab. We’ve rolled back that change to fix those tabs.
We’ve fixed more problems with how we choose what content goes into blogs’ RSS feeds. This time we’ve fixed a few issues with how answer post content is shown as RSS items.
We’ve also fixed some layout issues with the new desktop website navigation, especially glitches caused when resizing the browser window.
Fixed a visual glitch in the new activity redesign experiment on web that was making unread activity items difficult to read in some color palettes.
Fixed a bug in Safari that was preventing mature content from being blurred properly.
When using Tumblr on a mobile phone browser, the hamburger menu icon will now have an indicator when you have an unread ask or submission in your Inbox.
🚧 Ongoing
Nothing to report here today.
🌱 Upcoming
We hear it’s crab day tomorrow on Tumblr. 🦀
We’re working on adding the ability to reply to posts as a sideblog! We’re just getting started, so it may be a little while before we run an experiment with it.
Experiencing an issue? File a Support Request and we’ll get back to you as soon as we can!
Want to share your feedback about something? Check out our Work in Progress blog and start a discussion with the community.
854 notes · View notes
smellslikebot · 1 year ago
Text
"how do I keep my art from being scraped for AI from now on?"
if you post images online, there's no 100% guaranteed way to prevent this, and you can probably assume that there's no need to remove/edit existing content. you might contest this as a matter of data privacy and workers' rights, but you might also be looking for smaller, more immediate actions to take.
...so I made this list! I can't vouch for the effectiveness of all of these, but I wanted to compile as many options as possible so you can decide what's best for you.
Discouraging data scraping and "opting out"
robots.txt - This is a file placed in a website's home directory to "ask" web crawlers not to access certain parts of a site. If you have your own website, you can edit this yourself, or you can check which crawlers a site disallows by adding /robots.txt at the end of the URL. This article has instructions for blocking some bots that scrape data for AI.
HTML metadata - DeviantArt (i know) has proposed the "noai" and "noimageai" meta tags for opting images out of machine learning datasets, while Mojeek proposed "noml". To use all three, you'd put the following in your webpages' headers:
<meta name="robots" content="noai, noimageai, noml">
Have I Been Trained? - A tool by Spawning to search for images in the LAION-5B and LAION-400M datasets and opt your images and web domain out of future model training. Spawning claims that Stability AI and Hugging Face have agreed to respect these opt-outs. Try searching for usernames!
Kudurru - A tool by Spawning (currently a Wordpress plugin) in closed beta that purportedly blocks/redirects AI scrapers from your website. I don't know much about how this one works.
ai.txt - Similar to robots.txt. A new type of permissions file for AI training proposed by Spawning.
ArtShield Watermarker - Web-based tool to add Stable Diffusion's "invisible watermark" to images, which may cause an image to be recognized as AI-generated and excluded from data scraping and/or model training. Source available on GitHub. Doesn't seem to have updated/posted on social media since last year.
Image processing... things
these are popular now, but there seems to be some confusion regarding the goal of these tools; these aren't meant to "kill" AI art, and they won't affect existing models. they won't magically guarantee full protection, so you probably shouldn't loudly announce that you're using them to try to bait AI users into responding
Glaze - UChicago's tool to add "adversarial noise" to art to disrupt style mimicry. Devs recommend glazing pictures last. Runs on Windows and Mac (Nvidia GPU required)
WebGlaze - Free browser-based Glaze service for those who can't run Glaze locally. Request an invite by following their instructions.
Mist - Another adversarial noise tool, by Psyker Group. Runs on Windows and Linux (Nvidia GPU required) or on web with a Google Colab Notebook.
Nightshade - UChicago's tool to distort AI's recognition of features and "poison" datasets, with the goal of making it inconvenient to use images scraped without consent. The guide recommends that you do not disclose whether your art is nightshaded. Nightshade chooses a tag that's relevant to your image. You should use this word in the image's caption/alt text when you post the image online. This means the alt text will accurately describe what's in the image-- there is no reason to ever write false/mismatched alt text!!! Runs on Windows and Mac (Nvidia GPU required)
Sanative AI - Web-based "anti-AI watermark"-- maybe comparable to Glaze and Mist. I can't find much about this one except that they won a "Responsible AI Challenge" hosted by Mozilla last year.
Just Add A Regular Watermark - It doesn't take a lot of processing power to add a watermark, so why not? Try adding complexities like warping, changes in color/opacity, and blurring to make it more annoying for an AI (or human) to remove. You could even try testing your watermark against an AI watermark remover. (the privacy policy claims that they don't keep or otherwise use your images, but use your own judgment)
given that energy consumption was the focus of some AI art criticism, I'm not sure if the benefits of these GPU-intensive tools outweigh the cost, and I'd like to know more about that. in any case, I thought that people writing alt text/image descriptions more often would've been a neat side effect of Nightshade being used, so I hope to see more of that in the future, at least!
246 notes · View notes
jesterwaves · 5 days ago
Text
Making a personal website
Why do it?
Having a website is a great creative outlet, and gives you way more control over your space than social media. You are in full control of the content you host on your site, and, if you ever need to migrate to a new host for it, you won’t have to worry about losing a bunch of stuff (for the most part)
Make a page that's just a bunch of pictures of wizards! Turn it into an ARG! Use it as a portfolio! Make it dedicated to your OC Verse(s)! The world's your oyster! HTML and CSS may seem like a lot at first but it's honestly not very hard to learn!
You don't need to be an expert to have a good looking website!
Sections:
Where to Host + File Hosting
Actually Making a Website
What to write your code IN
Keeping Your Site Accessible
Preventing Scrapping with a robots.txt
Etiquette and Useful Terms
Where to Host
There are a few places around the net you can find, but for a personal, fully customizable site you’ll want to avoid commercial places like Squarespace. Squarespace is aimed at people who don’t want to make a site from scratch and are specifically looking at something professional for a portfolio or business. You won't have the rights to the code!
Neocities is the biggest name in the indie web space right now, but Nekoweb has gained some attention lately. You can even use both as a mirror of one another, and if you ever need to move hosts, you can download all your files from either of them.
Differences Between Neocities and Nekoweb:
Neocities offers 1 GB of storage
Nekoweb offers 50MB of storage (half of neocities)
Nekoweb does NOT restrict what file types you can host
Neocities restricts file types to non-supporters. Most files are fine, you'll probably only run into issue with video or audio files (but those eat up a lot of space anyway...) Full list here
IMO Neocities is also just more beginner friendly
NOTE: nekoweb has a robots.txt on their server by default, neocities does not but AFAIK new sites will be given a robots.txt for which they can set the allowed/disallowed themselves. There has been some misinfo about this: this is not neocities giving your data to ai, this is really just the default state of the internet, unfortunately. Either way, you can set up a robots.txt yourself to say whatever you want!
Alternate File Hosting
It’s best to host everything you can on the same host as your site, but if you're limited in space or type, you can host it somewhere else.
Make sure to use something dedicated to hosting files, otherwise your links may end up breaking (so don't use discord). I use file garden, which I have liked, though it's slow sometimes. I know others around neocities who have used catbox.moe… but those links always break for me, for some reason.
If you don’t mind hosting on youtube or soundcloud, there are ways you can embed those players onto your site as well!
I host audio and my art gallery on file garden; everything else is directly on my site and it only takes up 2.5% of my 1 GB of space!
Nekoweb and Neocities aren't meant to be used as file hosts, so don't try to use your neocities as "extra storage" for your nekoweb site, or visa versa.
The Actual “Making a Website” Part
For designing your website, I recommend to browse around personal sites on neocities and nekoweb for inspiration before drawing something out. If you don’t want to design a site yourself, there are plenty of templates, including the classic sadgrl.online site generator (and this guide to tweaking it).
Neocities Guides for Absolute Beginners (if you've never used any html, this is a good starting place)
Making a layout from start to finish (if you know what an html tag is, this should be fine for you; it's what I used!)
Making your website responsive (I swear it’s so easy to make your website mobile accessible unless you’re doing something totally crazy with it)
Sadgrls other guides
The Mozilla and W3Schools documentation are useful resources, but may be confusing to you at first. I myself learned basic HTML, CSS, and Javascript ages ago on Khan Academy, but as Khan Academy started using AI at some point, I have no idea how those hold up.
Take it in pieces, you’ll get a hang of it!
Relative Links
You don't need to link your full URL to link an image; you can link files relatively. For example, if I have a page in my main directory, and an image in folder titled "images" within that directory, I can just link it like this: "/images/image.png" You may or may not need the slash in the beginning, depending on your host. For neocities, I typically don't.
But what if you have a page in a folder and want to access a link in the main directory? Just add two dots for each folder you want to move backwards from: "../image.png" (1 folder backwards) "../../image.png" (2 folders backwards)
Avoid using relative links on your "not_found.html" page, because that page displays anytime a user tries to access a page that doesn't exist, and it will attempt to retrieve links from whatever the user typed into the bar. eg, if a user typed in "your-url/folder/page", it will treat relative links as though it is in that folder.
What to Write Your Code IN
If you make a file for all your website files and organize it in the same way as it is on your website host, you can open your html files in your browser offline and preview how they work and function.
Codepen is a great free code editor for html, css, and js specifically, which also allows a live preview of your site.
I've tried Dreamweaver and it's super buggy (and definitely not worth the price). I know some people use Visual Code Studio but I've never tried it myself.
Keeping Your Site Accessible
Many websites on the indie web right now, are unfortunately, accessibility nightmares… But, it’s actually not that hard to make your website more accessible without sacrificing your artistic intent
Semantic tags are tags that don’t have a specified style but help screen readers interpret content. You should also be careful not to use tags for something other than their intended purpose. Here’s a guide to semantic tags.
Alt text can describe elements to screen readers, but for decorative content like dividers, it's unnecessary. To let a screen reader just pass over them, set the alt property to an empty string ("")
Alt text furthermore should be descriptive but concise. Focus on the most important details and meaning/purpose of the image, not all the little details. Descriptions should also be objective, not subjective.
Color Contrast: text with low contrast against the background may be difficult or even impossible for some people to read. You can check color contrast using firefox’s developer tools, or through this website.
Flashing imagery and bright colors should, at the very least, be warned against. There is a way to use Javascript to freeze gifs, but it’s a bit complicated
Many people make their index page list content warnings so people can prepare themselves ahead of time, or turn back if content on the site may be harmful to them.
These are just the major things I’ve run into myself, but I’m still learning how to make my pages more accessible. For more info on things you can do to make your site more accessible check out these resources.
Prevent Scraping with a Robots.txt
This is not a foolproof method, in fact, bad actors will scrap your files anyway. All a “robots.txt” does is politely request that robots don’t scrap your site for anything… It’s up to the programmers to make their robots LISTEN. Here’s an article that has a blocklist for a bunch of the major bots.
I know this may be demoralizing, but unfortunately the only way you can “protect” your files against ai is to never share them. But, ai can never replace the way you feel about your work or the desire other people have to connect with it. AI can only ever produce a stale, easily digestible imitation… Basically, I know it's scary right now, but keep making your stuff. Do what you can to protect it…. But please don’t let ai stop your spirit!
Etiquette and Useful Terms
88x31 buttons were a staple of the old web, so many people make buttons for their site so other people can link to it!
Hotlinking refers to linking a file from someone else’s site to your own. This isn’t a big problem for big websites like tumblr or twitter, but hotlinking a file from someone’s personal site uses THEIR bandwidth anytime someone loads YOUR site and is frowned upon. This is only applicable to FILES on someone’s page, just linking to their page is fine!
i had an example but tumblr thought it was actual code...
Most browsers allow you to look at the source code for a website by right clicking and choosing "view source". This is a great way to learn how people do certain things... but they may not take kindly if you copy their code. Use it as a guide; don't copy huge chunks of code unless they have said it's okay to.
A webring is a collection of websites with some shared trait/topic that link to each other so that it forms a ring (i.e: Website 1 <--> Website 2 <--> Website 3 <--> Website 1). Web listings and web cliques are similar concepts; it’s basically like joining a club.
An RSS feed is basically like a “following” tab…but for the whole internet (well…any site that has an RSS feed). That way, people who don’t have a neocities or nekoweb (or other) account can get updated whenever your site does. To subscribe to an RSS feed, you’ll need a feed reader, which you can find as an extension for whatever browser you use. As for making a RSS feed, here’s a simple guide.
Javascript’s pretty complicated and I just look up what I want to know and learn from there so I'm not confident to give you help. But, I had to learn that scripts are very picky about where you declare them. If they aren't working, try moving them around.
I'm not an expert, so apologies if I've said anything wrong/confusing. These are resources I found useful or WISH I had when I started. Happy coding!
14 notes · View notes
probablyasocialecologist · 1 year ago
Text
There has been a real backlash to AI’s companies’ mass scraping of the internet to train their tools that can be measured by the number of website owners specifically blocking AI company scraper bots, according to a new analysis by researchers at the Data Provenance Initiative, a group of academics from MIT and universities around the world.  The analysis, published Friday, is called “Consent in Crisis: The Rapid Decline of the AI Data Commons,” and has found that, in the last year, “there has been a rapid crescendo of data restrictions from web sources” restricting web scraper bots (sometimes called “user agents”) from training on their websites. Specifically, about 5 percent of the 14,000 websites analyzed had modified their robots.txt file to block AI scrapers. That may not seem like a lot, but 28 percent of the “most actively maintained, critical sources,” meaning websites that are regularly updated and are not dormant, have restricted AI scraping in the last year. An analysis of these sites’ terms of service found that, in addition to robots.txt restrictions, many sites also have added AI scraping restrictions to their terms of service documents in the last year.
[...]
The study, led by Shayne Longpre of MIT and done in conjunction with a few dozen researchers at the Data Provenance Initiative, called this change an “emerging crisis” not just for commercial AI companies like OpenAI and Perplexity, but for researchers hoping to train AI for academic purposes. The New York Times said this shows that the data used to train AI is “disappearing fast.”
23 July 2024
86 notes · View notes
itsjunetime · 1 year ago
Text
i built a little crate for tower-based rust web servers the other day, if anyone’s interested. it’s called tower-no-ai, and it adds a layer to your server which can redirect all “AI” scraping bots to a URL of your choice (such as a 10gb file that hetzner uses for speed testing - see below). here’s how you can use it:
Tumblr media
it also provides a function to serve a robots.txt file which will disallow all of these same bots from scraping your site (but obviously, it’s up to these bots to respect your robots.txt, which is why the redirect layer is more recommended). because it’s built for tower, it should work with all crates which support tower (such as axum, warp, tonic, etc.). there’s also an option to add a pseudo-random query parameter to the end of each redirect respond so that these bots (probably) won’t automatically cache the response of the url you redirect them to and instead just re-fetch it every time (to maximize time/resource wasting of these shitty bots).
you can view (and star, if you would so like :)) the repo here:
60 notes · View notes
nixcraft · 2 years ago
Text
Are you a content creator or a blog author who generates unique, high-quality content for a living? Have you noticed that generative AI platforms like OpenAI or CCBot use your content to train their algorithms without your consent? Don’t worry! You can block these AI crawlers from accessing your website or blog by using the robots.txt file.
Tumblr media
Web developers must know how to add OpenAI, Google, and Common Crawl to your robots.txt to block (more like politely ask) generative AI from stealing content and profiting from it.
-> Read more: How to block AI Crawler Bots using robots.txt file
73 notes · View notes
mariacallous · 11 months ago
Text
Less than three months after Apple quietly debuted a tool for publishers to opt out of its AI training, a number of prominent news outlets and social platforms have taken the company up on it.
WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training. The cold reception reflects a significant shift in both the perception and use of the robotic crawlers that have trawled the web for decades. Now that these bots play a key role in collecting AI training data, they’ve become a conflict zone over intellectual property and the future of the web.
This new tool, Applebot-Extended, is an extension to Apple’s web-crawling bot that specifically lets website owners tell Apple not to use their data for AI training. (Apple calls this “controlling data usage” in a blog post explaining how it works.) The original Applebot, announced in 2015, initially crawled the internet to power Apple’s search products like Siri and Spotlight. Recently, though, Applebot’s purpose has expanded: The data it collects can also be used to train the foundational models Apple created for its AI efforts.
Applebot-Extended is a way to respect publishers' rights, says Apple spokesperson Nadine Haija. It doesn’t actually stop the original Applebot from crawling the website—which would then impact how that website’s content appeared in Apple search products—but instead prevents that data from being used to train Apple's large language models and other generative AI projects. It is, in essence, a bot to customize how another bot works.
Publishers can block Applebot-Extended by updating a text file on their websites known as the Robots Exclusion Protocol, or robots.txt. This file has governed how bots go about scraping the web for decades—and like the bots themselves, it is now at the center of a larger fight over how AI gets trained. Many publishers have already updated their robots.txt files to block AI bots from OpenAI, Anthropic, and other major AI players.
Robots.txt allows website owners to block or permit bots on a case-by-case basis. While there’s no legal obligation for bots to adhere to what the text file says, compliance is a long-standing norm. (A norm that is sometimes ignored: Earlier this year, a WIRED investigation revealed that the AI startup Perplexity was ignoring robots.txt and surreptitiously scraping websites.)
Applebot-Extended is so new that relatively few websites block it yet. Ontario, Canada–based AI-detection startup Originality AI analyzed a sampling of 1,000 high-traffic websites last week and found that approximately 7 percent—predominantly news and media outlets—were blocking Applebot-Extended. This week, the AI agent watchdog service Dark Visitors ran its own analysis of another sampling of 1,000 high-traffic websites, finding that approximately 6 percent had the bot blocked. Taken together, these efforts suggest that the vast majority of website owners either don’t object to Apple’s AI training practices are simply unaware of the option to block Applebot-Extended.
In a separate analysis conducted this week, data journalist Ben Welsh found that just over a quarter of the news websites he surveyed (294 of 1,167 primarily English-language, US-based publications) are blocking Applebot-Extended. In comparison, Welsh found that 53 percent of the news websites in his sample block OpenAI’s bot. Google introduced its own AI-specific bot, Google-Extended, last September; it’s blocked by nearly 43 percent of those sites, a sign that Applebot-Extended may still be under the radar. As Welsh tells WIRED, though, the number has been “gradually moving” upward since he started looking.
Welsh has an ongoing project monitoring how news outlets approach major AI agents. “A bit of a divide has emerged among news publishers about whether or not they want to block these bots,” he says. “I don't have the answer to why every news organization made its decision. Obviously, we can read about many of them making licensing deals, where they're being paid in exchange for letting the bots in—maybe that's a factor.”
Last year, The New York Times reported that Apple was attempting to strike AI deals with publishers. Since then, competitors like OpenAI and Perplexity have announced partnerships with a variety of news outlets, social platforms, and other popular websites. “A lot of the largest publishers in the world are clearly taking a strategic approach,” says Originality AI founder Jon Gillham. “I think in some cases, there's a business strategy involved—like, withholding the data until a partnership agreement is in place.”
There is some evidence supporting Gillham’s theory. For example, Condé Nast websites used to block OpenAI’s web crawlers. After the company announced a partnership with OpenAI last week, it unblocked the company’s bots. (Condé Nast declined to comment on the record for this story.) Meanwhile, Buzzfeed spokesperson Juliana Clifton told WIRED that the company, which currently blocks Applebot-Extended, puts every AI web-crawling bot it can identify on its block list unless its owner has entered into a partnership—typically paid—with the company, which also owns the Huffington Post.
Because robots.txt needs to be edited manually, and there are so many new AI agents debuting, it can be difficult to keep an up-to-date block list. “People just don’t know what to block,” says Dark Visitors founder Gavin King. Dark Visitors offers a freemium service that automatically updates a client site’s robots.txt, and King says publishers make up a big portion of his clients because of copyright concerns.
Robots.txt might seem like the arcane territory of webmasters—but given its outsize importance to digital publishers in the AI age, it is now the domain of media executives. WIRED has learned that two CEOs from major media companies directly decide which bots to block.
Some outlets have explicitly noted that they block AI scraping tools because they do not currently have partnerships with their owners. “We’re blocking Applebot-Extended across all of Vox Media’s properties, as we have done with many other AI scraping tools when we don’t have a commercial agreement with the other party,” says Lauren Starke, Vox Media’s senior vice president of communications. “We believe in protecting the value of our published work.”
Others will only describe their reasoning in vague—but blunt!—terms. “The team determined, at this point in time, there was no value in allowing Applebot-Extended access to our content,” says Gannett chief communications officer Lark-Marie Antón.
Meanwhile, The New York Times, which is suing OpenAI over copyright infringement, is critical of the opt-out nature of Applebot-Extended and its ilk. “As the law and The Times' own terms of service make clear, scraping or using our content for commercial purposes is prohibited without our prior written permission,” says NYT director of external communications Charlie Stadtlander, noting that the Times will keep adding unauthorized bots to its block list as it finds them. “Importantly, copyright law still applies whether or not technical blocking measures are in place. Theft of copyrighted material is not something content owners need to opt out of.”
It’s unclear whether Apple is any closer to closing deals with publishers. If or when it does, though, the consequences of any data licensing or sharing arrangements may be visible in robots.txt files even before they are publicly announced.
“I find it fascinating that one of the most consequential technologies of our era is being developed, and the battle for its training data is playing out on this really obscure text file, in public for us all to see,” says Gillham.
11 notes · View notes
paradoxcase · 1 month ago
Text
I'd like some people to participate in a quick LLM experiment - specifically, to get some idea as to what degree LLMs are actually ignoring things like robots.txt and using content they don't have permission to use.
Think of an OC that belongs to you, that you have created some non-professional internet-based content for - maybe you wrote a story with them in it and published it on AO3, or you talked about them on your blog, or whatever, but it has to be content that can be seen by logged-out users. Use Google to make sure they don't share a name with a real person or a well-known character belonging to someone else.
Pick an LLM of your choice and ask it "Who is <your OC's name>?" Don't provide any context. Don't give it any other information, or ask any follow-up questions. If you're worried about the ethicality of this, sending a single prompt is not going to be making anyone money, and the main energy-consumption associated with LLMs is in training them, so unless you are training your own model, you are not going to be contributing much to climate change and this request probably doesn't use much more energy than say, posting to tumblr.
Now check if what you've posted about this OC is actually searchable on Google.
Now answer:
robots.txt is a file that every website has, which tells internet bots what parts of the website they can and cannot access. It was originally established for search engine webcrawlers. It's an honor system, so there's nothing that actually prevents bots from just ignoring it. I've seen a lot of unsourced claims that LLM bots are ignoring robots.txt, but Wikipedia claims that GPTBot at least does respect robots.txt, at least at the current time. Some sites use robots.txt to forbid GPTBot and other LLM bots specifically while allowing other bots, but I think it's probably safe to say that if your content does not appear on Google, it has probably been forbidden to all bots using robots.txt.
I did this experiment myself, using an OC that I've posted a lot of content for on Dreamwidth, none of which has been indexed by Google, and the LLM did not know who they were. I'd be interested to hear about results from people posting on other sites.
2 notes · View notes
jadeseadragon · 3 months ago
Text
"There are 3 things a visual artist can do to protect their work. First, dump using Meta and X, who are scraping the images that are loaded to their apps. Second, on your artist's portfolio website, install a Robots.txt and ai.txt file that blocks the AI scraping bots. Please note that Adobe Portfolio websites don't give you this option to block AI bot scraping. Third, use the app "glaze" to protect your work. This app embeds data into and image that renders you image unusable to AI. If anyone needs more info on the Robots.txt, AI.txt setup or the app Glaze, please feel free to reach out to me." [E.R. Flynn]
2 notes · View notes
vague-humanoid · 1 year ago
Text
Recent discussions on Reddit are no longer showing up in non-Google search engine results. The absence is the result of updates to Reddit’s Content Policy that ban crawling its site without agreeing to Reddit’s rules, which bar using Reddit content for AI training without Reddit’s explicit consent.
As reported by 404 Media, using "site:reddit.com" on non-Google search engines, including Bing, DuckDuckGo, and Mojeek, brings up minimal or no Reddit results from the past week. Ars Technica made searches on these and other search engines and can confirm the findings. Brave, for example, brings up a few Reddit results sometimes (examples here and here) but not nearly as many as what appears on Google when using identical queries. A standout is Kagi, which is a paid-for engine that pays Google for some of its search index and still shows recent Reddit results.
As 404 Media noted, Reddit's Robots Exclusion Protocol (robots.txt file) blocks bots from scraping the site. The protocol also states, "Reddit believes in an open Internet, but not the misuse of public content." Reddit has approved scrapers from the Internet Archive and some research-focused entities.
Reddit announced changes to its robots.txt file on June 25. Ahead of the changes, it said it had "seen an uptick in obviously commercial entities who scrape Reddit and argue that they are not bound by our terms or policies. Worse, they hide behind robots.txt and say that they can use Reddit content for any use case they want."
Last month, Reddit said that any "good-faith actor" could reach out to Reddit to try to work with the company, linking to an online form. However, Colin Hayhurst, Mojeek's CEO, told me via email that he reached out to Reddit after he was blocked but that Reddit "did not respond to many messages and emails." He noted that since 404 Media's report, Reddit CEO Steve Huffman has reached out.
7 notes · View notes
simnostalgia · 2 years ago
Text
I've recently learned how to scrape websites that require a login. This took a lot of work and seemed to have very little documentation online so I decided to go ahead and write my own tutorial on how to do it.
Tumblr media
We're using HTTrack as I think that Cyotek does basically the same thing but it's just more complicated. Plus, I'm more familiar with HTTrack and I like the way it works.
Tumblr media
So first thing you'll do is give your project a name. This name is what the file that stores your scrape information will be called. If you need to come back to this later, you'll find that file.
Also, be sure to pick a good file-location for your scrape. It's a pain to have to restart a scrape (even if it's not from scratch) because you ran out of room on a drive. I have a secondary drive, so I'll put my scrape data there.
Tumblr media
Next you'll put in your WEBSITE NAME and you'll hit "SET OPTIONS..."
Tumblr media
This is where things get a little bit complicated. So when the window pops up you'll hit 'browser ID' in the tabs menu up top. You'll see this screen.
What you're doing here is giving the program the cookies that you're using to log in. You'll need two things. You'll need your cookie and the ID of your browser. To do this you'll need to go to the website you plan to scrape and log in.
Tumblr media
Once you're logged in press F12. You'll see a page pop up at the bottom of your screen on Firefox. I believe that for chrome it's on the side. I'll be using Firefox for this demonstration but everything is located in basically the same place so if you don't have Firefox don't worry.
So you'll need to click on some link within the website. You should see the area below be populated by items. Click on one and then click 'header' and then scroll down until  you see cookies and browser id. Just copy those and put those into the corresponding text boxes in HTTrack! Be sure to add "Cookies: " before you paste your cookie text. Also make sure you have ONE space between the colon and the cookie.
Tumblr media Tumblr media
Next we're going to make two stops and make sure that we hit a few more smaller options before we add the rule set. First, we'll make a stop at LINKS and click GET NON-HTML��LINKS and next we'll go and find the page where we turn on "TOLERANT REQUESTS", turn on "ACCEPT COOKIES"  and select "DO NOT FOLLOW ROBOTS.TXT"
This will make sure that you're not overloading the servers, that you're getting everything from the scrape and NOT just pages, and that you're not following the websites indexing bot rules for Googlebots. Basically you want to get the pages that the website tells Google not to index!
Okay, last section. This part is a little difficult so be sure to read carefully!
Tumblr media
So when I first started trying to do this, I kept having an issue where I kept getting logged out. I worked for hours until I realized that it's because the scraper was clicking "log out' to scrape the information and logging itself out! I tried to exclude the link by simply adding it to an exclude list but then I realized that wasn't enough.
So instead, I decided to only download certain files.  So I'm going to show you how to do that. First I want to show you the two buttons over to the side. These will help you add rules. However, once you get good at this you'll be able to write your own by hand or copy and past a rule set that you like from a text file. That's what I did!
Tumblr media
Here is my pre-written rule set. Basically this just tells the downloader that I want ALL images, I want any item that includes the following keyword, and the -* means that I want NOTHING ELSE. The 'attach' means that I'll get all .zip files and images that are attached since the website that I'm scraping has attachments with the word 'attach' in the URL.
It would probably be a good time to look at your website and find out what key words are important if you haven't already. You can base your rule set off of mine if you want!
WARNING: It is VERY important that you add -* at the END of the list or else it will basically ignore ALL of your rules. And anything added AFTER it will ALSO be ignored.
Tumblr media
Good to go!
Tumblr media
And you're scraping! I was using INSIMADULT as my test.
There are a few notes to keep in mind: This may take up to several days. You'll want to leave your computer on. Also, if you need to restart a scrape from a saved file, it still has to re-verify ALL of those links that it already downloaded. It's faster that starting from scratch but it still takes a while. It's better to just let it do it's thing all in one go.
Also, if you need to cancel a scrape but want all the data that is in the process of being added already then ONLY press cancel ONCE. If you press it twice it keeps all the temp files. Like I said, it's better to let it do its thing but if you need to stop it, only press cancel once. That way it can finish up the URLs already scanned before it closes.
40 notes · View notes
webtechnicaltips · 4 months ago
Text
youtube
In this comprehensive guide, learn what a robots.txt file is, its use cases, and how to create one effectively. We'll explore the RankMath Robots.txt Tester Tool and discuss the best SEO practices associated with robots.txt files. Additionally, we'll analyze live examples to provide a clear understanding of its implementation.
2 notes · View notes
sa6566 · 1 year ago
Text
What is the best way to optimize my website for search engines?
Optimizing Your Website for Search Engines:
Keyword Research and Planning
Identify relevant keywords and phrases for your content
Use tools like Google Keyword Planner, Ahrefs, or SEMrush to find keywords
Plan content around target keywords
On-Page Optimization
Title Tags: Write unique, descriptive titles for each page
Meta Descriptions: Write compelling, keyword-rich summaries for each page
Header Tags: Organize content with H1, H2, H3, etc. headers
Content Optimization: Use keywords naturally, aim for 1-2% density
URL Structure: Use clean, descriptive URLs with target keywords
Technical Optimization
Page Speed: Ensure fast loading times (under 3 seconds)
Mobile-Friendliness: Ensure responsive design for mobile devices
SSL Encryption: Install an SSL certificate for secure browsing
XML Sitemap: Create and submit a sitemap to Google Search Console
Robots.txt: Optimize crawling and indexing with a robots.txt file
Content Creation and Marketing
High-Quality Content: Create informative, engaging, and valuable content
Content Marketing: Share content on social media, blogs, and guest posts
Internal Linking: Link to relevant pages on your website
Image Optimization: Use descriptive alt tags and file names
Link Building and Local SEO
Backlinks: Earn high-quality backlinks from authoritative sources
Local SEO: Claim and optimize Google My Business listing
NAP Consistency: Ensure consistent name, address, and phone number across web
Analytics and Tracking
Google Analytics: Install and track website analytics
Google Search Console: Monitor search engine rankings and traffic
Track Keyword Rankings: Monitor target keyword rankings
Tumblr media
8 notes · View notes
rohan980 · 1 year ago
Text
What is robots.txt and what is it used for?
Robots.txt is a text file that website owners create to instruct web robots (also known as web crawlers or spiders) how to crawl pages on their website. It is a part of the Robots Exclusion Protocol (REP), which is a standard used by websites to communicate with web crawlers.
The robots.txt file typically resides in the root directory of a website and contains directives that specify which parts of the website should not be accessed by web crawlers. These directives can include instructions to allow or disallow crawling of specific directories, pages, or types of content.
Webmasters use robots.txt for various purposes, including:
Controlling Access: Website owners can use robots.txt to control which parts of their site are accessible to search engine crawlers. For example, they may want to prevent crawlers from indexing certain pages or directories that contain sensitive information or duplicate content.
Crawl Efficiency: By specifying which pages or directories should not be crawled, webmasters can help search engines focus their crawling efforts on the most important and relevant content on the site. This can improve crawl efficiency and ensure that search engines index the most valuable content.
Preserving Bandwidth: Crawlers consume server resources and bandwidth when accessing a website. By restricting access to certain parts of the site, webmasters can reduce the load on their servers and conserve bandwidth.
Privacy: Robots.txt can be used to prevent search engines from indexing pages that contain private or confidential information that should not be made publicly accessible.
It's important to note that while robots.txt can effectively instruct compliant web crawlers, it does not serve as a security measure. Malicious bots or those that do not adhere to the Robots Exclusion Protocol may still access content prohibited by the robots.txt file. Therefore, sensitive or confidential information should not solely rely on robots.txt for protection.
Click here for best technical SEO service
9 notes · View notes