attadatabatta - Tumblr blog

attadatabatta · 6 months ago

Text

A Quick and Dirty Approach to Data Rescue (updated slightly 12/8)

For valuable information, in bulk, that might disappear from the internet for no good reason.

Let me say first that there are better ways to do this. If you're working with an organized team of people, or if you understand web and database architecture, or can code, this is not how you want to go about it. (I am NOT at present aware of any large-scale projects going on. If you are, please spread the word.)

On the other hand, if you're solo and in a hurry - if you can hear the locusts coming, or you think maybe you can - and you know how to use a browser, this is better than nothing. All you'll need is:

a browser and internet acess

a spreadsheet program (like Excel or Google Sheets)

a text editor (something as simple as Notepad is fine)

If you've got those, you can start feeding content into the Internet Archive's Wayback Machine in minutes. (I am likewise NOT aware of any large-scale alternative repository that the general public can add to. If you are, again, please spread the word.)

Step 1: Collect a batch of URLs for content you want to save. There are a few ways to do this.

— Google site:<whatever it is> sitemap and you may turn up either the sitemap itself or a robots.txt document that will tell you where the sitemap(s) is/are. Government sites don't tend to expose these (a directory page formatted for humans to read is not the same thing as a sitemap) but if you're lucky enough to find one, just clean it up in your text editor of choice and you'll have your list of URLs.

— Google site:<whatever it is> <relevant term> (like, say, climate), paste the results into column A in your spreadsheet one page at a time, and sort the column you just pasted everything into alphabetically. All of the URLs will be clumped together in the Hs. Delete the other rows. There's your list. (It may be worth creating a throwaway Google login just so you can alter your settings. Opting for 100 results per page instead of 20 will speed this up.)

— Use a free webcrawler to compile a list for you. Xenu's Link Sleuth (and yes, the name, I know, but focus) is a free download that will start from a URL you select and compile a list of URLs linked by that page and by any subpages. Its express purpose is to find broken links, but it works fine for this, though there are some things to be aware of, neither of which are really its fault. One, it can't crawl anything that has been set up to block crawlers, and some federal pages are in fact set up that way, and two, as fast as it is compared to a person, crawling a large site or subsection thereof will still take it a long time (over a day for a big site is entirely possible) and you won't get your list until it's done. So if you're concerned about an imminent threat to a specific set of pages or files, do something about those more quickly, and have this going in the background. It's very simple to use, though. Once it's installed and open, go to File > Check URL..., enter your starting point in the top text field, make sure Check external links is on, and click OK. When it's done, there'll be a pop-up asking if you want a report. You want a report. Click yes (and click Cancel if you get a popup about FTP stuff after that) and you'll get a new browser tab (for a local temporary file, don't worry) displaying it. The URLs you want are in the List of valid URLs you can submit to a search engine: section.

— If you're dealing with a relatively small page or site, you can collect the URLs manually by right clicking every promising-looking link, selecting Copy Link Location or the equivalent, and pasting the results into your text editor, one per row. This is a pain in the ass, however.

Step 2: Paste your list of URLs into a spreadsheet, all in column A. If you used the search results method you already have this.

Step 3: Paste the following string, with no spaces in, before, or after it into cell B1:

http://web.archive.org/save/

Step 4: Enter the following formula in cell C1:

=CONCATENATE($B$1,A1)

Step 5: Drag that formula all the way down column C, for as many rows as you have entries in column C. Column C should then be full of entries that look like http://web.archive.org/save/https://somethingsite.gov/rainfall/

What's the point of this? The point is that the command to save a page to the Wayback Machine can be communicated as part of a URL, so entering any of your column C entries in a browser will upload the page in question to the Internet Archive. Now you just need to open all of these links in quick succession. Don't do this manually either.

Step 6: Go to webfx.com/tools/http-status-tool/

This isn't a unique service; if you find an alternative that allows you to put in a good-size batch of URLs at once that should work just as well or perhaps better. The "real" purpose of this site and those like it is to check whether batches of URLs are actually resolving correctly, as opposed to running into a 404 or some other kind of error. But it does this by, as you might have guessed, opening all of the URLs automatically and invisibly and telling you what happened.

Step 7: Paste a chunk (in the tens, for this site) of your column C URLs into the field here and click the Check button. You'll fairly quickly get your results: anything with a green 200 badge has been uploaded to the Archive successfully. Anything else hasn’t.

As for what to archive? Anything that looks useful.

Be proactive when reading the news. An informative article? Sure. Any government reports the article mentions? Those too. PDFs can be uploaded just fine, and are a good bet on a government site; they're probably charts, forms, official publications, presentation slideshows, or transcripts. Excel files are an even better bet. Neither legacy media not social media have shown themselves especially trustworthy recently. And the foxes are so excited to be renovating the henhouse.

#data rescue #science

146 notes · View notes

attadatabatta · 6 months ago

Text

A Quick and Dirty Approach to Data Rescue (updated 11/24)

Before the Trump administration or its allies take it down

On the other hand, if you're solo and in a hurry - Trump’s administration was known for deleting things and they’re likely to be much worse this time around - and you know how to use a browser, this is better than nothing. All you'll need is:

a browser and internet acess

a spreadsheet program (like Excel or Google Sheets)

a text editor (something as simple as Notepad is fine)

Step 1: Collect a batch of URLs for content you want to save. There are a few ways to do this.

Step 2: Paste your list of URLs into a spreadsheet, all in column A. If you used the search results method you already have this.

Step 3: Paste the following string, with no spaces in, before, or after it into cell B1:

http://web.archive.org/save/

Step 4: Enter the following formula in cell C1:

=CONCATENATE($B$1,A1)

Step 6: Go to webfx.com/tools/http-status-tool/

As for what to archive? Anything that looks useful.

Be proactive when reading the news. An informative article? Sure. Any government reports the article mentions? Those too. PDFs can be uploaded just fine, and are a good bet on a government site; they're probably charts, forms, official publications, presentation slideshows, or transcripts. Excel files are an even better bet. Neither legacy media not social media have shown themselves especially trustworthy recently; if you figure Trump and his friends might want the record gone, save it.

#data rescue

100 notes · View notes