documentcloud · 8 months
Scrape Everything
The DOJ maintains a website where they’ve continued to publish court filings and other documents related to the capital breach on January 6, 2021. The search options are limited there, however. We were curious about things like relationships between defendants and definitely interested in preserving the documents and in seeing new documents as they are added to the repository. We used DocumentCloud’s Scraper Add-On to gather all those documents into a project on DocumentCloud — try it yourself, by following our step-by-step Guide.
Pull all the PDFs off of a web page and upload them to DocumentCloud. Our Scraper Add-On can be scheduled to run at regular intervals, and will go up to two levels deep from any starting page. https://www.documentcloud.org/app?q=%2B#add-ons/MuckRock/documentcloud-scraper-addon
0 notes
documentcloud · 8 months
Are expensive OCR tools really more effective than Free and Open Source options? We tested them to find out.
0 notes
documentcloud · 11 months
Indictments, at your service:
0 notes
documentcloud · 11 months
The mailchimp link was stripped, which is not cool! Luckily, we can copy and paste:
Need to redact documents? You have tools (and options) with DocumentCloud
You often want to make sure sensitive information is not inadvertently made available while working with or sharing documents with others. DocumentCloud has several ways to help. Redaction tool - The simplest method involves using DocumentCloud's "Redact" tool. Users will navigate to the page where the data is located, click "Redact" in the sidebar, and then draw a box over the area in question. This information is then scrubbed from the document and a new version of it is made available in the app. Please keep in mind, once you've chosen to go forward with redacting anything from the document, it will be gone permanently. If you want to preserve the original document, make sure you have made a copy of it for yourself. Pro tip: Just type the keyboard shortcut ‘R’ while on a document, and the redaction tool will fire up for you. Bad Redactions - We've previously highlighted Gateway Grant recipient Fiquem Sabendo's use of this functionality to identify where redactions have failed and provide a chance to remove the information. It also generates a report showing you what data was there before for reference. To use it, go to “Add-Ons” and then “Browse All Add-Ons” and search for “Bad Redactions.” Then activate it and it will show up in your Add-Ons menu. PII Detector - Once activated, this Add-On will attempt to detect personally identifiable information in selected documents, including social security numbers, zip codes, email and physical addresses, phone numbers, etc. It will then generate a report, giving you the opportunity to decide what needs to be redacted from the document. Follow the same steps as Bad Redactions, but look for PII Detector instead. These and other functionalities were covered in a piece written for the Global Investigative Journalism Network this past April recapping our NICAR 23 session on Add-Ons. Do you have any examples of how you've been able to leverage any of these tools? Or, perhaps, nominations for MuckRock's FOIA redaction hall of shame? Reach out to our product outreach manager by email and share your story with us. We may highlight it in a future email or video session!
Register now for our August 8 DocumentCloud 101 session!
Registration is now open for the second of our summer series of virtual group DocumentCloud 101 sessions. It is set for Tuesday, August 8, at 2 p.m. Eastern/11 a.m. Pacific. This is another opportunity to showcase how DocumentCloud can help you tackle your transparency projects and highlight any recent changes or updates to the platform. Sign up for the August 8 session and share this email or registration link with others you think may be interested. Don't hesitate to reach out if you have any additional questions.
Luz Toledo on █████, █████ and how he uses DocumentCloud's redaction tools.
1 note · View note
documentcloud · 11 months
Luz Toledo on █████, █████ and how he uses DocumentCloud's redaction tools.
1 note · View note
documentcloud · 1 year
0 notes
documentcloud · 1 year
Tumblr media
0 notes
documentcloud · 1 year
1 note · View note
documentcloud · 1 year
MuckRock is hiring our first Engagement Journalist! We're looking for someone that is adept at connecting with communities and audiences through newsletters, social media and projects that bring readers into active involvement with the work that we do. The person who ends up in this role will run this newsletter as well as our social media accounts. They will work closely across our teams to make sure our reporting, data and tools drive transparency and change in new and exciting ways. Learn more about the position and apply .
0 notes
documentcloud · 1 year
DocumentCloud’s most powerful feature has always been our users. Every day, that community pushes the boundaries of what can be done with documents, from solo journo-coders extracting data on deadline to the Documenters platform rethinking how to make public meetings more public.
To help drive that community’s impact and collaboration, today we’re launching Add-Ons, an easy way for anyone to launch, maintain, and share new capabilities right within DocumentCloud, ranging from exporting notes to applying machine learning techniques.
To get started, all you need to do is log in to DocumentCloud, select some documents, and then pick an Add-On. It will start running in the background, and then notify you of its progress. Add-Ons can also optionally send you an email, generate files for you, or be configured to integrate with a wide range of external tools, such as Slack, cloud-hosted APIs, or a range of open source packages.
(read a whole lot more)
In addition to the Hello World Add-On template that demonstrates basic functionality, we have a few Add-Ons live now that can also serve as a base for your to fork and build on:
Regex Extractor: Let’s you define a Regex string to pull out specified text matches into a spreadsheet across a selection of documents.
PDF Export: Helps you get your PDFs out of DocumentCloud, adding the selected documents into a Zip file that’s then displayed to you.
Note Export: Extracts all the notes on selected documents and saves them as text files you can download.
Bulk Edit: Let’s you update metadata on many documents at once.
SideKick Document Classification: Makes it easy to train a machine learning model to classify documents by an arbitrary type, such as identifying if a document is likely to be an email, a resident complaint, or other categories of records.
Notification Alerts: A simple example of a scheduled automation that lets you adjust a search query and have DocumentCloud alert you if any new results match it.
Currently, running Add-Ons from the web interface requires submitting them through a review process and giving our team a chance to check in, but you can run Add-Ons from the command line or as scheduled GitHub Actions now, and we’ll be adding the ability to directly import and run your own Add-Ons from within the web interface, with no review process, in the coming months.
If you have written an Add-On or have an existing DocumentCloud script or other document analysis tooling you’d like to share as an Add-On for everyone to use, fill out this submission form and our team will follow up with you.
0 notes
documentcloud · 1 year
Over the past year the DocumentCloud team has been working hard to create better solutions for journalists and the public at large to share, analyze, annotate and, ultimately, publish source documents to the web. In the past six months we have seen great growth in DocumentCloud’s capabilities including the ability to de-index documents, some API changes, new translations of the DocumentCloud website, several new Add-Ons and new paid features.
For previous site improvements, check out all of MuckRock’s release notes, and if you’d like updates emailed to you — along with ways to help contribute to the site’s development yourself — subscribe to our developer newsletter.
0 notes
documentcloud · 1 year
There were more than 4000 families living in FEMA provided housing (typically trailers) in November 2022. FEMA updates this number weekly in their Daily Operations Briefings but the Briefings are only available as PDFs of slide decks shared to email list. The Data Liberation Project is compiling those PDFs into a searchable repository and extracting the housing counts into a public, structured data set. We’ll look at how Singer-Vine is approaching the project and how you’ll be able to use the tools he is building to develop your own data sets from useful numbers trapped in PDFs 
Join us on January 31 for a conversation with The Data Liberation Project
We're starting our 2023 virtual conversations by chatting with Jeremy Singer-Vine, director of The Data Liberation Project and the voice behind Data is Plural, "a weekly newsletter of useful/curious datasets." Singer-Vine will be chatting with us about his journey, his vision for The Data Liberation Project and his experiences developing an RSS Document Fetcher Add-On for DocumentCloud as part of the Gateway Grant effort he is currently undertaking: creating a real-time archive and public dataset of FEMA housing reports.
RSVP to Join Us
MuckRock’s Gateway Grants support projects that preserve critical document collections
The Data Liberation Project was one of four initial recipients of MuckRock’s Gateway Grants Program. With support from the Filecoin Foundation for the Distributed Web, MuckRock provides grants to explore  ways to leverage technology to preserve access to significantly consequential document collections.
Each grantee has been awarded $10,000 and technical assistance for their projects. The tools they build are open source and available for use by all DocumentCloud users in the form of new capabilities via Add-Ons. DocumentCloud users benefit from the growing library of new functionalities. Any user can write and share a DocumentCloud extension with Add-Ons.
Learn more about our initial grantees.
0 notes
documentcloud · 1 year
OpenNews has opened up scholarships for conference travel, with deadline of January 25:
These scholarships are for journalists who work in data or code. This program is one way we can help people in those roles build skills and connections to make their newsrooms more equitable and just.
This program is designed to do two things: help you learn something outside of your current comfort zone, and strengthen your support network of colleagues and peers.
0 notes
documentcloud · 1 year
0 notes