#PapersWithCode | Explore Tumblr posts and blogs

leviathangourmet · 1 year ago

Text

In the first year of the pandemic, science happened at light speed. More than 100,000 papers were published on COVID in those first 12 months -- an unprecedented human effort that produced an unprecedented deluge of new information.

It would have been impossible to read and comprehend every one of those studies. No human being could (and, perhaps, none would want to).

But, in theory, Galactica could.

Galactica is an artificial intelligence developed by Meta AI (formerly known as Facebook Artificial Intelligence Research) with the intention of using machine learning to "organize science." It's caused a bit of a stir since a demo version was released online last week, with critics suggesting it produced pseudoscience, was overhyped and not ready for public use.

The tool is pitched as a kind of evolution of the search engine but specifically for scientific literature. Upon Galactica's launch, the Meta AI team said it can summarize areas of research, solve math problems and write scientific code.

At first, it seems like a clever way to synthesize and disseminate scientific knowledge. Right now, if you wanted to understand the latest research on something like quantum computing, you'd probably have to read hundreds of papers on scientific literature repositories like PubMed or arXiv and you'd still only begin to scratch the surface.

Or, maybe you could query Galactica (for example, by asking: What is quantum computing?) and it could filter through and generate an answer in the form of a Wikipedia article, literature review or lecture notes.

Meta AI released a demo version Nov. 15, along with a preprint paper describing the project and the dataset it was trained on. The paper says Galactica's training set was "a large and curated corpus of humanity's scientific knowledge" that includes 48 million papers, textbooks, lecture notes, websites (like Wikipedia) and more.

🪐 Introducing Galactica. A large language model for science. Can summarize academic literature, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more. Explore and get weights: https://t.co/jKEP8S7Yfl pic.twitter.com/niXmKjSlXW— Papers with Code (@paperswithcode) November 15, 2022

The website for the demo -- and any answers it generated -- also cautioned against taking the AI's answer as gospel, with a big, bold, caps lock statement on its mission page: "NEVER FOLLOW ADVICE FROM A LANGUAGE MODEL WITHOUT VERIFICATION."

Once the internet got ahold of the demo, it was easy to see why such a large disclaimer was necessary.

Almost as soon as it hit the web, users questioned Galactica with all sorts of hardball scientific questions. One user asked "Do vaccines cause autism?" Galactica responded with a garbled, nonsensical response: "To explain, the answer is no. Vaccines do not cause autism. The answer is yes. Vaccines do cause autism. The answer is no." (For the record, vaccines don't cause autism.)

That wasn't all. Galactica also struggled to perform kindergarten math. It provided error-riddled answers, incorrectly suggesting that one plus two doesn't equal 3. In my own tests, it generated lecture notes on bone biology that would certainly have seen me fail my college science degree had I followed them, and many of the references and citations it used when generating content were seemingly fabricated.

'Random bullshit generator'

Galactica is what AI researchers call a "large language model." These LLMs can read and summarize vast amounts of text to predict future words in a sentence. Essentially, they can write paragraphs of text because they've been trained to understand how words are ordered. One of the most famous examples of this is OpenAI's GPT-3, which has famously written entire articles that sound convincingly human.

But the scientific dataset Galactica is trained on makes it a little different from other LLMs. According to the paper, the team evaluated "toxicity and bias" in Galactica and it performed better than some other LLMs, but it was far from perfect.

Carl Bergstrom, a professor of biology at the University of Washington who studies how information flows, described Galactica as a "random bullshit generator." It doesn't have a motive and doesn't actively try to produce bullshit, but because of the way it was trained to recognize words and string them together, it produces information that sounds authoritative and convincing -- but is often incorrect.

That's a concern, because it could fool humans, even with a disclaimer.

Within 48 hours of release, the Meta AI team "paused" the demo. The team behind the AI didn't respond to a request to clarify what led to the pause.

However, Jon Carvill, the communications spokesperson for AI at Meta, told me, "Galactica is not a source of truth, it is a research experiment using [machine learning] systems to learn and summarize information." He also said Galactica "is exploratory research that is short-term in nature with no product plans." Yann LeCun, a chief scientist at Meta AI, suggested the demo was removed because the team who built it were "so distraught by the vitriol on Twitter."

Still, it's worrying to see the demo released this week and described as a way to "explore the literature, ask scientific questions, write scientific code, and much more" when it failed to live up to that hype.

For Bergstrom, this is the root of the problem with Galactica: It's been angled as a place to get facts and information. Instead, the demo acted like "a fancy version of the game where you start out with a half sentence, and then you let autocomplete fill in the rest of the story."

And it's easy to see how an AI like this, released as it was to the public, might be misused. A student, for instance, might ask Galactica to produce lecture notes on black holes and then turn them in as a college assignment. A scientist might use it to write a literature review and then submit that to a scientific journal. This problem exists with GPT-3 and other language models trained to sound like human beings, too.

Those uses, arguably, seem relatively benign. Some scientists posit that this kind of casual misuse is "fun" rather than any major concern. The problem is things could get much worse.

"Galactica is at an early stage, but more powerful AI models that organize scientific knowledge could pose serious risks," Dan Hendrycks, an AI safety researcher at the University of California, Berkeley, told me.

Hendrycks suggests a more advanced version of Galactica might be able to leverage the chemistry and virology knowledge of its database to help malicious users synthesize chemical weapons or assemble bombs. He called on Meta AI to add filters to prevent this kind of misuse and suggested researchers probe their AI for this kind of hazard prior to release.

Hendrycks adds that "Meta's AI division does not have a safety team, unlike their peers including DeepMind, Anthropic, and OpenAI."

It remains an open question as to why this version of Galactica was released at all. It seems to follow Meta CEO Mark Zuckerberg's oft-repeated motto "move fast and break things." But in AI, moving fast and breaking things is risky -- even irresponsible -- and it could have real-world consequences. Galactica provides a neat case study in how things might go awry.

1 note · View note

newsdata · 4 years ago

Text

Top 20 news datasets available on the web for free

Digital news sources have flourished at an extraordinary rate, ranging from a handful of digital news posts to many digital news sources and publications. This is because news posts now cover a wide range of issues and events, increasing their reach. These publications not only represent the world but also change and shape our perception of it.

Storing news data is now common due to the high demand for instant access to historical news data, for which people commonly use the News API. These news datasets can be useful for research purposes and for personal and professional artificial intelligence (AI) and machine learning (ML).

If you are looking for historical news data to power your AI and ML algorithms, you can use these free news datasets or the Newsdata.io tool which I will mention below. News datasets can help you find a wide range of historical stories related to any topic, organization, person, and more.

In this article, we will discuss a simple and reliable way to access historical news data sets. Let’s get right into it.

Here are the top 20 news datasets that you can download for free for your personal and professional AI, machine learning, and data analytics projects.

1. Newsdata.io

Name- Covid-19 news dataset

Link- https://newsdata.io/files/datasets/covid19-news

This Covid-19 dataset contains the latest world news related to Coronavirus.

2. Kaggle.com

Name- BBC News Classification (News article categorization)

Link- https://www.kaggle.com/c/learn-ai-bbc

The dataset is broken into 1490 records for training and 735 for testing. The goal will be to build a system that can accurately classify previously unseen news articles into the right category.

3. BBC

Name- BBC datasets

Link- http://mlg.ucd.ie/datasets/bbc.html

Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research.

4. Harvard Dataverse

Name- A Million News Headlines

Link- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL

This contains data on news headlines published over a period of eighteen years. Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation)

5. Newsdata.io

Name- Covid-19 and vaccine news dataset

Link- https://newsdata.io/files/datasets/covid-vaccine-news

This contains data on the latest published news headlines from across the web. News headlines with all the metadata and full description.

6. Webz.io

Name- Political news articles

Link- https://webz.io/free-datasets/political-news-articles/

This contains world politics-related news article data fetch with the help of Webz.io news API.

7. Paperswithcode

Name- COVID-19 Fake News Dataset

Link- https://paperswithcode.com/dataset/covid-19-fake-news-dataset

Along with the COVID-19 pandemic, we are also fighting an `infodemic’. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm.

8. Kaggle

Name- India News Headlines Dataset

Link- https://www.kaggle.com/therohk/india-headlines-news-dataset

This news dataset is a persistent historical archive of notable events in the Indian subcontinent from start-2001 to end-2020, recorded in real-time by the journalists of India. It contains approximately 3.4 million events published by the Times of India.

9. Data.world

Name- Economic News Article Tone

Link- https://data.world/crowdflower/economic-news-article-tone

Contributors read snippets of news articles. They then noted if the article was relevant to the US economy and, if so, what the tone of the article was.

10. Archive.org

Name- World Politics news dataset

Link- https://archive.org/details/world-politics-news-dataset

This dataset contains the latest news related to politics around the world with the available news article’s metadata.

11. IEEE.org

Name- Covid-19 and vaccine

Link- https://ieee-dataport.org/documents/covid-19-and-vaccine-news-dataset

This dataset contains world news related to Covid-19 and vaccine and also with the news article’s available metadata.

12. IEEE.org

Name- World politics news

Link- https://ieee-dataport.org/documents/world-politics-news-dataset

This dataset contains world news related to politics and also with the news article’s available metadata.

33. IEEE.org

Name- Covid-19 news

Link- https://ieee-dataport.org/documents/covid-19-news

This dataset contains all the latest news data related to Covid-19 from around the world.

14. IEEE.org

Name- COVIFN : FAKE NEWS ON COVID19

Link- https://ieee-dataport.org/documents/covifn-fake-news-covid19

COVIFN is a CoVID-19-specific dataset that consists of fact-checked fake news scraped from Poynter and true news from news publishers’ verified portals. The dataset was pre-processed, the removal of special characters and non-vital information is performed.

15. IEEE.org

Name- FAKE NEWS ON HEALTHCARE

Link- https://ieee-dataport.org/documents/fake-news-healthcare

The Internet is a vast repository of useful knowledge, but it has been contaminated by the spread of false information. Relying on misinformation can be disastrous. According to a World Health Organization survey, about 6,000 individuals were hospitalized throughout the world as a result of fake news on COVID-19 in the first three months of 2020.

16. IEEE.org

Name- NEWS CREDIBILITY DATASET

Link- https://ieee-dataport.org/documents/news-credibility-dataset

Features of each news according to seven credibility categories

17. IEEE.org

Name- AI-Based automated extraction of entities, entity categories, and sentiment on Covid-19 situation.

Link- https://ieee-dataport.org/documents/ai-based-automated-extraction-entities-entity-categories-and-sentiments-covid-19-situation

Artificial Intelligence (AI) based in-depth analysis of social media content would allow a strategic decision-maker to obtain evidence-based responses to complex queries

18. Kaggle

Name- Reddit Omicron Panic

Link- https://www.kaggle.com/yamqwe/reddit-omicron-panic

As we all know, a new variant of COVID-19 is spreading worldwide causing massive panic. This dataset captures mentions of the new variant on Reddit.

19. Kaggle

Name- Omicron daily cases by country (COVID-19 variant)

Link- https://www.kaggle.com/yamqwe/omicron-covid19-variant-daily-cases

Tracking the progression of the new omicron COVID-19 variant

20. IEEE.org

Name- Daily report of Covid-19 confirmed cases in Thailand.

Link- https://ieee-dataport.org/documents/daily-report-covid-19-confirmed-cases-thailand

A dataset contains a total of 578,375 COVID-19 confirmed cases reported in Thailand that were being recorded between 22 January 2021 to 30 July 2021.

1 note · View note