Don't wanna be here? Send us removal request.
Text
Intro to Web Scraping
Chances are, if you have access to the internet, you have heard of Data Science. Aside from the buzz generated by the title ‘Data Scientist’, only a few in relevant fields can claim to understand what data science is. The majority of people think, if at all, that a data scientist is a mad scientist type able to manipulate statistics and computers to magically generate crazy visuals and insights seemingly out of thin air.
Looking at the plethora of definitions to be found in numerous books and across the internet of what data science is, the layman’s image of a data scientist may not be that far off.
While the exact definition of ‘data science’ is still a work in progress, most in the know would agree that the data science universe encompasses fields such as:
Big Data
Analytics
Machine Learning
Data Mining
Visualization
Deep Learning
Business Intelligence
Predictive Modeling
Statistics
Data Source: Top keywords

Image Source – Michael Barber
Further exploration of the skillset that goes into what makes a data scientist, consensus begins to emerge around the following:
Statistical Analysis
Programming/Coding Skills: - R Programming; Python Coding
Structured Data (SQL)
Unstructured Data (3-5 top NoSQL DBs)
Machine Learning/Data Mining Skills
Data Visualization
Big Data Processing Platforms: Hadoop, Spark, Flink, etc.
Structured vs unstructured data
Structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operation
Examples of structured data include numbers, dates, and groups of words and numbers called strings.
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text such as the body of an e-mail message, Web pages, or word-processor document. Source: Unstructured data - Wikipedia
Implied within the definition of unstructured data is the fact that it is very difficult to search. In addition, the vast amount of data in the world is unstructured. A key skill when it comes to mining insights out of the seeming trash that is unstructured data is web scraping.
What is web scraping?
Everyone has done this: you go to a web site, see an interesting table and try to copy it over to Excel so you can add some numbers up or store it for later. Yet this often does not really work, or the information you want is spread across a large number of web sites. Copying by hand can quickly become very tedious.
You’ve tried everything else, and you haven’t managed to get your hands on the data you want. You’ve found the data on the web, but, alas — no download options are available and copy-paste has failed you. Fear not, there may still be a way to get the data out. Source: Data Journalism Handbook
As a data scientist, the more data you collect, the better your models, but what if the data you want resides on a website? This is the problem of social media analysis when the data comes from users posting content online and can be extremely unstructured. While there are some websites who support data collection from their web pages and have even exposed packages and APIs (such as Twitter), most of the web pages lack the capability and infrastructure for this. If you are a data scientist who wants to capture data from such web pages then you wouldn’t want to be the one to open all these pages manually and scrape the web pages one by one. Source: Perceptive Analytics
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Source: Wikipedia
Web Scraping is a method to convert the data from websites, whether structured or unstructured, from HTML into a form on which analysis can be performed.
The advantage of scraping is that you can do it with virtually any web site — from weather forecasts to government spending, even if that site does not have an API for raw data access. While this method is very powerful and can be used in many places, it requires a bit of understanding about how the web works.
There are a variety of ways to scrape a website to extract information for reuse. In its simplest form, this can be achieved by copying and pasting snippets from a web page, but this can be unpractical if there is a large amount of data to be extracted, or if it spread over a large number of pages. Instead, specialized tools and techniques can be used to automate this process, by defining what sites to visit, what information to look for, and whether data extraction should stop once the end of a page has been reached, or whether to follow hyperlinks and repeat the process recursively. Automating web scraping also allows to define whether the process should be run at regular intervals and capture changes in the data.
https://librarycarpentry.github.io/lc-webscraping/
Web Scraping with R
Atop any data scientist’s toolkit lie Python and R. While python is a general purpose coding language used in a variety of situations; R was built from the ground up to mold statistics and data. From data extraction, to clean up, to visualization to publishing; R is in use. Unlike packages such as tableau, Stata or Matlab which are skewed either towards data manipulation or visualization, R is a general purpose statistical language with functionality cutting across all data management operations. R is also free and open source which contributes to making it even more popular.
To extend the boundaries limiting data scientists from accessing data from web pages, there are packages based on ‘Web scraping’ available in R. Let us look into web scraping technique using R.
Harvesting Data with RVEST
R. Hadley Wickham authored the rvest package for web scraping using R which will be demonstrated in this tutorial. Although web scraping with R is a fairly advanced topic it is possible to dive in with a few lines of code within a few steps and appreciate its utility, versatility and power.
We shall use 2 examples inspired by Julia Silge in her series cool things you can do with R in a tweet:
Scraping the list of districts of Uganda
Getting the list of MPs of the Republic of Rwanda
0 notes
Text
ANOVA – Analysis of Variance
As previously noted, my chosen data set is taken from GapMinder which is mostly quantitative. However during the course of data management, the variables of interest have been banded together into classes for ease of management and analysis. That said, since ANOVA requires both a categorical and quantitative variable, an unaltered primary variable – Income Per Person; will be used together with a secondary derived variable – Alcohol Group.
My null hypothesis therefore is that the amount of alcohol consumed is the same regardless of the income level of the consumer.

As evident in the figure above, the p value is less than .0001 implying that there is fact a strong relationship between the amount of alcohol consumed and the income level of the consumer. This further implies that we can reject the null hypothesis.
Because there are multiple levels (9) of my categorical explanatory variable, the F test and p value did not provide any insight into why the null hypothesis could be rejected. To further understand the variability across the 9 different levels, post hoc analysis using the Duncan’s Multiple Range Test was carried out.


Surprisingly, the Duncan test indicated that there was no significant difference between the groupings. What does this mean? Seems like a contradiction of the p-value.
Simon
0 notes
Text
Graphing
During the data management phase, the data from my variables of choice; alcohol consumption, income per person and urban rate, were grouped into approximately 10 - 12 ranges of equal spread.
Following that trend, I chose to chart using the newly created group variables as opposed to the original primary variables present in the raw data from GapMinder. My graphs are posted below.

Good News :-) According to the above graph, most of us are not alcoholics. Maybe worthwhile to further discriminate my research question to look at pure alcohol consumption 3 litres per capita.

No conclusions yet here but interesting bimodal peaks.

0 notes
Text
Data Management…maybe
Well, my chosen data set is taken from GapMinder which is mostly quantitative. With the variables I have selected; alcohol consumption, urban rate and income per person, the major challenge was getting these into more manageable ranges whose characteristics could quickly be discerned at a glance.
Therefore, for data management purposes, the data was grouped into approximately 10 - 12 ranges of equal spread.

Simon
0 notes
Text
Mo Freq Tables – Assignment 2
Since posting earlier, I decided to take a look at blog posts from classmates to see how they are faring, what challenges they are facing but mostly selfishly to see what I could learn. And, the fact that I felt my frequency tables did not say much was bugging me.
BTW, One of the blogs are I following here is Chris’ Statistics Page. Highly recommend; I like the style and clarity of writing on there. However, that made me wonder, aside from screenshots, how do I display the results of my SAS Program?
Following the blog surf, http://pds-frank22.tumblr.com/ led me to turn to Google to see what I could learn about PROC MEANS;
A quick search and surf led me to http://www.bluechipsolutions.com/SAS_Cheatsheets/Proc_Means_A_Simple_Example.htm
And from there to:

I am definitely learning a few things - Simon
0 notes
Text
Freq Tables – Assignment 2
Prior to attempting to build frequency tables for my variables of interest, I could visualize how the frequencies tables would be useful in getting a better understanding of the data; its discrepancies (missing values) and especially the distribution.
I also hoped that the frequency tables would help for refining my research questions into something more specific and meaningful.
However, on actually building the frequency tables, I was quite disappointed as the data seemed to be all over the place and I could not discern the approximate distributions just by glancing at the occurrences of each variable. Also, it was not quite as neat as presented in the lecture slides/videos ;-)
Anyhow, after a few runs and tweaks here and there, I learned a few things;
· Frequency tables are more revealing for categorical variables or numeric variables that occur over a small spread.
· For numeric variables occurring over a greater spread, it would be more effective to first classify the data in groups of ranges before building the frequency tables.
· In the case of missing values, the frequency tables were very useful.
So, nursing my slight disappointment but also determined to learn more, I went looking through the discussion forum as well as the next lecture slide notes. It was almost a eureka moment for me!
I love the way the next lecture aims to answer the very thing I want answer this very moment. They way one lecture dovetails neatly into the next made me think of a beautifully crafted play!!
Until next time - Simon
0 notes
Text
Intro - This is my blog for the Passion Driven Statistics Course - Coursera Wesleyan University
Assignment 1 requires that we examine the datasets available (Or provide our own) and select two topics to be researched. As a citizen of a '3rd World' Country, I was drawn to the Gapminder dataset that details how far along various world nations are to fulfilling the UN Millennium development goals.
Gapminder is a non-profit venture – a modern “museum” on the Internet – promoting sustainable global development and achievement of the United Nations Millennium Development Goals. --Gapminder
After reading the Gapminder data dictionary and given the situation in Uganda I decided to build research questions around per capita alcohol consumption.
Ref: http://topdocumentaryfilms.com/drunkest-place-earth/
http://www.time.com/time/world/article/0,8599,1989842,00.html
The following are my preliminary research questions:
MAIN TOPIC:
1.- Is alcohol consumption associated with per capita income?
SECOND TOPIC:
2.- Urbanization and Alcohol Consumption
More specifically, does urbanization increase alcohol consumption or is the reserve true?
I am almost certain that these topics will further be refined from their present state (hence calling them preliminary here) as I further acquaint myself with the data and topic. My hope is not only to further my data analysis skills but also to learn as much as possible about the surge that is alcohol consumption in excess.
Simon
0 notes