oditile - Tumblr blog

oditile · 8 years ago

Text

A Tidytext Analysis of the Weinstein Effect

From: https://www.gokhanciflikli.com/post/weinstein-effect/

Dec 3, 2017 · 3099 words · 15 minutes readPLOT R TIDYTEXT

Quantifying He-Said, She-Said: Newspaper Reporting

I have been meaning to get into quantitative text analysis for a while. I initially planned this post to feature a different package (that I wanted to showcase), however I ran into some problems with their .json parsing methods and currently waiting for the issue to be solved on their end. The great thing about doing data science with R is that there are multiple avenues leading you to the same destination, so let’s take advantage of that.

My initial inspiration came from David Robinson’s post on gendered verbs. I remember sending it around and thinking it was quite cool. Turns out he was building on Julia Silge’s earlier post on gender pronouns. I see that post and I go, ‘what a gorgeous looking ggplot theme!’. So. Neat. Praise be the open source gods, the code is on GitHub. Let’s take advantage of that too.

I still needed a topic, and even though both the Wikipedia plots and the Jane Austen datasets sound interesting to look at, I felt that there is another, obvious choice.1 It has a Wikipedia page and its own subreddit. Also, the title might have given it away. Let’s get to work.

Getting Full Text News Articles

My first instinct was to check out the NYT APIs—it made sense, given that they broke the news (along with the New Yorker). Everything seemed to be working out just fine, until I realised you cannot get the full text—only the lead. Staying true to my strict data scientist training, I googled ‘full text newspaper api r’ and there it was: GuardianR. Sorry NYC mates, we reckon we will have to cross the pond for this one.

Note that any one source will always be biased. If you are not familiar with the Guardian, it’s British and has a left-centre bias. It might be prudent to pair it with a right-centre newspaper, however not all newspapers provide APIs (which in itself is another selection bias). Alas, we will move on just with the Guardian—insert idiom regarding salt. Finally, you will need to get a free API key from their open source platform. You still have to register, but you are only in danger if you vote Tory and stand on the left side of the escalator. Once you got it, install/load the package via CRAN:

library(GuardianR) ls(pos = "package:GuardianR")

## [1] "get_guardian" "get_json" "parse_json_to_df"

As you can see, the GuardianR package is a simple one: it contains only three (but powerful) functions. We only need the first one to get a hold of the full text articles, and the syntax is super simple:

#not evaluated articles <- get_guardian(keywords = "sexual+harassment", section = "world", from.date = "2012-11-30", to.date = "2017-11-30", api.key = "your-key-here")

Running the above chunk with your own key will get you all articles published in the Guardian in the last five years tagged under the news section ‘world’2 and containing the keywords ‘sexual harassment’ in the Guardian API. The keywords can be as simple or complicated as you want; just add more terms using the plus sign.

You might think the time frame is a bit skewed towards the ‘pre’ era—the news broke out on October 5, 2017. Going all the way back five full years, we are comparing 58 months worth of ‘pre’ to only 2 months of ‘post’ Weinstein. However, luckily for you blog posts are not written in real-time, meaning you get to see a (somewhat working) final result so just bear with me. And no, this is not at all like scientists running 514 regressions and failing to mention this tidbit in their publication. Relevant xkcd.

No, the reason is pure pragmatism. It’s not that running the code ‘live’ and getting the articles ‘real-time’ would not slow down this page—it’s not how it works.3 However, it is good practice to extract a tad bigger chunk than you think you will need, as you can always slice it up later to suit your needs better.

In any case, I am working with downloaded data so I will just load it up. Feel free to subset the data to see whether the results change if you use a different cut-off point. Also, if you go back the same amount of time (i.e. two months before October 5), that would lead to 183 articles for pre and 121 articles for the post time period—it is a reckoning, alright. Going back five years gets us 1224 articles in total; so we actually have 1103-pre and 121-post articles (89% to 11%). That’s more or less cross-validation ratio (well, a bit on the less side maybe), and we will roll with that for now.

articles <- read.csv("articles.csv", stringsAsFactors = FALSE) dim(articles)

## [1] 1224 27

sum(articles$wordcount)

## [1] 1352717

colnames(articles)

## [1] "id" "sectionId" "sectionName" ## [4] "webPublicationDate" "webTitle" "webUrl" ## [7] "apiUrl" "newspaperPageNumber" "trailText" ## [10] "headline" "showInRelatedContent" "lastModified" ## [13] "hasStoryPackage" "score" "standfirst" ## [16] "shortUrl" "wordcount" "commentable" ## [19] "allowUgc" "isPremoderated" "byline" ## [22] "publication" "newspaperEditionDate" "shouldHideAdverts" ## [25] "liveBloggingNow" "commentCloseDate" "body"

We get a bunch of variables (27) with that call, but we won’t be needing most of them for our analysis:

#laziest subset for only two variables want.var <- c("webPublicationDate", "body") want <- which(colnames(articles) %in% want.var) articles <- articles[, want] articles$webPublicationDate <- as.Date.factor(articles$webPublicationDate)

The body contains the full-text, however it’s in HTML:

dplyr::glimpse(articles$body[1])

## chr "<p>Numerous women have accused Don Burke of indecent assault, sexual harassment and bullying during the 1980s a"| __truncated__

At this point, I must admit I resorted to hacking a bit. I’m sure there is a more elegant solution here. I’ll go with this SO answer to extract text from HTML. Basically, the cleaning function removes the HTML using regex. Unfortunately, this does not clear up various apostrophes found in the text. For that, we switch the encoding from ASCII to byte:

articles$body <- iconv(articles$body, "", "ASCII", "byte")

cleanFun <- function(htmlString) { return(gsub("<.*?>", "", htmlString)) } articles$body <- cleanFun(articles$body) dplyr::glimpse(articles$body[1])

## chr "Numerous women have accused Don Burke of indecent assault, sexual harassment and bullying during the 1980s and "| __truncated__

This will end up cutting some legitimate apostrophes (e.g. “hasn’t”, “didn’t” to “hasn”, “didn”) in some cases, but we will fix that later on.

Let’s split the data on date October 5, 2017 and get rid of the date column afterwards:

#You can also use negative index for subsetting articles.before <- articles[articles$webPublicationDate < "2017-10-05", ] articles.after <- articles[articles$webPublicationDate >= "2017-10-05", ] full.text.before <- articles.before[, 2] full.text.before <- as.data.frame(full.text.before) full.text.after <- articles.after[, 2] full.text.after <- as.data.frame(full.text.after)

N-Grams and Combinatorics

To me, n-grams are what prisoner’s dilemma to college freshman—that ‘wow, so simple but so cool’ moment. As in, simple after the fact when someone has already figured it out and explained it to you. N-grams are essentially combinations of n words. For example, a bigram (2-gram).4 Using the tidytext package developed by David and Julia, we can create them in a flash with unnest_tokens. After that, we will separate the bigrams into two distinct words. Next, we will subset the bigrams so that the first word is either he or she. Finally, we will transform the words into frequency counts. I’m heavily recycling their code—no need to reinvent the wheel:

library(tidytext) library(tidyverse) #or just dplyr and tidyr if you are allergic #Create bigrams bigrams.before <- full.text.before %>% unnest_tokens(bigram, full.text.before, token = "ngrams", n = 2) nrow(bigrams.before)

## [1] 1311051

head(bigrams.before)

## bigram ## 1 the walk ## 1.1 walk from ## 1.2 from the ## 1.3 the gare ## 1.4 gare du ## 1.5 du nord

#Separate bigrams into two words bigrams.separated.before <- bigrams.before %>% separate(bigram, c("word1", "word2"), sep = " ") head(bigrams.separated.before)

## word1 word2 ## 1 the walk ## 1.1 walk from ## 1.2 from the ## 1.3 the gare ## 1.4 gare du ## 1.5 du nord

#Subset he and she in word1 he.she.words.before <- bigrams.separated.before %>% filter(word1 %in% c("he", "she")) #Fix the missing t's after apostrophe fix.apos <- c("hasn", "hadn", "doesn", "didn", "isn", "wasn", "couldn", "wouldn") he.she.words.before <- he.she.words.before %>% mutate(word2 = ifelse(word2 %in% fix.apos, paste0(word2, "t"), word2)) #10 random samples; the numbers are row numbers not counts set.seed(1895) dplyr::sample_n(he.she.words.before, 10)

## word1 word2 ## 4403 she doesnt ## 3732 he was ## 5222 she wasnt ## 11862 she said ## 3972 she wrote ## 3189 he says ## 3952 she sees ## 4878 he was ## 9314 he went ## 9408 she noted

#Transform words into counts, add +1 for log transformation he.she.counts.before <- he.she.words.before %>% count(word1, word2) %>% spread(word1, n, fill = 0) %>% mutate(total = he + she, he = (he + 1) / sum(he + 1), she = (she + 1) / sum(she + 1), log.ratio = log2(she / he), abs.ratio = abs(log.ratio)) %>% arrange(desc(log.ratio)) #Top 5 words after she head(he.she.counts.before)

## # A tibble: 6 x 6 ## word2 he she total log.ratio abs.ratio ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 testified 0.0002194908 0.0027206771 18 3.631734 3.631734 ## 2 awoke 0.0001097454 0.0010580411 6 3.269163 3.269163 ## 3 filed 0.0002194908 0.0021160822 14 3.269163 3.269163 ## 4 woke 0.0002194908 0.0019649335 13 3.162248 3.162248 ## 5 misses 0.0001097454 0.0007557437 4 2.783737 2.783737 ## 6 quickly 0.0001097454 0.0007557437 4 2.783737 2.783737

A couple of observations. First, n-grams overlap, resulting in 1.6M observations (and this is only the pre-period). However, we will only use the gendered subset,5 which is much more smaller in size. Second, as we define the log ratio as (she / he), the sign of the log ratio determines the direction (positive for she, negative for he), while the absolute value of the log ratio is just the effect size (without direction).

Good stuff, no? Wait until you see the visualisations.

Let There Be GGraphs

Both David and Julia utilise neat data visualisations to drive home their point. I especially like the roboto theme/font, so I will just go ahead and use it. You need to install the fonts separately, so if you are missing them you will get an error message.

devtools::install_github("juliasilge/silgelib") #Required Fonts #https://fonts.google.com/specimen/Roboto+Condensed #https://fonts.google.com/specimen/Roboto library(ggplot2) library(ggrepel) library(scales) library(silgelib) theme_set(theme_roboto())

We are also loading several other libraries.6 In addition to the usual suspects, ggrepel will make sure we can plot overlapping labels in a bit nicer way. Let’s start by looking at the most gendered verbs in articles on sexual harassment. In other words, we are identifying which verbs are most skewed towards one gender. I maintain the original logarithmic scale, so the effect sizes are in magnitudes and easy to interpret. If you read the blog posts, you will notice that Julia reports a unidirectional magnitude (relative to she/he), so her scales go from

.25x .5x x 2x 4x

whereas David uses directions, i.e.

'more he' 4x 2x x 2x 4x 'more she'

In both cases, x denotes the same frequency (equally likely) of usage. I don’t think one approach is necessarily better than the other, but I went with David’s approach. Finally, I filter out non-verbs plus ‘have’ and only plot verbs that occur at least five times. If you are serious about filtering out (or the opposite, filtering on) classes of words—say a certain sentiment or a set of adjectives—you should locate a dictionary from an NLP package and extract the relevant words from there. Here, I am doing it quite ad-hoc (and manually):

he.she.counts.before %>% filter(!word2 %in% c("himself", "herself", "ever", "quickly", "actually", "sexually", "allegedly", "have"), total >= 5) %>% group_by(direction = ifelse(log.ratio > 0, 'More "she"', "More 'he'")) %>% top_n(15, abs.ratio) %>% ungroup() %>% mutate(word2 = reorder(word2, log.ratio)) %>% ggplot(aes(word2, log.ratio, fill = direction)) + geom_col() + coord_flip() + labs(x = "", y = 'Relative appearance after "she" compared to "he"', fill = "", title = "Pre Weinstein: 2012-17 The Guardian Articles on Sexual Harassment", subtitle = "Top 15 Most Gendered (Skewed) Verbs after he/she; at least 5 occurrences.") + scale_y_continuous(labels = c("8X", "6X", "4X", "2X", "Same", "2X", "4X", "6X", "8X"), breaks = seq(-4, 4)) + guides(fill = guide_legend(reverse = TRUE)) + expand_limits(y = c(-4, 4))

Several immediate and depressing trends emerge from the data. The top active verbs for women cluster on bringing charges: ‘testified’, ‘filed’; whereas male verbs seem to react to those with ‘argued’, ‘faces’, ‘acknowledges’, and ‘apologized’. Women ‘awoke’ and ‘woke’, matching the more violent male verbs such as ‘drugged’, ‘assaulted’, ‘punched’, and ‘raped’. ‘Alleged’ occurs four times more after she relative to he, and there is no mention of denial (e.g. ‘denied’, ‘denies’) after he. A note on word variations: in some cases, it might be better to combine conjugations into a single category using a wildcard (such as expect* in the graph above). However, I thought the tense can also contribute to a quantitative story, so I left them as they are.

Another way of visualising the gendered differences is to plot their magnitude in addition to their frequency. This time, we are not limited to just verbs; however we still filter out some uninteresting words. There are additional ggplot and ggrepel arguments in this plot. First, I added two reference lines: a red y-intercept with geom_hline to act as a baseline and an invisible x-intercept using geom_vline to give the labels more space on the left-hand side. Do you not love tidy grammar? Last but not least, I insert geom_text_repel to give us more readability: segment.alpha controls the line transparency, while the force argument governs the aggressiveness of the jittering algorithm. We could supply it with a fill argument that corresponds to a factor variable to highlight a certain characteristic (say, total occurrence), however there is not much meaningful variation there in our case.

he.she.counts.before %>% filter(!word2 %in% c("himself", "herself", "she", "too", "later", "apos", "just", "says"), total >= 10) %>% top_n(100, abs.ratio) %>% ggplot(aes(total, log.ratio)) + geom_point() + geom_vline(xintercept = 5, color = "NA") + geom_hline(yintercept = 0, color = "red") + scale_x_log10(breaks = c(10, 100, 1000)) + geom_text_repel(aes(label = word2), segment.alpha = .1, force = 2) + scale_y_continuous(breaks = seq(-4, 4), labels = c('8X "he"', '6X "he"', '4X "he"', '2X "he"', "Same", '2X "she"', '4X "she"', '6X "she"', '8X "she"')) + labs(x = 'Total uses after "he" or "she" (Logarithmic scale)', y = 'Relative uses after "she" to after "he"', title = "Gendered Reporting: Pre Weinstein, The Guardian", subtitle = "Words occurring at least 10 times after he/she: 160 unique words (100 displayed) | 11,013 occurrences in total") + expand_limits(y = c(4, -4))

Plotting frequencies complement the first plot quite nicely. We can infer reported characteristics more easily when there is a tangible baseline. Words around the red line occur after she or he more or less equally: the y-axis determines the relational effect size (with regards to gender), and the x-axis displays the total amount of occurrences. Some additional insights: we see that ‘sexually’ and ‘allegedly’ popping up after he quite frequently. There is also the verb ‘admitted’, as well as ‘denies’ (even though visually it is located above the red line, if you follow the grey segment, it’s located around 1X ‘he’). For women, more morbid words like ‘suffered’, ‘died’ are added to the mix. There are also nuances regarding the tense; ‘claims’ follows she twice more than he, while ‘claimed’ is twice likely to come after he.7

Moving on to the post-Weinstein period (‘the effect’), I quietly run the same code, and plot the equivalent graphics below. Some caveats: with the smaller sample size, I lowered the inclusion threshold from 5 to 2. Additionally, although it is top 15 most skewed verbs per gender, because of frequent ties, it ends up having more than that at the end.

After the scandal broke, we see that women are reported to have ‘complained’, ‘hoped’, and ‘became’. On the other hand, men are vehemently denying the accusations, with ‘denies’ and ‘denied’ being the most skewed verbs following he. Random point: in the pre-period, it’s ‘apologized’, in the post-period, it’s ‘apologised’. Maybe Brexit can manifest in mysterious ways.

Again we turn to the frequency plot to infer more. In addition to denial, men are also reported to use words such as ‘categorically’ and ‘utterly’. Both ‘claims’ and ‘claimed’ occur more after she, not repeating the earlier dynamic regarding the tense. In addition, we don’t see ‘alleged’ or ‘allegedly’ featured in the plot at all. How much of this change can we attribute to the effect? At a glance, we definitely see a difference. For example, verbs display a good variation for both genders. The post-frequency plot features less violent words than the pre-frequency plot. There is a lot more ‘denying’ and not much ‘alleging’ in the post-Weinstein period.

Some are definitely data artefacts. The post-frequency plot is ‘cleaner’—in addition to (and directly caused by) the smaller n—because the cut-off point is set to ‘more than once’: if we remove the filter, all the violence and harassment terms are back in. Some are probably reporter/reporting biases plus the prevalent gendered thinking (that occurs both on a conscious level and below). And perhaps some are genuine effects—true signal. It is still too early to pass judgement on whether the Weinstein effect will result in tangible, positive change. Currently, all we can get is a short, limited glimpse at the available data.

Hopefully you managed to enjoy this rather depressing data undertaking using the tidytextpackage. As usual, the underlying code is available on GitHub. N-grams are powerful. Imagine the possibilities: assume you have access to a rich dataset (say, minimum 100K very long articles/essays). You can construct n-grams sequentially; i.e. 2-grams; 3-grams, 4-grams etc., separate the words, and subset for gendered pronouns. This would give you access to structures like “he” + “word” + “her” (direct action) and “she” + “word” + “word” + “him” (allowing for adjective + verb). Then it would be possible to look for all kinds of interactions, revealing their underlying structure. I will be reporting more on this front, until I move onto image processing (looking at you, keras).

0 notes

oditile · 8 years ago

Text

Sentiment Analysis of “A Christmas Carol”

Posted: 29 Nov 2017 03:16 PM PST

(This article was first published on

R – rud.is , and kindly contributed to R-bloggers)

Our family has been reading, listening to and watching “A Christmas Carol” for just abt 30 years now. I got it into my crazy noggin to perform a sentiment analysis on it the other day and tweeted out the results, but a large chunk of the R community is not on Twitter and it would be good to get a holiday-themed post or two up for the season.

One reason I embarked on this endeavour is that @juliasilge & @drob made it so gosh darn easy to do so with:

(btw: That makes an excellent holiday gift for the data scientist[s] in your life.)

Let us begin!

STAVE I: hrbrmstr’s Code

We need the text of this book to work with and thankfully it’s long been in the public domain. As @drob noted, we can use the gutenbergr package to retrieve it. We’ll use an RStudio project structure for this and cache the results locally to avoid burning bandwidth:

library(rprojroot) library(gutenbergr) library(hrbrthemes) library(stringi) library(tidytext) library(tidyverse) rt <- find_rstudio_root_file() carol_rds <- file.path(rt, "data", "carol.rds") if (!file.exists(carol_rds)) { carol_df <- gutenberg_download("46") write_rds(carol_df, carol_rds) } else { carol_df <- read_rds(carol_rds) }

How did I know to use 46? We can use gutenberg_works() to get to that info:

gutenberg_works(author=="Dickens, Charles") ## # A tibble: 74 x 8 ## gutenberg_id title ## ## 1 46 A Christmas Carol in Prose; Being a Ghost Story of Christmas ## 2 98 A Tale of Two Cities ## 3 564 The Mystery of Edwin Drood ## 4 580 The Pickwick Papers ## 5 588 Master Humphrey's Clock ## 6 644 The Haunted Man and the Ghost's Bargain ## 7 650 Pictures from Italy ## 8 653 "The Chimes\r\nA Goblin Story of Some Bells That Rang an Old Year out and a New Year In" ## 9 675 American Notes ## 10 678 The Cricket on the Hearth: A Fairy Tale of Home ## # ... with 64 more rows, and 6 more variables: author , gutenberg_author_id , language , ## # gutenberg_bookshelf , rights , has_text

STAVE II: The first of three wrangles

We’re eventually going to make a ggplot2 faceted chart of the sentiments by paragraphs in each stave (chapter). I wanted nicer titles for the facets so we’ll clean up the stave titles first:

#' Convenience only carol_txt <- carol_df$text # Just want the chapters (staves) carol_txt <- carol_txt[-(1:(which(grepl("STAVE I:", carol_txt)))-1)] #' We'll need this later to make prettier facet titles data_frame( stave = 1:5, title = sprintf("Stave %s: %s", stave, carol_txt[stri_detect_fixed(carol_txt, "STAVE")] %>% stri_replace_first_regex("STAVE [[:alpha:]]{1,3}: ", "") %>% stri_trans_totitle()) ) -> stave_titles

stri_trans_totitle() is a super-handy function and all we’re doing here is extracting the stave titles and doing a small transformation. There are scads of ways to do this, so don’t get stuck on this example. Try out other ways of doing this munging.

You’ll also see that I made sure we started at the first stave break vs include the title bits in the analysis.

Now, we need to prep the text for text analysis.

STAVE III: The second of three wrangles

There are other text mining packages and processes in R. I’m using tidytext because it takes care of so many details for you and does so elegantly. I was also at the rOpenSci Unconf where the idea was spawned & worked on and I’m glad it blossomed into such a great package and a book!

Since we (I) want to do the analysis by stave & paragraph, let’s break the text into those chunks. Note that I’m doing an extra break by sentence in the event folks out there want to replicate this work but do so on a more granular level.

#' Break the text up into chapters, paragraphs, sentences, and words, #' preserving the hierarchy so we can use it later. data_frame(txt = carol_txt) %>% unnest_tokens(chapter, txt, token="regex", pattern="STAVE [[:alpha:]]{1,3}: [[:alpha:] [:punct:]]+") %>% mutate(stave = 1:n()) %>% unnest_tokens(paragraph, chapter, token = "paragraphs") %>% group_by(stave) %>% mutate(para = 1:n()) %>% ungroup() %>% unnest_tokens(sentence, paragraph, token="sentences") %>% group_by(stave, para) %>% mutate(sent = 1:n()) %>% ungroup() %>% unnest_tokens(word, sentence) -> carol_tokens carol_tokens ## A tibble: 28,710 x 4 ## stave para sent word ## ## 1 1 1 1 marley ## 2 1 1 1 was ## 3 1 1 1 dead ## 4 1 1 1 to ## 5 1 1 1 begin ## 6 1 1 1 with ## 7 1 1 1 there ## 8 1 1 1 is ## 9 1 1 1 no ## 0 1 1 1 doubt ## ... with 28,700 more rows

By indexing each hierarchy level, we have the flexibility to do all sorts of structured analyses just by choosing grouping combinations.

STAVE IV: The third of three wrangles

Now, we need to layer in some sentiments and do some basic sentiment calculations. Many of these sentiment-al posts (including this one) take a naive approach with basic match and only looking at 1-grams. One reason I didn’t go further was to make the code accessible to new R folk (since I primarily blog for new R folk :-). I’m prepping some 2018 posts with more involved text analysis themes and will likely add some complexity then with other texts.

#' Retrieve sentiments and compute them. #' #' I left the `index` in vs just use `paragraph` since it'll make this easier to reuse #' this block (which I'm not doing but thought I might). inner_join(carol_tokens, get_sentiments("nrc"), "word") %>% count(stave, index = para, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) %>% left_join(stave_titles, "stave") -> carol_with_sent

STAVE V: The end of it

Now, we just need to do some really basic ggplot-ing to to get to our desired result:

ggplot(carol_with_sent) + geom_segment(aes(index, sentiment, xend=index, yend=0, color=title), size=0.33) + scale_x_comma(limits=range(carol_with_sent$index)) + scale_y_comma() + scale_color_ipsum() + facet_wrap(~title, scales="free_x", ncol=5) + labs(x=NULL, y="Sentiment", title="Sentiment Analysis of A Christmas Carol", subtitle="By stave & ¶", caption="Humbug!") + theme_ipsum_rc(grid="Y", axis_text_size = 8, strip_text_face = "italic", strip_text_size = 10.5) + theme(legend.position="none")

You’ll want to tap/click on that to make it bigger.

Despite using a naive analysis, I think it tracks pretty well with the flow of the book.

Stave one is quite bleak. Marley is morose and frightening. There is no joy apart from Fred’s brief appearance.

The truly terrible (-10 sentiment) paragraph also makes sense:

Marley’s face. It was not in impenetrable shadow as the other objects in the yard were, but had a dismal light about it, like a bad lobster in a dark cellar. It was not angry or ferocious, but looked at Scrooge as Marley used to look: with ghostly spectacles turned up on its ghostly forehead. The hair was curiously stirred, as if by breath or hot air; and, though the eyes were wide open, they were perfectly motionless. That, and its livid colour, made it horrible; but its horror seemed to be in spite of the face and beyond its control, rather than a part of its own expression.

(I got to that via this snippet which you can use as a template for finding the other significant sentiment points:)

filter( carol_tokens, stave == 1, para == filter(carol_with_sent, stave==1) %>% filter(sentiment == min(sentiment)) %>% pull(index) )

Stave two (Christmas past) is all about Scrooge’s youth and includes details about Fezziwig’s party so the mostly-positive tone also makes sense.

Stave three (Christmas present) has the highest:

The Grocers’! oh, the Grocers’! nearly closed, with perhaps two shutters down, or one; but through those gaps such glimpses! It was not alone that the scales descending on the counter made a merry sound, or that the twine and roller parted company so briskly, or that the canisters were rattled up and down like juggling tricks, or even that the blended scents of tea and coffee were so grateful to the nose, or even that the raisins were so plentiful and rare, the almonds so extremely white, the sticks of cinnamon so long and straight, the other spices so delicious, the candied fruits so caked and spotted with molten sugar as to make the coldest lookers-on feel faint and subsequently bilious. Nor was it that the figs were moist and pulpy, or that the French plums blushed in modest tartness from their highly-decorated boxes, or that everything was good to eat and in its Christmas dress; but the customers were all so hurried and so eager in the hopeful promise of the day, that they tumbled up against each other at the door, crashing their wicker baskets wildly, and left their purchases upon the counter, and came running back to fetch them, and committed hundreds of the like mistakes, in the best humour possible; while the Grocer and his people were so frank and fresh that the polished hearts with which they fastened their aprons behind might have been their own, worn outside for general inspection, and for Christmas daws to peck at if they chose.

and lowest (sentiment) points of the entire book:

And now, without a word of warning from the Ghost, they stood upon a bleak and desert moor, where monstrous masses of rude stone were cast about, as though it were the burial-place of giants; and water spread itself wheresoever it listed, or would have done so, but for the frost that held it prisoner; and nothing grew but moss and furze, and coarse rank grass. Down in the west the setting sun had left a streak of fiery red, which glared upon the desolation for an instant, like a sullen eye, and frowning lower, lower, lower yet, was lost in the thick gloom of darkest night.

Stave four (Christmas yet to come) is fairly middling. I had expected to see lower marks here. The standout negative sentiment paragraph (and the one that follows) are pretty dark, though:

They left the busy scene, and went into an obscure part of the town, where Scrooge had never penetrated before, although he recognised its situation, and its bad repute. The ways were foul and narrow; the shops and houses wretched; the people half-naked, drunken, slipshod, ugly. Alleys and archways, like so many cesspools, disgorged their offences of smell, and dirt, and life, upon the straggling streets; and the whole quarter reeked with crime, with filth, and misery.

Finally, Stave five is both short and positive (whew!). Which I heartily agree with!

FIN

The code is up on GitHub and I hope that it will inspire more folks to experiment with this fun (& useful!) aspect of data science.

Make sure to send links to anything you create and shoot over PRs for anything you think I did that was awry.

For those who celebrate Christmas, I hope you keep Christmas as well as or even better than old Scrooge. “May that be truly said of us, and all of us! And so, as Tiny Tim observed, God bless Us, Every One!”

0 notes

oditile · 8 years ago

Link

Link datasets ibm

0 notes

oditile · 8 years ago

Text

Data acquisition in R 1-4 figshare

Data acquisition in R (1/4)

R is an incredible tool for reproducible research. In the present series of blog posts I want to show how one can easily acquire data within an R session, documenting every step in a fully reproducible way. There are numerous data acquisition options for R users. Of course, I do not attempt to show all the data possibilities and tend to focus mostly on demographic data. If your prime interest lies outside human population statistics, it’s worth checking the amazing Open Data Task View.

The series consists of four posts:

Loading prepared datasets

Accessing popular statistical databases

Demographic data sources

Getting spatial data

For each of the data acquisition options I provide a small visualization use case.

Built-in datasets

For illustration purposes, many R packages include data samples. Base R comes with a datasetspackage that offers a wide range of simple, sometimes very famous, datasets. Quite a detailed list of built-in datasets from various packages is maintained by Vincent Arel-Bundock.

The nice feature of the datasets form datasets package is that they are “always there”. The unique names of the datasets may be referred as the objects from Global Environment. Let’s have a look at a beautiful small dataset calls swiss - Swiss Fertility and Socioeconomic Indicators (1888) Data. I am going to check visually the difference in fertility based of rurality and domination of Catholic population.

library(tidyverse) swiss %>% ggplot(aes(x = Agriculture, y = Fertility, color = Catholic > 50))+ geom_point()+ stat_ellipse()+ theme_minimal(base_family = "mono")

Gapminder

Some packages are created specifically to disseminate datasets in a ready to use format. One of the nice examples is a package gapminder that contains a neat dataset widely used by Hans Rosling in his Gapminder project.

library(tidyverse) library(gapminder) gapminder %>% ggplot(aes(x = year, y = lifeExp, color = continent))+ geom_jitter(size = 1, alpha = .2, width = .75)+ stat_summary(geom = "path", fun.y = mean, size = 1)+ theme_minimal(base_family = "mono")

Grab a dataset by URL

If a dataset is hosted online and has a direct link to the file, it can be easily imported into the R session just specifying the URL. For illustration, I will access Galton dataset from HistData package using a direct link from Vincent Arel-Bundock’s list.

library(tidyverse) galton <- read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/HistData/Galton.csv") galton %>% ggplot(aes(x = father, y = height))+ geom_point(alpha = .2)+ stat_smooth(method = "lm")+ theme_minimal(base_family = "mono")

Download and unzip an archive

Quite often datasets are stored in archived from. With R it is very simple to download and unzip the desired data archives. As an example, I will download Historical New York City Crime Data provided by the Government of the Sate of New York and hosted at data.gov portal. The logic of the process is: first, we create a directory for the unzipped data; second, we download the archive; finally, unzip the archive and read the data.

library(tidyverse) library(readxl) # create a directory for the unzipped data ifelse(!dir.exists("unzipped"), dir.create("unzipped"), "Directory already exists") # specify the URL of the archive url_zip <- "http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/citywide_historical_crime_data_archive.zip" # storing the archive in a temporary file f <- tempfile() download.file(url_zip, destfile = f) unzip(f, exdir = "unzipped/.")

If the zipped file is rather big and we don’t want to download it again the next time we run the code, it might be useful to keep the archived data.

# if we want to keep the .zip file path_unzip <- "unzipped/data_archive.zip" ifelse(!file.exists(path_unzip), download.file(url_zip, path_unzip, mode="wb"), 'file alredy exists') unzip(path_unzip, exdir = "unzipped/.")

Finally, let’s read and plot some of the downloaded data.

murder <- read_xls("unzipped/Web Data 2010-2011/Seven Major Felony Offenses 2000 - 2011.xls", sheet = 1, range = "A5:M13") %>% filter(OFFENSE %>% substr(1, 6) == "MURDER") %>% gather("year", "value", 2:13) %>% mutate(year = year %>% as.numeric()) murder %>% ggplot(aes(year, value))+ geom_point()+ stat_smooth(method = "lm")+ theme_minimal(base_family = "mono")+ labs(title = "Murders in New York")

Figshare

In Academia it is becoming more and more popular to store the datasets accompanying papers in the specialized repositories. Figshare is one of the most popular free repositories. There is an R package rfigshare to access the datasets from this portal. As an example I will grab the dataset on ice-hockey playes height that I assembled manually for my blog post. Please note that at the first run the package will ask to enter your Figshare login details to access API - a web page will be opened in browser.

There is a search function fs_search, though my experience shows that it is easier to search for a dataset in a browser and then use the id of a file to download it. The function fs_download turns an id number into a direct URL to download the file.

library(tidyverse) library(rfigshare) url <- fs_download(article_id = "3394735") hockey <- read_csv(url) hockey %>% ggplot(aes(x = year, y = height))+ geom_jitter(size = 2, color = "#35978f", alpha = .1, width = .25)+ stat_smooth(method = "lm", size = 1)+ ylab("height, cm")+ xlab("year of competition")+ scale_x_continuous(breaks = seq(2005, 2015, 5), labels = seq(2005, 2015, 5))+ theme_minimal(base_family = "mono")

Full reproducibility

All the code chunks can be found in this gist.

################################################################################ # # ikashnitsky.github.io 2017-10-17 # Data acquisition in R - Part 1/4 # https://ikashnitsky.github.io/2017/data-acquisition-one # Ilya Kashnitsky, [email protected] # ################################################################################ # Erase all objects in memory rm(list = ls(all = TRUE)) # load required packages library(tidyverse) # data manipulation and viz # built-in --------------------------------------------------------------------- # Swiss Fertility and Socioeconomic Indicators (1888) Data. Let's check the difference in fertility based of rurality and domination of Catholic population. swiss %>% ggplot(aes(x = Agriculture, y = Fertility, color = Catholic > 50))+ geom_point()+ stat_ellipse()+ theme_minimal(base_family = "mono") ggsave("swiss.png", width = 8, height = 5) # gapminder -------------------------------------------------------------------- library(gapminder) gapminder %>% ggplot(aes(x = year, y = lifeExp, color = continent))+ geom_jitter(size = 1, alpha = .2, width = .75)+ stat_summary(geom = "path", fun.y = mean, size = 1)+ theme_minimal(base_family = "mono") ggsave("gapminder.png", width = 8, height = 5) # URL -------------------------------------------------------------------------- library(tidyverse) galton <- read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/HistData/Galton.csv") galton %>% ggplot(aes(x = father, y = height))+ geom_point(alpha = .2)+ stat_smooth(method = "lm")+ theme_minimal(base_family = "mono") ggsave("galton.png", width = 8, height = 5) # UNZIP ------------------------------------------------------------------------ # create a directory for the unzipped data ifelse(!dir.exists("unzipped"), dir.create("unzipped"), "Directory already exists") # specify the URL of the archive url_zip <- "http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/citywide_historical_crime_data_archive.zip" # download, unzip and read data f <- tempfile() download.file(url_zip, destfile = f) unzip(f, exdir = "unzipped/.") # if we want to keep the .zip file path_unzip <- "unzipped/data_archive.zip" ifelse(!file.exists(path_unzip), download.file(url_zip, path_unzip, mode="wb"), 'file alredy exists') unzip(path_unzip, exdir = "unzipped/.") library(readxl) murder <- read_xls("unzipped/Web Data 2010-2011/Seven Major Felony Offenses 2000 - 2011.xls", sheet = 1, range = "A5:M13") %>% filter(OFFENSE %>% substr(1, 6) == "MURDER") %>% gather("year", "value", 2:13) %>% mutate(year = year %>% as.numeric()) # plot murder %>% ggplot(aes(year, value))+ geom_point()+ stat_smooth(method = "lm")+ theme_minimal(base_family = "mono")+ labs(title = "Murders in New York") ggsave("new-york.png", width = 8, height = 5) # Figshare --------------------------------------------------------------------- library(rfigshare) # find the dataset # fs_search("ice hockey players") # not working url <- fs_download(article_id = "3394735") hockey <- read_csv(url) hockey %>% ggplot(aes(x = year, y = height))+ geom_jitter(size = 2, color = "#35978f", alpha = .1, width = .25)+ stat_smooth(method = "lm", size = 1)+ ylab("height, cm")+ xlab("year of competition")+ scale_x_continuous(breaks = seq(2005, 2015, 5), labels = seq(2005, 2015, 5))+ theme_minimal(base_family = "mono") ggsave("ice-hockey.png", width = 8, height = 5)

0 notes

oditile · 8 years ago

Text

Text Analytics

Text Mining has become quite mainstream nowadays as the tools to make a reasonable text analysis are ready to be exploited and give astoundingly nice and reasonable results. At BNOSAC, we have used it mainly for text mining on call center logs, salesforce data, emails, HR reviews, IT logged tickets, customer reviews, survey feedback and many more. We help companies get a competitive edge by providing text analytics using our proven methodology and our web app txtminer. Interested? Get in touch.

Application Domains

Typical application domains for which we executed projects are

1 Understanding what is written in text by identifying topics/client issues.

2 Enriching predictive models (churn / upsell) with text analytics

3 Process efficiency gains by better email redirection or reducing IT handling times

4 Building a FAQ list + automatic replies to client emails including matched solutions from FAQ questions

5 Automatic solving of client issues

6 Extracting structured content from text, pdf files or websites

txtminer

txtminer For citizen data scientists with a deadline, we have developed our text mining application called txtminer, a web application which automates text analytics.

Features txtminer performs automated text analytics for all major European languages, provides interactive visualisations, integrated root cause analysis and predictive modelling and can be installed on-premise or in the cloud.

More information here

.Our text mining methodology

A typical text mining project consists of executing a number of steps which are detailed below.

Text analytics steps

1 A first step consists of cleaning the data. Making sure we retain only relevant text. The more structured your text data is already about a specific topic, the better the results will be.

2 Next we perform Natural Language Processing. This means we identify words and label if the word is a noun/verb/adverb/ or if it is a person, a city, a number or a specific object.

3 In the subsequent step we identify sentiment scores and topics in the text.

4 The results are visualised extensively and reported. This is core to understanding the topics and having a dialogue with business users.

5 The text analytics enrichments are next used for other purposes like predictive modelling, automatic replies, better forwarding, creating of FAQ lists, ...

6 The process is automated.

Visualisation is key

Visualisation is key during text analytics. Some sample visualisation of typical text mining results can be found below. For reason of confidentiality these were executed on the open CETA trade agreement instead of showing the client results.

Automation is key

A next step next to understanding what is written is performing predictive modelling and giving suggestions for process improvements and responses to clients. Automation is always a part of our text mining proposal.

Training

We want you to use the techniques in-house so if you have data scientists at your site, we train them. We give training in text mining using open source tools. This training covers the following topics.

Text encodings

Cleaning of text data, regular expressions

String distances

Graphical displays of text data

Natural language processing: stemming, parts-of-speech tagging, tokenization, lemmatisation

Sentiment analysis

Statistical topic detection modelling and visualization (latent diriclet allocation)

Visualisation of correlations & topics

Word embeddings

Document similarities & Text alignment

More information can be found here.

0 notes

oditile · 8 years ago

Text

Text Mining with R - upcoming courses in Belgium

We use text mining a lot in day-to-day data mining operations. In order to share our knowledge on this, to show that R is an extremely mature platform to do business-oriented text analytics and to give you practical experience with text mining, our course on Text Mining with R is scheduled for the 3rd consecutive year at LStat, the Leuven Statistics Research Center (Belgium) as well as at the Data Science Academy in Brussels. Courses are scheduled 2 times in November 2017 and also in March 2018.

This course is a hands-on course covering the use of text mining tools for the purpose of data analysis. It covers basic text handling, natural language engineering and statistical modelling on top of textual data. The following items are covered.

Text encodings

Cleaning of text data, regular expressions

String distances

Graphical displays of text data

Natural language processing: stemming, parts-of-speech tagging, tokenization, lemmatisation

Sentiment analysis

Statistical topic detection modelling and visualization (latent diriclet allocation)

Visualisation of correlations & topics

Word embeddings

Document similarities & Text alignment

Feel free to register at the following dates:

18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here

08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here

27-28/11/2017: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to [email protected]

19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here

20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here

08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here

22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here

Interested in R training? Download our R training brochure in PDF here.

Download PDF

A MUST READ

Text Analytics

Big Data

Products

Statistical Consulting

R methodology

LATEST BLOG POSTS

Text Mining with R - upcoming courses in Belgium

Is udpipe your new NLP processor for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing

Machine Learning with R - upcoming course in Belgium

Computer Vision Algorithms for R users

Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger

Scheduling R scripts and processes on Windows and Unix/Linux

Detect Lines in Digital Images

An overview of text mining visualisations possibilities with R on the CETA trade agreement

BelgiumMaps.StatBel: R package with Administrative boundaries of Belgium

Sentiment analysis and Parts of Speech tagging in Dutch/French/English/German/Spanish/Italian

Text Mining with R - upcoming training schedule

Good news from Belgium: Course on Applied spatial modelling with R (April 13-14)

New RStudio add-in to schedule R scripts

taskscheduleR: R package to schedule R scripts with the Windows task manager

Web scraping with R & novel classification algorithms on unbalanced data

Advanced R programming topics course in Leuven

Open Data in Belgium - release of BelgiumStatistics R package

Text Mining with R

R courses on basic R, advanced R, statistical machine learning with R, text mining with R, spatial modelling with R and R package building

R courses

0 notes

oditile · 8 years ago

Text

11 October 2017 | by Jake HoareLinear Discriminant Analysis in R: An Introduction

How does Linear Discriminant Analysis work and how do you use it in R? This post answers these questions and provides an introduction to Linear Discriminant Analysis.

Linear Discriminant Analysis (LDA) is a well-established machine learning technique for predicting categories. Its main advantages, compared to other classification algorithms such as neural networks and random forests, are that the model is interpretable and that prediction is easy.

The intuition behind Linear Discriminant Analysis

Linear Discriminant Analysis takes a data set of cases (also known as observations) as input. For each case, you need to have a categorical variable to define the class and several predictor variables (which are numeric). We often visualize this input data as a matrix, such as shown below, with each case being a row and each variable a column. In this example, the categorical variable is called “class” and the predictive variables (which are numeric) are the other columns.

Think of each case as a point in N-dimensional space, where N is the number of predictor variables. Every point is labeled by its category. (Although it focuses on t-SNE, this video neatly illustrates what we mean by dimensional space).

The LDA algorithm uses this data to divide the space of predictor variables into regions. The regions are labeled by categories and have linear boundaries, hence the “L” in LDA. The model predicts the category of a new unseen case according to which region it lies in. The model predicts that all cases within a region belong to the same category.

The linear boundaries are a consequence of assuming that the predictor variables for each category have the same multivariate Gaussian distribution. Although in practice this assumption may not be 100% true, if it is approximately valid then LDA can still perform well.

Mathematically, LDA uses the input data to derive the coefficients of a scoring function for each category. Each function takes as arguments the numeric predictor variables of a case. It then scales each variable according to its category-specific coefficients and outputs a score. The LDA model looks at the score from each function and uses the highest score to allocate a case to a category (prediction). We call these scoring functions the discriminant functions.

I am going to stop with the model described here and go into some practical examples. If you would like more detail, I suggest one of my favorite reads, Elements of Statistical Learning (section 4.3).

Linear Discriminant Analysis Example

Predicting the type of vehicle

Even though my eyesight is far from perfect, I can normally tell the difference between a car, a van, and a bus. I might not distinguish a Saab 9000 from an Opel Manta though. They are cars made around 30 years ago (I can’t remember!). Despite my unfamiliarity, I would hope to do a decent job if given a few examples of both.

I will demonstrate Linear Discriminant Analysis by predicting the type of vehicle in an image. The 4 vehicle categories are a double-decker bus, Chevrolet van, Saab 9000 and Opel Manta 400. The input features are not the raw image pixels but are 18 numerical features calculated from silhouettes of the vehicles. You can read more about the data behind this LDA example here.

To start, I load the 846 instances into a data.frame called vehicles. The columns are labeled by the variables, with the target outcome column called class. The earlier table shows this data.

flipMultivariates: A new R package

The package I am going to use is called flipMultivariates (click on the link to get it). It is based on the MASSpackage, but extends it in the following ways:

Handling of weighted data

Graphical outputs

Options for missing data

Ability to output discriminant functions

The package is installed with the following R code.

library

(devtools)

install_github

(

"Displayr/flipStandardCharts"

)

Then the model is created with the following two lines of code.

library

(flipMultivariates)

lda <-

LDA

(class ~ ., data = vehicles)

The output is shown below. The subtitle shows that the model identifies buses and vans well but struggles to tell the difference between the two car models. The first four columns show the means for each variable by category. High values are shaded in blue ad low values in red, with values significant at the 5% level in bold. The R-Squared column shows the proportion of variance within each row that is explained by the categories. On this measure, ELONGATEDNESS is the best discriminator.

Customizing the LDA model with alternative inputs in the code

The LDA function in flipMultivariates has a lot more to offer than just the default. Consider the code below:

123456

lda.2 <-

LDA

(class ~ COMPACTNESS + CIRCULARITY + DISTANCE.CIRCULARITY + RADIUS.RATIO,

data = vehicles,

output =

"Scatterplot"

prior =

"Equal"

subset = vehicles$ELONGATEDNESS < 50,

weight =

ifelse

(vehicles$class ==

"saab"

, 2, 1))

I’ve set a few new arguments, which include;

output

prior

subset

weight

The default output is Means, which is what you got in the very first output. In the code immediately above, I changed this to be Scatterplot. I explain how to interpret the scatterplot in the next section. Also available are Prediction-Accuracy Table, Detail andDiscriminant Functions. The latter produces a table of the coefficients of the discriminant functions described earlier.

This argument sets the prior probabilities of category membership. Observed is the default, which uses the frequencies of the input data. It is also possible to specify values for each category as a vector (which naturally sum to 1). Equal implies that vehicles are equally distributed across the categories.

This is a vector of TRUE/FALSE, which in the code above limits the analysis to cases with lower ELONGATEDNESS (less than scores of 50 on that numeric predictor variable)

This is a vector of values used to scale the cases. In this example, I overweight the saab cars.

It is also possible to control treatment of missing variables with the missing argument (not shown in the code example above). The options are Exclude cases with missing data (default), Error if missing data and Imputation (replace missing values with estimates). Imputation allows the user to specify additional variables (which the model uses to estimate replacements for missing data points).

The R command ?LDA gives more information on all of the arguments.

Interpreting the Linear Discriminant Analysis output

The previous block of code above produces the following scatterplot. (Note: I am no longer using all the predictor variables in the example below, for the sake of clarity). I am going to talk about two aspects of interpreting the scatterplot: how each dimension separates the categories, and how the predictor variables correlate with the dimensions.

I said above that I would stop writing about the model. However, to explain the scatterplot I am going to have to mention a few more points about the algorithm. If you prefer to gloss over this, please skip ahead.

An alternative view of linear discriminant analysis is that it projects the data into a space of (number of categories – 1) dimensions. In this example that space has 3 dimensions (4 vehicle categories minus one). While this aspect of dimension reduction has some similarity to Principal Components Analysis (PCA), there is a difference. The difference from PCA is that LDA chooses dimensions that maximally separate the categories (in the transformed space). The LDA model orders the dimensions in terms of how much separation each achieves (the first dimensions achieves the most separation, and so forth). Hence the scatterplot shows the means of each category plotted in the first two dimensions of this space. So in our example here, the first dimension (the horizontal axis) distinguishes the cars (right) from the bus and van categories (left). However, the same dimension does not separate the cars well.

Also shown are the correlations between the predictor variables and these new dimensions. Because DISTANCE.CIRCULARITY has a high value along the first linear discriminant it positively correlates with this first dimension. It has a value of almost zero along the second linear discriminant, hence is virtually uncorrelated with the second dimension. Note the scatterplot scales the correlations to appear on the same scale as the means. So you can’t just read their values from the axis. In other words, the means are the primary data, whereas the scatterplot adjusts the correlations to “fit” on the chart.

The Prediction-Accuracy Table

Finally, I will leave you with this chart to consider the model’s accuracy. Changing the output argument in the code above to Prediction-Accuracy Table produces the following:

So from this, you can see what the model gets right and wrong (in terms of correctly predicting the class of vehicle). The ideal is for all the cases to lie on the diagonal of this matrix (and so the diagonal is a deep color in terms of shading). But here we are getting some misallocations (no model is ever perfect). For instance, 19 cases that the model predicted as Opel are actually in the bus category (observed). Given the shades of red and the numbers that lie outside this diagonal (particularly with respect to the confusion between Opel and saab) this LDA model is far from perfect.

Try it yourself

I created the analyses in this post with R in Displayr. You can review the underlying data and code or run your own LDA analyses here (just sign into Displayr first). I used the flipMultivariates package (available on GitHub).

Displayr also makes Linear Discriminant Analysis and other machine learning tools available through menus, alleviating the need to write code.

This dataset originates from the Turing Institute, Glasgow, Scotland, which closed in 1994 so I doubt they care, but I’m crediting the source anyway.

0 notes

oditile · 8 years ago

Link

0 notes

oditile · 8 years ago

Link

Indicazioni sul questionario Short Form 36

SF35

0 notes

oditile · 8 years ago

Text

Post listing by date

From: http://ellisp.github.io/blog/

Post listing by date

Most recent at the top

04 Jun 2017 » Global choropleth maps of military expenditure

21 May 2017 » Sankey charts for swinging voters

14 May 2017 » Web app for individual party vote from the 2014 New Zealand election study

06 May 2017 » Modelling individual party vote from the 2014 New Zealand election study

30 Apr 2017 » Luke-warm about micromaps

25 Apr 2017 » More cartograms of New Zealand census data (district and city level)!

23 Apr 2017 » Cartograms of New Zealand census data

15 Apr 2017 » Impact of omitted variables on estimating causal effects - simulations

09 Apr 2017 » Exploring propensity score matching and weighting

26 Mar 2017 » New Zealand election forecasts

21 Mar 2017 » House effects in New Zealand voting intention polls

12 Mar 2017 » Simulations to explore excessive lagged X variables in time series modelling

11 Mar 2017 » New data and functions in nzelect 0.3.0 R package

04 Mar 2017 » Visualising relationships between children's books

26 Feb 2017 » Success rates of appeals to the Supreme Court by Circuit

18 Feb 2017 » Moving largish data from R to H2O - spam detection with Enron emails

23 Jan 2017 » US Presidential inauguration speeches

22 Jan 2017 » Does seasonally adjusting first help forecasting?

14 Jan 2017 » Books I like

05 Jan 2017 » Cross-validation of topic modelling

31 Dec 2016 » Sparse matrices, k-means clustering, topic modelling with posts on the 2004 US Presidential election

26 Dec 2016 » Extracting data on shadow economy from PDF tables

24 Dec 2016 » forecastHybrid 0.3.0 on CRAN

18 Dec 2016 » Air quality in Indian cities

10 Dec 2016 » Extrapolation is tough for trees!

07 Dec 2016 » Why time series forecasts prediction intervals aren't as good as we'd hope

27 Nov 2016 » Error, trend, seasonality - ets and its forecast model friends

24 Nov 2016 » Declining sea ice in the Arctic

19 Nov 2016 » Earthquake energy over time

15 Nov 2016 » Extreme pie chart polishing

06 Nov 2016 » Timeseries forecasting using extreme gradient boosting

29 Oct 2016 » FiveThirtyEight's polling data for the US Presidential election

19 Oct 2016 » Tourism forecasting competition data in the Tcomp R package

15 Oct 2016 » Statistics New Zealand experimental API initiative

12 Oct 2016 » Update of `ggseas` for seasonal decomposition on the fly

18 Sep 2016 » New Zealand Election Study individual level data

16 Sep 2016 » Why you need version control

13 Sep 2016 » Analysing the Modelled Territorial Authority GDP estimates for New Zealand

28 Aug 2016 » Dual axes time series plots with various more awkward data

18 Aug 2016 » Dual axes time series plots may be ok sometimes after all

13 Aug 2016 » Elastic net regularization of a model of burned calories

04 Aug 2016 » nzcensus on GitHub

14 Jul 2016 » nzelect 0.2.0 on CRAN

02 Jul 2016 » Animated world inequality map

30 Jun 2016 » International Household Income Inequality data

16 Jun 2016 » Monthly Regional Tourism Estimates

14 Jun 2016 » Presentation slides on using graphics

05 Jun 2016 » Bootstrap and cross-validation for evaluating modelling strategies

29 May 2016 » Actual coverage of confidence intervals for standard deviation

22 May 2016 » Visual contrast of two robust regression methods

14 May 2016 » Minimalist Tufte-inspired axis text with Scottish-New Zealand historical material

07 May 2016 » Announcing new forecastHybrid package

16 Apr 2016 » Election analysis contest entry part 4 - drivers of preference for Green over Labour party

09 Apr 2016 » Election analysis contest entry part 3 - interactive exploration of voting locations with leaflet and Shiny

04 Apr 2016 » Election analysis contest entry part 2 - building the nzelect R package

03 Apr 2016 » Election analysis contest entry part 1 - introducing the nzelect R package

28 Mar 2016 » Seasonal decomposition in the ggplot2 universe with ggseas

19 Mar 2016 » Skill v luck in determining backgammon winners

06 Mar 2016 » Explore with Shiny the impact of sample size on "p-charts"

24 Feb 2016 » New Zealand Tourism Dashboard pushes Shiny to the max

20 Feb 2016 » Dynamic Stochastic General Equilibrium models made (relatively) easy with R

08 Feb 2016 » ggseas package for seasonal adjustment on the fly with ggplot2

06 Feb 2016 » Data from the World Health Organization API

30 Jan 2016 » Better prediction intervals for time series forecasts

23 Jan 2016 » Filling in the gaps - highly granular estimates of income and population for New Zealand from survey data

29 Dec 2015 » A (not too talkative) twitterbot is born

26 Dec 2015 » Network charts of commuting in New Zealand with R and D3

21 Dec 2015 » X13-SEATS-ARIMA as an automated forecasting tool

12 Dec 2015 » Importance of exports and economic growth, cross-country time series

05 Dec 2015 » Inequality measures in the World Development Indicators

26 Nov 2015 » Deaths from assault over time in 40 relatively rich countries

21 Nov 2015 » Create ARIMA time series from bottom up

15 Nov 2015 » Linear model with time series random component

30 Oct 2015 » Modelled Territorial Authority GDP for New Zealand

25 Oct 2015 » Silver flows in the seventeenth and eighteenth centuries

10 Oct 2015 » Seasonal adjusment on the fly with X-13ARIMA-SEATS, seasonal and ggplot2

04 Oct 2015 » Recruiting Analysts for dynamic cutting edge public sector team

30 Sep 2015 » Success rates of automated ARIMA fitting

20 Sep 2015 » How to compare two blackbox timeseries generators?

19 Sep 2015 » Autocorrelation functions of materially different time series

12 Sep 2015 » Sampling distribution of Gini coefficient

07 Sep 2015 » Transforming the breaks to match a scale

05 Sep 2015 » Creating a scale transformation

30 Aug 2015 » Getting started in applied statistics / datascience

21 Aug 2015 » A better way of visualising income distributions with zeroes and negatives

15 Aug 2015 » Importing the New Zealand Income Survey SURF

07 Aug 2015 » Simulating backgammon players' Elo ratings

01 Aug 2015 » New Zealand Data & APIs on GitHub

26 Jul 2015 » Hello world!

Follow this blog with RSS. Or follow posts in this blog featuring R with RSS.

This blog is built with Jekyll, Bootstrap, Ruby and R. Read my acknowledgements for a little more on how. I'm pleased to be aggregated at R-bloggers, the one-stop shop for blog posts featuring R.

Peter's stats stuff by Peter Ellis is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

0 notes

oditile · 8 years ago

Text

Global choropleth maps of military expenditure

From: http://ellisp.github.io/blog/2017/06/04/military-gdp

Global choropleth maps of military expenditure

04 Jun 2017

At a glance:

For choropleth maps showing the whole world, we don't need to stick to static maps with Mercator projections. I like rotating globes, and interactive slippery maps with tooltips.

Global choropleth maps

Today I’m writing about using country-level choropleth maps at a global level, using example data showing nations’ military spending as a percentage of GDP. There are a few specific issues to work through as a result of the global coverage. I’m going to start with my two preferred end results.

First, here’s a Twitter-friendly rotating globe:

There are a few problems with it I didn’t quite finish. If you watch it long enough you’ll see a few times Canada, the USA or Russia flicker out of the frame. More on that below.

Second, here’s an interactive version, suited for a fully featured web browser:

I find this the most useful version, even though I prefer to look at the rotating globe. The ability to hover over a country to get its name and value, and to zoom into a particular region, adds real value to the visualization. This is really the minimum people expect out of what I call micro-interactivity (I’m sure there’s a proper, technical name) - ability to zoom and get useful information when hovering.

Unless you spend a lot of time looking at maps of the Middle East, you probably wouldn’t have identified those two dark patches of high military spending as Oman and Eritrea, without the help of the hovering tooltips.

Caveat for today’s post - I’m not a GIS expert at all, and in fact am barely into the “conscious incompetence” stage. So there may be mistakes in what follows, and there are almost certainly better ways to do some of what I’m doing.

Data on military spend

First I needed some example data. Choropleth maps are appropriate for variables like proportions and growth rates rather than absolute numbers. Military spending as a percentage of GDP was in the news when I started this, because of USA President Trump drawing attention (in somewhat misleading terms) to the pledge by NATO leaders in 2014 to spend 2% of their GDP on the military by 2024 (the 2% was described as an unofficial floor as early as 2006, but only committed to at the leadership level in 2014 in the atmosphere of nervousness after Russia’s actions in the Ukraine). With seven years to the commitment deadline, it is widely reported that only five countries have yet passed the 2% figure, usually listed as the United States, Greece, United Kingdom, Estonia and Poland. Interestingly, the data I’m using suggest the UK is currently just under 2% and France just over, slightly different to some of the media reporting.

Often my first port of call looking for internationally comparable data is the World Development Indicators of the World Bank. While they are the definitive origin of very little data, the World Bank does a great job in collating data from other sources and making it available in their visualisation tool. There’s also Vincent Arel-Bundock’s handy WDI R package which makes it easy to get the data from the Bank into your own analytical tool.

When I started this exercise a couple of weeks ago, I was able to create this choropleth map from the very friendly WDI visualisation tool:

(but interestingly, returning to the site today, I can’t reproduce it). Anyway, there’s a few things that leave me not happy with it:

The data only go to 2015, whereas the data from the source goes up to 2016 in many cases.

The Mercator projection of the world map - according to xkcd’s brilliant summary of ‘what your favorite map projection says about you’, Mercator says “you’re not really into maps”. There are several problems, but for me it’s all summed up by Greenland looking way too large.

There’s a great ability to use the slider at the bottom of the map (not in the above screenshot though) to cycle through the years, but the scale of the colour fill recalibrates itself each year so this doesn’t allow a coherent view of change over time.

the colour scheme isn’t particularly good at drawing out differences between countries.

So I went to the source, the Stockholm International Peace Research Institute (SIPRI), who collate data from governments around the world into a Yearbook and publish the data separately - at the time of writing, with data from 1949 to 2016, using the countries that exist in 2016 (ie no East Germany, Soviet Union, etc; data for the Russian Federation, Ukraine, etc begins in 1992 or soon after).

Data exploration

If we plot all the data for all the countries and years at once, it looks like this:

First thing you might notice is the massive spike. That turns out to be Kuwait in 1991, when it spent 17 percent more than its GDP on military - the kind of thing that happens when you’re invaded and become the battleground for what in the USA is referred to as the First Gulf War.

In case you’re wondering, there’s no reason why a country can’t spend more than its GDP on the military (or anything else). In fact, while they are both measured in dollars, GDP is not really the same sort of variable as spend. GDP is the sum of value add of economic activity in the country. If spending is on imported goods and services it can exceed GDP. Similarly, while we see economic indicators like “exports as a percentage of GDP” in use (and it is a good indicator of the importance of trade), there’s no ceiling of that indicator at 100%, and countries like Luxembourg, Singapore, Malta and Ireland do in fact exceed 100%.

Here’s the code to download and import the SIPRI “military as a percentage of GDP” data into R and draw that first graph:

# Load in all the packages we need for the full post library(cshapes) library(tidyverse) library(forcats) library(scales) library(openxlsx) library(directlabels) library(ggthemes) library(RColorBrewer) library(countrycode) library(maps) library(viridis) library(leaflet) library(rworldmap) library(sf) library(maptools) #------------download------------ tf <- tempfile() download.file("https://www.sipri.org/sites/default/files/SIPRI-Milex-data-1949-2016.xlsx", destfile = tf, mode = "wb") #---------------import and tidy---------- sipri <- read.xlsx(tf, sheet = "Share of GDP", startRow = 6) %>% gather(Year, Value, -Country, -Notes) %>% mutate(Value = as.numeric(Value), Year = as.numeric(Year)) %>% filter(!is.na(Value)) #----------------exploratory chart----------- sipri %>% ggplot(aes(x = Year, y = Value, colour = Country)) + geom_line() + theme(legend.position = "none") + scale_y_continuous("Military expenditure as a percentage of GDP", label = percent) + ggtitle("Countries' and regions' expenditure on military as a percentage of GDP") + labs(caption = "Source: Stockholm International Peace Research Institute", x = "")

As further exploration, I picked the five so called “Five Eyes” countries. I have a little familiarity with their military history so thought this could be a useful way of checking I understood the data. Here is military spend as a percentage of GDP for those five countries, with major conflicts they were involved in indicated in a timeline.

The dates of the conflicts are just taken from Wikipedia.

The spikes in spending clearly and unsurprisingly relate to conflicts or other known political events, most obviously:

The Korean war led to a big spike for all four countries that had data available.

The surge in intensity (of their involvement) in the Vietnam war led to spikes for the USA and Australia.

There was a blip in UK military spending at the time of the Falklands war, but it did not arrest the overall downwards trend, even more marked since significant military conflict in Northern Ireland finished.

The early 1980s military boom for the USA was the Reagan administration ratcheting up the Cold War as well as militarily engaging in other countries in ways too numerous to show.

The wars of the George W. Bush era in Afghanistan and Iraq, attempts to create a formal “War on Terror”, temporarily reversed the secular decline in the economic importance of US military spending.

Code for drawing this chart (I’m not particularly happy with the manual data entry for the conflicts and their dates, but it will do for a one-off thing like this):

wid <- 0.01 # width of annotation bars wars <- data_frame( name = c("Malaysia", "Korea", "Aden", "Vietnam", "Northern Ireland", "Falklands", "Gulf", "Afghanistan", "Iraq"), start = as.Date(c("15/6/1948", "25/6/1950", "10/12/1963", "1/11/1955", "1/1/1968", "2/4/1982" , "2/8/1990", "7/10/2001", "20/3/2003"), format = "%d/%m/%Y"), end = as.Date(c("12/7/1960", "27/7/1953", "30/11/1967", "30/4/1975", "30/6/1998", "14/6/1982", "28/2/1991","28/12/2014", "18/12/2011"), format = "%d/%m/%Y"), lead = c("UK", "USA", "UK", "USA", "UK", "UK", "USA", "USA", "USA") ) %>% mutate(name_seq = n():1, ystart = wid * name_seq + 0.12, yend = ystart +wid ) countries <- c("Australia", "New Zealand", "USA", "UK", "Canada") palette <- brewer.pal(5, "Set1") names(palette) <- countries p <- sipri %>% filter(Country %in% countries) %>% mutate(YearDate = as.Date(paste0("30/6/", Year), format = "%d/%m/%Y")) %>% ggplot() + geom_rect(data = wars, aes(xmin = start, xmax = end, ymin = ystart, ymax = yend, fill = lead), alpha = 0.2) + geom_text(data = wars, aes(x = end, y = (yend + ystart) / 2, label = name), colour = "grey50", hjust = 1, nudge_x = -200, size = 3) + geom_line(aes(x = YearDate, y = Value, colour = Country)) + scale_y_continuous("Military expenditure as a percentage of GDP", label = percent, breaks = c(0, 5, 10, 15) / 100) + ggtitle("Selected countries' expenditure on military as a percentage of GDP", "'Five eyes' countries only; periods also shown for conflicts in which they had material deployments") + labs(x = "", caption = "Source: Stockholm International Peace Research Institute") + scale_colour_manual(values = palette) + scale_fill_manual(values = palette, guide = "none") + theme_tufte(base_family = "myfont") + theme(plot.caption = element_text(colour = "grey50")) + annotate("text", x = as.Date("1970/6/30"), y =0.145, hjust = 0.5, colour = "grey50", size = 3, label = "Not shown - Grenada, Panama, Balkans,\nLibya, Lebanon, Haiti and numerous smaller...") direct.label(p)

Different world projections

OK, time for maps. First I wanted to sort out some projections for flat versions of the world. ggplot2 makes this convenient. You can define a single map object, and just add a different version of + coord_map() to it to get a different projection. Here are two which worked quite nicely:

Globular projection

Projection with bilateral symmetry about the Prime Meridian and the equator. Hemisphere is circle, circular arc meridians equally spaced on equator, circular arc parallels equally spaced on 0- and 90-degree meridians:

Orthographic projection

Viewed from infinity, centering on 20 degrees latitude, 20 degrees longitude:

Code for the basic maps

There’s a bit of work to be done to join the military spend data to a map data. Most importantly I need a country code. The countrycode package is a great example of a specialist package that does one thing well; what it does is convert country names or codes between eachother. For my use case, I want everything to be the ISO three character code (eg NZL for New Zealand). Once this is done, all that is required is to create a flat ggplot-ready version of a map shape file and join the two data frames together.

In the code below I create a single ggplot2 object worldmap which is an un-projected world choropleth map complete with the information on colour scale (I use viridis inferno), legend, captions, etc. Then I can add whatever projection I want (final two lines of code in the chunk below.

#===============map prep======================== sipri <- sipri %>% mutate(iso3c = countrycode(Country, "country.name", destination = "iso3c")) %>% # two manual concordances of country code: mutate(iso3c = ifelse(Country == "Kosovo", "KOS", iso3c)) # for some reason the tidyverse doesn't work for Central African Republic! # so we fix it old school: sipri[sipri$Country == "Central African Rep.", "iso3c"] <- "CAF" world <- map_data("world") %>% mutate(iso3c = countrycode(region, "country.name", destination = "iso3c")) # data on military for just the latest year for each country the_data <- sipri %>% group_by(Country) %>% filter(Year == max(Year)) # using the help at http://ggplot2.tidyverse.org/reference/coord_map.html world2 <- world %>% left_join(the_data, by = "iso3c") # define a ggplot mapping object worldmap <- ggplot(world2, aes(x = long, y = lat, group = group, fill = Value)) + geom_polygon(colour = "grey75") + scale_y_continuous("", breaks = (-2:2) * 30) + scale_x_continuous("", breaks = (-4:4) * 45) + scale_fill_viridis("", label = percent, option = "inferno", direction = -1) + theme_minimal(base_family = "myfont") + theme(legend.position = "right", axis.text = element_blank()) + ggtitle("Military expenditure as a percentage of GDP", "Most recent data shown for each country; mostly this is 2016") + labs(caption = "Source: Stockholm International Peace Research Institute") #-------------single map, nice projection------------------- # Lots of the projections in coord_map have problems with drawing # polygons. ok are: globular, harrison worldmap + coord_map("globular") worldmap + coord_map("orthographic", orientation = c(20, 20, 0))

Animated choropleth over time

Remebering my reservations with the World Development Indicators site, one of my aims was to have a choropleth map that showed military expenditure as a proportion of GDP for different years, with colour on the same scale. An animated graphic is the obvious way to do this:

… but it has imperfections:

Because of the few year-country combinations with massive spikes (eg Kuwait in 1991), I needed to use a logarithm transform on the scale or everything looked the same colour.

The slowly changing dimension of country existence proves a real problem, with data before the 1990s sadly lacking.

Data existence in general tends to dominate the visual; so rather than a feel for how military expenditure changes over time, the impression one gets is of more data gradually becoming available over time.

Not really a success. Nonetheless, a nice idea (I think) and here’s the code that does it. It’s basically the same idea as the previous chunk of code, but in a loop that repeats the data filtering and joining for each value of year, and saves a single frame for each printed annual map.

#================animated map over time=============== setwd("C:/temp1") # or whatever folder you want to hold the frames in years <- unique(sipri$Year) for(i in years){ this_year_data <- sipri %>% group_by(Country) %>% filter(Year == i) world2 <- world %>% left_join(this_year_data, by = "iso3c") worldmap <- ggplot(world2, aes(x = long, y = lat, group = group, fill = Value)) + geom_polygon(colour = "grey75") + scale_y_continuous("", breaks = (-2:2) * 30) + scale_x_continuous("", breaks = (-4:4) * 45) + scale_fill_viridis("", label = percent, option = "inferno", direction = -1, limits = range(sipri$Value), trans = "sqrt", breaks = c(1, 10, 40, 100) / 100) + theme_minimal(base_family = "myfont") + theme(legend.position = "right", axis.text = element_blank()) + ggtitle(paste("Military expenditure as a percentage of GDP -", i)) + labs(caption = "Source: Stockholm International Peace Research Institute") png(paste0(i, ".png"), 12 * 72, 6 * 72) print(worldmap + coord_map("harrison", dist = 1, angle = 30, ylim = c(-75, 85))) dev.off() } # Use ImageMagick to convert all those PNG frames into a single animated GIF # (requires ImageMagick to be separately installed) system('magick -loop 0 -delay 40 *.png "0099-military-gdp-time.gif"')

Animated globe

Finally, on to the two “good” versions of the data. Here’s a reminder of the animated globe:

The strategy is to create 360 different snapshots of the globe, each one 1 degree changed in longitude and latitude from the previous. Longitude moves 1 degree each frame in the same direction, latitude meanders from 30 degrees north to 30 degrees south and back again.

I had to leave the ggplot2 universe to do this (but I’m not sure I had to). I couldn’t get my projections of the ggplot2 version of the globe to work with enough values of latitude and longitude as the centre point. Whenever one of the larger countries like Russia, China or Canada was split in two ggplot2 would draw the polygons in ugly ways. My best effort can be seen in this tweet - which also has some very useful comments and suggestions in response.

In the end I adapted the suggestion of Edzer Pebesma, which is included in a demo in his amazing sf package. I don’t really understand exactly how it successfully clips the polygons, but it works nearly entirely well.

There were still quite a few frames where countries went missing altogether for the wrong combination of latitude and longitude. Experimenting with different versions of world maps, I found that some were vulnerable to different combinations from others. To reduce the problem, I made each frame of the end animation actually two globes of countries drawn on top of eachother - to increase the chance that at least one of the maps draws each country. This reduced the missing country problem to only three or four frames.

Here’s the code that does that:

#------------sf approach------------------ # this solution came from the demo in the sf package, tweeted about at # https://twitter.com/edzerpebesma/status/835249468874321920 circ <- function(l = c(-180:180), lon0 = 0, lat0 = 30) { deg2rad = pi / 180 lat = atan(-cos((l - lon0) * deg2rad)/tan(lat0 * deg2rad)) / deg2rad xy = if (lat0 == 0) { l1 = lon0 - 90 l2 = lon0 + 90 rbind(c(l1,-90), c(l2,-90), c(l2,0), c(l2,90), c(l1,90), c(l1,0), c(l1,-90)) } else if (lat0 > 0) { xy = cbind(lon = l, lat = lat) rbind(c(-180,90),xy,c(180,90),c(-180,90)) } else { xy = cbind(lon = l, lat = lat)[length(l):1,] rbind(c(180,-90), xy, c(-180,-90),c(180,-90)) } st_sfc(st_polygon(list(xy)), crs = st_crs(4326)) } m <- st_make_grid() m <- st_segmentize(m, 4e5) data(wrld_simpl) # Two versions of the world map, joined to the military/GDP data: w1 <- st_as_sf(countriesLow) %>% left_join(the_data, by = c("ISO3" = "iso3c")) %>% mutate(fill = pal(Value)) w2 <- st_as_sf(wrld_simpl) %>% left_join(the_data, by = c("ISO3" = "iso3c")) %>% mutate(fill = pal(Value)) # latitudes will go from 30 north to 30 south and back again: lats <- rep(c(30:(-30), (-29):29), 3) # longitudes will start at 50 and go around backwards: lons <- c(180:(-179)) dir.create("C:/temp2") setwd("C:/temp2") # Frame 234 goes missing if using wrld_simpl # Frame 269 goes missing if using countriesLow or wrld_simpl # etc for(i in 1:length(lons)){ png(paste0(1000 + i, ".png"), 600, 550, res = 100) par(mar <- rep(0, 4), bg = "black") lat <- lats[i] lon <- lons[i] # define the proj4 string: p4s <- paste0("+proj=ortho +lat_0=", lat, " +lon_0=", lon) # draw the pale blue globe in space blank_globe <- st_transform(m, st_crs(p4s), check = TRUE) plot(blank_globe, col = '#e6f2ff', border = 'grey99') # create a clipped version of the great circle?? # I don't really understand what is happening, but it seems to work. crc <- circ(lat0 = lat, lon0 = lon) w10 <- suppressWarnings(st_intersection(w1, crc)) w20 <- suppressWarnings(st_intersection(w2, crc)) # cast and re-project the map itself w10 <- st_cast(w10, "MULTIPOLYGON") w10 <- st_transform(w10["fill"], st_crs(p4s), check = TRUE) w20 <- st_cast(w20, "MULTIPOLYGON") w20 <- st_transform(w20["fill"], st_crs(p4s), check = TRUE) # draw the map plot(w10, col = w10$fill, add = TRUE, border = NA) plot(w20, col = w20$fill, add = TRUE, border = "grey75") # title and legend title("Military expenditure as a percentage of GDP\nNearest year to 2016", col.main = "white", font.main = 1, adj = 0) title(sub = "Source: Stockholm International Peace Research Institute", col.sub = "grey50", adj = 1) leg_nums <- seq(from = 4, to = 20, length.out = 6) / 100 legend("bottomright", legend = paste0(round(leg_nums * 100), "%"), pch = 15, adj = 0.1, col = pal(leg_nums), text.col = pal(leg_nums), bg = "grey80") dev.off() } system('magick -loop 0 -delay 7 *.png "0099-military-gdp.gif"')

Interactive leaflet choropleth

Finally, there’s the interactive version (my favourite):

To draw this, I used leaflet and went back to sp spatial polygons data frames. The code for this is actually some of the simplest in this post. While most of the examples I’ve seen of leaflet use Mercator projections, it’s possible to reproject your map (so long as you don’t need Google or other map tiles) with a couple of lines of code:

#-------------------------leaflet------------------ shape <- countriesCoarse # define colour palette pal <- colorNumeric( palette = inferno(10, direction = -1), domain = the_data$Value) # uses the_data defined earlier in the ggplot2 demos data2 <- shape@data %>% left_join(the_data, by = c("ISO_A3" = "iso3c")) %>% mutate(tooltip = paste0(ADMIN, " ", Year, ", ", round(Value * 100, 1), "%")) # EPSG4326 means latitude and longitude coarse_crs <- leafletCRS(crsClass = "L.CRS.EPSG4326", proj4def = proj4string(countriesCoarse)) shape %>% leaflet(options = leafletOptions(crs = coarse_crs)) %>% addPolygons(stroke = FALSE, smoothFactor = 0.2, fillOpacity = 1, color = ~pal(data2$Value), label = data2$tooltip)

Happy mapping!

← Older

Sankey charts for swinging voters

Follow this blog with RSS. Or follow posts in this blog featuring R with RSS.

This blog is built with Jekyll, Bootstrap, Ruby and R. Read my acknowledgements for a little more on how. I'm pleased to be aggregated at R-bloggers, the one-stop shop for blog posts featuring R.

Peter's stats stuff by Peter Ellis is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

0 notes

oditile · 8 years ago

Text

R by Examples

From: http://www.mayin.org/ajayshah/KB/R/

R by example

Basics

Vectors, lists, matrices, data frames

Sorting

Prices and returns

Writing functions

Amazing R vector notation

Amazing R indexing notation

Making latex tabular objects

Associative arrays / hashes

Matrix notation (portfolio computations in financial economics)

Handling missing data

Reading files

Reading a file with a few columns of numbers, and look at what is there.

Reading a file involving dates

Reading in a file made by CMIE's Business Beacon program

Reading and writing both ascii files and binary files. Also, measure speed of these.

Sending an R data object to someone else, either in email or as a binary file.

Make a "zoo" object, for handling time-series data.

Exporting and importing data.

Reading .gz .bz2 files and URLs

Directly reading Microsoft Excel files

Graphs

A grid of multiple pictures on one screen

Making PDF files that go into books/papers

A histogram with tails in red

z=f(x,y) using contour lines and colours

Show recessions using filled colour in a macro time-series plot

Plotting two series on one graph, one with a left y axis and another with a right y axis.

Probability and statistics

Tables, joint and marginal distributions

`Moving window' standard deviation

Quartiles/deciles tables/graphs.

[

Requires this data file.

]

Distribution of sample mean and sample median

The bootstrap

[

Notes on boot()

]

Doing MLE with your own likelihood function

[

Notes on MLE

]

The strange Cauchy distribution

An example of simulation-based inference

Four standard operations with standard distributions

Two CDFs and a two-sample Kolmogorov-Smirnoff test

Simulation to measure size and power of a test

Regression

Doing OLS

Dummy variables in regression

Generate latex tables of OLS results

`Least squares dummy variable' (LSDV) or `fixed effects' model

Estimate beta of Sun Microsystems using data from Yahoo finance :

Elaborate version

Terse version

Nonlinear regression

Standard tests

Using orthogonal polynomials

A function that takes a model specification as an argument

Time-series analysis

ARMA estimation, diagnostics, forecasting

All these examples in one tarfile.

Outright non-working code is unlikely, though occasionally my fingers fumble or code-rot occurs. More importantly, I am not an R guru. So I will always be happy if you alert me to mistakes or improvements. Please do also send me requests for things that ought to be on this page and aren't (ideally with the code!).

Other useful materials

How to install R (for Debian Gnu/Linux only)

Running R

Getting started with the `boot' package in R for bootstrap inference

Writing your own likelihood function

R for economists

Things to be careful about

I use this `tutorial program' for teaching R. Ana Nelson used her neat roux program to create a prettyprinted terminal session of running tutorial.R.

If you're going to write any R, you are going to need style guidelines. Here are two choices: from google and from Henrik Bengtsson.

Suggestions for learning R

The R project is at http://www.r-project.org : In particular, see the `other docs' there.

Over and above the strong set of functions that you get in `off the shelf' R, there is a concept like CPAN (of the perl world) or CTAN (of the tex world), where there is a large, well-organised collection of 3rd party software, written by people all over the world. This is called `CRAN' for Comprehensive R Archive Network. You will be amazed at the quality and comprehensiveness of the code out there.

The dynamism of R and of the surrounding 3rd party packages has thrown up the need for a newsletter, R News. You should make it a point to look hard at back issues. The standard online documents associated with R tend to be reference manuals targeting someone who already knows quite a bit. The articles in R News are very valuable in taking you from scratch to understanding R. As an example, you can certainly learn using the online documents on the boot() package, by saying:

library(help=boot) library(boot) ?boot

But you will learn a lot more by reading the article Resampling Methods in R: The boot package by Angelo J. Canty, which appeared in the December 2002 issue of R News.

Ajay Shah, 2005

0 notes

oditile · 8 years ago

Text

Programming in R

From: http://manuals.bioinformatics.ucr.edu/home/programming-in-r

Author:

Thomas Girke

UC Riverside

Contents

1 Introduction

2 R Basics

3 Code Editors for R

4 Integrating R with Vim and Tmux

5 Finding Help

6 Control Structures

7 Functions

8 Useful Utilities

9 Running R Programs

10 Object-Oriented Programming (OOP)

11 Building R Packages

12 Reproducible Research by Integrating R with Latex or Markdown

13 R Programming Exercises

14 Translation of this Page

6.1 Conditional Executions

6.2 Loops

6.1.1 Comparison Operators

6.1.2 Logical Operators

6.1.3 If Statements

6.1.4 Ifelse Statements

6.2.1 For Loop

6.2.2 While Loop

6.2.3 Apply Loop Family

6.2.4 Other Loops

6.2.5 Improving Speed Performance of Loops

6.2.3.1 For Two-Dimensional Data Sets: apply

6.2.3.2 For Ragged Arrays: tapply

6.2.3.3 For Vectors and Lists: lapply and sapply

8.1 Debugging Utilities

8.2 Regular Expressions

8.3 Interpreting Character String as Expression

8.4 Time, Date and Sleep

8.5 Calling External Software with System Command

8.6 Miscellaneous Utilities

10.1 Define S4 Classes

10.2 Assign Generics and Methods

13.1 Exercise Slides

13.2 Sample Scripts

13.2.1 Batch Operations on Many Files

13.2.2 Large-scale Array Analysis

13.2.3 Graphical Procedures: Feature Map Example

13.2.4 Sequence Analysis Utilities

13.2.5 Pattern Matching and Positional Parsing of Sequences

13.2.6 Identify Over-Represented Strings in Sequence Sets

13.2.7 Translate DNA into Protein

13.2.8 Subsetting of Structure Definition Files (SDF)

13.2.9 Managing Latex BibTeX Databases

13.2.10 Loan Payments and Amortization Tables

13.2.11 Course Assignment: GC Content, Reverse & Complement

Introduction

[

Slides

] [

R Code

]

General Overview

One of the main attractions of using the

R (http://cran.at.r-project.org)

environment is the ease with which users can write their own programs and custom functions. The R programming syntax is extremely easy to learn, even for users with no previous programming experience. Once the basic R programming control structures are understood, users can use the R language as a powerful environment to perform complex custom analyses of almost any type of data.

Format of this Manual

In this manual all commands are given in code boxes, where the R code is printed in black, the comment text in blue and the output generated by R in green. All comments/explanations start with the standard comment sign '#' to prevent them from being interpreted by R as commands. This way the content in the code boxes can be pasted with their comment text into the R console to evaluate their utility. Occasionally, several commands are printed on one line and separated by a semicolon ';'. Commands starting with a '$' sign need to be executed from a Unix or Linux shell. Windows users can simply ignore them.

R Basics

The

R & BioConductor

manual provides a general introduction to the usage of the R environment and its basic command syntax.

Code Editors for R

Several excellent code editors are available that provide functionalities like R syntax highlighting, auto code indenting and utilities to send code/functions to the R console.

Basic code editors provided by Rguis

RStudio: GUI-based IDE for R

Vim-R-Tmux: R working environment based on vim and tmux

Emacs (ESS add-on package)

gedit and Rgedit

RKWard

Eclipse

Tinn-R

Notepad++ (NppToR)

Programming in R using Vim or Emacs Programming in R using RStudio

Integrating R with Vim and Tmux

Users interested in integrating R with vim and tmux may want to consult the Vim-R-Tmux configuration page.

Finding Help

Reference list on R programming (selection)

R Programming for Bioinformatics, by Robert Gentleman

Advanced R, by Hadley Wickham

S Programming, by W. N. Venables and B. D. Ripley

Programming with Data, by John M. Chambers

R Help & R Coding Conventions, Henrik Bengtsson, Lund University

Programming in R (Vincent Zoonekynd)

Peter's R Programming Pages, University of Warwick

Rtips, Paul Johnsson, University of Kansas

R for Programmers, Norm Matloff, UC Davis

High-Performance R, Dirk Eddelbuettel tutorial presented at useR-2008

C/C++ level programming for R, Gopi Goswami

Control Structures

Conditional Executions

Comparison Operators

equal: ==

not equal: !=

greater/less than: > <

greater/less than or equal: >= <=

Logical Operators

and: &

or: |

not: !

If StatementsIf statements operate on length-one logical vectors.

Syntax

if(cond1=true) { cmd1 } else { cmd2 }

Example

if(1==0) { print(1) } else { print(2) } [1] 2

Table of Contents

Avoid inserting newlines between '} else'.

Ifelse StatementsIfelse statements operate on vectors of variable length.

Syntax

ifelse(test, true_value, false_value)

Example

x <- 1:10 # Creates sample data ifelse(x<5 | x>8, x, 0) [1] 1 2 3 4 0 0 0 0 9 10

Table of Contents

Loops

The most commonly used loop structures in R are for, while and apply loops. Less common are repeat loops. The break function is used to break out of loops, and next halts the processing of the current iteration and advances the looping index.

For LoopFor loops are controlled by a looping vector. In every iteration of the loop one value in the looping vector is assigned to a variable that can be used in the statements of the body of the loop. Usually, the number of loop iterations is defined by the number of values stored in the looping vector and they are processed in the same order as they are stored in the looping vector.

Syntax

for(variable in sequence) { statements }

Example

mydf <- iris myve <- NULL # Creates empty storage container for(i in seq(along=mydf[,1])) { myve <- c(myve, mean(as.numeric(mydf[i, 1:3]))) # Note: inject approach is much faster than append with 'c'. See below for details. } myve [1] 3.333333 3.100000 3.066667 3.066667 3.333333 3.666667 3.133333 3.300000 [9] 2.900000 3.166667 3.533333 3.266667 3.066667 2.800000 3.666667 3.866667

Table of Contents

Example:

condition*

x <- 1:10 z <- NULL for(i in seq(along=x)) { if(x[i] < 5) { z <- c(z, x[i] - 1) } else { z <- c(z, x[i] / x[i]) } } z [1] 0 1 2 3 1 1 1 1 1 1

Table of Contents

Example:

stop on condition and print error message

x <- 1:10 z <- NULL for(i in seq(along=x)) { if (x[i]<5) { z <- c(z,x[i]-1) } else { stop("values need to be <5") } } Error: values need to be <5 z [1] 0 1 2 3

Table of Contents

While LoopSimilar to for loop, but the iterations are controlled by a conditional statement.

Syntax

while(condition) statements

Example

z <- 0 while(z < 5) { z <- z + 2 print(z) } [1] 2 [1] 4 [1] 6

Table of Contents

Apply Loop Family

For T

wo-Dimensional Data Sets: apply

Syntax

apply(X, MARGIN, FUN, ARGs)

X: array, matrix or data.frame; MARGIN: 1 for rows, 2 for columns, c(1,2) for both; FUN: one or more functions; ARGs: possible arguments for function

Example

## Example for applying predefined mean function apply(iris[,1:3], 1, mean) [1] 3.333333 3.100000 3.066667 3.066667 3.333333 3.666667 3.133333 3.300000 ... ## With custom function x <- 1:10 test <- function(x) {

# Defines some custom function if(x < 5) { x-1 } else { x / x } } apply(as.matrix(x), 1, test) #

Returns same result as previous for loop* [1] 0 1 2 3 1 1 1 1 1 1 ## Same as above but with a single line of code apply(as.matrix(x), 1, function(x) { if (x<5) { x-1 } else { x/x } })

[1] 0 1 2 3 1 1 1 1 1 1

Table of Contents

For Ragged Arrays: tapply

Applies a function to array categories of variable lengths (ragged array). Grouping is defined by factor.

Syntax

tapply(vector, factor, FUN)

Example

## Computes mean values of vector agregates defined by factor tapply(as.vector(iris[,4]), factor(iris[,5]), mean) setosa versicolor virginica 0.246 1.326 2.026 ## The aggregate function provides related utilities aggregate(iris[,1:4], list(iris$Species), mean) Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa 5.006 3.428 1.462 0.246 2 versicolor 5.936 2.770 4.260 1.326 3 virginica 6.588 2.974 5.552 2.026

Table of Contents

For Vectors and Lists: lapply and sapply

Both apply a function to vector or list objects. The function lapply returns a list, while sapply attempts to return the simplest data object, such as vector or matrix instead of list.

Syntax

lapply(X, FUN) sapply(X, FUN)

Example

## Creates a sample list mylist <- as.list(iris[1:3,1:3]) mylist $Sepal.Length [1] 5.1 4.9 4.7 $Sepal.Width [1] 3.5 3.0 3.2 $Petal.Length [1] 1.4 1.4 1.3 ## Compute sum of each list component and return result as list lapply(mylist, sum) $Sepal.Length [1] 14.7 $Sepal.Width [1] 9.7 $Petal.Length [1] 4.1 ##

Compute sum of each list component and return result as vector sapply(mylist, sum) Sepal.Length Sepal.Width Petal.Length 14.7 9.7 4.1

Table of Contents

Other Loops

Repeat Loop Syntax

repeat statement

Loop is repeated until a break is specified. This means there needs to be a second statement to test whether or not to break from the loop.

Example

z <- 0 repeat { z <- z + 1 print(z) if(z > 100) break() }

Table of Contents

Improving Speed Performance of LoopsLooping over very large data sets can become slow in R. However, this limitation can be overcome by eliminating certain operations in loops or avoiding loops over the data intensive dimension in an object altogether. The latter can be achieved by performing mainly vector-to-vecor or matrix-to-matrix computations which run often over 100 times faster than the corresponding for() or apply() loops in R. For this purpose, one can make use of the existing speed-optimized R functions (

e.g.:

rowSums, rowMeans, table, tabulate) or one can design custom functions that avoid expensive R loops by using vector- or matrix-based approaches. Alternatively, one can write programs that will perform all time consuming computations on the C-level.

(1)

Speed comparison of for loops with an append versus an inject step:

myMA <- matrix(rnorm(1000000), 100000, 10, dimnames=list(1:100000, paste("C", 1:10, sep=""))) results <- NULL system.time(for(i in seq(along=myMA[,1])) results <- c(results, mean(myMA[i,]))) user system elapsed 39.156 6.369 45.559 results <- numeric(length(myMA[,1])) system.time(for(i in seq(along=myMA[,1])) results[i] <- mean(myMA[i,])) user system elapsed 1.550 0.005 1.556

Table of Contents

The inject approach is 20-50 times faster than the append version.

(2)

Speed comparison of apply loop versus rowMeans for computing the mean for each row in a large matrix:

system.time(myMAmean <- apply(myMA, 1, mean)) user system elapsed 1.452 0.005 1.456 system.time(myMAmean <- rowMeans(myMA)) user system elapsed 0.005 0.001 0.006

Table of Contents

The rowMeans approach is over 200 times faster than the apply loop.

(3)

Speed comparison of apply loop versus vectorized approach for computing the standard deviation of each row:

system.time(myMAsd <- apply(myMA, 1, sd)) user system elapsed 3.707 0.014 3.721 myMAsd[1:4] 1 2 3 4 0.8505795 1.3419460 1.3768646 1.3005428 system.time(myMAsd <- sqrt((rowSums((myMA-rowMeans(myMA))^2)) / (length(myMA[1,])-1))) user system elapsed 0.020 0.009 0.028 myMAsd[1:4] 1 2 3 4 0.8505795 1.3419460 1.3768646 1.3005428

Table of Contents

The vector-based approach in the last step is over 200 times faster than the apply loop.

(4)

Example for computing the mean for any custom selection of columns without compromising the speed performance:

## In the following the colums are named according to their selection in myList myList <- tapply(colnames(myMA), c(1,1,1,2,2,2,3,3,4,4), list) myMAmean <- sapply(myList, function(x) rowMeans(myMA[,x])) colnames(myMAmean) <- sapply(myList, paste, collapse="_") myMAmean[1:4,] C1_C2_C3 C4_C5_C6 C7_C8 C9_C10 1 0.0676799 -0.2860392 0.09651984 -0.7898946 2 -0.6120203 -0.7185961 0.91621371 1.1778427 3 0.2960446 -0.2454476 -1.18768621 0.9019590 4 0.9733695 -0.6242547 0.95078869 -0.7245792

## Alternative to achieve the same result with similar performance, but in a much less elegant way myselect <- c(1,1,1,2,2,2,3,3,4,4) # The colums are named according to the selection stored in myselect myList <- tapply(seq(along=myMA[1,]), myselect, function(x) paste("myMA[ ,", x, "]", sep="")) myList <- sapply(myList, function(x) paste("(", paste(x, collapse=" + "),")/", length(x))) myMAmean <- sapply(myList, function(x) eval(parse(text=x))) colnames(myMAmean) <- tapply(colnames(myMA), myselect, paste, collapse="_") myMAmean[1:4,] C1_C2_C3 C4_C5_C6 C7_C8 C9_C10 1 0.0676799 -0.2860392 0.09651984 -0.7898946 2 -0.6120203 -0.7185961 0.91621371 1.1778427 3 0.2960446 -0.2454476 -1.18768621 0.9019590 4 0.9733695 -0.6242547 0.95078869 -0.7245792

Table of Contents

Functions

A very useful feature of the R environment is the possibility to expand existing functions and to easily write custom functions. In fact, most of the R software can be viewed as a series of R functions.

Syntax to define functions

myfct <- function(arg1, arg2, ...) { function_body }

Table of Contents

The value returned by a function is the value of the function body, which is usually an unassigned final expression,

e.g.:

return()

Syntax to call functions

myfct(arg1=..., arg2=...)

Table of Contents

Syntax Rules for Functions

General

Functions are defined by (1) assignment with the keyword function, (2) the declaration of arguments/variables (arg1, arg2, ...) and (3) the definition of operations (function_body) that perform computations on the provided arguments. A function name needs to be assigned to call the function (see below).

Naming

Function names can be almost anything. However, the usage of names of existing functions should be avoided.

Arguments

It is often useful to provide default values for arguments (

e.g.

:arg1=1:10). This way they don't need to be provided in a function call. The argument list can also be left empty (myfct <- function() { fct_body }) when a function is expected to return always the same value(s). The argument '...' can be used to allow one function to pass on argument settings to another.

Function body

The actual expressions (commands/operations) are defined in the function body which should be enclosed by braces. The individual commands are separated by semicolons or new lines (preferred).

Calling functions

Functions are called by their name followed by parentheses containing possible argument names. Empty parenthesis after the function name will result in an error message when a function requires certain arguments to be provided by the user. The function name alone will print the definition of a function.

Scope

Variables created inside a function exist only for the life time of a function. Thus, they are not accessible outside of the function. To force variables in functions to exist globally, one can use this special assignment operator: '<<-'. If a global variable is used in a function, then the global variable will be masked only within the function.

Example:

Function basics

myfct <- function(x1, x2=5) { z1 <- x1/x1 z2 <- x2*x2 myvec <- c(z1, z2) return(myvec) } myfct # prints definition of function myfct(x1=2, x2=5) # applies function to values 2 and 5 [1] 1 25 myfct(2, 5) # the argument names are not necessary, but then the order of the specified values becomes important myfct(x1=2) # does the same as before, but the default value '5' is used in this case

Table of Contents

Example:

Function with optional arguments

myfct2 <- function(x1=5, opt_arg) { if(missing(opt_arg)) { # 'missing()' is used to test whether a value was specified as an argument z1 <- 1:10 } else { z1 <- opt_arg } cat("my function returns:", "\n") return(z1/x1) } myfct2(x1=5) # performs calculation on default vector (z1) that is defined in the function body my function returns: [1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 myfct2(x1=5, opt_arg=30:20) # a custom vector is used instead when the optional argument (opt_arg) is specified my function returns: [1] 6.0 5.8 5.6 5.4 5.2 5.0 4.8 4.6 4.4 4.2 4.0

Table of Contents

Control utilities for functions: return, warning and stop

Return

The evaluation flow of a function may be terminated at any stage with the return function. This is often used in combination with conditional evaluations.

Stop

To stop the action of a function and print an error message, one can use the stop function.

Warning

To print a warning message in unexpected situations without aborting the evaluation flow of a function, one can use the function warning("...").

myfct <- function(x1) { if (x1>=0) print(x1) else stop("This function did not finish, because x1 < 0") warning("Value needs to be > 0") } myfct(x1=2) [1] 2 Warning message: In myfct(x1 = 2) : Value needs to be > 0 myfct(x1=-2) Error in myfct(x1 = -2) : This function did not finish, because x1 < 0

Table of Contents

Useful Utilities

Debugging Utilities

Several debugging utilities are available for R. The most important utilities are: traceback(), browser(), options(error=recover), options(error=NULL) and debug(). The

Debugging in R

page provides an overview of the available resources.

Regular Expressions

R's regular expression utilities work similar as in other languages. To learn how to use them in R, one can consult the main help page on this topic with ?regexp. The following gives a few basic examples.

The grep function can be used for finding patterns in strings, here letter A in vector month.name.

month.name[grep("A", month.name)] [1] "April" "August"

Table of Contents

Example for using regular expressions to substitute a pattern by another one using the sub/gsub function with a back reference. Remember: single escapes '\' need to be double escaped '\\' in R.

gsub("(i.*a)", "xxx_\\1", "virginica", perl = TRUE) [1] "vxxx_irginica"

Table of Contents

Example for split and paste functions

x <- gsub("(a)", "\\1_", month.name[1], perl=TRUE) # performs substitution with back reference which inserts in this example a '_' character x [1] "Ja_nua_ry" strsplit(x, "_") # splits string on inserted character from above [[1]] [1] "Ja" "nua" "ry" paste(rev(unlist(strsplit(x, NULL))), collapse="") # reverses character string by splitting first all characters into vector fields and then collapsing them with paste [1] "yr_aun_aJ"

Table of Contents

Example for importing specific lines in a file with a regular expression. The following example demonstrates the retrieval of specific lines from an external file with a regular expression. First, an external file is created with the cat function, all lines of this file are imported into a vector with readLines, the specific elements (lines) are then retieved with the grep function, and the resulting lines are split into vector fields with strsplit.

cat(month.name, file="zzz.txt", sep="\n") x <- readLines("zzz.txt") x <- x[c(grep("^J", as.character(x), perl = TRUE))] t(as.data.frame(strsplit(x, "u"))) [,1] [,2] c..Jan....ary.. "Jan" "ary" c..J....ne.. "J" "ne" c..J....ly.. "J" "ly"

Table of Contents

Interpreting Character String as Expression

Example

mylist <- ls() # generates vector of object names in session mylist[1] # prints name of 1st entry in vector but does not execute it as expression that returns values of 10th object get(mylist[1]) # uses 1st entry name in vector and executes it as expression eval(parse(text=mylist[1])) # alternative approach to obtain similar result

Table of Contents

Time, Date and Sleep

Example

system.time(ls()) # returns CPU (and other) times that an expression used, here ls() user system elapsed 0 0 0 date() # returns the current system date and time [1] "Wed Dec 11 15:31:17 2012" Sys.sleep(1) # pause execution of R expressions for a given number of seconds (e.g. in loop)

Table of Contents

Calling External Software with System CommandThe system command allows to call any command-line software from within R on Linux, UNIX and OSX systems.

system("...") # provide under '...' command to run external software e.g. Perl, Python, C++ programs

Table of Contents

Related utilities on Windows operating systems

x <- shell("dir", intern=T) # reads current working directory and assigns to file shell.exec("C:/Documents and Settings/Administrator/Desktop/my_file.txt") # opens file with associated program

Table of Contents

Miscellaneous Utilities

(1)

Batch import and export of many files.

In the following example all file names ending with *.txt in the current directory are first assigned to a list (the '$' sign is used to anchor the match to the end of a string). Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the assign function. Consult help with ?read.table to understand arguments row.names=1 and comment.char = "A". Third, the data frames are exported using their names for file naming and appending the extension *.out.

files <- list.files(pattern=".txt$") for(i in files) { x <- read.table(i, header=TRUE, comment.char = "A", sep="\t") assign(i, x) print(i) write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) }

Table of Contents

(2)

Running Web Applications (basics on designing web client/crawling/scraping scripts in R)

Example for obtaining MW values for peptide sequences from the

EXPASY's pI/MW Tool

web page.

myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS") myresult <- NULL for(i in myentries) { myurl <- paste("http://ca.expasy.org/cgi-bin/pi_tool?protein=", i, "&resolution=monoisotopic", sep="") x <- url(myurl) res <- readLines(x) close(x) mylines <- res[grep("Theoretical pI/Mw:", res)] myresult <- c(myresult, as.numeric(gsub(".*/ ", "", mylines))) print(myresult) Sys.sleep(1) # halts process for one sec to give web service a break } final <- data.frame(Pep=myentries, MW=myresult) cat("\n The MW values for my peptides are:\n") final Pep MW 1 MKWVTFISLLFLFSSAYS 2139.11 2 MWVTFISLL 1108.60 3 MFISLLFLFSSAYS 1624.82

Table of Content

Running R Programs

(1)

Executing an R script from the R console

source("my_script.R")

Table of Contents

(2.1)

Syntax for running R programs from the command-line. Requires in first line of my_script.R the following statement: #!/usr/bin/env Rscript

$ Rscript my_script.R # or just ./myscript.R after making file executable with 'chmod +x my_script.R'

All commands starting with a '$' sign need to be executed from a Unix or Linux shell.

(2.2)

Alternatively, one can use the following syntax to run R programs in BATCH mode from the command-line.

$ R CMD BATCH [options] my_script.R [outfile]

The output file lists the commands from the script file and their outputs. If no outfile is specified, the name used is that of infile and .Rout is appended to outfile. To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file: options(echo=FALSE). If the command is run like this R CMD BATCH --no-save my_script.R, then nothing will be saved in the .Rdata file which can get often very large. More on this can be found on the help pages: $ R CMD BATCH --help or ?BATCH.

(2.3)

Another alternative for running R programs as silently as possible.

$ R --slave < my_infile > my_outfile

Argument --slave makes R run as 'quietly' as possible.

(3)

Passing Command-Line Arguments to R Programs

Create an R script, here named test.R, like this one:

######################

myarg <- commandArgs()

print(iris[1:myarg[6], ])

######################

Then run it from the command-line like this:

$ Rscript test.R 10

In the given example the number 10 is passed on from the command-line as an argument to the R script which is used to return to STDOUT the first 10 rows of the iris sample data. If several arguments are provided, they will be interpreted as one string that needs to be split it in R with the strsplit function.

(4)

Submitting R script to a Linux cluster via Torque

Create the following shell script my_script.sh

#################################

#!/bin/bash

cd $PBS_O_WORKDIR

R CMD BATCH --no-save my_script.R

#################################

This script doesn't need to have executable permissions. Use the following qsub command to send this shell script to the Linux cluster from the directory where the R script my_script.R is located. To utilize several CPUs on the Linux cluster, one can divide the input data into several smaller subsets and execute for each subset a separate process from a dedicated directory.

$ qsub my_script.sh

Table of Contents

Here is a short R script that generates the required files and directories automatically and submits the jobs to the nodes:

submit2cluster.R

. For more details, see also this

'Tutorial on Parallel Programming in R' by Hanna Sevcikova

(5)

Submitting jobs to Torque or any other queuing/scheduling system via the

BatchJobs

package. This package provides one of the most advanced resources for submitting jobs to queuing systems from within R. A related package is

BiocParallel

from Bioconductor which extends many functionalities of BatchJobs to genome data analysis. Useful documentation for BatchJobs:

Technical Report

GitHub page

Slide Show

Config samples

library(BatchJobs) loadConfig(conffile = ".BatchJobs.R") ## Loads configuration file. Here .BatchJobs.R containing just this line: ## cluster.functions <- makeClusterFunctionsTorque("torque.tmpl") ## The template file torque.tmpl is expected to be in the current working ## director. It can be downloaded from here: ## https://github.com/tudo-r/BatchJobs/blob/master/examples/cfTorque/simple.tmpl getConfig() # Returns BatchJobs configuration settings reg <- makeRegistry(id="BatchJobTest", work.dir="results") ## Constructs a registry object. Output files from R will be stored under directory "results", ## while the

standard objects from BatchJobs will be stored in the directory "BatchJobTest-files". print(reg) ## Some test function f <- function(x) { system("ls -al >> test.txt") x } ## Adds jobs to registry object (here reg) ids <- batchMap(reg, fun=f, 1:10) print(ids) showStatus(reg) ## Submit jobs or chunks of jobs to batch system via cluster function done <- submitJobs(reg, resources=list(walltime=3600, nodes="1:ppn=4", memory="4gb")) ## Load results from BatchJobTest-files/jobs/01/1-result.RData loadResult(reg, 1)

Table of Contents

Object-Oriented Programming (OOP)

R supports two systems for object-oriented programming (OOP). An older S3 system and a more recently introduced S4 system. The latter is more formal, supports multiple inheritance, multiple dispatch and introspection. Many of these features are not available in the older S3 system. In general, the OOP approach taken by R is to separate the class specifications from the specifications of generic functions (function-centric system). The following introduction is restricted to the S4 system since it is nowadays the preferred OOP method for R. More information about OOP in R can be found in the following introductions:

Vincent Zoonekynd's introduction to S3 Classes

S4 Classes in 15 pages

Christophe Genolini's S4 Intro

The R.oo package

BioC Course: Advanced R for Bioinformatics

Programming with R by John Chambers

and

R Programming for Bioinformatics by Robert Gentleman

Define S4 Classes

(A)

Define S4 Classes with

setClass() and new()

y <- matrix(1:50, 10, 5) # Sample data set setClass(Class="myclass", representation=representation(a="ANY"), prototype=prototype(a=y[1:2,]), # Defines default value (optional) validity=function(object) { # Can be defined in a separate step using setValidity if(class(object@a)!="matrix") { return(paste("expected matrix, but obtained", class(object@a))) } else { return(TRUE) } } )

Table of Contents

The setClass function defines classes. Its most important arguments are

Class: the name of the class

representation: the slots that the new class should have and/or other classes that this class extends.

prototype: an object providing default data for the slots.

contains: the classes that this class extends.

validity, access, version: control arguments included for compatibility with S-Plus.

where: the environment to use to store or remove the definition as meta data.

(B)

The function new creates an instance of a class (here myclass)

myobj <- new("myclass", a=y) myobj An object of class "myclass" Slot "a": [,1] [,2] [,3] [,4] [,5] [1,] 1 11 21 31 41 [2,] 2 12 22 32 42 ... new("myclass", a=iris) # Returns an error message due to wrong input type (iris is data frame) Error in validObject(.Object) : invalid class “myclass” object: expected matrix, but obtained data.frame

Table of Contents

Its arguments are:

Class: the name of the class

...: Data to include in the new object with arguments according to slots in class definition.

(C)

A more generic way of creating class instances is to define an initialization method (details below)

setMethod("initialize", "myclass", function(.Object, a) { .Object@a <- a/a .Object }) new("myclass", a = y) [1] "initialize" new("myclass", a = y)> new("myclass", a = y) An object of class "myclass" Slot "a": [,1] [,2] [,3] [,4] [,5] [1,] 1 1 1 1 1 [2,] 1 1 1 1 1 ...

Table of Contents

(D)

Usage and helper functions

myobj@a # The '@' extracts the contents of a slot. Usage should be limited to internal functions! initialize(.Object=myobj, a=as.matrix(cars[1:3,])) # Creates a new S4 object from an old one. # removeClass("myclass") # Removes object from current session; does not apply to associated methods.

Table of Contents

(E)

Inheritance: allows to define new classes that inherit all properties (

e.g.

data slots, methods) from their existing parent classes

setClass("myclass1", representation(a = "character", b = "character")) setClass("myclass2", representation(c = "numeric", d = "numeric")) setClass("myclass3", contains=c("myclass1", "myclass2")) new("myclass3", a=letters[1:4], b=letters[1:4], c=1:4, d=4:1) An object of class "myclass3" Slot "a": [1] "a" "b" "c" "d" Slot "b": [1] "a" "b" "c" "d" Slot "c": [1] 1 2 3 4 Slot "d": [1] 4 3 2 1 getClass("myclass1") Class "myclass1" [in ".GlobalEnv"] Slots: Name: a b Class: character character Known Subclasses: "myclass3" getClass("myclass2") Class "myclass2" [in ".GlobalEnv"] Slots: Name: c d Class: numeric numeric Known Subclasses: "myclass3" getClass("myclass3") Class "myclass3" [in ".GlobalEnv"] Slots: Name: a b c d Class: character character numeric numeric Extends: "myclass1", "myclass2"

Table of Contents

The argument contains allows to extend existing classes; this propagates all slots of parent classes.

(F)

Coerce objects to another class

setAs(from="myclass", to="character", def=function(from) as.character(as.matrix(from@a))) as(myobj, "character") [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" ...

Table of Contents

(G)

Virtual classes are constructs for which no instances will be or can be created. They are used to link together classes which may have distinct representations (e.g. cannot inherit from each other) but for which one wants to provide similar functionality. Often it is desired to create a virtual class and to then have several other classes extend it. Virtual classes can be defined by leaving out the representation argument or including the class VIRTUAL:

setClass("myVclass") setClass("myVclass", representation(a = "character", "VIRTUAL"))

Table of Contents

(H)

Functions to introspect classes

getClass("myclass")

getSlots("myclass")

slotNames("myclass")

extends("myclass2")

Assign Generics and MethodsAssign generics and methods with setGeneric() and setMethod()

(A)

Accessor function (to avoid usage of '@')

setGeneric(name="acc", def=function(x) standardGeneric("acc")) setMethod(f="acc", signature="myclass", definition=function(x) { return(x@a) }) acc(myobj) [,1] [,2] [,3] [,4] [,5] [1,] 1 11 21 31 41 [2,] 2 12 22 32 42 ...

Table of Contents

(B.1)

Replacement method using custom accessor function (acc <-)

setGeneric(name="acc<-", def=function(x, value) standardGeneric("acc<-")) setReplaceMethod(f="acc", signature="myclass", definition=function(x, value) { x@a <- value return(x) }) ## After this the following replace operations with 'acc' work on new object class acc(myobj)[1,1] <- 999 # Replaces first value colnames(acc(myobj)) <- letters[1:5] # Assigns new column names rownames(acc(myobj)) <- letters[1:10] # Assigns new row names myobj An object of class "myclass" Slot "a": a b c d e a 999 11 21 31 41 b 2 12 22 32 42 ...

Table of Contents

(B.2)

Replacement method using "[" operator ([<-)

setReplaceMethod(f="[", signature="myclass", definition=function(x, i, j, value) { x@a[i,j] <- value return(x) }) myobj[1,2] <- 999 myobj An object of class "myclass" Slot "a": a b c d e a 999 999 21 31 41 b 2 12 22 32 42 ...

Table of Contents

(C)

Define behavior of "[" subsetting operator (no generic required!)

setMethod(f="[", signature="myclass", definition=function(x, i, j, ..., drop) { x@a <- x@a[i,j] return(x) }) myobj[1:2,] # Standard subsetting works now on new class An object of class "myclass" Slot "a": a b c d e a 999 999 21 31 41 b 2 12 22 32 42 ...

Table of Contents

(D)

Define print behavior

setMethod(f= show", signature="myclass", definition=function(object) { cat("An instance of ", "\"", class(object), "\"", " with ", length(acc(object)[,1]), " elements", "\n", sep="") if(length(acc(object)[,1])>=5) { print(as.data.frame(rbind(acc(object)[1:2,], ...=rep("...", length(acc(object)[1,])), acc(object)[(length(acc(object)[,1])-1):length(acc(object)[,1]),]))) } else { print(acc(object)) }}) myobj # Prints object with custom method An instance of "myclass" with 10 elements a b c d e a 999 999 21 31 41 b 2 12 22 32 42 ... ... ... ... ... ... i 9 19 29 39 49 j 10 20 30 40 50

Table of Contents

(E)

Define a data specific function (here randomize row order)

setGeneric(name="randomize", def=function(x) standardGeneric("randomize")) setMethod(f="randomize", signature="myclass", definition=function(x) { acc(x)[sample(1:length(acc(x)[,1]), length(acc(x)[,1])), ] }) randomize(myobj) a b c d e j 10 20 30 40 50 b 2 12 22 32 42 ...

Table of Contents

(F)

Define a graphical plotting function and allow user to access it with generic plot function

setMethod(f="plot", signature="myclass", definition=function(x, ...) { barplot(as.matrix(acc(x)), ...) }) plot(myobj)

Table of Contents

(G)

Functions to inspect methods

showMethods(class="myclass")

findMethods("randomize")

getMethod("randomize", signature="myclass")

existsMethod("randomize", signature="myclass")

Building R Packages

To get familiar with the structure, building and submission process of R packages, users should carefully read the documentation on this topic available on these sites:

Writing R Extensions, R web site

R Packages, by Hadley Wickham

R Package Primer, by Karl Broman

Package Guidelines, Bioconductor

Advanced R Programming Class, Bioconductor

Short Overview of Package Building Process

(A)

Automatic package building with the package.skeleton function:

package.skeleton(name="mypackage", code_files=c("script1.R", "script2.R"))

Table of Contents

Note: this is an optional but very convenient function to get started with a new package. The given example will create a directory named mypackage containing the skeleton of the package for all functions, methods and classes defined in the R script(s) passed on to the code_files argument. The basic structure of the package directory is

described here

. The package directory will also contain a file named 'Read-and-delete-me' with the following instructions for completing the package:

Edit the help file skeletons in man, possibly combining help files for multiple functions.

Edit the exports in NAMESPACE, and add necessary imports.

Put any C/C++/Fortran code in src.

If you have compiled code, add a useDynLib() directive to NAMESPACE.

Run R CMD build to build the package tarball.

Run R CMD check to check the package tarball.

Read Writing R Extensions for more information.

(B)

Once a package skeleton is available one can build the package from the command-line (Linux/OS X):

$ R CMD build mypackage

Table of Contents

This will create a tarball of the package with its version number encoded in the file name, e.g.: mypackage_1.0.tar.gz.

Subsequently, the package tarball needs to be checked for errors with:

$ R CMD check mypackage_1.0.tar.gz

Table of Contents

All issues in a package's source code and documentation should be addressed until R CMD check returns no error or warning messages anymore.

(C)

Install package from source:

Linux:

install.packages("mypackage_1.0.tar.gz", repos=NULL)

Table of Contents

OS X:

install.packages("mypackage_1.0.tar.gz", repos=NULL, type="source")

Table of Contents

Windows requires a zip archive for installing R packages, which can be most conveniently created from the command-line (Linux/OS X) by installing the package in a local directory (here tempdir) and then creating a zip archive from the installed package directory:

$ mkdir tempdir $ R CMD INSTALL -l tempdir mypackage_1.0.tar.gz $ cd tempdir $ zip -r mypackage mypackage ## The resulting mypackage.zip archive can be installed under Windows like this: install.packages("mypackage.zip", repos=NULL)

Table of Contents

This procedure only works for packages which do not rely on compiled code (C/C++). Instructions to fully build an R package under Windows can be

found here

and

here

(D)

Maintain/expand an existing package:

Add new functions, methods and classes to the script files in the ./R directory in your package

Add their names to the NAMESPACE file of the package

Additional *.Rd help templates can be generated with the prompt*() functions like this:

source("myscript.R") # imports functions, methods and classes from myscript.R prompt(myfct) # writes help file myfct.Rd promptClass("myclass") # writes file myclass-class.Rd promptMethods("mymeth") # writes help file mymeth.Rd

Table of Contents

The resulting *.Rd help files can be edited in a text editor and properly rendered and viewed from within R like this:

library(tools) Rd2txt("./mypackage/man/myfct.Rd") # renders *.Rd files as they look in final help pages checkRd("./mypackage/man/myfct.Rd") # checks *.Rd help file for problems

Table of Contents

(E)

Submit package to a public repository

The best way of sharing an R package with the community is to submit it to one of the main R package repositories, such as CRAN or Bioconductor. The details about the submission process are given on the corresponding repository submission pages:

Submitting to Bioconductor (guidelines, submission, svn control, build/checks release, build/checks devel)

Submitting to CRAN

Reproducible Research by Integrating R with Latex or Markdown

See Sweave/Stangle sections of Slide Show for this manual

Sweave Manual

R Markdown

knitr manual

RStudio's Rpubs page

R Programming Exercises

Exercise Slides[

Slides

] [

Exercises

] [

Additional Exercises

]

Download on of the above exercise files, then start editing this R source file with a programming text editor, such as Vim, Emacs or one of the R GUI text editors. Here is the

HTML version

of the code with syntax coloring.

Sample Scripts

Batch Operations on Many Files

## (1) Start R from an empty test directory ## (2) Create some files as sample data for(i in month.name) { mydf <- data.frame(Month=month.name, Rain=runif(12, min=10, max=100), Evap=runif(12, min=1000, max=2000)) write.table(mydf, file=paste(i , ".infile", sep=""), quote=F, row.names=F, sep="\t") } ## (3) Import created files, perform calculations and export to renamed files files <- list.files(pattern=".infile$") for(i in seq(along=files)) { # start for loop with numeric or character vector; numeric vector is often more flexible x <- read.table(files[i], header=TRUE, row.names=1, comment.char = "A", sep="\t") x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) # calculates sum and mean for each data frame assign(files[i], x) # generates data frame object and names it after content in variable 'i' print(files[i], quote=F) # prints loop iteration to screen to check its status write.table(x, paste(files[i], c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) } ## (4) Same as above, but file naming by index data frame. This way one can organize file names by external table. name_df <- data.frame(Old_name=sort(files), New_name=sort(month.abb)) for(i in seq(along=name_df[,1])) { x <- read.table(as.vector(name_df[i,1]), header=TRUE, row.names=1, comment.char = "A", sep="\t") x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) assign(as.vector(name_df[i,2]), x) # generates data frame object and names it after 'i' entry in column 2 print(as.vector(name_df[i,1]), quote=F) write.table(x, paste(as.vector(name_df[i,2]), c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) } ## (5) Append content of all input files to one file. files <- list.files(pattern=".infile$") all_files <- data.frame(files=NULL, Month=NULL, Gain=NULL , Loss=NULL, sum=NULL, mean=NULL) # creates empty data frame container for(i in seq(along=files)) { x <- read.table(files[i], header=TRUE, row.names=1, comment.char = "A", sep="\t") x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) x <- data.frame(file=rep(files[i], length(x[,1])), x) # adds file tracking column to x all_files <- rbind(all_files, x) # appends data from all files to data frame 'all_files' write.table(all_files, file="all_files.xls", quote=FALSE, sep="\t", col.names = NA) } ## (6) Write the above code into a text file and execute it with the commands 'source' and 'BATCH'. source("my_script.R") # execute from R console $ R CMD BATCH my_script.R # execute from shell

Table of Contents

Large-scale Array Analysis

Sample script to perform large-scale expression array analysis with complex queries: lsArray.R. To demo what the script does, run it like this:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/lsArray.R")

Table of Contents

Graphical Procedures: Feature Map Example

Script to plot feature maps of genes or chromosomes: featureMap.R. To demo what the script does, run it like this:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt")

Table of Contents

Sequence Analysis UtilitiesIncludes sequence batch import, sub-setting, pattern matching, AA Composition, NEEDLE, PHYLIP, etc. The script '

sequenceAnalysis.R

' demonstrates how R can be used as a powerful tool for managing and analyzing large sets of biological sequences. This example also shows how easy it is to integrate R with the

EMBOSS

project or other external programs. The script provides the following functionality:

Batch sequence import into R data frame

Motif searching with hit statistics

Analysis of sequence composition

All-against-all sequence comparisons

Generation of phylogenetic trees

To demonstrate the utilities of the script, users can simply execute it from R with the following source command:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/sequenceAnalysis.txt")

Table of Contents

Pattern Matching and Positional Parsing of Sequences

Functions for importing sequences into R, retrieving reverse and complement of nucleotide sequences, pattern searching, positional parsing and exporting search results in HTML format: patternSearch.R. To demo what the script does, run it like this:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/patternSearch.R")

Table of Contents

Identify Over-Represented Strings in Sequence Sets

Functions for finding over-represented words in sets of DNA, RNA or protein sequences: wordFinder.R. To demo what the script does, run it like this:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/wordFinder.R")

Table of Contents

Translate DNA into Protein

Script 'translateDNA.R' for translating NT sequences into AA sequences (required codon table). To demo what the script does, run it like this:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/translateDNA.R")

Table of Contents

Subsetting of Structure Definition Files (SDF)

Script for importing and subsetting SDF files: sdfSubset.R. To demo what the script does, run it like this:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/sdfSubset.R")

Table of Contents

Managing Latex BibTeX Databases

Script for importing BibTeX databases into R, retrieving the individual references with a full-text search function and viewing the results in R or in HubMed: BibTex.R. To demo what the script does, run it like this:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/BibTex.R")

Table of Contents

Loan Payments and Amortization Tables

This script calculates monthly and annual mortgage or loan payments, generates amortization tables and plots the results: mortgage.R. To demo what the script does, run it like this:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R")

Table of Contents

Course Assignment: GC Content, Reverse & Complement

Apply the above information to write a function that calculates for a set of DNA sequences their GC content and generates their reverse and complement. Here are some useful commands that can be incorporated in this function:

## Generate an example data frame with ID numbers and DNA sequences fx <- function(test) { x <- as.integer(runif(20, min=1, max=5)) x[x==1] <- "A"; x[x==2] <- "T"; x[x==3] <- "G"; x[x==4] <- "C" paste(x, sep = "", collapse ="") } z1 <- c() for(i in 1:50) { z1 <- c(fx(i), z1) } z1 <- data.frame(ID=seq(along=z1), Seq=z1) z1 ## Write each character of sequence into separate vector field and reverse its order my_split <- strsplit(as.character(z1[1,2]),"") my_rev <- rev(my_split[[1]]) paste(my_rev, collapse="") ## Generate the sequence complement by replacing G|C|A|T by C|G|T|A ## Use 'apply' or 'for loop' to apply the above operations to all sequences in sample data frame 'z1' ## Calculate in the same loop the GC content for each sequence using the following command table(my_split[[1]])/length(my_split[[1]])

Table of Contents

0 notes

oditile · 8 years ago

Text

Data wrangling

From: http://www.r-exercises.com/2017/05/24/data-wrangling-part-1-io/?utm_source=rss&utm_medium=rss&utm_campaign=data-wrangling-part-1-io

Data wrangling : I/O (Part-1)

24 May 2017 by Vasileios Tsakalos Leave a Comment

Difficulty level: Not rated yet

Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the first part of this series and it aims to cover the importing and exporting of data locally. Please download the data from here

Before proceeding, it might be helpful to look over the help pages for the read.csv, read.table, read_excel, fromJSON, as.data.frame, xmlParse, xmlToList, write.csv, write.table, write.xlsx, toJSON, write, write.xml.

Moreover please load the following libraries. install.packages("readxl") library(readxl) install.packages("xlsx") library(xlsx) install.packages("rjson") library(rjson) install.packages("plyr") library(plyr) install.packages("XML") library(XML) install.packages("kulife") library(kulife)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Import the data.csv file to the csv_file object.

Exercise 2

Import the data.txt file to the txt_file object.

Exercise 3

Import the data.xls file to the xls_file object.

Exercise 4

Import the data.xlsx file to the xlsx_file object.

Learn more about Data Pre-Processing in the online course

R Data Pre-Processing & Data Management – Shape your Data!

. In this course you will learn how to:

import data into R in several ways while also beeing able to identify a suitable import tool

use SQL code within R

And much more

Exercise 5

Import the data.json file to the json_file object.

Exercise 6

Import the data.xml file to the xml_file object.

Exercise 7

Export the csv_file as “data.csv” and txt_file as “data.txt”.

Exercise 8

Export the xls_file as “data.xls” and xlsx_file as “data.xlsx”.

Exercise 9

Export the json_file as “data.json”.

Exercise 10

Export the xml_file as “data.xml”.

0 notes

oditile · 8 years ago

Text

R vs Python

Tratto da .....https://gigadom.wordpress.com/2017/05/22/r-vs-python-different-similarities-and-similar-differences/

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

R vs Python: Different similarities and similar differences

By Tinniam V Ganesh, gigadom.wordpress.com

Vedi originale

Maggio 22º, 2017

rprogramming

A debate about which language is better suited for Datascience, R or Python, can set off diehard fans of these languages into a tizzy. This post tries to look at some of the different similarities and similar differences between these languages. To a large extent the ease or difficulty in learning R or Python is subjective. I have heard that R has a steeper learning curve than Python and also vice versa. This probably depends on the degree of familiarity with the languuge To a large extent both R an Python do the same thing in just slightly different ways and syntaxes. The ease or the difficulty in the R/Python construct’s largely is in the ‘eyes of the beholder’ nay, programmer’ we could say. I include my own experience with the languages below.

1. R data types

R has 2 data types

Character

Numeric

Python has several data types

Int

float

Long

Complex and so on

2a. Other data types – Python

Python also has certain other data types like the tuple, dictionary etc as shown below. R does not have as many of the data types, nevertheless we can do everything that Python does in R

# Python tuple b = (4,5,7,8) print(b) #Python dictionary c={'name':'Ganesh','age':54,'Work':'Professional'} print(c) #Print type of variable c

## (4, 5, 7, 8) ## {'name': 'Ganesh', 'age': 54, 'Work': 'Professional'}

2.Type of Variable

To know the type of the variable in R we use ‘class’, In Python the corresponding command is ‘type’

#R - Type of variable <-(4,5,1,3,4,5) caprint(class(a))

## [1] "numeric"

#Python - Print type of tuple a a=[4,5,1,3,4,5] print(type(a)) b=(4,3,"the",2) print(type(b))

## <class 'list'> ## <class 'tuple'>

3. Length

To know length in R, use length()

#R - Length of vector # Length of a <-(4,5,1,3,4,5) caprint(length(a))

## [1] 6

To know the length of a list,tuple or dict we can use len()

# Python - Length of list , tuple etc # Length of a a=[4,5,1,3,4,5] print(len(a)) # Length of b b = (4,5,7,8) print(len(b))

## 6 ## 4

4. Accessing help

To access help in R we use the ‘?’ or the ‘help’ function

#R - Help - To be done in R console or RStudio #?sapply #help(sapply)

Help in python on any topic involves

#Python help - This can be done on a (I)Python console #help(len) #?len

7. Getting the index of element

To determine the index of an element which satisifies a specific logical condition in R use ‘which’. In the code below the index of element which is equal to 1 is 4

# R - Which <- (5,2,3,1,7) caprint(which(a == 1))

## [1] 4

In Python array we can use np.where to get the same effect. The index will be 3 as the index starts from 0

# Python - np.where import numpy as np a =[5,2,3,1,7] a=np.array(a) print(np.where(a==1))

## (array([3], dtype=int64),)

8. Data frames

R, by default comes with a set of in-built datasets. There are some datasets which come with the SkiKit- Learn package

# R # To check built datasets use #data() - In R console or in R Studio #iris - Don't print to console

We can use the in-built data sets that come with Scikit package

#Python import sklearn as sklearn import pandas as pd from sklearn import datasets # This creates a Sklearn bunch data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names)

9. Working with dataframes

With R you can work with dataframes directly. For more complex dataframe operations in R there are convenient packages like dplyr, reshape2 etc. For Python we need to use the Pandas package. Pandas is quite comprehensive in the list of things we can do with data frames The most common operations on a dataframe are

Check the size of the dataframe

Take a look at the top 5 or bottom 5 rows of dataframe

Check the content of the dataframe

a.Size

In R use dim()

#R - Size dim(iris)

## [1] 150 5

For Python use .shape

#Python - size import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.shape

b. Top & bottom 5 rows of dataframe

To know the top and bottom rows of a data frame we use head() & tail as shown below for R and Python

#R head(iris,5)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa

tail(iris,5)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica

#Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.head(5)) print(iris.tail(5))

## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 0 5.1 3.5 1.4 0.2 ## 1 4.9 3.0 1.4 0.2 ## 2 4.7 3.2 1.3 0.2 ## 3 4.6 3.1 1.5 0.2 ## 4 5.0 3.6 1.4 0.2 ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 145 6.7 3.0 5.2 2.3 ## 146 6.3 2.5 5.0 1.9 ## 147 6.5 3.0 5.2 2.0 ## 148 6.2 3.4 5.4 2.3 ## 149 5.9 3.0 5.1 1.8

c. Check the content of the dataframe

#R summary(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ##

str(iris)

## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

## <class 'pandas.core.frame.DataFrame'> ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None

d. Check column names

#R names(iris)

## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species"

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species"

## Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', ## 'petal width (cm)'], ## dtype='object')

e. Rename columns

In R we can assign a vector to column names

#R colnames(iris) <- c("lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal","Species") colnames(iris)

## [1] "lengthOfSepal" "widthOfSepal" "lengthOfPetal" "widthOfPetal" ## [5] "Species"

In Python we can assign a list to s.columns

## Index(['lengthOfSepal', 'widthOfSepal', 'lengthOfPetal', 'widthOfPetal'], dtype='object')

9. Functions

product <- function(,){ <- * } cbacbaproduct(5,7)

## [1] 35

def product(a,b): c = a*b return c print(product(5,7))

## 35

Conclusion

Personally, I took to R, much like a ‘duck takes to water’. I found the R syntax very simple and mostly intuitive. R packages like dplyr, ggplot2, reshape2, make the language quite irrestible. R is weakly typed and has only numeric and character types as opposed to the full fledged data types in Python.

Python, has too many bells and whistles, which can be a little bewildering to the novice. It is possible that they may be useful as one becomes more experienced with the language. Also I found that installing Python packages sometimes gives errors with Python versions 2.7 or 3.6. This will leave you scrambling to google to find how to fix these problems. These can be quite frustrating. R on the other hand makes installing R packages a breeze.

Anyway, this is my current opinion, and like all opinions, may change in the course of time. Let’s see!

I may write a follow up post with more advanced features of R and Python. So do keep checking! Long live R! Viva la Python!

Note: This post was created using RStudio’s RMarkdown which allows you to embed R and Python code snippets. It works perfectly, except that matplotlib’s pyplot does not display.

Also see 1. Introducing QCSimulator: A 5-qubit quantum computing simulator in R 2. Re-working the Lucy Richardson algorithm in OpenCV 3. Neural Networks: The mechanics of backpropagation 4. A method to crowd source pothole marking on (Indian) roads 5. A Cloud medley with IBM Bluemix, Cloudant DB and Node.js 6. GooglyPlus: yorkr analyzes IPL players, teams, matches with plots and tables To see all posts click Index of posts

0 notes