Text
Data Mining With Python & Pandas - 2 of N - Why Indians Should Be Happy That Germany Won Against France
It's the 13th minute and Hummels just scored! The excitement!
In one of the earlier posts, I had mentioned about the data.gov.in website which is really a fantastic place to explore some interesting data.
I have taken this particular data source to decide whom to support in today's FIFA World Cup 2014 Quarter Finals.
Let's get started from where we previously left off.
df.head()
The data is loaded and ready to go.
Germany vs. France, let the game begin!
Before we begin with meddling around the data that we have in our hands, let's just look at this snippet from Wikipedia:
According to the Department of Commerce, the fifteen largest trading partners of India represent 62.1% of Indian imports, and 58.1% of Indian exports as of December 2010. These figures do not include services or foreign direct investment, but only trade in goods.
Well, that's one goal to Germany!
But what does the data really say? Well, let's find out...
#some assignments to speed up things later on country = 'Country exporting to India' value = 'Value (INR) - 2012-13' #create a filter fromGermans = df[country] == 'GERMANY' #slice the dataframe germany = df[fromGermans] germany.sort(columns=[value], ascending=False).head(10)
Interesting code isn't it? I'll explain it line-by-line in just a moment, but now, let's take a look at what the above lines of code produces:
Those are the top 10 goods that India imports from Germany ordered descending by how much India had to spend on each of those - i.e - Costliest on top.
For some reason however, data.gov.in hasn't updated the quantity of import for most of the goods in the top 10. Weird!
Okay, let's get back to the code. Pandas does some really clever data indexing, so once you've loaded data into your DataFrame, they can be selected, sliced, drilled-down, etc. in any manner you want (and in some really clever ways that you will find out exclusively on pythonplay.com - I couldn't resist a marketing pitch. The effects of late night blogging after watching the World Cup quarters I suppose! )
Also, in another quarters, Federer won and moves on in the Wimbledon to the next round.
What I'm doing here is basically called boolean indexing:
#create a filter fromGermans = df[country] == 'GERMANY' #slice the dataframe germany = df[fromGermans]
I create a filter / criterion for slicing the DataFrame - notice that it's a vector operation, but essentially Pandas gives you the power to do it by pretending that it is a scalar value.
Hold on, France is attacking...
fromFrench = df[country] == 'FRANCE' france = df[fromFrench] france.sort(columns=[value], ascending=False).head(10)
Ah, but - How much more do we spend on German goods than French goods? Turns out that number is - 560723316290! I can't even comprehend this number at one look.
57000 Crores.
germany.sum()[value] - france.sum()[value]
India imports stuff from Germany that is worth INR 57K Crores more than that from France!
So what if the Indian economy is influenced by all this? We just want a good game of football, don't we?
Germany 1 - 0 France.
#pandas#python#world cup#fifa#football#fifa world cup#2014#world cup 2014#data#big data#data and football#data and world cup#fifa world cup data#fifa world cup big data#India#Germany#France#Germany vs France#World Cup Quarters#statistics#python data mining#data mining#pandas data mining#data mining in pandas#data viz#data vizualization#insights#trends#knowledge#suhas
1 note
·
View note
Text
Data Mining With Python & Pandas - 1 of N - A Sunday Afternoon Data Hack!
This is the first post of a series titled 'India In Numbers' that's more politically, economically and socially explored on my personal blog - suhas.co
Here I'll be talking about the science of it - and the beauty of the library called Pandas.
I started by exploring data.gov.in which publicly provides some really interesting data - and I found this which excited me - Country-wise commodity imports of India
Pandas is in-memory, so once I downloaded the csv from the above link, I have to load it to analyze the data, and this is how you do it:
import pandas as pd df = pd.read_csv("data.csv")
And that's it - the data structure that I've named df is basically what's called a Data Frame, take a look at its documentation.
All operations in pandas is now a breeze once our data has been 'loaded' into the Data Frame.
To really get a feel of the data, we usually need to take a sneak peek into the actual data itself, and sneak peek's are easy to do from a Data Frame, here's how you do it -
df.head()
and sure enough, we get our first few tuples -
Now that we've loaded the data and taken a sneak peek at it, let's analyze and dig some knowledge. Look out for it in the article titled Data Mining With Python & Pandas - 2 of N.
- by Suhas SG
#python#data#big data#data viz#data vizualization#dataviz#pandas#mining#data mining#statistics#python data mining#data mining in python#data analysis in python#data analysis#data analytics#india#knowledge#patterns#trends#insights#data.gov.in#suhas.co#suhas
1 note
·
View note
Text
Highest (Greatest) Common Element Among Two Lists In Python
>>> from collections import Counter >>> xs = [1, 3, 5, 7, 9, 15, 45] >>> ys = [7, 8, 9, 15, 1, 1, 2] >>> max(list((Counter(xs) & Counter(ys)).elements())) 15
#Greatest common#highest common#lists#python#two lists#common among two lists#highest two lists#list intersection python#python lists#highest element in two lists python#greatest common element among two lists
0 notes
Text
Code Golf, Recursive Lambdas and Saturday Morning Fun with Fibonacci Numbers
This is probably the naive way of implementing Fibonacci Numbers:
def fib(n): if n < 2: return n return fib(n-1) + fib(n-2) print [fib(_) for _ in range(10)]
But that is six lines of code (including the blank line). No I'm not happy with six, I want to do it in a one-liner.
Okay, a slightly condensed form of the above can be something like this:
def fib(n): return n if n < 2 else fib(n-1) + fib(n-2) print [fib(_) for _ in range(10)]
Still, that's four lines of code. Not good enough. Is there a way to do this in just one line?
Turns out, there is a way! Recursive Lambdas! :)
print [(lambda x:lambda y:x(x, y))(lambda z, u:u if u < 2 else z(z, u-1) + z(z, u-2))(_) for _ in range(10)]
Looks a little crazy doesn't it? :)
#DP#Lambdas#lambda#functional programming#fibonacci#recursion#code golf#esoteric#recursive lamba#recursive lambdas
1 note
·
View note
Text
Merging Two Dicts On Common Keys With Values As A List
What do you do when you have two dicts like this
x = {'one': 1, 'three': 3, 'two': 2} y = {'one': 1.0, 'two': 2.0, 'three': 3.0}
and you need to combine them in such a way that the resultant dict will be something like this:
{'one': [1, 1.0], 'three': [3, 3.0], 'two': [2, 2.0]}
Here's a quick way of doing it:
result = dict(x.items() + y.items() + [(k, [x[k], y[k]]) for k in x.viewkeys() & y.viewkeys()])
#python dict#python#pythonic#key val#map#combine#combine two dicts#combining in python#how to combine dict python#python combine two dict
2 notes
·
View notes
Text
Grouping Consecutive Occurences of Tuples in a List
So I have a data set like this,
data = [('a',1),('a',2),('a',3),('b',1),('b',2),('a',4)]
Let's say I want to group them by consecutive occurences of the first element of each tuple. So the output I'm expecting is something like this,
a 1,2,3 b 1,2 a 4
Take a deep breath, and don't be surprised, I'm going to do this in just two lines.
>>> for k, v in itertools.groupby(data, key = lambda x : x[0]): >>> print k, [_[1] for _ in list(v)] a [1, 2, 3] b [1, 2] a [4]
Isn't python fun?
0 notes
Text
How to find out CDF of data in python? (The simple, non-probabilistic version)
In [1]:
book_prices = [23.5,47.5,55.0,21.0,1.5,2.6,33.5,45.5,99.5,20.5,21.5,100.0,88.5,40.5, 30.0,18.99,23.5,22.25,45.5,90.0,85.5,90.0,15.0]
In [2]:
i = 0 cumulative_prices = [] for p in sorted(book_prices, reverse=True): if i==0: cumulative_prices.append(p) else: cumulative_prices.append(p+cumulative_prices[i-1]) i+=1 cumulative_prices
Out[2]:
[100.0, 199.5, 289.5, 379.5, 468.0, 553.5, 608.5, 656.0, 701.5, 747.0, 787.5, 821.0, 851.0, 874.5, 898.0, 920.25, 941.75, 962.75, 983.25, 1002.24, 1017.24, 1019.84, 1021.34]
In [3]:
sum(book_prices)
Out[3]:
1021.34
In [4]:
cumulative_percentages = [ (c*100.0)/ sum(book_prices) for c in cumulative_prices ]
In [5]:
cumulative_percentages
Out[5]:
[9.7910588050991834, 19.53316231617287, 28.345115240762134, 37.157068165351397, 45.822155207864178, 54.193510486223978, 59.57859282902853, 64.229345761450645, 68.684277517770767, 73.139209274090902, 77.104588090156071, 80.3845927898643, 83.321910431394045, 85.622809250592354, 87.923708069790663, 90.102218653925235, 92.207296297021557, 94.26341864609239, 96.270585701137719, 98.129907768226047, 99.598566588990934, 99.853134117923503, 100.0]
In [6]:
import matplotlib.pyplot as plt
In [7]:
plt.plot(cumulative_percentages)
Out[7]:
[<matplotlib.lines.Line2D at 0x79ab978>]
4 notes
·
View notes
Text
Top 10 Highlights From PyCon India 2013
1. A lot of people use IPython Notebooks.
2. A lot of people use Python for Data Analysis, and Scientific Computing, and especially Machine Learning.
3. Python is up and coming, it's used increasingly in production, and people love its simplicity and elegance.
4. Web Frameworks like Django and Flask are popular. The former being an all-in-one solution and the latter a light weight get-your-app-running-in-five-minutes type.
5. Some people are targeting python early in education, and are campaigning towards a notion of fun and easy coding for kids. It's already part of School Syllabus in CBSE, and some parents are worried about this.
6. Python is diverse in its capabilities, from telephony to robotics, and even to predict black swan events!
7. People are hiring if you are capable of conversing in python. (shouldn't take an awful lot of time thanks to its simplicity)
8. The Tee Shirt sizes were properly thought out, mine fit me!
9. Students of India are an enthusiastic lot, so are the corporates. Enthusiasm was everywhere.
10. Python, although gaining more popularity by the day, is still not being used in highly scalable, and time-efficient systems.
#PyCon#Pycon2013#PyCon India#India#Django#Flask#Pandas#NumPy#SciPy#Scikit-learn#Machine Learning#Top#Top 10#Black Swan#Robotics#Conference#IPython#Bangalore
1 note
·
View note
Text
An Elegant Way To Find Out Median Of Three Numbers In Python
There are multiple ways of doing this.
The first one is the comparison way -
>>> def median3(a,b,c): ... if a<b: ... if c<a: ... return a ... elif b<c: ... return b ... else: ... return c ... else: ... if a<c: ... return a ... elif c<b: ... return b ... else: ... return c ... >>> median3(1,5,2) 2 >>> median3(3,5,2) 3 >>> median3(3,5,7) 5 >>> median3(7,5,2) 5 >>> median3(7,5,1) 5 >>> median3(2,5,1) 2
this takes at least two comparisons and at most three to compute.
Another elegant way to do this might be to sort the three numbers and return the middle one.
Like this:
>>> def median3(a,b,c): ... return sorted([a,b,c])[1] ... >>> median3(1,5,2) 2 >>> median3(3,5,2) 3 >>> median3(3,5,7) 5 >>> median3(7,5,6) 6
#python#median#median3#median of three numbers#median in python#find median of three numbers in python#find median of 3 numbers in python#find median of three numbers#find median#finding median in python#median of numbers in python#median of numbers#elegant way to find median in python#fast way to find out median in python
0 notes
Text
Divide and Conquer With Python - Merge Sort
In Python, the messy handling of memory with pointers and such are non-existent - which really makes way for more fluent translation of pseudo-code into a working python code.
Sometimes, it's so simple and elegant that the python code looks simpler than the pseudo-code.
That said, let's look at how to approach merge sort in python.
Merge Sort is a divide and conquer based approach to sorting which runs in O(nlogn) time. All you have to do is to divide the unsorted input array into a LEFT sub-array and a RIGHT sub-array and recursively call it again and again until the base case when the sub-array is of length 1 - where it's trivially sorted, and then merge back (conquer).
Here's how simple it is:
def mergesort(nums): if len(nums) <= 1: return nums mid = len(nums)/2 sorted_left_array = mergesort(nums[:mid]) sorted_right_array = mergesort(nums[mid:]) return merge(sorted_left_array,sorted_right_array)
Quite simple.
Now all we have to do is implement the merge method that takes two sorted arrays and merges them together into one single sorted array.
Here's the idea of the merge sub-routine:
1. Traverse through sorted_left_array and sorted_right_array with say indices i and j
2. Compare sorted_left_array[i] and sorted_right_array[j] and whichever is smaller, append it to the result - increment i if sorted_left_array[i] is appended or increment j otherwise. (This will become obvious once you see the implementation.)
This is how you can do it -
def merge(xs,ys): ms = [] i = 0 j = 0 while i < len(xs) and j < len(ys): if xs[i] <= ys[j]: ms.append(xs[i]) i = i+1 else: ms.append(ys[j]) j = j+1 while i < len(xs) and j == len(ys): ms.append(xs[i]) i = i+1 while i == len(xs) and j < len(ys): ms.append(ys[j]) j = j+1 return ms
How many Lines Of Code will it take in other languages? :)
#algorithm#algorithms#python#divide and conquer#divide#conquer#merge#merge sort#sort#sorting#sorting in python#how to sort#O(nlogn)#Big Oh#O(nlog(n))#divide & conquer#fast sort#sorting numbers
2 notes
·
View notes
Text
Functional Programming - lambdas, some fun, and then Currying!
Functions can be assigned.
>>>def square(x): ... return x*x ... >>>f=square >>>f(4) 16
They can also be passed as arguments.
>>>def fsum(f,foo): ... return sum(map(f,foo)) ... >>>fsum(square,[3,4]) 25
Lambdas are convenient anonymous functions. They are defined without being bound to any identifier - Nameless functions! Lambdas can be handy when you need to pass one function to some higher order function. Like in the above example, you can do this:
>>> fsum(lambda x: x*x, [3,4]) 25
Now that we know that, let's try currying in python.
As always, let's start with a list.
Today, this will be my list:
foo = range(1,10)
I need to do the following additive operations on this list:
Sum of squares of each element in the list (1^2) + (2^2) + (3^2) + (4^2) ...
Sum of all the elements in the list when each element is multiplied by, say 3 (1*3) + (2*3) + (3*3) + (4*3) ...
Sum of all the elements in the list when each element is added with, say 5 (1+5) + (2+5) + (3+5) + (4+5) ...
At first thoughts, you'd probably think of defining three functions for them, no?
>>> foo = range(1,10) >>> def square_adder(foo): ... return sum(map(lambda x: x**2, foo)) ... >>> def mul_adder(foo): ... return sum(map(lambda x: x*3, foo)) ... >>> def sum_adder(foo): ... return sum(map(lambda x: x+5, foo)) ... >>> square_adder(foo) 285 >>> mul_adder(foo) 135 >>> sum_adder(foo) 90
Well, here's how it can be done using currying:
>>> def fsum(f): ... return lambda x,y: sum(map(f,range(x,y))) ... >>> fsum(lambda x: x**2)(1,10) 285 >>> fsum(lambda x: x*3)(1,10) 135 >>> fsum(lambda x: x+5)(1,10) 90
and this is possible too:
>>> square_adder = fsum(lambda x: x**2) >>> square_adder(1,10) 285 >>> mul_adder = fsum(lambda x: x*3) >>> mul_adder(1,10) 135 >>> sum_adder = fsum(lambda x: x+5) >>> sum_adder(1,10) 90
0 notes
Text
Functional Programming - map()
From the docs:
map(function,sequence)calls function(item)for each of the sequence’s items and returns a list of the return values. For example, to compute some cubes:
>>> def cube(x): ... return x*x*x ... >>> map(cube,range(1,11)) [1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
More than one sequence may be passed; the function must then have as many arguments as there are sequences and is called with the corresponding item from each sequence (or Noneif some sequence is shorter than another). For example:
>>> seq=range(8) >>> def add(x,y): return x+y ... >>> map(add,seq,seq) [0, 2, 4, 6, 8, 10, 12, 14]
If I wanted the sum of squares of 1 to 10, then this would do it:
>>> def square(x): ... return x*x ... >>> sum(map(square,range(1,11))) 385
#FP in python#fp in python#Functional#function#map()#map#python fp#fp#functional programming#functional programming in python#mapreduce#higher order functions
0 notes
Text
Functional Programming In Python - Filtering Lists Using filter()
Consider this list,
foo = range(1,100)
Now I have to find only multiples of 5 in foo.
Without FP, You (or I) would probably do something like this:
>>> >>> bar = [] >>> for x in foo: ... if x%5 == 0: ... bar.append(x) ... >>> print bar [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95] >>>
But with FP, the solution is quite elegant, here's how it can be done:
>>> def f(x): return x%5 == 0 ... >>> filter(f, foo) [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
filter(function, list) takes the list and applies the function on each item of the list.
From the docs:
filter(function, sequence) returns a sequence consisting of those items from the sequence for which function(item) is true.
#python fp#fp#function#functional#functional programming#programming paradigm in python#fp in python#functional programming in python#lists#filter#python filter#python filter()#filter()#multiples of five#append list#copy list#two lists#apply function on list
0 notes
Text
Introduction To List Comprehensions And Reading Lines From File Into A List
List comprehensions is an exciting feature of Python. You can build entire lists using one statement. Say, I want a list of all squares of numbers between 1 to 10, or I want a list of numbers x to the power of some y, or I have to find only even squares, then I do this:
Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> [x**2 for x in range(1,10)] [1, 4, 9, 16, 25, 36, 49, 64, 81] >>> [x**y for x in range(1,5) for y in range(1,3)] [1, 1, 2, 4, 3, 9, 4, 16] >>> [x**2 for x in range(1,10) if x%2 == 0] [4, 16, 36, 64] >>> [x**y for x in range(1,5) for y in range(1,4) if not x==y] [1, 1, 2, 8, 3, 9, 4, 16, 64] >>>
And here's how you can read lines from a file into a list using list comprehension:
import sys lines = [line.strip() for line in open("".join(sys.argv[1]),"r")] for line in lines: print line
#List comprehension#lists#list#reading a file in python#read a file#read a file line by line in python#read file#squares of natural numbers#squares#find all squares#finding squares#what is list comprehension python#list comprehensions in python#list comprehenshions#range#strip
0 notes
Text
Executing Linux Shell (Bash) Commands Within Python Code
Ever wondered how to run a linux command like say, grep from within python code? It's fairly simple. Consider I have file1.txt with the following lines:
Hello1 World1 Hello2 World2
Now let's say I have to move all the lines from file1.txt that starts with Hello, to file2.txt
Here's the grep command that'll do it:
grep "^Hello" file1.txt > file2.txt
Okay, now how do I execute this inside a python code? Using subprocess:
import subprocess p = subprocess.Popen('grep "^Hello" file1.txt > file2.txt', stdout=subprocess.PIPE,shell=True) p.communicate()
If you don't want to redirect to file2.txt, but rather you want to parse the output of the command within python, then you can do this:
import subprocess p = subprocess.Popen('grep "^Hello" file1.txt', stdout=subprocess.PIPE,shell=True) output, errormsg = p.communicate() #do something with output print output
Also do note that as the docs say:
Warning: Passing shell=True can be a security hazard if combined with untrusted input.
#execute linux command#shell command#bash#bash command#shell#execute shell#shell commands from python#shell command in python#bash command in python#execute linux command in python#execute bash command in python#run linux command from python code#run shell script in python#run shell script#linux#unix#command prompt#command#terminal
0 notes
Text
Reading And Parsing Files Using Command Line Arguments In Python
For every line in this file.txt:
Line1 Line2 Line3
I need to parse each line, line by line.
For this, I can do something like this in python:
#readfile.py import sys def read_file(): #sys.argv[1] will read the first argument #"".join() will string-ize the list element for line in open("".join(sys.argv[1]),"r"): #do something with each line here #.strip() function removes \n characters print line.strip() if __name__ == "__main__": read_file()
We can run this like so:
$ python readfile.py file.txt Line1 Line2 Line3
#command line arguments#bash#CLI#command line#arguments#files#parsing#read file#reading a file in python#read a file#reading files#parse lines in file#parse each line in a file in python#read a file line by line in python#reading each lines of file in python#reading all lines of file python#reading all lines of file#read file line by line#file as command line argument#read file command line argument#file input#file#file handling#files in python#file handling in python
0 notes
Text
What Do The Underscores Mean In __init__()?
From Python Style Guide:
_single_leading_underscore: weak "internal use" indicator. E.g.
from M import *
does NOT import objects whose name starts with an underscore.
single_trailing_underscore_: used by convention to avoid conflicts with Python keyword, e.g.
Tkinter.Toplevel(master, class_='ClassName')
__double_leading_underscore: when naming a class attribute, invokes name mangling (inside class FooBar, __boo becomes _FooBar__boo; see below).
__double_leading_and_trailing_underscore__: "magic" objects or attributes that live in user-controlled namespaces. E.g. __init__, __import__ or __file__. Never invent such names; only use them as documented.
#python#tutorial#computer#science#style#underscores#__init__#__init__()#what do underscores mean#what is __init__()#what is __init__#python style#style guide
0 notes