datascistuff-blog
datascistuff-blog
DataDriven
16 posts
Don't wanna be here? Send us removal request.
datascistuff-blog · 10 years ago
Text
Stacking,Blending
The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.
Let’s say you want to do 2-fold stacking (Generalize to k-fold ):
Split the train set in 2 parts: train_a and train_b
Fit a first-stage model on train_a and create predictions for train_b
Fit the same model on train_b and create predictions for train_a
Finally fit the model on the entire train set and create predictions for the test set. That is to say , we find the best model via 2-fold CV.
Now train a second-stage stacker model on the probabilities from the first-stage model(s).
A stacker model gets more information on the problem space by using the first-stage predictions as features, than if it was trained in isolation.
With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.
Blending has a few benefits:
It is simpler than stacking.
It wards against an information leak: The generalizers and stackers use different data.
You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.
The cons are:
You use less data overall
The final model may overfit to the holdout set.
Your CV is more solid with stacking (calculated over more folds) than using a single small holdout set.
"""Kaggle competition: Predicting a Biological Response.Blending {RandomForests, ExtraTrees, GradientBoosting} + stretching to[0,1]. The blending scheme is related to the idea Jose H. Solorzanopresented here:http://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning/10950#post10950'''You can try this: In one of the 5 folds, train the models, then usethe results of the models as 'variables' in logistic regression overthe validation data of that fold'''. Or at least this is theimplementation of my understanding of that idea :-)The predictions are saved in test.csv. The code below created my bestsubmission to the competition:- public score (25%): 0.43464- private score (75%): 0.37751- final rank on the private leaderboard: 17th over 711 teams :-)Note: if you increase the number of estimators of the classifiers,e.g. n_estimators=1000, you get a better score/rank on the privatetest set.Copyright 2012, Emanuele Olivetti.BSD license, 3 clauses.""" from __future__ import divisionimport numpy as npimport load_datafrom sklearn.cross_validation import StratifiedKFoldfrom sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegression def logloss(attempt, actual, epsilon=1.0e-15): """Logloss, i.e. the score of the bioresponse competition. """ attempt = np.clip(attempt, epsilon, 1.0-epsilon) return - np.mean(actual * np.log(attempt) + (1.0 - actual) * np.log(1.0 - attempt)) if __name__ == '__main__': np.random.seed(0) # seed to shuffle the train set n_folds = 10 verbose = True shuffle = False X, y, X_submission = load_data.load() if shuffle: idx = np.random.permutation(y.size) X = X[idx] y = y[idx] skf = list(StratifiedKFold(y, n_folds)) clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'), RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'), ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'), ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'), GradientBoostingClassifier(learn_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)] print "Creating train and test sets for blending." dataset_blend_train = np.zeros((X.shape[0], len(clfs))) dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs))) for j, clf in enumerate(clfs): print j, clf dataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf))) for i, (train, test) in enumerate(skf): print "Fold", i X_train = X[train] y_train = y[train] X_test = X[test] y_test = y[test] clf.fit(X_train, y_train) y_submission = clf.predict_proba(X_test)[:,1] dataset_blend_train[test, j] = y_submission dataset_blend_test_j[:, i] = clf.predict_proba(X_submission)[:,1] dataset_blend_test[:,j] = dataset_blend_test_j.mean(1) print print "Blending." clf = LogisticRegression() clf.fit(dataset_blend_train, y) y_submission = clf.predict_proba(dataset_blend_test)[:,1] print "Linear stretch of predictions to [0,1]" y_submission = (y_submission - y_submission.min()) / (y_submission.max() - y_submission.min()) print "Saving Results." np.savetxt(fname='test.csv', X=y_submission, fmt='%0.9f')
0 notes
datascistuff-blog · 11 years ago
Text
CrossValidation
CrossValidation : One should build the model completely based on Training data. The test data should probably not be used during model building and tuning phases.
One can use CrossValidation to tune parameters, choose the algorithm that needs to be used, etc.
K-fold CV
------------
  If K is large --  Majority of data in various training sets will be similar. The test data size for CV will be smaller. This will imply LOW BIAS and HIGH VARIANCE.
 If K is small --- Training and Test data can be significantly different in different CV situations. This will lead to HIGH BIAS and LOW VARIANCE
 High Var ----OVER FIT
One can also use Bootstrap /Sampling without Replacement
0 notes
datascistuff-blog · 11 years ago
Text
Using Regression : R NOTES
Steps 1:
1. Identify the kind of features the data has (numeric or categorical).
2. Regression needs Numerical features. So, we need to bucket or convert  categorical features into numeric.
3. Distributions:
     Check the dist of the Dependent Variable. One needs Normal Dist in general. So, some changes mite be required.
    Compute Joint Distribution of Dependent Variable with Independent Variables.
  Numeric Var -- Correlation, Density    Categorical Var --- Use TABLE
4. Distributions of Independent Variables:
       a. Correlation
       b. Graph using pairs, pairs.panels (library psych)
 5. Use lm function to build REGRESSION MODEL.
 6. Look at summary of the model using summary() 
    a. Look for dist of Residual Errors
    b. R-sq value - how well is the dependent variable models. How much of variance is described.
    c. Look at Signif. Values to identify how much predictive power each feature has.
   The stars (for example, ***) indicate the predictive power of each feature in the model. The significance level (as listed by the Signif. codes in the footer) provides a measure of how likely the true coefficient is zero given the value of the estimate. The presence of three stars indicates a significance level of 0 , which means that the feature is extremely unlikely to be unrelated to the dependent variable. A common practice is to use a significance level of 0.05 to denote a statistically significant variable. If the model had few features that were statistically significant, it may be cause for concern, since it would indicate that our features are not very predictive of the outcome. Here, our model has several significant variables, and they seem to be related to the outcome in logical ways
7. In Regression, feature selection and model specs are the analyst's job.
  a. non-linear terms x^2, xy , etc
  b. Transformation – converting a numeric variable to a binary indicator
  c.  Domain Knowledge Helps
0 notes
datascistuff-blog · 11 years ago
Text
Exploratory Data Science Recipes 1
Given a dataset in a tabular form, we want to study the relationships between predictor variables and dependent variable.
Recipes:
1. Given a data.frame, how do split it into training and test and calibration data.
a. Create a vector of size equal to the number of rows in the data.frame. The vector is populated with random number uniformly generated between 0 and 1. (runif(nrow)).
   dTrainAll <- subset(data.f, rgroup<=0.9) #Approx 90%
   dTest <- subset(data.f,rgroup>0.9)
  ###Split dTrainAll into Calibration Data and Train Data
  useForCal <- rbinom(n=dim(dTrainAll)[[1]],size=1,prob=0.1)>0
 dCal <-subset(dTrainAll,useForCal)
 dTrain<-subset(dTrainAll,!useForCal)
Find out which variables are Categorical and which are Numeric
vars <-colnames(data.f)
catVars <- vars[sapply(dTrainAll[,vars],class) %in% c('factor','character')]
numericVars<-vars[sapply(dTrainAll[,vars],class) %in% c('numeric','integer')]
0 notes
datascistuff-blog · 11 years ago
Text
Generative Learning Algorithms: NAIVE BAYES
Generative Learning Algorithms: NAIVE BAYES
1. Input Data x is discrete. For example in text classification, x is a vector of 0 and 1 (bag of words). The vector size is the size of the vocabulary V.
2. In generate algorithms, we generate a model for the input data. For example, we can assume that the input data is Multivariate Gaussian Dist.
Q. What is the data input model in NB ?
  Its a MULTINOMIAL DIST. Since for a vector of size V, we have 2^V possible values of x drawn from a multinomial distribution. Since, we have 2^V possible values, defining or modeling x over this input space of 2^V-1 dim space is tough.
Q. How do we solve this dimensionality problem?
 By making the INDEPENDENCE assumption which is clearly not correct but works.
 What is independence ?? If x and y are independent, then
  p(x) = p(x|y)
in NB, this becomes
 p(x1,x2,…,xV|y) = p(x1|y) p(x2|y,x1)p(x3|y,x1,x2)…..
=p(x1|y)p(x2|y)…p(xV|y) 
 Words in a document are indepenent of each other.
0 notes
datascistuff-blog · 11 years ago
Text
Generative Learning Models: Gaussian Discriminant Ananlysis Model
GDA is a classification based machine learning algorithm. The basic premise is as follows. Suppose we have two classes for which the designated labels are y=1 and y=0. The assumptions which GDA makes (which if wrong is a bad assumption to make) is that the data from the two classes comes from Multivariate Guassian Distribution with different means and cov.
 Generative Models tend to assume/model the behavior/distribution of the data for each of the classes. The data is generated via the Guassian Dist.  What GDA does is find parameters which  sepearate the two classes such that P(y=1|x) is greater than 0.5 on one side and <0.5 on the other side.
GDA Parameters
----------------------
Gauss Dist of 2 classes mean1 mean2 and sigma matrix
Also prior dist of classes c1 and c2 which is bernoulli for example with param PHI.
GDA and Logistic Regression
Numerical Optimization
------------------------------
We need to estimate the parameters to maximize such that the data is indeed similar to what we have.
There are two concepts here:
Log-Likelihood
Likelihood (MLE)
GDA vs LR (Duality)
-------------------------------------
GDA and LR have a duality. If the data for classes is indeed from Guassian Dist, GDA makes sense bcus this is the assumption it works on. 
P(y=1|x) can be expressed as a LR equation for some Theta and same x.
If p(x|y) is Gaussian, GDA is better than LGR.
However, LR does not make the assumption that data is Gaussian. and hence is more robust to incorrect modelling assumption. If the data is Poisson for example, LR still works well but GDA does not.
Thats why LR is more popular, since it does not make assumptions about the input data distribution.
0 notes
datascistuff-blog · 12 years ago
Text
R Text Mining libary(help=tm)
    Information on package ‘tm’ Description: Package:              tm Title:                Text Mining Package Version:              0.5-9 Date:                 2013-06-17 Authors@R:            c(person("Ingo", "Feinerer", role = c("aut", "cre"), email =                       "[email protected]"), person("Kurt", "Hornik", role = "aut")) Depends:              R (>= 2.14.0), methods Imports:              parallel, slam (>= 0.1-22) Suggests:             filehash, proxy, Rgraphviz, SnowballC, XML SystemRequirements:   Antiword (http://www.winfield.demon.nl/) for reading MS Word files,                       pdftotext from Poppler (http://poppler.freedesktop.org/) for reading                       PDF Description:          A framework for text mining applications within R. License:              GPL (>= 2) URL:                  http://tm.r-forge.r-project.org/ Packaged:             2013-06-18 09:35:09 UTC; hornik Author:               Ingo Feinerer [aut, cre], Kurt Hornik [aut] Maintainer:           Ingo Feinerer <[email protected]> NeedsCompilation:     yes Repository:           CRAN Date/Publication:     2013-06-18 12:14:27 Built:                R 3.0.0; x86_64-pc-linux-gnu; 2013-07-01 08:39:04 UTC; unix Index: DataframeSource         Data Frame Source Dictionary              Dictionary DirSource               Directory Source FunctionGenerator       Function Generator GmaneSource             Gmane Source PCorpus                 Permanent Corpus Constructor PlainTextDocument       Plain Text Document RCV1Document            RCV1 Text Document Reuters21578Document    Reuters-21578 Text Document ReutersSource           Reuters-21578 XML Source Source                  Access Sources TermDocumentMatrix      Term-Document Matrix TextDocument            Access and Modify Text Documents TextRepository          Text Repository URISource               Uniform Resource Identifier Source VCorpus                 Volatile Corpus VectorSource            Vector Source WeightFunction          Weighting Function XMLSource               XML Source Zipf_plot               Explore Corpus Term Frequency Characteristics acq                     50 Exemplary News Articles from the                         Reuters-21578 XML Data Set of Topic acq as.PlainTextDocument    Create Objects of Class PlainTextDocument c.Corpus                Combine Corpora, Documents, Term-Document                         Matrices, and Term Frequency Vectors crude                   20 Exemplary News Articles from the                         Reuters-21578 XML Data Set of Topic crude dissimilarity           Dissimilarity findAssocs              Find Associations in a Term-Document Matrix findFreqTerms           Find Frequent Terms getFilters              List Available Filters getReaders              List Available Readers getSources              List Available Sources getTokenizers           List Available Tokenizers getTransformations      List Available Transformations inspect                 Inspect Objects makeChunks              Split a Corpus into Chunks materialize             Materialize Lazy Mappings meta                    Meta Data Management ncol.TermDocumentMatrix                         The Number of                         Rows/Columns/Dimensions/Documents/Terms of a                         Term-Document Matrix plot.TermDocumentMatrix                         Visualize a Term-Document Matrix preprocessReut21578XML                         Preprocess the Reuters-21578 XML archive. prescindMeta            Prescind Document Meta Data readDOC                 Read In a MS Word Document readGmane               Read In a Gmane RSS Feed readPDF                 Read In a PDF Document readPlain               Read In a Text Document readRCV1                Read In a Reuters Corpus Volume 1 Document readReut21578XML        Read In a Reuters-21578 XML Document readTabular             Read In a Text Document readXML                 Read In an XML Document read_dtm_Blei_et_al     Read Document-Term Matrices removeNumbers           Remove Numbers from a Text Document removePunctuation       Remove Punctuation Marks from a Text Document removeSparseTerms       Remove Sparse Terms from a Term-Document Matrix removeWords             Remove Words from a Text Document rownames.TermDocumentMatrix                         Row, Column, Dim Names, Document IDs, and Terms sFilter                 Statement Filter scan_tokenizer          Tokenizers searchFullText          Full Text Search stemCompletion          Complete Stems stemDocument            Stem Words stopwords               Stopwords stripWhitespace         Strip Whitespace from a Text Document termFreq                Term Frequency Vector tm_filter               Filter and Index Functions on Corpora tm_intersect            Intersection between Documents and Words tm_map                  Transformations on Corpora tm_reduce               Combine Transformations tm_tag_score            Compute a Tag Score weightBin               Weight Binary weightSMART             SMART Weightings weightTf                Weight by Term Frequency weightTfIdf             Weight by Term Frequency - Inverse Document                         Frequency writeCorpus             Write a Corpus to Disk Further information is available in the following vignettes in directory ‘/home/sverma/R/x86_64-pc-linux-gnu-library/3.0/tm/doc’: extensions: Extensions (source, pdf) tm: Introduction to the tm Package (source, pdf)
0 notes
datascistuff-blog · 12 years ago
Text
LDA: Topic Modeling using topicmodels package in R
Packages Required : topicmodels, tm (textmining), SnowballC (R interface to the C libstemmer library that implements Porter's word stemming algorithm for collapsing words)
library(topicmodels) library(tm) library("XML") library("SnowballC")
set.seed(1102)
install.packages("corpus.JSS.papers",repos ="http://datacube.wu.ac.at/")
data("JSS_papers", package = "corpus.JSS.papers")
attributes(JSS_papers) #dim 556, 15 Matrix
remove_HTML_markup <-   function(s) {     doc <- htmlTreeParse(s, asText = TRUE, trim = FALSE)     iconv(xmlValue(xmlRoot(doc)), "", "UTF-8")   } #Prepare Corpuse and DocumentToTerm Matrix corpus <- Corpus(VectorSource(sapply(JSS_papers[, "description"], remove_HTML_markup))) dtm <- DocumentTermMatrix(corpus,control = list(stemming = TRUE, stopwords = TRUE, minWordLength = 3,removeNumbers = TRUE)) dtm <- removeSparseTerms(dtm, 0.99) dim(dtm)
#LDA jss_LDA <- LDA(dtm[1:450,], control = list(alpha = 0.1), k = 10) post <- posterior(jss_LDA, newdata = dtm[-c(1:450),]) get_terms(jss_LDA, 5)
 Topic 1  Topic 2  Topic 3    Topic 4   Topic 5   Topic 6    Topic 7   Topic 8   Topic 9    [1,] "model"  "use"    "model"    "packag"  "test"    "use"      "packag"  "statist" "packag"   [2,] "use"    "packag" "estim"    "use"     "model"   "packag"   "provid"  "use"     "data"     [3,] "estim"  "method" "packag"   "data"    "use"     "cluster"  "data"    "comput"  "model"    [4,] "can"    "time"   "function" "user"    "statist" "data"     "analysi" "can"     "use"      [5,] "packag" "provid" "use"      "graphic" "method"  "function" "statist" "calcul"  "function"      Topic 10 [1,] "use"     [2,] "data"    [3,] "program" [4,] "can"     [5,] "factor"
0 notes
datascistuff-blog · 12 years ago
Text
Installing topicmodels R package
Installing topicmodels R package
Original Link http://theoryno3.blogspot.sg/2010/12/installing-topicmodels-r-package.html
It has been quite annoying trying to install the "topicmodels" package for R. But here's a run down, in case others out there encounter the following error message when installing the package into a non-standard location. ctm.c:29:25: error: gsl/gsl_rng.h: No such file or directory ctm.c:30:28: error: gsl/gsl_vector.h: No such file or directory ctm.c:31:28: error: gsl/gsl_matrix.h: No such file or directory First, you will need to install the GNU GSL library. Pick it up from here: ftp://ftp.gnu.org/gnu/gsl/. The typical yum or manual compilation should work just fine. If you're installing this library into a non-standard location, take note of the installation path because you will need that below. Second, even when passing the "--configure-vars" option to point to the location of the GSL include and shared library folder, R CMD INSTALL will fail. The solution? Here: 1) Download the topicmodel source from here: http://cran.r-project.org/web/packages/topicmodels/index.html 2) Unpack into a working folder. 3) Modify the src/Makevars file to read as follows: LIB_GSL=/path/to/gsl/installation/ PKG_LIBS=-lgsl -lgslcblas -L${LIB_GSL}/lib PKG_CPPFLAGS=-I$(LIB_GSL)/include Of course, modify the value for LIB_GSL according to your installation path. Now re-compress the folder structure like so: "tar zcvf topicmodels.tar.gz /path/to/working/folder" Now run "R CMD INSTALL topicmodels.tar.gz ". Note that you will also encounter an issue with the vignette generation, so you'll need to install the OAIHarvester package in an R session: install.packages("OAIHarvester")
0 notes
datascistuff-blog · 12 years ago
Text
LDA (Latent Dirichlet Allocation) Topic Modeling
Given a document corpus, we are interested in understanding the key topics into which these documents can be classified.  
One such algorithm for Topic Modeling is LDA. LDA computes the hidden structure that most likely generated the document corpus. 
1. We have a bunch of documents. Each document exhibits a few topics. The document corpus has a few topics (say 20,50, 100). A given document may have, say, 5 topics.  Each document has a topic distribution. 
2. A topic is a hidden variable. It can be viewed as a distribution over a fixed vocabulary set. 
3. Observed and Hidden Variables: Documents are observed. The topic structure - the topics, per-document topic distribution and the per-document per-word topic assignment are LATENT.
4. LDA is a probabilistic algorithm. In generative prob modeling, data is assumed to arise from a generative process with hidden variables. The generative process defines a Joint Prob Dist over multiple variables, some of which are known and some are latent. This joint distribution is modeled as a conditional distribution over hidden variables given the know variables. This conditional distribution is called the Posterior Distribution. 
0 notes
datascistuff-blog · 12 years ago
Text
Naive Bayes
1. Bayes theorem             P(A/B) = P(B/A)P(A)/P(B)         P(A) - Prior , P(B/A) - Likeliehood , P(B)  - Marginal Likelihood         Goal is to computethe Posterior Prob by Computing the remaining there probabilities.
        P(B/A)P(A) --  P(A intersection B) Joint Probability 
        P(A int B) = P(A)P(B) Independent
2.  NB uses Bayes Th for Classification. It computes probability.
     Wrong Assumption:
   All features are equally important and INDEPENDENT. Without the Independent assumption, the computation may be infeasible. Assuming independence among features makes the joint prob easy to compute. This is particularly true if the number of features is large.
   Not good for datasets with large number of numeric features. Some discretization may be needed. BINNING technique can be useful
   Probabilities which are computed are less reliable, but the prediction in terms of classes generally is OKAY.
3.
The Laplace estimator - If probability of a given feature is 0/n , convert it to 1/n.
2 notes · View notes
datascistuff-blog · 12 years ago
Text
Learning from Data
HW1:
Bins and Marbles 3. We have 2 opaque bags, each containing 2 balls. One bag has 2 black balls and the other has a black ball and a white ball. You pick a bag at random and then pick one of the balls in that bag at random. When you look at the ball, it is black. You now pick the second ball from that same bag. What is the probability that this ball is also black?
A. Baye's Th
Assume: Bag1 has 2 black balls, Bag2 has 1 white/1black ball
 We want to find out the prob that we took the bag1 given the first ball is black.
P(bag=1|firstball=black) =      
P(firstball=black|bag=1)P(bag=1)|P(firstball=black)=
1*.5 /(P(firstball=black|bag=1)P(bag=1) +P(firstball=black|bag=2)P(bag=2))=
0.5 /(1*0.5+0.5*0.5)= 0.5/.75  = 2/3
________________________________________________________
Consider a sample of 10 marbles drawn from a bin that has red and green marbles. The probability that any marble we draw is red is μ = 0.55 (independently, with replacement). We address the probability of getting no red marbles (ν = 0) in the following cases:
4. We draw only one such sample. Compute the probability that ν = 0. The closest answer is (closest is the answer that makes the expression |your answer− given option| closest to 0):
.0003405 (Binomial with nCr , n=r=10 , answer is (1-μ)^10
5. We draw 1,000 independent samples. Compute the probability that (at least) one of the samples has ν = 0. The closest answer is:
Prob(that at least one sample has v=0) = 1-no sample has v=0
= 1 - (1-.0003405)^1000=0.289
0 notes
datascistuff-blog · 12 years ago
Text
R : SVM ROC/AUC and Splitting Data Into Training and Testing Data
#Function begins # split the data set in test and training set split.data <- function(data, p = 0.7, s = 666){   set.seed(s)   index <- sample(1:dim(data)[1])   train <- data[index[1:floor(dim(data)[1] * p)], ]   test <- data[index[((ceiling(dim(data)[1] * p)) + 1):dim(data)[1]], ]   return(list(train = train, test = test)) } #Function ends
The function takes a matrix and splits it into two parts , train & test.
#dati dati = split.data(magic04, p = 0.7) train<-dati$train test<-dati$test
#dati dati = split.data(magic04, p = 0.7) train<-dati$train test<-dati$test #str(train) #str(test) #SVM TRAINING library(e1071) model <- svm(train[,1:10],train[,11], probability = T) # prediction on the test set pred <- predict(model, test[,1:(dim(test)[[2]]-1)], probability = T) # Check the predictions table(pred,test[,dim(test)[2]]) pred.prob <- attr(pred, "probabilities") pred.to.roc <- pred.prob[, 1] # performance assessment library(ROCR) pred.rocr <- prediction(pred.to.roc, as.factor(test[,(dim(test)[[2]])])) perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff") cat("AUC =",deparse(as.numeric([email protected])),"\n") perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr") plot(perf.tpr.rocr, colorize=T)
AUC = 0.914772127332079
Tumblr media
0 notes
datascistuff-blog · 12 years ago
Text
Resources for Data Engineering/Science
DataSci@Harvard http://cs109.org/
  MLBase http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html
  Practical Quant Blog http://practicalquant.blogspot.sg/2013/09/data-analysis-just-one-component-of-the-data-science-workflow.html
Start http://strata.oreilly.com/2013/08/big-data-and-advertising.html
Programming Bayesian Methods in Python
https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
Strata
http://strata.oreilly.com/tag/data-science
0 notes
datascistuff-blog · 12 years ago
Text
Recommender Systems
http://spectrum.ieee.org/computing/software/deconstructing-recommender-systems
0 notes
datascistuff-blog · 12 years ago
Text
Processing WikiPedia
Loading from HDFS to Hbase   https://github.com/whym/wikihadoop
0 notes