mjliu
mjliu
The Game is On
83 posts
"When you have eliminated the impossible, whatever remains, however improbable, must be the truth?" - Sherlock Holmes
Don't wanna be here? Send us removal request.
mjliu · 5 years ago
Text
Tumblr media
BF = Posterior odds / Prior odds
0 notes
mjliu · 5 years ago
Text
Mathematica for free, heh?!
Mathematica is sorta intriguing tool to have in one's toolbox. I have it on my raspberry pi, works just fine.. and when I need it elsewhere? yes, it just released WolframEngine to developers, so here are the steps
download and install wolfram engine
git clone https://github.com/WolframResearch/WolframLanguageForJupyter.git
(optional) vi configure-jupyter.wls to replace display name from Wolfram Language to Mathematica (garwsh, I like the old name much better)
./configure-jupyter.wls add "path_to_wolfram_engine" "path_to_jupyter"
launch your jupyter notebook then joy
For me, on msft data science VM (ubuntu) the engine path is /usr/local/Wolfram/WolframEngine/12.0/Executables/wolfram jupyter path is /data/anaconda/envs/py35/bin/jupyter
0 notes
mjliu · 5 years ago
Text
LDA vs word2vec, really?
These two are related but not comparable. LDA’s intent isn’t to identify the hidden topics out of corpora (plural of corpus), while word2vec is to represent words in an high dimensional embedding space with reserving of the context. So word2vec is contextual, well at least to some degree.
Some says to compare LDA to doc2vec, those are not comparable either. As doc2vec is just the sum of word2vec. If we really like to do it then word2vec (doc2vec) + k-mean (or other clustering techniques) might be something to compare LDA topic modelling to.
As things go more interesting way, could we just take the word2vec representation to LDA calculations to model topics instead of tf-idf?
0 notes
mjliu · 5 years ago
Link
0 notes
mjliu · 5 years ago
Link
0 notes
mjliu · 5 years ago
Photo
Tumblr media
Seem to be sth. for customer survey mining .. customer-topic
0 notes
mjliu · 6 years ago
Link
1 note · View note
mjliu · 6 years ago
Link
0 notes
mjliu · 6 years ago
Text
My awesome A/B test calculator http://bit.ly/ABCalc
0 notes
mjliu · 6 years ago
Text
maiden voyage with word2vec on spark
from pyspark.ml.feature import Tokenizer, RegexTokenizer from pyspark.ml.feature import StopWordsRemover from pyspark.sql.functions import col, udf from pyspark.sql.types import IntegerType from pyspark.ml.feature import Word2Vec from pyspark.sql import SparkSession import numpy as np import pandas as pd spark = SparkSession.builder.appName('abc').getOrCreate() sc = spark.sparkContext l = pd.DataFrame((['hello world', 1], ['alice wonderland', 2], ['simplicity is thy ultimate sophisitication', 3])) df = spark.createDataFrame(l) in_col = df.columns[0] regexTokenizer = RegexTokenizer(inputCol=in_col, outputCol='words', pattern='\\W') regexTokenized = regexTokenizer.transform(df) # remove stop words though not necessary remover = StopWordsRemover(inputCol='words', outputCol='filtered') filtered = remover.transform(regexTokenized) word2vec = Word2Vec(vectorSize = 20, minCount = 1, inputCol = 'filtered', outputCol = 'result') model = word2vec.fit(filtered) result = model.transform(filtered) V = model.getVectors() # find top 3 synonyms model.findSynonyms('simplicity', 3) # f.... sync as spark bug insync V.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession V._sc = spark._sc # put results into local pandas V_pd = V.toPandas() spark.stop()
0 notes
mjliu · 6 years ago
Text
quick test drive on LDA topic modelling
import numpy as np import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import LatentDirichletAllocation as LDA corpus_verbitam = ["computer, tea, early grey, hot"] cnt_vec = CountVectorizer(stop_words='english') cnt_data = cnt_vec.fit_transform(corpus_verbitam) words = cnt_vec.get_feature_names() # consider to remove stop words, punctuations lda = LDA(n_components=3) lda.fit(cnt_data) for topic_idx, topic in enumerate(lda.components_): print('\nTopic #%d:' % topic_idx+1) print(' '.join([words[i] for i in topic.argsort()[:-11:-1]]))
0 notes
mjliu · 6 years ago
Text
Tumblr media
Some days you get the bear, some days bear gets ya
0 notes
mjliu · 6 years ago
Text
Plot lift chart using R
layout: default title: "Plot lift chart using R" date: 2018-08-16 19:31:45 -0700
categories: www
Here is the code to draw cum lift chart
require(tidyverse) n.groups=100 lift.tbl <- data.frame(label=label, pred=prediction) %>% drop_na()%>% mutate(cutoffs = ntile(-(pred), n.groups)) %>% group_by(cutoffs) %>% summarise_at(c("label"), funs(total=n(), totalresp=sum(., na.rm=T))) %>% mutate(cum.resp = cumsum(totalresp), gain = cum.resp / sum(totalresp), cum.lift = gain/(cutoffs/n.groups)) %>% with(plot(cutoffs, cum.lift, type="l"))
LLAP
0 notes
mjliu · 6 years ago
Text
bayes analyses (SAS) class notes
layout: default title: "bayes analyses (SAS) class notes" date: 2017-06-26 07:26:27 -0700
/Chapter I/
/Demo #1: GENMOD non informative prior/
model low = alcohol hist_hyp mother_wt prev_pretrm / dist=binomial link=logit; bayes seed=90210 outpost=out_birth stats=all; title 'Bayesian Analysis of Low Birth Weight Model with non informative prior'; run; /*autocall macros*/ /*gelman is used to compare chains*/ /*%gelman(chains); */ %tadplot(data=out_birth, var=loglike logpost intercept alcohol); %geweke(data=out_birth); %heidel(data=out_birth); %raftery(data=out_birth); %mcse(data=out_birth); %ess(data=out_birth); %postcov(data=out_birth); %postcor(data=out_birth); %postint(data=out_birth); %postsum(data=out_birth); %cater(data=out_birth, var=intercept alcohol); proc print data=pred(obs=50); run; proc means data=pred clm var mean n std; run; `</pre> /_Demo #2: Informative Prior_/ <pre>` input _TYPE_ $ alcohol1 hist_hyp mother_wt prev_pretrm; datalines; Mean 1.0986 0 0 0 Var 0.00116 1e6 1e6 1e6 ; run; proc genmod data=sasuser.birth desc; class alcohol(desc); model low = alcohol hist_hyp mother_wt prev_pretrm / dist=binomial link=logit; lsmeans alcohol / diff oddsratio plots=all cl; bayes seed=27513 coeffprior=normal(input=prior_birth) sampling=arms outpost=out_birth2 plots(smooth)=all diag=all nmc=25000; title 'Bayesian Analysis of Low Birth Weight Model with informative prior for alcohol1'; run; `</pre> /*Demo #3: PHREG .. what is survival analysis? how that link to telco? churn analysis, time to call, time to device burnout etc. */ <pre>` input _TYPE_ $ dose clinic1 prison; datalines; Mean -0.034160 0 0 Var .0003217 1e6 1e6 ; run; proc phreg data=sasuser.methadone; class clinic (param=ref ref='2'); model time*status(0)=clinic dose prison / ties=exact; bayes seed=27513 coeffprior=normal(input=prior_methadone) diag=all plots(smooth)=all sampling=rwm thin=10 nmc=200000 statistics=all; hazardratio "HR1" clinic; hazardratio "HR2" dose / units=10; hazardratio "HR3" prison; title "Bayesian Analysis with Informative Prior for Methadone Data"; run; `</pre> /_Chapter II_/ /_Demo #1: Logistic Regression_/ /_--> see above birth example_/ /* and take time to implement it in R? _/ /_Demo #2: PREDDIST_/ /_ --> see above to output in genmod ? nop, pre-existing procs does not support PREDDIST _/ /_ hand calc required on posterior using IML _/ /_Demo #3: Mixed Model, random effect .. hierarchical model is considered a particular type of bayesian network, not to be confused with mixture (mixing) model? [reference](http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_fmm_a0000000343.htm) <pre>`mixed model y = f + gamma mixing model (mixture) y = f * pr(U) `</pre> _/ /_proc hpbnet*/ <pre>` class adhesive toy; model pressure = adhesive; repeated subject = toy; run; proc glimmix data=sasuser.toy; class adhesive toy; model pressure = adhesive; random int /subject=toy residual; run; proc fmm data=sasuser.toy; class adhesive toy; model pressure = adhesive; model + toy; bayes; run; proc sgplot data=sasuser.toy; series x=adhesive y=pressure; run; proc mcmc data=sasuser.toy plots=all stats=all; parm beta0 beta1 beta2 sigma2; prior beta : N(0, var=1e6); prior sigma2 ~ igamma(2.001, 1.001) mu = beta0 + beta1*adhesive; random ga ~ gamma(2.001, 1.001) subject=toy; model pressure ~ N(mu, var=sigma2); run; `</pre> /_Demo #4: ZIP, ZINB, genmod or fmm ? in this particular case, genmod does not support bayes while fmm does_/ <pre>` model roots = photo | bap / dist=poi; model + / dist=constant; bayes outpost=roots_out; run; `</pre> /_Demo #5: Missing Value_/ /* impute missing values before modeling or build it right into the model */ /_Chapter III_/ /_Demo #1: historical data, prior eclicitation_/ /* over tea, napkin sketches */ <pre>`proc mixed data=sasuser.crossover; class patient sequence visit drug; model changehr = drug sequence visit / solution; random patient; run; proc genmod data=sasuser.crossover; class patient sequence visit drug; model changehr = drug sequence visit ; repeated sub=patient; run; proc glimmix data=sasuser.crossover; class patient sequence visit drug; model changehr = drug sequence visit / solution; random drug /sub=patient; run; proc univariate data=crossover; var changehr; histogram; run; proc fmm data=crossover; class patient drug sequence visit; model changehr = drug sequence visit; model + patient ; bayes metrop nmc=500000 thin=15 nbi=5000 betapriorparms=(0, 100); run; `</pre> /_Demo #2: exact likelihood meta analysis_/ /_Demo #3: normal approximation meta analysis_/ /* further thoughts ... not really meta analysis per se hierarchical(multilevel) modeling social network analysis /* poisson or negative binomial, should we even consider poisson? it does have the direct thinking.. but the mu = var assumption might be limitation ? */ /_solution II_/ <pre>`proc fmm data=sasuser.bakery; class surf flour; model volume = surf ; model + flour; bayes outpost=bakeryout2; title "fmm this make more sense or is it?"; run; proc genmod data=sasuser.bakery; class surf flour; model volume = surf; repeated subject=flour; lsmeans surf /plots=none; run; proc glimmix data=sasuser.bakery; class surf flour; model volume = surf; random int / sub = flour; lsmeans surf /plots=none; run; data bakery_post; set bakeryout2; surf1 = parm_1 + parm_2; surf2 = parm_1 + parm_3; surf3 = parm_1; contrast_1_2 = (surf1-surf2 gt 0); contrast_1_3 = (surf1-surf3 gt 0); contrast_2_3 = (surf2-surf3 gt 0); run; proc means data=bakery_post n mean stddev stderr var clm maxdec=2; title "bakery example post analysis"; run; proc sgplot data=bakery_post; density surf1; density surf2; density surf3; run; %postint(data=bakery_post);
0 notes
mjliu · 6 years ago
Text
Stats for Hackers
title: "Stats for Hackers" author: "M. Liu" date: 2015-01-10 11:11:11 -0700 categories: www
output: html_document
warm-up: coin toss
22 heads out of 30 coin tosses. Is it a fair coin? How probable to get that?
Classic ```{r} prop.test(22, 30, 0.05)
Simulation ```{r} N <- 10000 x <- NULL for(i in 1:N){ x[i] <- ifelse(sum(sample(c(0,1), 30, replace=T))>=22, 1, 0) } sum(x)/N
sneetches
Classic Student's T Test ```{r} y1 <- c(84, 72, 57, 46, 63, 76, 99, 91) y2 <- c(81, 69, 74, 61, 56, 87, 69, 65, 66, 44, 62, 69) t.test(y1, y2)
Bayesian EStimates ```{r} require(BEST) best.sneetches <- BESTmcmc(y1, y2) summary(best.sneetches) plotAll(best.sneetches)
Bayesian Factor posterior odd ... prior odd x likelihood odd (aka bayes factor)
0.578 https://en.wikipedia.org/wiki/Bayes_factor
require(BayesFactor) bf.sneetches <- ttestBF(y1, y2, posterior = TRUE, iterations = 10000) summary(bf.sneetches) plot(bf.sneetches)
shuffling and resampling ????? ```{r}
ys <- c(y1, y2) mean_diff <- NULL
for(i in 1:N){ ys1[i] <- ys[sample(1:length(ys), 8, replace = T)] mean_diff [i] <- mean(ys1[i]) - (sum(ys) - sum(ys1[i]))/12 }
##yertle bootstraping ```{r} yertle <- c(48, 24, 32, 61, 51, 12, 32, 18, 19, 24, 21, 41, 29, 21, 25, 23, 42, 18, 23, 13) yertle.mean <- mean(yertle) yertle.se <- sqrt(var(yertle)/(length(yertle)-1)) y.sample <- NULL ymean <- NULL yse <- NULL for (i in 1:N) { y.sample[i] <-rnorm(20, yertle.mean, yertle.se) } mean(y.sample) sd(y.sample)
cross validation
0 notes
mjliu · 6 years ago
Text
Tumblr media
Factorial experiments - hierarchical Bayesian mcmc
0 notes
mjliu · 6 years ago
Text
Experimentation drives innovation
First something about experiment
Experiment
Two group experiment
Typical control and treatment
T test
U test
Proportion test
Chi-Square test, exact test
[All above look for p-value, p>.05 does not mean H naught is true, merely stats absence of evidence to reject it. Absence of evidence is not the evidence of absence].
Simulation, BF (Bayesian Factor)
Factorial
2+ factors such as dosage and age group
ANOVA
Hierarchical Bayesian regression
Crossover , sequential ?
Cochran Armitage trend test
Bandit (Thompson sampling, Bayesian UCB)
0 notes