mjliu - Tumblr blog

mjliu · 5 years ago

Text

BF = Posterior odds / Prior odds

0 notes

mjliu · 5 years ago

Text

Mathematica for free, heh?!

Mathematica is sorta intriguing tool to have in one's toolbox. I have it on my raspberry pi, works just fine.. and when I need it elsewhere? yes, it just released WolframEngine to developers, so here are the steps

download and install wolfram engine

git clone https://github.com/WolframResearch/WolframLanguageForJupyter.git

(optional) vi configure-jupyter.wls to replace display name from Wolfram Language to Mathematica (garwsh, I like the old name much better)

./configure-jupyter.wls add "path_to_wolfram_engine" "path_to_jupyter"

launch your jupyter notebook then joy

For me, on msft data science VM (ubuntu) the engine path is /usr/local/Wolfram/WolframEngine/12.0/Executables/wolfram jupyter path is /data/anaconda/envs/py35/bin/jupyter

#mathematica #jupyter #wolframengine

0 notes

mjliu · 5 years ago

Text

LDA vs word2vec, really?

These two are related but not comparable. LDA’s intent isn’t to identify the hidden topics out of corpora (plural of corpus), while word2vec is to represent words in an high dimensional embedding space with reserving of the context. So word2vec is contextual, well at least to some degree.

Some says to compare LDA to doc2vec, those are not comparable either. As doc2vec is just the sum of word2vec. If we really like to do it then word2vec (doc2vec) + k-mean (or other clustering techniques) might be something to compare LDA topic modelling to.

As things go more interesting way, could we just take the word2vec representation to LDA calculations to model topics instead of tf-idf?

#nlp #lda #word2vec #word embedding #topic model

0 notes

mjliu · 5 years ago

Link

0 notes

mjliu · 5 years ago

Link

0 notes

mjliu · 5 years ago

Photo

Seem to be sth. for customer survey mining .. customer-topic

0 notes

mjliu · 6 years ago

Link

1 note · View note

mjliu · 6 years ago

Link

0 notes

mjliu · 6 years ago

Text

My awesome A/B test calculator http://bit.ly/ABCalc

0 notes

mjliu · 6 years ago

Text

maiden voyage with word2vec on spark

from pyspark.ml.feature import Tokenizer, RegexTokenizer from pyspark.ml.feature import StopWordsRemover from pyspark.sql.functions import col, udf from pyspark.sql.types import IntegerType from pyspark.ml.feature import Word2Vec from pyspark.sql import SparkSession import numpy as np import pandas as pd spark = SparkSession.builder.appName('abc').getOrCreate() sc = spark.sparkContext l = pd.DataFrame((['hello world', 1], ['alice wonderland', 2], ['simplicity is thy ultimate sophisitication', 3])) df = spark.createDataFrame(l) in_col = df.columns[0] regexTokenizer = RegexTokenizer(inputCol=in_col, outputCol='words', pattern='\\W') regexTokenized = regexTokenizer.transform(df) # remove stop words though not necessary remover = StopWordsRemover(inputCol='words', outputCol='filtered') filtered = remover.transform(regexTokenized) word2vec = Word2Vec(vectorSize = 20, minCount = 1, inputCol = 'filtered', outputCol = 'result') model = word2vec.fit(filtered) result = model.transform(filtered) V = model.getVectors() # find top 3 synonyms model.findSynonyms('simplicity', 3) # f.... sync as spark bug insync V.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession V._sc = spark._sc # put results into local pandas V_pd = V.toPandas() spark.stop()

#word2vec #word embedding #spark

0 notes

mjliu · 6 years ago

Text

quick test drive on LDA topic modelling

import numpy as np import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import LatentDirichletAllocation as LDA corpus_verbitam = ["computer, tea, early grey, hot"] cnt_vec = CountVectorizer(stop_words='english') cnt_data = cnt_vec.fit_transform(corpus_verbitam) words = cnt_vec.get_feature_names() # consider to remove stop words, punctuations lda = LDA(n_components=3) lda.fit(cnt_data) for topic_idx, topic in enumerate(lda.components_): print('\nTopic #%d:' % topic_idx+1) print(' '.join([words[i] for i in topic.argsort()[:-11:-1]]))

#LDA #topic-modelling

0 notes

mjliu · 6 years ago

Text

Some days you get the bear, some days bear gets ya

0 notes

mjliu · 6 years ago

Text

Plot lift chart using R

layout: default title: "Plot lift chart using R" date: 2018-08-16 19:31:45 -0700

categories: www

Here is the code to draw cum lift chart

require(tidyverse) n.groups=100 lift.tbl <- data.frame(label=label, pred=prediction) %>% drop_na()%>% mutate(cutoffs = ntile(-(pred), n.groups)) %>% group_by(cutoffs) %>% summarise_at(c("label"), funs(total=n(), totalresp=sum(., na.rm=T))) %>% mutate(cum.resp = cumsum(totalresp), gain = cum.resp / sum(totalresp), cum.lift = gain/(cutoffs/n.groups)) %>% with(plot(cutoffs, cum.lift, type="l"))

LLAP

0 notes

mjliu · 6 years ago

Text

bayes analyses (SAS) class notes

layout: default title: "bayes analyses (SAS) class notes" date: 2017-06-26 07:26:27 -0700

/Chapter I/

/Demo #1: GENMOD non informative prior/

model low = alcohol hist_hyp mother_wt prev_pretrm / dist=binomial link=logit; bayes seed=90210 outpost=out_birth stats=all; title 'Bayesian Analysis of Low Birth Weight Model with non informative prior'; run; /*autocall macros*/ /*gelman is used to compare chains*/ /*%gelman(chains); */ %tadplot(data=out_birth, var=loglike logpost intercept alcohol); %geweke(data=out_birth); %heidel(data=out_birth); %raftery(data=out_birth); %mcse(data=out_birth); %ess(data=out_birth); %postcov(data=out_birth); %postcor(data=out_birth); %postint(data=out_birth); %postsum(data=out_birth); %cater(data=out_birth, var=intercept alcohol); proc print data=pred(obs=50); run; proc means data=pred clm var mean n std; run; `</pre> /_Demo #2: Informative Prior_/ <pre>` input _TYPE_ $ alcohol1 hist_hyp mother_wt prev_pretrm; datalines; Mean 1.0986 0 0 0 Var 0.00116 1e6 1e6 1e6 ; run; proc genmod data=sasuser.birth desc; class alcohol(desc); model low = alcohol hist_hyp mother_wt prev_pretrm / dist=binomial link=logit; lsmeans alcohol / diff oddsratio plots=all cl; bayes seed=27513 coeffprior=normal(input=prior_birth) sampling=arms outpost=out_birth2 plots(smooth)=all diag=all nmc=25000; title 'Bayesian Analysis of Low Birth Weight Model with informative prior for alcohol1'; run; `</pre> /*Demo #3: PHREG .. what is survival analysis? how that link to telco? churn analysis, time to call, time to device burnout etc. */ <pre>` input _TYPE_ $ dose clinic1 prison; datalines; Mean -0.034160 0 0 Var .0003217 1e6 1e6 ; run; proc phreg data=sasuser.methadone; class clinic (param=ref ref='2'); model time*status(0)=clinic dose prison / ties=exact; bayes seed=27513 coeffprior=normal(input=prior_methadone) diag=all plots(smooth)=all sampling=rwm thin=10 nmc=200000 statistics=all; hazardratio "HR1" clinic; hazardratio "HR2" dose / units=10; hazardratio "HR3" prison; title "Bayesian Analysis with Informative Prior for Methadone Data"; run; `</pre> /_Chapter II_/ /_Demo #1: Logistic Regression_/ /_--> see above birth example_/ /* and take time to implement it in R? _/ /_Demo #2: PREDDIST_/ /_ --> see above to output in genmod ? nop, pre-existing procs does not support PREDDIST _/ /_ hand calc required on posterior using IML _/ /_Demo #3: Mixed Model, random effect .. hierarchical model is considered a particular type of bayesian network, not to be confused with mixture (mixing) model? [reference](http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_fmm_a0000000343.htm) <pre>`mixed model y = f + gamma mixing model (mixture) y = f * pr(U) `</pre> _/ /_proc hpbnet*/ <pre>` class adhesive toy; model pressure = adhesive; repeated subject = toy; run; proc glimmix data=sasuser.toy; class adhesive toy; model pressure = adhesive; random int /subject=toy residual; run; proc fmm data=sasuser.toy; class adhesive toy; model pressure = adhesive; model + toy; bayes; run; proc sgplot data=sasuser.toy; series x=adhesive y=pressure; run; proc mcmc data=sasuser.toy plots=all stats=all; parm beta0 beta1 beta2 sigma2; prior beta : N(0, var=1e6); prior sigma2 ~ igamma(2.001, 1.001) mu = beta0 + beta1*adhesive; random ga ~ gamma(2.001, 1.001) subject=toy; model pressure ~ N(mu, var=sigma2); run; `</pre> /_Demo #4: ZIP, ZINB, genmod or fmm ? in this particular case, genmod does not support bayes while fmm does_/ <pre>` model roots = photo | bap / dist=poi; model + / dist=constant; bayes outpost=roots_out; run; `</pre> /_Demo #5: Missing Value_/ /* impute missing values before modeling or build it right into the model */ /_Chapter III_/ /_Demo #1: historical data, prior eclicitation_/ /* over tea, napkin sketches */ <pre>`proc mixed data=sasuser.crossover; class patient sequence visit drug; model changehr = drug sequence visit / solution; random patient; run; proc genmod data=sasuser.crossover; class patient sequence visit drug; model changehr = drug sequence visit ; repeated sub=patient; run; proc glimmix data=sasuser.crossover; class patient sequence visit drug; model changehr = drug sequence visit / solution; random drug /sub=patient; run; proc univariate data=crossover; var changehr; histogram; run; proc fmm data=crossover; class patient drug sequence visit; model changehr = drug sequence visit; model + patient ; bayes metrop nmc=500000 thin=15 nbi=5000 betapriorparms=(0, 100); run; `</pre> /_Demo #2: exact likelihood meta analysis_/ /_Demo #3: normal approximation meta analysis_/ /* further thoughts ... not really meta analysis per se hierarchical(multilevel) modeling social network analysis /* poisson or negative binomial, should we even consider poisson? it does have the direct thinking.. but the mu = var assumption might be limitation ? */ /_solution II_/ <pre>`proc fmm data=sasuser.bakery; class surf flour; model volume = surf ; model + flour; bayes outpost=bakeryout2; title "fmm this make more sense or is it?"; run; proc genmod data=sasuser.bakery; class surf flour; model volume = surf; repeated subject=flour; lsmeans surf /plots=none; run; proc glimmix data=sasuser.bakery; class surf flour; model volume = surf; random int / sub = flour; lsmeans surf /plots=none; run; data bakery_post; set bakeryout2; surf1 = parm_1 + parm_2; surf2 = parm_1 + parm_3; surf3 = parm_1; contrast_1_2 = (surf1-surf2 gt 0); contrast_1_3 = (surf1-surf3 gt 0); contrast_2_3 = (surf2-surf3 gt 0); run; proc means data=bakery_post n mean stddev stderr var clm maxdec=2; title "bakery example post analysis"; run; proc sgplot data=bakery_post; density surf1; density surf2; density surf3; run; %postint(data=bakery_post);

0 notes

mjliu · 6 years ago

Text

Stats for Hackers

title: "Stats for Hackers" author: "M. Liu" date: 2015-01-10 11:11:11 -0700 categories: www

output: html_document

warm-up: coin toss

22 heads out of 30 coin tosses. Is it a fair coin? How probable to get that?

Classic ```{r} prop.test(22, 30, 0.05)

Simulation ```{r} N <- 10000 x <- NULL for(i in 1:N){ x[i] <- ifelse(sum(sample(c(0,1), 30, replace=T))>=22, 1, 0) } sum(x)/N

sneetches

Classic Student's T Test ```{r} y1 <- c(84, 72, 57, 46, 63, 76, 99, 91) y2 <- c(81, 69, 74, 61, 56, 87, 69, 65, 66, 44, 62, 69) t.test(y1, y2)

Bayesian EStimates ```{r} require(BEST) best.sneetches <- BESTmcmc(y1, y2) summary(best.sneetches) plotAll(best.sneetches)

Bayesian Factor posterior odd ... prior odd x likelihood odd (aka bayes factor)

0.578 https://en.wikipedia.org/wiki/Bayes_factor

require(BayesFactor) bf.sneetches <- ttestBF(y1, y2, posterior = TRUE, iterations = 10000) summary(bf.sneetches) plot(bf.sneetches)

shuffling and resampling ????? ```{r}

ys <- c(y1, y2) mean_diff <- NULL

for(i in 1:N){ ys1[i] <- ys[sample(1:length(ys), 8, replace = T)] mean_diff [i] <- mean(ys1[i]) - (sum(ys) - sum(ys1[i]))/12 }

##yertle bootstraping ```{r} yertle <- c(48, 24, 32, 61, 51, 12, 32, 18, 19, 24, 21, 41, 29, 21, 25, 23, 42, 18, 23, 13) yertle.mean <- mean(yertle) yertle.se <- sqrt(var(yertle)/(length(yertle)-1)) y.sample <- NULL ymean <- NULL yse <- NULL for (i in 1:N) { y.sample[i] <-rnorm(20, yertle.mean, yertle.se) } mean(y.sample) sd(y.sample)

cross validation

0 notes

mjliu · 6 years ago

Text

Factorial experiments - hierarchical Bayesian mcmc

0 notes

mjliu · 6 years ago

Text

Experimentation drives innovation

First something about experiment

Experiment

Two group experiment

Typical control and treatment

T test

U test

Proportion test

Chi-Square test, exact test

[All above look for p-value, p>.05 does not mean H naught is true, merely stats absence of evidence to reject it. Absence of evidence is not the evidence of absence].

Simulation, BF (Bayesian Factor)

Factorial

2+ factors such as dosage and age group

ANOVA

Hierarchical Bayesian regression

Crossover , sequential ?

Cochran Armitage trend test

Bandit (Thompson sampling, Bayesian UCB)

0 notes