ai-landing - Tumblr blog

ai-landing · 7 years ago

Link

I deeply thanks for anyone who is willing to take an action against the dictatorship currently practiced in China. No matter who you are and no matter what profession you currently have. No matter you are a famous person or a nobody like me. No matter where you live is far away from Chinese government's influence or as close as under their regime. No matter what action you have ever taken is as small as signing a petition or as big as rallying your own public protest. I believe every action counts and eventually we would remove the evil who watches its people ruthless and merciless. And the tyrrant who constantly using their economic advantage to force international corporate to cooperate by following their own rules. We'd return the justice to people who are insisting on it. We'd return the freedom of speech to people who don't think they can have it one day.

#gratitude #censorship

1 note · View note

ai-landing · 7 years ago

Text

Random ideas about NLP

About text summarization:

Instead of using "alignment", why not using similarity score? I think this problem is similar to "parapharazing" problem where the summary should be short compared to original article. And there are many ways to summarize the original article. We just need the best or better ones but not "exact" ones.

Crazy thoughts:

Since human is able to be taught to use finite set of formal language to programming, why not we "train" computer to condense natural language into similar same formal language instead of creating a complicated graph to describe the dependency. Maybe it's the first step to create a complicated graph to describe all possible dependency and then simplify the graph....

Anyone cares to comment these or let me know where I should post these non-sense

#non-sense #crowdsourcing

0 notes

ai-landing · 7 years ago

Link

2/26/2018 Feed Summary

#attention #lstm #online convex optimization

0 notes

ai-landing · 7 years ago

Link

2/25/2018 Feed Summary on Dropbox

#seq2seq #attention

0 notes

ai-landing · 7 years ago

Link

2/23/2018 Feed Summary on DropBox

#modern pandas #feature selection #kernel

0 notes

ai-landing · 7 years ago

Link

2018/02/21 Feed Summary on Dropbox

#seq2seq #sql

0 notes

ai-landing · 7 years ago

Link

2018/2/19 Feed summary on dropbox

Last compressive sensing introduciton

Interesting fake news tracer (encoding graph)

#fake news tracer #TraceMiner #compressive sensing

0 notes

ai-landing · 7 years ago

Link

2018/02/18 Feed summary on Dropbox

#compressive sensing #deep learning #network architectures

0 notes

ai-landing · 7 years ago

Link

2018/02/17 Feed Summary on Dropbox

#compressive sensing

0 notes

ai-landing · 7 years ago

Link

2018/02/16 Feed Summary on Dropbox

#VM #deeplearning

0 notes

ai-landing · 7 years ago

Link

2018/02/02 Feed summary on DropBox

#recommender system #stacking

0 notes

ai-landing · 7 years ago

Link

2018/1/29 Feed summary on DropBox

#RL #AlphaGo #AlphaZero #alphago zero #course note #reinforcement learning

0 notes

ai-landing · 7 years ago

Link

2018/1/28 Feed Summary

Machine Learning Mastery

A Gentle Introduction to Neural Machine Translation: The first post in the series of Neural Translation. This post gives some Machine Translation background. The development of Machine Translation flow based on time is listed and illustrate as following:

Rule-based MT incorpoates all the linguistic properties and construct grammar rules in order to capture the syntatic structure

Data-driven MT or Statistical Translation is purely using the training data alone and maximizing the likelihood of source-target pairs in the alignment position. It uses phrase-based translation which allows variable-length inputs. It could construct latent or hidden states which have the same concept with context vector encoded by encoder-decoder mechanism. However, this approach suffers from certain scarce training cases and rare words; therefore, need to incorporate linguistic or syntatic information.

Neural-based MT is a variant of data-driven MT which features phrase-based translation and allows variable length source phrase to be transformed into a shared context vector through encoder/decoder mechnism in the model. The fixed length of shared context vector has been an issue for phrase-based translation until attention mechanism (jointly align and translation) was also used in the neural translation and variable length of context vector generated through attention mechanism allows more flexibility in model.

Encoder-Decoder recurrent model: In this second post, the author introduces the encoder-decoder recurrent model which is the core model used in google translation service (from 2014). He introduces two variations of encoder-decoder RNN:

Sutskerever NMT model: End-to-end model which encode intput with fixed length context vector and then output by decoder with variable length target translation. It was firstly developed for English-French translation. It uses LSTM and gradient clipping to tackle gradient explosion problem. It pre-processes input to suffix a tag to the source sentence, reverse source encoding input and out of vocabulary handling as UNK

Cho NMT model: a similar sequence-to-seuqence model (end-to-end) as Sutskerever NMT model but using GRU (Gated Recurrent Unit, a variant or simplified LSTM) instead of full LSTM unit. Same as above, it trains a English-Fendch translation with much smaller batch size. It also uses pooling layer such as maxout.

Cho NMT mode + Attention Mechanism: Cho et al based on the previous paper and observed a decreasing performance when the input sequence length and the vocabulary size increases. They proposed Attention mechanims in which a varible length context vector is jointly trained with alignment (this model will outout a target word and find a best alignment position) to mitigate this problem.

Beam search in NMT: In order to find the best output sequence, two popular methods are often used in NLP. They are greedy decoder (or Viterbi algorithm) which looks for maxmial likelyhood probability when generating the next word and beam search which in turns finding K (tunable parameter and also is called the width of beam) possible candidates words while generating sequence. It often does better than Viterbi. Simple python code available in this post.

Configure Encoder-Decoder in NMT: Mainly discuss 2017 paper about large scale exploration of NMT architecture which used English-German translation and discuss the possible configuration for a model to achieve state-of-art translation result. The hyperparameters they studied are listed in the table below. They found that the use of attention (attention dim and attention type) and beam search (beam size and length penalty) can improve significantly when comparing with the models without them. Other hyperparemeter tuning could achieve minor difference such as embedding dim (128 generally good and the higher the better, for example, 2048 could achieve best result with marginally difference). They compares different RNN cells including Vallina RNN, GRU and LSTM. The performance determines by the complexity of cell types that is LSTM achieves best. The depth of encoder and decoder is minor (1 layer is sufficient to achieve good result for one direction). For the direction of encoder bi-direational is better than unidirectional and reverse is better than without it.

The main difficulties to conduct this survey is possible model configurations is too large and difficult to exhaustively execute. Some heuristic knowledge might be required to tune seq2seq model.

Hands on with Keras and use French-English datsaet:

prepare French-English data: Use European Paraliment 1996 - 2011 English-French dataset . Some basic text processing is done such as using space-token, turn lower cases, removing punctuations, convert French character into latin, removing non-printables and non-alphabets (numers etc). Some minor note, he frequently use str.maketrans n his code,to create a translation table and do some single character replacement but not through re.sub function.

using keras to build model from scratch: Using pre-trained embedding and keras wrapper.

Terminology:

phrase-based translation: A variable length translation model which doesn't rely on window-based segmentation of whole sentence.

end-to-end model: There is no components to be trained separatedly.

Today's Paper:

Massive Exploration of Neural Machine Translation Architectures

Written with StackEdit.

#NMT #keras #Machine Translation #Attention #encoder-decoder

0 notes

ai-landing · 7 years ago

Link

2018/1/27 Feed Summary

The Morning Paper

One Model to Learn Them All: In this post, the author summarize a paper which introduces a MultiModel general deep learning framework which will train 8 tasks at the same time (Image recognition, Image caption generation, Speech recognition, Parsing, German/French to English Translation and their reverse English to German/French translation). MultiModel consists an encoder / decoder architecture which would share a common learning unit. It also consists of a mixture-of-expert layer to dispatch the learning efforts. The article referred in this post shows that training such mixed model doesn't pose any performance degradation problem and sometimes help the task with less data available (parsing). This might imply a much larger scale or cross-domain transfer or multi-tasking learning experiment could be done in the future. This article also points out even though the computational building blocks need to be present for some specific domain (convolution neural network for image and attention / mixture-of-expert for language model), their presence does not interfere the cability of learning other tasks in different domains.

(RW: From the first glimpse of this short summary, the article referred in this post doesn't show any convincing result that cross-domain training does work by showing how much different these domains are. It just iterate similar conclusion from past experiments. Does Image caption take on the role to bridge image and language domain? How about choosing tasks randomly to break the MultiModel in order to know the degree of correlaiton shared among models? Would a sub-model which is trained insufficiently take the whole unified model down? I think those questions might shed some light about cross-domain learning)

KDNuggets

Kogentix Automated Machine Learning Platform: Another MLaaS targets bussiness data (pipeline in the figure below) and features the only one platform running Spark natively.

Data Enginner Introduction Part 1: In the following figure borrowed from The AI Heirarchy of Needs, AirFlow (monitoring tool used by Airbnb) locates at second layer of AI Needs Pyramid

The Democratization of Artificial Intelligence and Deep Learning: Free e-Book give-away. The Democratization of Artificial Intelligence is an idea to make Artificial Intelligence applicaiton is accessible to everyone.

Data Science Job Market Trends: automatino, data enpowerment, mass cleanup, ethnics & influence and blockchain app

O'Reilly Media / AI

tensoflow + mobile device: introduce Tensoflow Lite

using Apache MXNet for anomalty detection: Tutorial for using MXNet. Tranditional methods used to detect anomalty including: Kalm filter, KNN, K-Means and autoencoder with DL. Using IoT time-series data for demonstration. The author will train a encoder-decoder and detect anomalty as any data point outside 3rd standard deviation.

LSTM introduction with Tensorflow: using LSTM to classify Stock Tweets

2018 trends in AI O'Reilly version: And Include

Bayesian methods into Deep Learning and optimize training through neuro-evoluation on gradient-based deep learning.

Low cost hardware to improve computation efficiency.

Fast evolve AI tools including simulators (including reinforcement learning to automate deep learning training such as AutoML), AI develop toolbox handling more complicated / multimodal inputs and finally tools that not for data scientist or AI enginner for use such as friendly UI / UX etc or Intelligent wearables alike

Replace low-skilled tasks with automation

Other ethnics or issues about AI application

Convolution NN for language modeling: tutorial using 1D kernel and Tensorflow > Written with StackEdit.

#MLaaS #Multi-tasksing #MultiModel #deep learning #data engineering

2 notes · View notes

ai-landing · 7 years ago

Link

2018/1/26 Feed Summary

No Free Hunch

Another Tutorial For GAN: Described current challenge of GAN, mode collapse, where Generator only learns one class of image pattern at each turn as long as Discriminator learns how to combat Generator's learned behavior. This problem is said can be alleviated thourgh backpropagating through Discriminator toghether. It also provides link to ICCV for more GAN details.

Other resource about mode collapse: Mode Collapse in GANS: Mode collapse in GAN happens in a situation where the target probability distribution is multi-modal (this is often true in real life cases) and Generator-Discriminator duel pair could result in stuck at cat-mouse cycle. In this ever-stopping cycle, Generator learns one mode firstly until Discrimintor learns how to combat it and then switch to other mode as so on (This might imply GAN work worse when the dataset are in distinct distributions but not a big mixture). Current literature have solutions roughly can be divided into four (or three) categories:

Produce the more diverse sample in one batch. This can be done by studying the difference of between minibatch of true images and fake images. This would lead to methods using minibatch discrimination and feature matching.

Using minmax objective instead: current the loss function used for GAN in Ian Goodfellow's 2016 paper doesn't take the whole play history into consideration as most AI game agent do. By incorporating more simulation runs might mitigate this. Cons: time consuming for training

Randomly insert old batches into the current run: this is similar to the category 2 to force Discriminator to become more "memorable" learner.

Boosting multiple GANs (actually the correct word is Stacking): each GAN in the ensemble should learn one mode from the whole sample. Cons: time consuming for training

The author also suggests to look at f-divergence used in GAN objective function and find that the original GAN objective function would settle in one mode seeking. This problem should be alleviated when more information (as mentioned above) is included (see below). The GAN objective, from practice to theory and back again: In this post, author carefully examine the loss function used in DCGAN and experiment the result with different f-divergence including KL-Divergence and JS-Divergence. Author contends that in theory the minimum should be achieved when target probability distribution equals to proposed probability no matter what divergence is used; however, in practice, what divergence used does matter. In terms of the modes of target distribution and proposed one, the behavior of target distribution can be classified as "mode-seeking" (find the prominent mode) or ""mode-covering" (try to average over all the modes). The divergence used in DCGAN is a "mode-seeking" behavior as shown in GAN improvement paper (and it explains mode collapse problem). The author futher tries other 3 divergence: they are f-GAN, Reverse KL-divergnece and JS-divergence. The images generated are shown but not the loss behavior in training. PS: Recall the loss function used in DCGAN: it will do runs of feed-forward through Discriminator: one is using real images and the other is using fake image. However, back propagation update is done only with Generator.

Algorithmia

Racial bias in facial recognition: the article discuss about racial bias and Open face serves as a face recognition MNIST trainig set to standardize face recognition training.

Serverless Mircoservices: Introduction and [extend AWS Alexia]: Algorithmia features fast deployment cycle by combining several stages of develoments (from monolithic architecture) at the same stage via building micro-services. And based on serverless structure the development can focus on the code not the environment (server) (FaaS): Current cloud platform provides serverless microservices are Google Cloud Funciton, AWS Lambda, Azure Functions and Algorithmia Serverless AI cloud.

Doing Bayesian Data Analysis

Problem of treating Ordinal Ranking as Metrics in Moving Rating: Ordinal variable is not propriate to express as distance or metric measure. They should be modeled as ordered-probit model (similar to latent model / Dirichlet process : each scale has its own distribution descriptive parameters but also enforce the mean of scale distributions to follow the pre-defined order). For example, in the movie rating problem, the 5 star rating system cannot be measured anyway because the exact scale is unkown. The only information we know is the ordinal scale given. With the assumption that the ordinal scale is the same for everyone (therefore, the same across all movies), one can model in a way each scale has its own distribution. In this post, the author also showed that the mean is not consistently or monotonically increase as movie rating scale when treated as a metric (compute the average rating for one movie). The author also shows that two movies could have very similar average rating (treated as metrics) but could have very differen means in ordinal-probit model. Or one movie has slightly worse average rating than the other but could end up has higher mean in ordinal-probit model. RW: However, the assumption that assume everyone has the same ordinal scale is not always true. But modeling sentiment labels as multi-nomial distribution probably is not correct. Hence, quoted from the author's reply to same ordinal scale comment:

"there is no guarantee that a useful model is a correct model"

Supplementary reading: beyond one-hot encoding: in this post author tried 7 encoding methods for categorical variable which are:

One-Hot-Encoding (or Effects coding) is used to handle not ordered categorical variable. This is often used to compare all groups to grand mean.

Dummy coding is similar to one-hot-encoding but with control group (all zero present). This is often used to compare non-control groups to control group.

Contrast coding is used when the a priori distribution is known where the between group difference is known significantly larger while the within group difference is small. It requires to assign orthogonal coding in regression setting and should be constrained to sum to one (for all the assigned coding). Other variants of contrast coding include: Polynomial coding is a variant of ordinal coding when the spacing is all equal (not really understood), Backward Difference Coding and Hermet coding

Ordinal Coding can be applied to categorical variable whose values are ordered and encoding this way will directly take the integer level as value

Binary: further turn integer encoding int binary representation; this often results in shorter code length than one-hot.

Today's Paper

Neural Machine Translation by Jointly Learning to Align and Translation: Attention Mechanism original paper

Written with StackEdit.

#GAN #categorical variable encoding #attention

0 notes

ai-landing · 7 years ago

Link

2018/1/25 Daily Feeds Summary

DeepMind

Multiagent-game asymmetric game backed with game theory: A decomposition method is propersed in this paper where using game theory to decompose asymmetric multi-agent game (such as prisoner's delimma) to two or more symmetric games. A symetric game is a classic game theory problem such as a prisoner's delimma or zero-sum game (often two players). The game could only achieve minimal loss when their choices of actions strike a Nash equilibirum. On the other hand, in an asymmetric game, two agents have the same option of actions but separate reward systems which make the overall analysis more complicated. A typical assymetric game is "Battle of Sexes" where the best solution often lead to unstable if using traditional techinique for analysis. After decomposition, the equilibrium can be found in asymmetric game. The analysis could be applied to multi-agent environment where some complicated strategies such as coordination need to be applied.

Generalized Probability Smoothing: In this paper, a complete code length analysis is conducted and lead to state of art data compression algorith.

DeepMind Control Suite: In this paper, a benchmarking set for reinforcement learning is available for public access.

No Free Hunch (Kaggle)

Interviews:

Collections of dataset publishing award winners' interview: learn how to collect data and continue contributing code / kernel competition

Masking Challenger winner interview: Mercedes-Benz Masking Challenge poses a challenge of tacking high dimensional and also diverse features. The winner starts off with 67528 features and retain only 900 features at the end (through manually group variables whose sample size lower than 50 and also through decision tree to transform categorical variables into numerical ones). The model used is (of course) XGBoost. In this interview, some modeling insights are shared (especially trapping individual variables and propose important interactions):

cumulative sums of binary variables could capture joint information about two or more variables but also introduce artificial dependencies between variables (XGBoost handles this)

trapping individual variables is applied in stacking phase (though not successfully as the author put it),

Today's Paper:

Online Dictionary Learning for Sparse Coding

Written with StackEdit.

#Kaggle #asymmetric game #game theory #XGBoost #curse of dimensionality

0 notes

ai-landing · 7 years ago

Link

2018/1/24 Today's paper:

Neural Attention Model for Text Summarization: From Facebook AI Research

2018/1/24 Feed Summary

inference

The Generalization Mystery: Sharp vs Flat Minima: Two papers analyze the loss landscape and using flatness as an indicator for generalization are present and also discussed in this post. They are analyzing sharp minima (flatness, in constrast to sharp minima, can be invariant under certain network architecture re-parameterization but still can not be only indicator for generalization) and visualizing loss landscpe (where the flatness can be visualized in 1D or 2D plot through re-parameterization; however, the author contends that this experiment only considers a small portion of reparameterizations and couldn't be conclusive). The author futher developed his own indicator of generalization using ratio of two generalization measurements. In fact, the author derived a "local measure of generalization ability" which considers the loss of two consecutive minibatches and divided by a hyper-paramter $\epsilon$ denoting a restricted region where next minibatch could be moved to (or the region within flatness). The author further proposed using KL divergence as $\epsilon$ measure instead of Euclidean norm to maintain true invariance. All of the assumption based on SGD due to small batch of SGD could yield better generalization. (RW: looks like computing statistics from Adaptive learning rate such as momentum methods)

KDNuggets

ML SaaS comparison: Compare Amazon (ML and SageMaker), Microsoft (Azure) and Google (Preidction API and ML Engine). This comparison is based on what common models can be trained in their cloud and how easy they can be used in terms of automation and parameters tuning (table available). Also the NLP API (speech and text) and Image recognition API provided online. Details included.

GA for hypertuning RNN: tutorial for how to construct GA (DEAP) to optimize RNN (keras)

Excel with Pandas: tutorial article where using pandas with a bundle of excel packages

H2O+R+DeepLearning: Introduce a "Flow" web-based frontend to use DL framework

Cognitive Computing:

evaluate conversational ai: paper

topic-based conversational bots evaluation: paper

About Alexa: podcast and TED talk

Written with StackEdit.

#SGD #flatness #optimization #deep learning #MLaaS comparison

1 note · View note