fastforwardlabs - Tumblr blog

fastforwardlabs · 8 years ago

Text

Thomas Wiecki on Probabilistic Programming with PyMC3

A rolling regression with PyMC3: instead of the regression coefficients being constant over time (the points are daily stock prices of 2 stocks), this model assumes they follow a random-walk and can thus slowly adapt them over time to fit the data best.

Probabilistic programming is coming of age. While normal programming languages denote procedures, probabilistic programming languages denote models and perform inference on these models. Users write code to specify a model for their data, and the languages run sampling algorithms across probability distributions to output answers with confidence rates and levels of uncertainty across a full distribution. These languages, in turn, open up a whole range of analytical possibilities that have historically been too hard to implement in commercial products.

One sector where probabilistic programming will likely have significant impact is financial services. Be it when predicting future market behavior or loan defaults, when analyzing individual credit patterns or anomalies that might indicate fraud, financial services organizations live and breathe risk. In that world, a tool that makes it easy and fast to predict future scenarios while quantifying uncertainty could have tremendous impact. That’s why Thomas Wiecki, Director of Data Science for the crowdsourced investment management firm Quantopian, is so excited about probabilistic programming and the new release of PyMC3 3.0.

We interviewed Dr. Wiecki to get his thoughts on why probabilistic programming is taking off now and why he thinks it’s important. Check out his blog, and keep reading for highlights!

A key benefit of probabilistic programming is that it makes it easier to construct and fit Bayesian inference models. You have a history working with Bayesian methods in your doctoral work on cognition and psychiatry. How did you use them?

One of the main problems in psychiatry today is that disorders like depression or schizophrenia are diagnosed based purely on subjective reporting of symptoms, not biological traits you can measure. By way of comparison, imagine if a cardiologist were to prescribe heart medication based on answers you gave in a questionnaire! Even the categories used to diagnose depression aren’t that valid, as two patients may have completely different symptoms, caused by different underlying biological mechanisms, but both fall under the broad category “depressed.” My thesis tried to change that by identifying differences in cognitive function -- rather than reported symptoms -- to diagnose psychiatric diseases. Towards that goal, we used computational models of the brain, estimated in a Bayesian framework, to try to measure cognitive function. Once we had accurate measures of cognitive function, we used machine learning to train classifiers to predict whether individuals were suffering from certain psychiatric or neurological disorders. The ultimate goal was to replace disease categories based on subjective descriptions of symptoms with objectively measurable cognitive function. This new field of research is generally known as computational psychiatry, and is starting to take root in industries like pharmaceuticals to test the efficacy of new drugs.

What exactly was Bayesian about your approach?

We mainly used it to get accurate fits of our models to behavior. Bayesian methods are especially powerful when there is hierarchical structure in data. In computational psychiatry, individual subjects either belong to a healthy group or a group with psychiatric disease. In terms of cognitive function, individuals are likely to share similarities with other members of their group. Including these groupings into a hierarchical model gave more powerful and informed estimates about individual subjects so we could make better and more confident predictions with less data.

Bayesian inference provides robust means to test hypotheses by estimating how different two different groups are from one another.

How did you go from computational psychiatry to data science at Quantopian?

I started working part-time at Quantopian during my PhD and just loved the process of building an actual product and solving really difficult applied problems. After I finished my PhD, it was an easy decision to come on full-time and lead the data science efforts there. Quantopian is a community of over 100.000 scientists, developers, students, and finance professionals interested in algorithmic trading. We provide all the tools and data necessary to build state-of-the-art trading algorithms. As a company, we try to identify the most promising algorithms and work with the authors to license them for our upcoming fund, which will launch later this year. The authors retain the IP of their strategy and get a share of the net profits.

What’s one challenging data science problem you face at Quantopian?

Identifying the best strategies is a really interesting data science problem because people often overfit their strategies to historical data. A lot of strategies thus often look great historically but falter when actually used to trade with real money. As such, we let strategies bake in the oven a bit and accumulate out-of-sample data that the author of the strategy did not have access to, simply because it hadn’t happened yet when the strategy was conceived. We want to wait long enough to gain confidence, but not so long that strategies lose their edge. Probabilistic programming allows us to track uncertainty over time, informing us when we’ve waited long enough to have confidence that the strategy is actually viable and what level of risk we take on when investing in it.

It’s tricky to understand probabilistic programming when you first encounter it. How would you define it?

Probabilistic programming allows you to flexibly construct and fit Bayesian models in computer code. These models are generative: they relate unobservable causes to observable data, to simulate how we believe data is created in the real world. This is actually a very intuitive way to express how you think about a dataset and formulate specific questions. We start by specifying a model, something like “this data fits into a normal distribution”. Then, we run flexible estimation algorithms, like Markov Chain Monte Carlo (MCMC), to sample from the “posterior”, the distribution updated in light of our real-world data, which quantifies our belief into the most likely causes underlying the data. The key with probabilistic programming is that model construction and inference are almost completely independent. It used to be that those two were inherently tied together so you had to do a lot of math in order to fit a given model. Probabilistic programming can estimate almost any model you dream up which provides the data scientist with a lot of flexibility to iterate quickly on new models that might describe the data even better. Finally, because we operate in a Bayesian framework, the models rest on a very well thought out statistical foundation that handles uncertainty in a principled way.

Much of the math behind Bayesian inference and statistical sampling techniques like MCMC is not new, but probabilistic tooling is. Why is this taking off now?

There are mainly three reasons why probabilistic programming is more viable today than it was in the past. First is simply the increase in compute power, as these MCMC samplers are quite costly to run. Secondly, there have been theoretical advances in the sampling algorithms themselves, especially a new class called Hamiltonian Monte Carlo samplers. These are much more powerful and efficient in how they sample data, allowing us to fit highly complex models. Instead of sampling at random, Hamiltonian samplers use the gradient of the model to focus sampling on high probability areas. By contrast, older packages like BUGS could not compute gradients. Finally, the third required piece was software using automatic differentiation -- an automatic procedure to compute gradients on arbitrary models.

What are the skills required to use probabilistic programming? Can any data scientist get started today or are there prerequisites?

Probabilistic programming is like statistics for hackers. It used to be that even basic statistical modeling required a lot of fancy math. We also used to have to sacrifice the ability to really map the complexity in data to make models that were tractable, but just too simple. For example, with probabilistic programming we don’t have to do something like assume our data is normally distributed just to make our model tractable. This assumption is everywhere because it’s mathematically convenient, but no real-world data looks like this! Probabilistic programming enables us to capture these complex distributions. The required skills are the ability to code in a language like Python and a basic knowledge of probability to be able to state your model. There are also a lot of great resources out there to get started, like Bayesian Analysis with Python, Bayesian Methods for Hackers, and of course the soon-to-be-released Fast Forward Labs report!

Congratulations on the new release of PyMC3! What differentiates PyMC3 from other probabilistic programming languages? What kinds of problems does it solve best? What are its limitations?

Thanks, we are really excited to finally release it, as PyMC3 has been under continuous development for the last 5 years! Stan and PyMC3 are among the current state-of-the-art probabilistic programming frameworks. The main difference is that Stan requires you to write models in a custom language, while PyMC3 models are pure Python code. This makes model specification, interaction, and deployment easier and more direct. In addition to advanced Hamiltonian Monte Carlo samplers, PyMC3 also features streaming variational inference, which allows for very fast model estimation on large data sets as we fit a distribution to the posterior, rather than trying to sample from it. In version 3.1, we plan to support more variational inference algorithms and GPUs, which will make things go even faster!

For which applications is probabilistic programming the right tool? For which is it the wrong tool?

If you only care about pure prediction accuracy, probabilistic programming is probably the wrong tool. However, if you want to gain insight into your data, probabilistic programming allows you to build causal models with high interpretability. This is especially relevant in the sciences and in regulated sectors like healthcare, where predictions have to be justified and can’t just come from a black-box. Another benefit is that because we are in a Bayesian framework, we get uncertainty in our parameters and in our predictions, which is important for areas where we make high-stakes decisions under very noisy conditions, like in finance. Also, if you have prior information about a domain you can very directly build this into the model. For example, let’s say you wanted to estimate the risk of diabetes from a dataset. There are many things we already know even without looking at the data, like that high blood sugar increases that risk dramatically -- we can build that into the model by using an informed prior, something that’s not possible with most machine learning algorithms.

Finally, hierarchical models are very powerful, but often underappreciated. A lot of data sets have an inherent hierarchical structure. For example, take individual preferences of users on a fashion website. Each individual has unique tastes, but often shares tastes with similar users. For example, people are more likely to have similar taste if they have the same sex, or are in the same age group, or live in the same city, state, or country. Such a model can leverage what it has learned from other group members and apply it back to an individual, leading to much more accurate predictions, even in the case where we might only have few data points per individual (which can lead to cold start problems in collaborative filtering). These hierarchies exist everywhere but are all too rarely taken into account properly. Probabilistic programming is the perfect framework to construct and fit hierarchical models.

Interpretability is certainly an issue with deep neural nets, which also require far more data than Bayesian models to train. Do you think Bayesian methods will be important for the future of deep learning?

Yes, and it’s a very exciting area! As we’re able to specify and estimate deep nets or other machine learning methods in probabilistic programming, it could really become a lingua franca that removes the barrier between statistics and machine learning, giving a common tool to do both. One thing that’s great about PyMC3 is that the underlying library is Theano, which was originally developed for deep learning. Theano helps bridge these two areas, combining the power nets have to extract latent representations out of high-dimensional data with variational inference algorithms to estimate models in a Bayesian framework. Bayesian deep learning is hot right now, so much so that NIPS offered a day-long workshop. I’ve also written about the benefits in this post and this post, explaining how Bayesian methods provide more rigor around the uncertainty and estimations of deep net predictions and provides better simulations. Finally, Bayesian Deep Learning will also allow to build exciting new architectures, like Hierarchical Bayesian Deep Networks that are useful for transfer learning. A bit like the work you did to get stronger results from Pictograph using the Wordnet hierarchy.

Bayesian deep nets provide greater insight into the uncertainty around predicted values at a given point. Read more here.

What books, papers, and people have had the greatest influence on you and your career?

I love Dan Simmons’ Hyperion Cantos series, which got me hooked on science fiction. Michael Frank (my PhD advisor) and EJ Wagenmakers first introduced me to Bayesian statistics. The Stan guys, who developed the NUTS sampler and black-box variational inference, have had a huge influence on PyMC3. They continue to push the boundaries of applied Bayesian statistics. I also really like the work coming out of the labs of David Blei and Max Welling. We hope that PyMC3 will also be an influential tool on the productivity and capabilities on data scientists across the world.

How do you think data and AI will change the financial services industry over the next few years? What should all hedge fund managers know?

I think it’s already had a big impact on finance! And as the mountains of data continue to grow, so will the advantage computers have over humans in their ability to combine and extract information out of that data. Data scientists, with their ability to pull that data together and build the predictive models will be the center of attention. That is really at the core of what we’re doing at Quantopian. We believe that by giving people everywhere on earth a platform that’s state-of-the-art for free we can find that talent before anyone else can.

#probabilistic programming #data science #interview #bayesian statistics

6 notes · View notes

fastforwardlabs · 8 years ago

Text

Five 2016 Trends We Expect to Come to Fruition in 2017

The start of a new year is an excellent occasion for audacious extrapolation. Based on 2016 developments, what do we expect for 2017?

This blog post covers five prominent trends: Deep Learning Beyond Cats, Chat Bots - Take Two, All the News In The World - Turning Text Into Action, The Proliferation of Data Roles, and What Are You Doing to My Data?

(1) Deep Learning Beyond Cats

In 2012, Google found cats on the internet using deep neural networks. With a strange sense of nostalgia, the post reminds us how far we have come in only four years, with more nuanced reporting as well as technical progress. The 2012 paper predicted the findings could be useful in the development of speech and image recognition software, including translation services. In 2016, Google’s WaveNet can generate human speech, General Adversarial Networks (GANs), Plug & Play Generative Networks, and PixelCNN can generate images of (almost) naturalistic scenes including animals and objects, and machine translation has improved significantly. Welcome to the future!

In 2016, we saw neural networks combined with reinforcement learning (i.e., deep reinforcement learning) beat the reigning champion Lee Sedol in Go (the battle continues online) and solve a real problem; deep reinforcement learning significantly reduces Google’s energy consumption. The combination of neural networks with probabilistic programming (i.e., Bayesian Deep Learning) and symbolic reasoning proved (almost) equally powerful. We saw significant advances in neural network architecture, for example, the addition of long-term memory (Neural Turing Machines) which adds a capacity resembling “common sense” to neural networks and may help us build (more) sophisticated dialogue agents.

In 2017, enabled by open-sourced software like Google’s TensorFlow released in late 2015, Theano, and Keras, neural networks will find (more) applications in industry (e.g., recommender systems), but widespread adoption won’t come easily. Algorithms are good at playing games like Go because games easily allow to generate the amount of data needed to train these advanced, data-hungry algorithms. The availability of data, or lack thereof, is a real bottleneck. Efforts to use pre-trained models for novel tasks using transfer learning (i.e., using what you have learned on one task to solve another, novel task) will mature and unlock a bigger class of use cases.

Parallel work on deep neural network architecture will enhance said architecture, deepen our understanding, and hopefully help us develop principled approaches for choosing the right architecture (there are many) for tasks beyond “CNNs are good for translation invariance and RNNs for sequences”.

In 2017, neural networks will go beyond game playing and deliver on their promise to industry.

(2) Chat Bots - Take Two

2016 had been declared by many the year of the bots, and it wasn’t. The narrative was loud but the results, more often than not, disappointing. Why?

Amongst the many reasons; lack of avenues for distribution, lack of enabling technologies, and the tendency to treat bots as a purely technical not product or design challenge. Through hard work and often failure, the best driver of future success, the bot community learned some valuable lessons in 2016. Bots can be brand ambassadors (e.g., Casper’s Insomnobot-3000) or marketing tools (e.g., Call of Duty’s Lt Reyes). Bots are good for tasks with clear objectives (e.g., scheduling a meeting) while exploration, especially if the content can be visualized, is better left to apps (you can, of course, squeeze it into a chatbot solution). Facebook’s messenger platform added an avenue for distribution; Google (Home, Allo) may follow while Apple (Siri) will probably stay closed. Facebook’s Wit.ai adds technology to enable developers to build bots, at re:Invent 2016, Amazon unveiled Lex.

After excitement and inflated expectations in 2016, we will see useful, goal-oriented, narrow-domain chatbots with use case appropriate personality supported by human agents when the bot’s intent recognition fails or when it wrangles a conversation. We will see more sophisticated intent recognition, graceful error handling, and more variety in the largely human-written template responses while ongoing research into end-to-end dialogue systems promises more sophisticated chatbots in the years to come. After the hype, a small, committed core remains and they will deliver useful chatbots in 2017.

Who wins our “The Weirdest Bot Of 2016” award? The Invisible Boyfriend.

(3) All the News In the World - Turning Text into Action

At the beginning there was the number; algorithms work on numerical data. Traditionally, natural language was difficult to turn into numbers that capture the meaning of words. Conventional bag-of-word approaches, useful in practice, fail to use syntactic information and fail to understand that “great” and “awesome” or Cassius Clay and Muhammad Ali are related concepts.

In 2013, Tomas Mikolov proposed a fast and efficient way to train word embeddings. A word embedding is a numerical representation of a word, say “great”, that is learned by studying the context in which the “great” tends to appear. The word embedding captures the meaning of “great” in the sense that “great” and “awesome” will be close to one another in the multi-dimensional word embedding space, the algorithm learned they are related. Alternatives like GloVe, word2vec for documents (i.e., doc2vec), and underlying methods like skip-gram and skip-thought further improved our ability to turn text into numbers and opened up natural language to machine learning and artificial intelligence.

In 2016, fastText allowed us to deal with out-of-vocab words (words the language model was not originally trained on) and SyntaxNet enhances our ability to not only encode the meaning of words but to parse the syntactic structure of sentences. Powerful, open-source natural language processing tool kits like spaCy allow data scientists and machine learning engineers without deep expertise in natural language processing to get started. FastText? Just pip install! Fuelled by this progress in the field, we saw a quiet but strong trend in industry towards utilizing these new powerful natural language processing tools to build large-scale industry applications that turn 6.6M news articles into a numerical indicator for banking distress or use 3M news articles to assess systemic risk in the European banking system. Algorithms will help us not only to make sense of the information in the world, they will help write content, too, and of course they will help bring our chit chatty chat bots to live.

In 2017, we will expect more data products built on top of vast amounts of news data especially data products that condense information into small, meaningful, actionable insights. Our world has become overwhelming, there is too much content. Algorithms can help! Somewhat ironically, we will also be using machines to create more content. A battle of machines.

In a world shaken by “fake news”, of course, one may regard these innovations with suspicion. As new technology enters the mainstream there is always hesitancy, but the critics are right. How do we know the compression is not biased? How do we train people to evaluate the trustworthiness of the information they consume especially when it has been condensed and computer generated? How do we fix the incentive problem of the news industry, distribution platforms like Facebook do not incentivise for deep, thoughtful writing; they monetize a few seconds of attention and are likely to feed existing biases, not all challenges are technical but should concern technically minded people.

The best AI writer of the year goes to? Benjamin, a recurrent neural network that wrote the movie script Sunspring.

(4) The Proliferation Of Data Roles

Remember when data scientist was branded the sexiest job of the 21st century? How about machine learning engineer? AI, deep learning, or NLP specialist? As a discipline, data science is maturing. Organizations have increasingly recognized the value of data science to their business, entire companies are based on AI products leveraging the power of deep learning for image recognition (e.g., Clarifai) or offering natural language generation solutions (e.g., Narrative Science). With success comes a greater recognition, appreciation of differences, and specialization. What’s more, the sheer complexity of new, emerging algorithms requires deep expertise. 2017 will see a proliferation of data roles.

The opportunity to specialize allows data people to focus on what they are good at and enjoy, great. But there will be growing pains. It takes time to understand the meaning of new job titles; companies will be advertising data science roles when they want machine learning engineers and vice versa. Hype combined with a fierce battle over talent will lead to an overabundance of “trendy” roles blurring useful differences. As a community, we will have to clarify what the new roles mean (and we’ll have to hold ourselves accountable when hiring hits a rough patch).

We will have to figure out processes for data scientists, machine learning engineers, and deep learning/AI/NLP specialists to work together productively which will affect adjacent roles. Andrew Ng, Chief Scientist at Baidu, argues for the new role of AI Product Manager who sets expectations by providing data folks with the test set (i.e., the data an already trained algorithm should perform well on). We may need transitional roles like the Chief AI Officer to guide companies in recognizing and leveraging the power of emerging algorithms.

2017 will be an exciting year for teams to experiment, but there will be battle scars.

(5) What Are You Doing To My Data?

By developing models to guide law enforcement, models to predict recidivism, models to predict job performance based on Facebook profiles, data scientists are playing high stakes games with other people’s lives. Models make mistakes; a perfectly qualified and capable candidates may not get her dream job. Data is biased; word embeddings (mentioned above) encode the meaning of words through the context in which they are used, allow simple analogies, and, trained on Google News articles, reveal gender stereotypes—”man is to programmer as woman is to homemaker”. Faulty reward functions can cause agents to go haywire. Models are powerful tools. Caution.

In 2016, Cathy O’Neill published Weapons of Math Destruction on the potential harm of algorithms which got significant attention (e.g., Scientific American, NPR). FATML, a conference on Fairness, Accountability, and Transparency in Machine Learning had record attendance. The EU issued new regulation including “the right to be forgotten”, giving individuals control over their data, and restricts the use of automated, individual decision-making especially if decisions the algorithm makes cannot be explained to the individual. Problematically, automated, individual decision-making is what neural networks do and their inner workings are hard to explain.

2017 will see companies grappling with the consequences of this “right to an explanation” which Oxford researchers have started to explore. In 2017, we may come to a refined understanding of what we mean when we say: “a model is interpretable”. Human decisions are interpretable in some sense, we can provide explanations for our decisions, but not others, we do not (yet) understand the brain dynamics underlying (complex) decisions. We will make progress on algorithms that help us understand model behavior and exercise the much needed caution when we build predictive models areas like healthcare, education, and law enforcement.

In 2017, let’s commit to responsible data science and machine learning.

– Friederike

Many thanks to Jeremy Karnowski for helpful comments.

#deep learning #language processing #language generation

5 notes · View notes

fastforwardlabs · 8 years ago

Text

Learning to Use React

React and Redux helped us keep application state manageable in our probabilistic programming prototypes.

For every topic we research at Fast Forward Labs, we create prototypes to show how the technology can be applied to make great products. Finite, stand-alone projects, our prototype web applications are great opportunities to experiment with new front-end tech. In our latest report on probabilistic programming, I used the React javascript library to create the interface with Redux for managing the data state of UI components. This setup was extremely helpful in prototyping: by keeping the application state in one place with Redux, it was much easier to switch components in and out as the prototype direction changed. The setup (which also involved Webpack, Babel and learning ES6 syntax features) did feel overwhelming at times, though new tools and tutorials made the process much smoother than my experience using React on the previous Text Summarization prototype. Once everything was rolling, it was the most enjoyable front-end coding experience I’ve ever had.

When Aditya, our Data Visualization and Prototyping Intern, started with us this fall, I asked him to get familiar with React and Redux (and Webpack, and ES6 features…) in preparation for future products. Now that he’s experimented with it, we decided to document what has been useful and not so useful in the process.

– Grant

The Process

Grant: First I sent Aditya some helpful resources:

How to Build a Todo App Using React, Redux, and Immutable.js

Getting Started with Redux

React Starter Kit

Aditya: I was familiar with Javascript but had never used a Front-end framework, so first read the React documentation, and proceeded fairly quickly to the sitepoint tutorial. The sitepoint tutorial (about Redux) really tripped me up because it used a lot of unfamiliar syntax. A web search for alternate intros to Redux just ended up confusing me more.

Fortunately I found the egghead videos by Dan Abramov, which make no assumptions about the learner’s prior knowledge, and explain things like ES6 syntax that might throw off new learners.

Grant: While I later realized it was too much to drop on Aditya all at once, I suggested the sitepoint tutorial as a starting point because I like how it shows how Redux works across an entire React app. Abramov’s Egghead videos are great, but I got impatient with them because I wanted to incorporate Redux into the React app straight away. The sitepoint article helped me figure out how to structure Redux to adapt it into the prototype I was working on – but that was after I had gone through several other tutorials and banged my head against the code in various ways.

Aditya: Eventually, I realized that my confusion had little to do with the tutorials themselves. I really needed to go back and iron out my understanding of React before jumping into Redux. I realized this as I came across Abramov’s You might not need Redux, where he stresses the importance of learning to think in React before attempting to learn Redux. I took a step back to introspect and indeed found that my React knowledge hadn’t really seeped in yet.

When the creator of Redux encourages you to think carefully about whether you need to use it, it’s best to listen to him.

Grant: From the beginning I debated whether to introduce React alone or together with Redux. In retrospect, I should have started with React alone. I was anxious to get Redux in early because adopting it had made my own React development process simpler and more enjoyable. Especially when dealing with APIs (thanks in large part to Soham Kamani’s A simplified approach to calling APIs with Redux), Redux helped me maintain a readily understandable model of what was happening where – versus the spaghetti-ish ComponentDidMount situation I found myself in with React. (Note: You can have a perfectly reasonable set-up without Redux if you structure your React app well, which I’m not convinced I did. The point is, the opinionated structure of Redux was super helpful for developing and maintaining a manageable set-up.)

That’s my rationalization for introducing Redux at the start. But as Aditya points out, it was too many new moving parts at once! Much better to get a handle on React first -- and even work with the messiness Redux helps reduce -- than have the whole system dropped on you all at once. I knew this in principle, but this process was a good reminder.

Aditya: Thinking in React was a great resource to deepen my understanding. It teaches concepts by showing one the processes used to design a React app: I like this approach because it can be overwhelming to read scores of definitions with no idea how they ultimately interplay to form the grand picture. ‘Thinking in React’ helped me break down the development process into discrete steps that made the fundamental concepts easier to understand.

Take, for example, the concepts of props and states in React. Both were well defined in the standard documentation, but when I started making a simple app on my own, I was a bit confused. What part of my app would be a state? What would be a prop? Should I write the parent component first or begin with the leaves? We might say I knew the words but not the grammar. ‘Thinking in React’ introduced a framework to help think about their code architecture.

I’m very impressed by the React documentation, and it’s still the primary source for most of my React-related questions. As it should be. There are a lot of third-party resources (most behind a paywall) that are so fragmented that learning these technologies can be jarring. Third-party resources should exist, and are even required in the ecosystem, but those who make technology are also responsible for explaining it to users. Kudos to the React community for not only making a great technology but also by giving a damn about good documentation. Providing a great number of examples (whether in codepens or code blocks), a way to quick start ‘hello-world’ programs (such as Create React App), and a way to think about process is key to a good developer experience. It’s no coincidence that libraries, like D3.js and React, that provide that kind of an experience are so successful and widely adopted.

Conclusions

Grant: There’s a lot of worry and excitement about the current state of Javascript development. I’m mainly excited. There is a pretty big learning curve to this stuff, but once you get going tools like React and Redux can make front-end development a lot more fun. They enabled me to much more quickly develop and experiment with different interface components in our probabilistic programming prototype. Sometimes I was surprised (in a good way!) by how different components worked in combination, and I took inspiration from those interactions to build out a feature or interaction I hadn’t planned on.

That said, there is definitely work to be done to help people with that learning curve. We touched on some starter materials in this post. One thing we didn’t discuss is the React Starter Pack, which include the Create React App command-line tool. One of the biggest headaches I had in my earlier attempts to get started with React was getting the development environment setup. The pipeline created within Create React App made it much easier to get up and running smoothly.

Even with all the great work being done, it can still be quite overwhelming to jump into. It helps me to read about others’ process and experiences (especially their mistakes and lessons learned). Hopefully this post is helpful to others.

Aditya: I think it's really important for people building these technologies to start thinking about documentation at a fundamental level. We are facing an avalanche of new technologies everyday and developers need to be mindful that not everyone comes to the table with the same stack or the same level of experience. What a Senior Software Engineer expects from documentation is probably going to differ from that of a college sophomore.

At Fast Forward Labs, we work on cutting edge machine learning technologies, but remain mindful that this radically new ecosystem, where knowledge is concentrated in the hands of a few, can lead to socio-economic inequalities. Lowering barriers to entry is critically important as the technology ecosystem evolves at an increasingly rapid clip.

2 notes · View notes

fastforwardlabs · 9 years ago

Text

Hilary Mason at Data Driven NYC

Hilary Mason, Fast Forward Labs Founder & CEO, gave a talk at November’s Data Driven NYC Meetup. Check it out to hear our thoughts on:

How innovation works in academia, startups, and large enterprise

Why it often makes sense to build, not buy, AI products

How to predict that a new AI technology will be impactful

Technologies we’re excited about, including demos of our prototypes for natural language generation, deep learning for image analysis, automated text summarization, and, coming soon, probabilistic programming!

We’re running a holiday promotion on the research reports and prototypes Hilary introduces in her talk. Our research is a great resource to educate your organization on what’s truly possible in contemporary machine learning. We cover a range of technologies to help our clients make informed choices on which algorithms will work best for their data and problems. Write to us at [email protected] to learn more!

#ai #data science #data driven #innovation

1 note · View note

fastforwardlabs · 9 years ago

Text

Machines in Conversation

We at Fast Forward Labs have long been interested in speech recognition technologies. This year’s chatbot craze has seen growing interest in machines that interface with users in friendly, accessible language. Bots, however, only rarely understand the complexity of colloquial conversation: many practical customer service bots are trained on a very constrained set of queries (”I lost my password”). That’s why we’re excited to highlight Gridspace, a San Francisco-based startup that provides products, services, and an easy-to-use API focused on making human-to-human conversation analyzable by machines. Gridspace Co-Founders Evan Macmillan and Anthony Scodary share their thoughts and demo their API below. Catch them this week at the IEEE Workshop on Spoken Language Technology (SLT) in San Diego!

When most people think about speech systems, they think about virtual assistants like Siri and Alexa, which parse human speech intended for a machine. Extracting useful information from human-to-machine speech is a challenge. Virtual assistants must decode speech audio with a high degree of accuracy and map a complex tree of possible natural language queries, called an ontology, to distinguish one-word differences in similar yet distinct requests.

But compared to processing human-to-human speech, current virtual assistants have it easy! Virtual assistants work with a restricted set of human-to-machine speech requests (i.e. “Call Mom!” or “Will it rain tomorrow?”) and process only a few seconds of speech audio at a time. Virtual assistants also get the benefit of relatively slow and clear speech audio to process. After all, the user knows she is only speaking with a machine.

Today, most of our spoken conversations don’t involve machines, but they soon could. A big hurdle to making machines capable of processing more types of conversation, specifically natural conversations we have with other people, is called natural language understanding (NLU).The jump between NLU for human-to-machine conversations to NLU for long human-to-human conversations is non-trivial, but it’s a challenge we chose to tackle at Gridspace.

Gridespace Sift API

The Gridspace Sift API provides capabilities specifically tailored to long-form, human-to-human speech. The transcription and speech signal analysis pipeline has been trained on tens of thousands of hours of noisy, distorted, and distant speech signals of organic, colloquial human speech. The downstream natural language processing capabilities, which perform tasks like phrase matching, topic modelling, entity extraction, and classification, were all designed to accept noisy transcripts and signals.

The following examples can all be run in the Sift hosted script browser environment, in which the full API (and handling of asynchronous events like speech or telephone calls) is accessed through short snippets of javascript.

For example, let’s say we want to call restaurants to get the wait times. You could prompt restaurants to enter the wait time into a keypad, but you’ll likely have a low success rate. However, by using natural human recordings and allowing the restaurants to respond with natural language, a response rate of over 30% is achievable. This response rate could be even higher if you exclude restaurants that operate an IVR (interactive voice response) system. In the JavaScript sketch below (the full example can be viewed and run here), we call a list of restaurants, greet whoever answers the phone, and then ask for a wait time:

gs.onStart = function() { var waitTimes = {} for (var restaurant in NUMBERS) { var number = NUMBERS[restaurant]; console.log(restaurant); console.log(number); var conn = gs.createPhoneCall(number); var trans = ""; for (var j = 0; j < MAX_TRANS; j++) { console.log("Try " + j); if (j == 0) { console.log("Saying hello..."); newTrans = conn.getFreeResponse({"promptUrl": "http://apicdn.gridspace.com/examples/assets/hi_there.wav"}); } else { console.log("Asking for the wait time..."); newTrans = conn.getFreeResponse({"promptUrl": "http://apicdn.gridspace.com/examples/assets/wait_time.wav"}); } if (newTrans) { trans += newTrans + " "; if (j > 1 || trans.indexOf('minute') != -1 || trans.indexOf('wait') != -1) { break; } } } console.log("Saying thank you..."); conn.play("http://apicdn.gridspace.com/examples/assets/thanks.wav"); waitTimes[restaurant] = trans; conn.hangUp(); } console.log(waitTimes); }

As soon as we hear the word ‘minute’, we thank them and hang up. The results of each restaurant are simply printed to the console.

In our experiments, about one in three restaurants provide a response, but this basic example can be easily improved upon (and we encourage you to try!).

One glaring problem is the crudeness of the parser (we simply look for the word ‘minute’ and call it a day). In the next example (the full sketch is here), we listen in on a simulated conference call, wherein status updates for different employees are extracted.

In this example, instead of simply looking for exact words, we scan for approximate matches for status reports and names. This fuzzy natural language extraction allows for soft rephrasing and extraction of general concepts like numbers, names, dates, and times. Even if the conversation lasts for hours, each time a status update is detected, a sound is played, and the employee status is parsed. This entire behavior is implemented in just a couple lines of JavaScript.

In the Sift API hosted script environment, you’ll find a wide array of other examples including automated political polling, deep support call analysis, an FAA weather scraper, and interactive voice agents. Each example is only a couple dozen lines long and demonstrates a broad spectrum of speech analysis capabilities.

While there is still much work to done in the area of conversational speech processing, we are excited about what is already possible. Human-to-human speech systems can now listen, learn and react to increasingly complex patterns in long-form conversational speech. For businesses and developers, these advancements in speech processing mean more structured data and the opportunity to build new kinds of voice applications.

- Evan Macmillan and Anthony Scodary, Co-Founders, Gridspace

0 notes

fastforwardlabs · 9 years ago

Text

Dimensionality Reduction and Intuition

“I call our world Flatland, not because we call it so, but to make its nature clearer to you, my happy readers, who are privileged to live in Space.”

So reads the first sentence of Edwin Abbott Abbott’s 1884 work of science fiction and social satire, Flatland: A Romance of Many Dimensions. At the time, Abbott used contemporary developments in the fields of geometry and topology (he was a contemporary of Poincaré) to illustrate the rigid social hierarchies in Victorian England. A century later, with machine learning algorithms playing an increasingly prominent role in our daily lives, Abbott’s play on the conceptual leaps required to cross dimensions is relevant again. This time, however, the dimensionality shifts lie not between two human social classes, but between the domains of human reasoning and intuition and machine reasoning and computation.

Much of the recent excitement around artificial intelligence stems from the fact that computers are newly able to process data historically too complex to analyze. At Fast Forward Labs, we’ve been excited by new capabilities to use computers to perceive objects in images, extract the most important sentences from long bodies of text, and translate between languages. But making complex data like images or text tractable for machines involves representing the data in high-dimensional vectors, long strings of numbers that encode the complexity of pixel clusters or relationships between words. The problem is these vectors become so large that it’s hard for humans to make sense of them: plotting them often requires a space of way more than the three dimensions we live in and perceive!

On the other hand, machine learning techniques that entirely remove humans from the loop, like automatic machine learning and unsupervised learning, are still active areas of research. For now, machines perform best when nudged by humans. And that means we need a way to reverse engineer the high-dimensionality vectors machines compute in back down to the two and three dimensional spaces our visual systems have evolved to make sense of.

What follows is a brief survey of some tools available to reduce and visualize high-dimensional data. Send us a note at [email protected] if you know of others!

Google’s Embedding Projector

Yesterday, Google open-sourced the Embedding Projector, a web application for interactive visualization and analysis of high-dimensional data that is part of TensorFlow. The release highlights how the tool helps researchers navigate embeddings, or mathematical vector representations of data, which have proved useful for tasks like natural language processing. A popular example is to use embeddings to do “algebra” on words, using the space between vectors as a proxy for semantic relationships like man:king::woman:queen. Embedding Projector includes a few dimensionality reduction techniques like Principal Component Analysis (PCA) and t-SNE. Here’s an example of using PCA on an image data set (done before Google’s release).

t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is an increasingly popular non-linear dimensionality reduction technique useful for exploring local neighborhoods and finding clusters in data. As explained in this post, t-SNE algorithms adapt transformations to the structure of the input data they work on, and have a tuneable parameter called “perplexity” that “says (loosely) how to balance attention between local and global aspects of your data.” While the algorithms are powerful, their output representations must be read with care, as the perplexity parameter can create confusion.

Visualization of how distance between clusters vary widely under different parameters on a t-SNE algorithm.

Mike Tyka, a machine learning artist, has used t-SNE to cluster images per similarity in Deep Dream’s neural network architecture. The resulting “map” reveals some interesting conclusions, showing, for example, that Deep Dream clusters violins near trombones. As the shapes of these two instruments differ to our eyes, their proximity in the neural network space may mean that Deep Dream uses the context of “people playing instruments” as a discriminatory feature for classification.

Topological Data Analysis

Palo Alto-based Ayasdi uses theory from topology, the study of geometrical properties that stay constant even when shapes are transformed, to help humans find patterns in large data sets. As CEO Gurjeet Singh explains in this O’Reilly interview, the two key benefits of using topology for machine learning are:

The ability to combine results from different machine learning algorithms, while still maintaining guarantees about the underlying shapes or distributions

The ability to discover the underlying shape of data so you don’t assume it and, thereby, impact the parameters for an optimization problem

Ayasdi’s product visualizes relationships in data as graphs, enabling users to visually perceive relationships that would be hard to uncover in the language of formal equations. We love the parallel insight that we, as humans, excel at what topologists call “deformation invariance,” the property that the letter A is still the letter A in different fonts.

Machines using an autoencoder to reconstruct digits with moderate deformation invariance, as we explained in this blog post.

Data Visualization for the 3-D Web

Finally, Datavized is working on a data analytics tool fit for the 3-D web. While they’ve yet to work on dimensionality reduction, they have embarked on projects to give consumers of data a more empathic, first-person interpretation of statistics and conclusions. We look forward to the release of their product in 2017!

Conclusion

Our ability to represent rich, complex data, like images and text, in numbers required for mathematical functions on computers requires a Mephistophelean deal with the devil. These high-dimensional vectors are impossible to understand and interpret. But there’s been great progress in dimensionality reduction and visualization tools that enable us, in our Flatland, to make sense of the strange, cold world of machine intelligence.

- Kathryn

#data science #dimensionality #data visualization #t-sne

3 notes · View notes

fastforwardlabs · 9 years ago

Text

Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters

Probabilistic data structures store data compactly with low memory and provide approximate answers to queries about stored data. They are designed to answer queries in a space-efficient manner, which can mean sacrificing accuracy. However, they typically provide guarantees and bounds on error rates depending on specifications of the data structure in question. Because they provide low memory footprints, probabilisitic data structures are particularly useful in streaming and low power settings. As such, they are extremely useful in big data situations like counting views on a video or maintaining a list of unique tweets in the past. A single HyperLogLog++ structure, for example, can count up to 7.9 billion unique items using 2.56KB of memory with only a 1.65% error rate.

The Fast Forward Labs team explored probabilistic data structures in our "Probabilistic Methods for Real-time Streams" report and prototype (contact us if you're interested in this topic). This post provides an update by exploring Cuckoo filters, a new probabilistic data structure that improves upon the standard Bloom filter. The Cuckoo filter provides a few advantages: 1) it enables dynamic deletion and addition of items 2) it can be easily implemented compared to Bloom filter variants with similar capabilities, and 3) for similar space constraints, the Cuckoo filter provides lower false positives, particularly at lower capacities. We provide a python implementation of the Cuckoo filter here, and compare it to a counting Bloom filter (a Bloom filter variant).

Application

While they seem esoteric, probabilistic data structures are very useful. Consider large scale internet applications like Twitter that struggle to keep new users engaged. To tackle this, Twitter's growth & engagement team develop marketing campaigns to encourage new and unengaged users to use Twitter more often. To aid this work, every new user can be added to a Cuckoo filter. When he/she becomes active, he/she can be removed, and the engagement team can target growth campaigns to individuals currently in the Cuckoo filter. The Cuckoo filter can add and remove users down the line depending on their activity level. Cuckoo filters are easy to implement, so are a good choice for this use case. With hundreds of millions of users, it helps to have a low memory footprint and low false positive rates.

What's in a name: "Cuckoo"

Like Bloom filters, the Cuckoo filter is a probabilistic data structure for testing set membership. The 'Cuckoo' in the name comes from the filter's use of the Cuckoo hashtable as its underlying storage structure. The Cuckoo hashtable is named after the cuckoo bird becauses it leverages the brood parasitic behavior of the bird in its design. Cuckoo birds are known to lay eggs in the nests of other birds, and once an egg hatches, the young bird typically ejects the host's eggs from the nest. A Cuckoo hash table employs similar behavior in dealing with items to be inserted into occupied 'buckets' in a Cuckoo hash table. We explain this behavior in the section on Cuckoo filter. Now, we'll provide a brief overview of a Bloom filter before exploring Cuckoo filters.

Bloom filter overview

Bloom filters are a popular probabilistic data structure that allow space-efficient testing of set membership. When monitoring a real-time stream of tweets, for example, a Bloom filter allows us to test whether a tweet is new or has been seen before. Bloom filters use hash functions to compactly encode items as integers; these serve as indices of a bit array that is then set. To test if an item has been seen before, a Bloom filter hashes the item to produce its set of indices, and each index is checked to see if it has been set. Since it's possible to hash multiple items to the same indices, a membership test returns either false or maybe. That means, Bloom filters give no false negatives but a controllable rate of false positives. If a Bloom filter indicates that an item has not been seen before, we can be certain that's the case; but if it indicates an item has been seen, it's possible that's not the case (a false positive).

Traditional Bloom filters do not support deletions because hashing is lossy and irreversible. That means, deletions require the entire filter to be rebuilt. But what if we want to delete items seen in the past, like certain tweets in the Twitter example above? The counting Bloom filter was introduced to solve this problem. To support deletions, counting Bloom filters extend buckets in traditional Bloom filters from single bit values to n-bit counters. Here, insertions increment rather than set Bloom filter indices.

Cuckoo filter

The Cuckoo filter is an alternative to the Bloom filter when one requires support for deletions. They were introduced in 2014 by Fan et. al. Like the counting Bloom filter, Cuckoo filters provide insert, delete, and lookup capabilities. However, Cuckoo filters use different underlying data structures and different insertion procedures than Bloom filters.

The Cuckoo filter consists of a Cuckoo hash table that stores the 'fingerprints' of items inserted. The fingerprint of an item is a bit string derived from the hash of that item. A cuckoo hash table consists of an array of buckets where an item to be inserted is mapped to two possible buckets based on two hash functions. Each bucket can be configured to store a variable number of fingerprints. Typically, a Cuckoo filter is identified by its fingerprint and bucket size. For example, a (2,4) Cuckoo filter stores 2 bit length fingerprints and each bucket in the Cuckoo hash table can store up to 4 fingerprints. Following the above paper, we implemented the cuckoo filter in python. Below, we initialize an example cuckoo filter and test simple inserts and deletions. We also implement a counting Bloom filter to compare performance.

from cuckoofilter import CuckooFilter c_filter = CuckooFilter(10000, 2) #specify capacity and fingerprint size

c_filter.insert("James") print("James in c_filter == {}".format("James" in c_filter)) # James in c_filter == True c_filter.remove("James") print("James in c_filter == {}".format("James" in c_filter)) # James in c_filter == False

from cuckoofilter import CountingBloomFilter b_filter = CountingBloomFilter(10000) #specify the capacity of a counting bloom filter b_filter.add("James") print("James in b_filter == {}".format("James" in b_filter)) # James in b_filter == True b_filter.remove("James") print("James in b_filter == {}".format("James" in b_filter)) # James in b_filter == False

Inserting into a Cuckoo filter

The Cuckoo filter supports three key operations: insert, delete, and lookup. The figure above, from the Fan et. al. paper, shows how insertion into the Cuckoo filter works. Of all the Cuckoo filter operations, the insert operation is most involved. To insert an item into the Cuckoo filter, one derives two indices from the item based on hashing the item and its fingerprint. On obtaining these indices, one then inserts the item's fingerprint into one of the two possible buckets that correspond to the derived indices. In our implementation, we default to the first index.

As the Cuckoo hash table begins to fill up, one can encounter a situation where the two possible indices where an item can be inserted has been filled. In this case, items currently in the Cuckoo hash table are swapped to their alternative indices to free up space for inserting the new item. By implementing insertion in this manner, one can easily delete an item from the table by looking up its fingerprint in one of two possible indices, and deleting this fingerprint if present. To make the insertion procedure more concrete, we provide code below implementing the insertion procedure.

#example function to demonstrate how to insert into a cuckoo filter. import mmh3 def obtain_indices_from_item(item_to_insert, fingerprint_size, capacity): #hash the string item hash_value = mmh3.hash_bytes(item_to_insert) #subset the hash to a fingerprint size fingerprint = hash_value[:fingerprint_size] #derive the index index_1 = int.from_bytes(hash_value, byteorder="big") index_1 = index_1 % capacity #derive the index from the fingerprint hashed_fingerprint = mmh3.hash_bytes(fingerprint) finger_print_index = int.from_bytes(hashed_fingerprint, byteorder="big") finger_print_index = finger_print_index % capacity #second index -> first_index xor index derived from hash(fingerprint) index_2 = index_1 ^ finger_print_index index_2 = index_2 % capacity return index_1, index_2, fingerprint def insert_into_table(table, index_1, index_2, bucket_capacity): #now insert item into the table if len(table[index_1])

#let's create a crude cuckoo hashtable capacity = 10 #capacity of our cuckoo hashtable bucket_capacity = 4 table = [[] for _ in range(capacity)]

#obtain possibe indices index_1, index_2, fp = obtain_indices_from_item("James", 2, 10) #now let's insert "James into the table" table, _ = insert_into_table(table, index_1, index_2, bucket_capacity) print("Table after James is inserted.") #check to see that "james" has been inserted print(table) # Table after James is inserted. # [[], [], [], [], [], [], [], [], [], [b'\xc0\n']]

#let's insert "james" again. index_1, index_2, fp = obtain_indices_from_item("James", 2, 10) #now let's insert "James into the table" table, _ = insert_into_table(table, index_1, index_2, bucket_capacity) print("Table after James is inserted a second time.") #now let's check to see that "James" has been inserted again print(table) print("\n") # Table after James is inserted a second time. # [[], [], [], [], [], [], [], [], [], [b'\xc0\n', b'\xc0\n']] #let's insert a different item now index_1, index_2, fp = obtain_indices_from_item("Henry", 2, 10) table, _ = insert_into_table(table, index_1, index_2, bucket_capacity) print("Table after Henry is inserted.") #now let's check to see that "Henry" has been inserted into the table. print(table) # Table after Henry is inserted. # [[], [], [b'\x1c\xb2'], [], [], [], [], [], [], [b'\xc0\n', b'\xc0\n']]

Bench marking against counting Bloom filter

False positive rate comparison

Let's compare the Cuckoo filter to the counting Bloom filter. A critical metric for probabilistic data structures like the Bloom and Cuckoo filters is the false positive rate. As shown in the insertion section, comparing the Cuckoo filter and the Bloom filter can be tricky given the difference in their internal workings. To tackle the issue of false positive rates, we fix the space allocation for both filters and then vary the capacities in order to observe the change in false positive rate. Below we show a graph of the false positive rate vs the capacity for both structures.

As seen in the graph, a key advantage of the Cuckoo filter is that with fixed space, the Cuckoo filter provides much lower false positive rates at smaller capacities. As noted in the original paper, for applications that desire lower than 3 percent false positive rate (blue dashed line), the Cuckoo filter is particularly ideal. Of note is that the Cuckoo filter here is a straightforward implementation without any space optimizations. This further indicates that a Cuckoo filter provides better performance without any tuning compared to optimized Bloom filters. See Notebook for other performance benchmarks comparing the counting Bloom filter to the Cuckoo filter.

Insertion throughput comparison

Another important metric to consider is the insertion throughput. Insertion throughput is essentially how long it takes to insert an item in an existing filter. From the design of the Counting Bloom filter, time to insert into the filter does not change as the filter fills up. However, with the Cuckoo filter, time to insert into the filter increases as the filter fills up to capacity. If an item is to be inserted into a Cuckoo table, and both of its possible indices are fully occupied, then the current items are swapped to their alternative indices to free up space for the item being inserted. As the Cuckoo table fills up, more swapping would typically occur as there are more items to relocate.

The figure below shows the insertion time for a counting Bloom filter and a Cuckoo filter of the same capacities as both fill up (See Notebook for details). With the Cuckoo filter, we notice an insertion throughput increase of up to 85 percent as it fills up to 80 percent capacity, while the insertion throughput for the counting Bloom filter remains relatively stable over this range. In the figure, we further notice that the Cuckoo filter is about 3 times faster than the counting Bloom filter over the entire range despite the significant increase in insertion throughput for the Cuckoo filter. While such differences are significant here, counting Bloom filters can be optimized to provide similar insertion speeds to Cuckoo filters. Here we seek to emphasize the significant change in insertion throughput that occurs as a Cuckoo filter fills up.

Conclusion

Bloom filters and its variants have proven useful in streaming applications and others where membership testing is critical. In this post, we have shown how a Cuckoo filter, which can be implemented simply, provides better practical performance, under certain circumstances, out of the box and without tuning than counting Bloom filters. Ultimately, Cuckoo filters can serve as alternatives in scenarios where a counting Bloom filter would normally be used.

– Julius

#code #realtime data

7 notes · View notes

fastforwardlabs · 9 years ago

Text

Job Opportunities at Signal

We’re excited to partner with Prehype and an international news media organization to develop Signal, a new way to understand and structure large volumes of streaming news and media content. Signal will combine novel data with emerging algorithms. We're looking for a couple of folks to join the team. Read more below, and if this is interesting to you, get in touch!

About Signal

At Signal we are building natural language technology to understand and structure large, streaming news and media content to help surface critical, need-to-know information. We take it as a given that consumers are overwhelmed by the amount of news and information being produced every day. As the volume of information has grown, making informed decisions has become increasingly difficult. Our mission is to gather the signal from the noise. We are partnered with a major international news media organization and a team of experts in the machine learning and data science fields at Fast Forward Labs to develop an applied machine learning solution to understand, structure, and personalize the torrent of online news. This means that we will have the opportunity to reach millions of users upon launch.

Job Description: Lead Data Engineer

The Role

We are a small team. Everyone will be involved in every part of the product development process from ideation to design, prototyping, planning, and execution. As the lead engineer, you will lead the design and build of a fast, scalable, and durable system to ingest data from multiple, static and streaming data sources—an opportunity to design and build a system from scratch. You will work alongside the lead machine learning engineer to enable easy access to data and to help productionalize machine learning solutions. You have experience with distributed data storage and message-passing systems, you are excited about microservices architecture. While this is a lead role, we are open to candidates with no prior experience leading the development of a machine learning product. We are looking for people who are eager to learn, driven to build amazing products, and enjoy being a part of a small, versatile, innovative team.

The initial phase of the project is a six month engagement. We are open to full time and contract work.

What we look for

You really know python. If you know Scala and/or Go, awesome.

You have experience with distributed data storage solutions (e.g., Hadoop, Spark).

You have experience with real-time messaging systems (e.g., Kafka, NSQ, RabbitMQ).

You have sufficient experience with cloud computing services to set up and provide computing resources to the team (e.g., AWS).

You enjoy the process of iterating on a product to try to get it right. You think creatively about solutions.

You enjoy being part of a small team.

Apply to this position by e-mailing your resume and cover letter to [email protected]. We look forward to working with you!

Job Description: Lead Machine Learning Engineer/Data Scientist

The Role

We are a small team. Everyone will be involved in every part of the product development process from ideation to design, prototyping, planning, and execution. As the lead machine learning engineer/data scientist, you will drive the development of a new data product—machine learning algorithms to cluster news and media content into digestible pieces of information. You will dig into techniques for text summarization and natural language generation (NLG). You will design algorithms for smart alerting to surface content that rises above the continuing chatter of news media streams. You have prior experience with clustering, natural language processing techniques (NLP), and anomaly detection. You enjoy thinking about creative applications of out-of-the-box techniques, and you love developing your own, custom solutions. Ideally, you have experience working in a production environment: you will work alongside the lead data engineer to ship data science solutions and machine learning features to production, you know or are excited to learn how to build algorithms that scale. While this is a lead role, we are open to applicants with no prior experience leading a data science or machine learning team. We are looking for people who are eager to learn, driven to build amazing products, and enjoy being a part of a small, versatile, innovative team.

The initial phase of the project is a six month engagement. We are open to full time and contract work.

What we look for

You have prior python programming experience, you write clean object-oriented code, and you know how to hack together an API.

You know and have worked with popular, off-the-shelf machine learning (e.g., scikit-learn) and NLP libraries (e.g., nltk) on both structured and unstructured text, and time series data.

You understand the principles behind clustering. If you know and understand hierarchical (agglomerative) clustering techniques, we would be very excited.

You know popular approaches and algorithms for anomaly detection.

You feel comfortable and you have worked with standard NLP tools. Working with us, you’d be excited to expand upon your knowledge (e.g., text summarization, NLG).

You have or you are excited about developing custom machine learning solutions.

You love the process of iterating on a product to try to get it right. You think creatively about solutions.

You enjoy being part of a small team.

Apply to this position by e-mailing your resume and cover letter to [email protected]. We look forward to working with you!

#announcement #jobs

5 notes · View notes

fastforwardlabs · 9 years ago

Text

Exploring Deep Learning on Satellite Data

This is a guest post featuring a project Patrick Doupe, now a Senior Data Analyst at Icahn School of Medicine at Mount Sinai, completed as a fellow in the Insight Data Science program. In our partnership with Insight, we occassionally advise fellows on month-long projects and how to build a career in data science.

Machines are getting better at identifying objects in images. These technologies are used to do more than organise your photos or chat your family and friends with snappy augmented pictures and movies. Some companies are using them to better understand how the world works. Be it by improving forecasts on Chinese economic growth from satellite images of construction sites or estimating deforestation, algorithms and data can help provide useful information about the current and future states of society.

In early 2016, I developed a prototype of a model to predict population from satellite images. This extends existing classification tasks, which ask whether something exists in an image. In my prototype, I ask how much of something not directly visible is in an image? The regression task is difficult; current advice is to turn any regression problem into a classification task. But I wanted to aim higher. After all, satellite image appear different across populated and non populated areas.

Populated region

Empty region

The prototype was developed in conjuction with Fast Forward Labs, as my project in the Insight Data Science program. I trained convolutional neural networks on LANDSAT satellite imagery to predict Census population estimates. I also learned all of this, from understanding what a convolutional neural network is, to dealing with satellite images to building a website within four weeks at Insight. If I can do this in a few weeks, your data scientists too can take your project from idea to prototype in a short amount of time.

LANDSAT-landstats

Counting people is an important task. We need to know where people are to provide government services like health care and to develop infrastructure like school buildings. There are also constitutional reasons for a Census, which I'll leave to Sam Seaborn.

We typically get this information from a Census or other government surveys like the American Community Survey. These are not perfect measures. For example, the inaccuracies are biased against those who are likely to use government services.

If we could develop a model that could estimate the population well at the community level, we could help government services better target those in need. The model could also help governments that facing resources constraints that prevent the running of a census. Also, if it works for counting humans, then maybe it could work for estimating other socio-economic statistics. Maybe even help provide universal internet access. So much promise!

So much reality

Satellite images are huge. To keep the project manageable I chose two US States that are similar in their environmental and human landscape; one State for model training and another for model testing. Oregon and Washington seemed to fit the bill. Since these states were chosen based on their similarity, I thought I would stretch the model by choosing a very different state as a tougher test. I'm from Victoria, Australia, so I chose this glorious region.

Satellite images are also messy and full of interference. To minimise this issue and focus on the model, I chose the LANDSAT Top Of Atmosphere (TOA) annual composite satellite image for 2010. This image is already stitched together from satellite images with minimal interference. I obtained the satellite images from the Google Earth Engine. I began with low resolution images (1km) and lowered my resolution in each iteration of the model.

For the Census estimates, I wanted the highest spatial resolution, which is the Census block. A typical Census block contains between 600 and 3000 people, or about a city block. To combine these datasets I assigned each pixel its geographic coordinates and merged each pixel to its census population estimates using various Python geospatial tools. This took enough time that I dropped the bigger plans. Best get something complete than a half baked idea.

A very high level overview of training Convolutional Neural Networks

The problem I faced is a classic supervised learning problem: train a model on satellite images to predict census data. Then I could use standard methods, like linear regression or neural networks. For every pixel there is number corresponding to the intensity of various light bandwidths. We then have the number of features equal to the number of bandwidths by the number of pixels. Sure, we could do some more complicated feature engineering but the basic idea could work, right?

Not really. You see, a satellite image is not a collection of independent pixels. Each pixel is connected to other pixels and this connection has meaning. A mountain range is connected across pixels and human built infrastructure is connected across pixels. We want to retain this information. Instead of modelling pixels independently, we need to model pixels in connection with their neighbours.

Convolutional neural networks (hereafter, "convnets") do exactly this. These networks are super powerful at image classification, with many models reporting better accuracy than humans. What we can do is swap the loss function and run a regression.

Diagram of a simple convolutional neural network processing an input image. From Fast Forward Labs report on Deep Learning: Image Analysis

Training the model

Unfortunately convnets can be hard to train. First, there are a lot of parameters to set in a convnet: how many convolutional layers? Max-pooling or average-pooling? How do I initialise my weights? Which activations? It's super easy to get overwhelmed. Micha suggested I use the well known VGGNet as a starting base for a model. For other parameters, I based the network on what seemed to be the current best practices. I learned these by following this winter's convolutional neural network course at Stanford.

Second, they take a lot of time and data to train. This results in training periods of hours to weeks, while we want fast results for a prototype. One option is to use pre-trained models, like those available at the Caffe model zoo. I was writing my model using the Keras python library, which at present doesn't have as large a zoo of models. Instead, I chose to use a smaller model and see if the results pointed in a promising direction.

Results

To validate the model, I used data from on Washington and Victoria, Australia. I show the model's accuracy on the following scatter plot of the model's predictions against reality. The unit of observation is the small image-observation used by the network and I estimate the population density in an image. Since each image size is the same, this is the same as estimating population. Last, the data is quasi log-normalised[6]. Let's start with Washington

Washington State

We see that the model is picking up the signal. Higher actual population densities are associated with higher model predictions. Also noticeable is that the model struggles to estimate regions of zero population density. The R2 of the model is 0.74. That is, the model explains about 74 percent of the spatial variation in population. This is up from 26 percent in the four weeks achieved in Insight.

Victoria

A harder test is a region like Victora with a different natural and built environment. The scatter plot of model performance shows the reduced performance. The model's inability to pick regions of low population is more apparent here. Not only does the model struggle with areas of zero population, it predicts higher population for low population areas. Nevertheless, with an R2 of 0.63, the overall fit is good for a harder test.

An interesting outcome is that the regression estimates are quite similar for both Washington and Victoria: the model consistently underestimates reality. In sample, we still have a model that underestimates population. Given that the images are unlikely to have enough information to identify human settlements at current resolution, it's understandable that the model struggles to estimate population in these regions.

Variable A perfect model Washington Victoria Oregon (in sample) Intercept 0 -0.43 -0.37 -0.04 Slope 1 0.6 0.6 0.86 R2 1 0.74 0.63 0.96

Conclusion

LANDSAT-landstats was an experiment to see if convnets could estimate objects they couldn't 'see.' Given project complexity, the timeframe, and my limited understanding of the algorithms at the outset, the results are promising. We're not at a stage to provide precise estimates of a region's population, but with improved image resolution and advances in our understanding of convnets, we may not be far away.

-Patrick Doupe

#deep learning #code

0 notes

fastforwardlabs · 9 years ago

Text

New TensorFlow Code for Text Summarization

Yesterday, Google released new TensorFlow model code for text summarization, specifically for generating news headlines on the Annotated English Gigaword dataset. We’re excited to see others working on summarization, as we did in our last report: our ability to “digest large amounts of information in a compressed form” will only become more important as unstructured information grows.

The TensorFlow release uses sequence-to-sequence learning to train models that write headlines for news articles. Interestingly, the models output abstractive - not extractive - summaries. Extractive summarization involves weighing words/sentences in a document according to some metric, and then selecting those words/sentences with high scores as proxies for the important content in a document. Abstractive summarization looks more like a human-written summary: inputting a document and outputting the points in one’s own words. It’s a hard problem to solve.

Like the Facebook NAMAS model, the TensorFlow code works well on relatively short input data (100 words for Facebook; the first few sentences of an article for Google), but struggles to achieve strong results on longer, more complicated text. We faced similar challenges when we built Brief (our summarization prototype) and decided to opt for extractive summaries to provide meaningful results on long-form articles like those in the New Yorker or the n+1. We anticipate quick progress on abstractive summarization this year, given progress with recurrent neural nets and this new release.

If you’d like to learn more about summarization, contact us ([email protected]) to discuss our research report & prototype or come hear Mike Williams’ talk at Strata September 28!

#natural language processing #recurrent neural network #deep learning #summarization

3 notes · View notes

fastforwardlabs · 9 years ago

Text

Next Economics: Interview with Jimi Crawford

Building shadows as proxies for construction rates in Shanghai. Photos courtesy of Orbital Insight/Digital Globe.

It’s no small feat to commercialize new technologies that arise from scientific and academic research. The useful is a small subset of the possible, and the features technology users (let alone corporate buyers) care about rarely align with the problems researchers want to solve. But it’s immensely exciting when it works. When the phase transition is complete. When the general public starts to appreciate how a bunch of mathematics can impact their business, their lives, and their understanding of how the world works. It’s why the Fast Forward Labs team wakes up every day. It’s why we love what we do. It drives us. And it’s why we’re always on the lookout for people who are doing it well.

Orbital Insight is an excellent example of a company that is successfully commercializing deep learning technologies. 2015 saw a series of improvements in the performance of object recognition and computer vision systems. The technology is being applied across domains, to improve medical diagnosis, gain brand insights, or update our social media experience.

Building on his experience at The Climate Corporation, Orbital Insight CEO & Founder Jimi Crawford decided to aim big and apply the latest in computer vision to satellite imagery. His team focused their first commercial offering on the financial services industry, honing their tools to count cars in parking lots to infer company performance and, transitively, stock market behavior. But hedge funds are just the beginning. Crawford’s long-term ambition (as that of FeatureX) is to reform macroeconomics, to replace government reports with quantified observations about the physical world. Investors have taken notice.

We interviewed Jimi, discussing what he learned in the past, what he does in the present, and what he envisions for the future. Read on for highlights.

You’ve been in artificial intelligence long enough to see the rise and fall of different theoretical trends. How has the field evolved over the years?

AI was different when I did my doctorate at UT Austin in the late 80s. Machine learning as induction from data wasn’t as important as it is now. We were concerned with getting computers to know what people know when they think or make true statements, which meant using variance of first-order logic as foundation of knowledge. The goal of our research was to program human common sense into a system using logical - or symbolic - techniques. While this branch of AI has since been eclipsed by machine learning and statistical techniques, there are still challenges in intelligent systems (like mimicking common sense) that will likely only be solved by synthesizing symbolic and neural (deep learning) techniques. We can make a loose analogy to the structure of the brain: a small part is the cerebral cortex, which executes logical thought; the rest is a dense, complex network of neurons.

Does that mean that near-term advances in AI will continue to involve human-machine partnerships as opposed to straight-up automation?

I think that will be the case for the foreseeable future. Even in chess, a very controlled, rules-based game, joint human-computer teams beat teams of only computers or only humans. If we add the complexity of real-world data and real-world problems, things only get messier. At Orbital Insight, we consider computers to be a mechanism to focus human attention on the objects and entities in the world that have significance for a given task or purpose (e.g., counting how many cars there are in a store parking lot at a given time of day). The world is big. Without computer vision tools, we’d need 8 million people to review and analyze satellite images at one meter resolution to get the insights we derive using automation. That’s a massive economy of scale.

You’ve had a rich career, having worked at NASA, Google Books, and the Climate Corporation before founding Orbital Insight. Are there parallels between the problems you worked on at Google Books and those you work on at Orbital Insight?

Google Books was deeply inspirational for Orbital Insight. In essence, both projects are about taking a complex input and transforming it into a simple output people care about. At Google Books, the input was images of millions of book pages. The project’s main purpose was to improve Google’s search engines. We’d digitize images, pass them through an OCR pipeline to figure out what the text was, and annotate them with copyright information etc. The goal was to transform all this raw information into the quotes and passages people could search for and cared about. At Orbital Insight, we follow a similar human-computer data processing pipeline, preparing images, analyzing them with convolutional neural nets, and processing them to output the information people care about, like how many cars are in a company parking lot.

There were some interesting takeaways from the Google Books project. One of our 20% projects (i.e., the 20% of work time Google employees are free to devote to creative research projects) was the Ngram Viewer, which displays graphs showing how different words or phrases occur in book corpuses over selected years. Using the tool, we were able to see a shift from saying “the United States are,” at the signing of the Constitution, to “the United States is,” right around the Civil War. Some linguists used the Ngram Viewer to correlate verb conjugation regularity with frequency of use: the tool shows that conjugations of verbs like “to be,” which are used all the time, vary more frequently.

N-gram of mathematics trends from 1800-2000.

Orbital Insight is a data product company, where development involves the right balance between data science and software engineering. How do you manage that balance?

When I was SVP of Science and Engineering at The Climate Corporation, I had about 100 people on my team. A little less than half were data scientists; the rest were software engineers. That experience taught me to think carefully about the gap between prototypes and products. Many data scientists are not trained as computer scientists: they are comfortable writing prototypes in R or Python, but then pass models to computer scientists to rewrite code for production. Leadership teams have to be mindful of what it takes to go from prototype to bulletproof production code, and include that in timelines and collaboration between teams.

What kinds of problems are Orbital Insight data scientists working on?

We have an interesting mix at Orbital Insight. Part of the team specializes in computer vision, using convolutional neural nets to interpret satellite data. They transform pixels to numbers. We are in the business of counting objects in images, which differs from the classification techniques used for object recognition (as the Fast Forward Labs team researched with Pictograph) that dominate the literature. Say the task is counting how many cars are in a parking lot. We classify each pixel, cluster together areas in the image that contain cars, and then count number of pixels. We hit challenges if we change contexts. The algorithms are trained to count cars in retail parking lots, so don’t automatically transfer to, say, the lot of a car manufacturing plant, where makers place cars inches apart to squeeze in as many as possible. This space differential muddles the clusters. So we have to retrain algorithms for different contexts.

The second group of data scientists is focused on analytics and statistics. They transform numbers to English. They take the millions of numbers about parked cars and distil this information into a single sentence that matters for the user. These scientists have different backgrounds and PhDs than the computer vision team, so I do think a lot about helping them collaborate successfully.

What are some other challenges you face working with satellite data?

We’re limited by what we’re able to collect. The satellites we work with orbit over the geographical space where retailers conduct business on a daily basis. That means, we may see the parking lot of a Walmart store in Massachusetts every day at around 10 am, and a different branch in the midwest every day at 2 pm. So we have to compute a time of day curve for every retailer, and do some statistics to get the timing right. We can back up any inferences with six years of data. The other limitation is that we don’t have data about parking lot patterns in the evening. So our technique really doesn’t work for certain sectors, like evening restaurant chains or movie theaters.

You get to see and work with multiple satellite providers. What hardware developments are you most excited about?

The most interesting developments for us are the ability to use new spectral bands and the increased frequency of imagery. Counting cars falls within the bandwidth of human vision, but there are other applications we’re keen to work on that require low-range infrared or ultraviolet. We want to do things like predict the right spot to mine for iron ore, predict crop health based upon soil moisture levels, discern if a building is occupied or unoccupied based on heat levels, or discern whether a power plant is active. A few new vendors are using novel detectors to push outside of human visual spectrum.

The uptick in image frequency, provided by companies like Planet (with whom we just partnered), provides more data to drive more accurate insights. This shift is remarkable, and is enabled by new hardware and rapidly falling costs. What’s interesting here is when Moore’s Law applies and when it doesn’t. The laws of optics don’t follow Moore’s Law, but the ability to mass produce devices does. Development has therefore not been focused on getting higher and higher resolution from space: in most use cases (like counting cars), getting satellites to 1 or 0.5 meter resolution is perfectly fine, as people want to measure and count things we can also see. So the more useful development was to mass produce hardware, to make cheaper commodities that could be reused and relaunched….and may eventually lead to Elon Musk launching a million vehicles into space.

What is Orbital Insight’s long-term vision?

We want to understand the Earth. It’s amazing how poor our current understanding is: people review government reports and stock reports that say, for example, that steel up and crops are down, but it’s all really just guesses upon guesses given the absence of ground truth. And if you probe economists, their analyses are inevitably built on government reports. Our vision is to replace these reports - and this system - with quantified observations. We want to be able to measure and track the physical world economy like we currently measure and track the digital world (clicks, views, likes).This will impact stocks, agencies, and supply chains: major aircraft manufacturers, who worry about titanium supply, will be able to track how titanium mines are functioning. In short, we want to help rebuild economics on top of real-world observations.

Headshot courtesy of Orbital Insight

What recent developments in machine learning are you most excited about?

Deep learning has only just gotten started. It has tremendous power. AlphaGo beating the world Go champion is mind bending: Go is an intuitive, visual game that is far more complex than chess. And we’re just getting started, especially when we apply this algorithmic power to data from the internet of things. We’re testing this model at Orbital Insight. We’re a data company, but a highly differentiated data company that fuses techniques to create reports that are valuable and hard to create. There are a tremendous number of new data streams, and the game is on for entrepreneurs and data scientists to explore the data, push the algorithms, and create something that is truly unique and new.

What advice would you give to young entrepreneurs looking to push the boundaries and build something new?

We just had a party to celebrate a successful B round and what struck me was the number of folks present who helped get the company started. One great thing about being in Silicon Valley is the access to people and resources who truly support you if they see your vision and think you can be something someday. At the beginning, people gave us free office space, made dozens of intros, shared countless pieces of advice. I’d tell young entrepreneurs to build and rely on their network, and to be open to their input and sensitive to their feedback. Everyone in my network said Orbital Insight was a great idea. And it helped to act with the confidence of a clear signal from the beginning.

#deep learning #interview #satellite data #machine learning #hedge funds

0 notes

fastforwardlabs · 9 years ago

Text

Under the Hood of the Variational Autoencoder (in Prose and Code)

The Variational Autoencoder (VAE) neatly synthesizes unsupervised deep learning and variational Bayesian methods into one sleek package. In Part I of this series, we introduced the theory and intuition behind the VAE, an exciting development in machine learning for combined generative modeling and inference—“machines that imagine and reason.”

To recap: VAEs put a probabilistic spin on the basic autoencoder paradigm—treating their inputs, hidden representations, and reconstructed outputs as probabilistic random variables within a directed graphical model. With this Bayesian perspective, the encoder becomes a variational inference network, mapping observed inputs to (approximate) posterior distributions over latent space, and the decoder becomes a generative network, capable of mapping arbitrary latent coordinates back to distributions over the original data space.

The beauty of this setup is that we can take a principled Bayesian approach toward building systems with a rich internal “mental model” of the observed world, all by training a single, cleverly-designed deep neural network.

These benefits derive from an enriched understanding of data as merely the tip of the iceberg—the observed result of an underlying causative probabilistic process.

The power of the resulting model is captured by Feynman’s famous chalkboard quote: “What I cannot create, I do not understand.” When trained on MNIST handwritten digits, our VAE model can parse the information spread thinly over the high-dimensional observed world of pixels, and condense the most meaningful features into a structured distribution over reduced latent dimensions.

Having recovered the latent manifold and assigned it a coordinate system, it becomes trivial to walk from one point to another along the manifold, creatively generating realistic digits all the while:

In this post, we’ll take a look under the hood at the math and technical details that allow us to optimize the VAE model we sketched in Part I.

Along the way, we’ll show how to implement a VAE in TensorFlow—a library for efficient numerical computation using data flow graphs, with key features like automatic differentiation and parallelizability (across clusters, CPUs, GPUs…and TPUs if you’re lucky). You can find (and tinker with!) the full implementation here, along with a couple pre-trained models.

Building the Model

Let’s dive into code (Python 3.4), starting with the necessary imports:

import functools from functional import compose, partial import numpy as np import tensorflow as tf

One perk of these models is their modularity—VAEs are naturally amenable to swapping in whatever encoder/decoder architecture is most fitting for the task at hand: recurrent neural networks, convolutional and deconvolutional networks, etc.

For our purposes, we will model the relatively simple MNIST dataset using densely-connected layers, wired symmetrically around the hidden code.

class Dense(): """Fully-connected layer""" def __init__(self, scope="dense_layer", size=None, dropout=1., nonlinearity=tf.identity): # (str, int, (float | tf.Tensor), tf.op) assert size, "Must specify layer size (num nodes)" self.scope = scope self.size = size self.dropout = dropout # keep_prob self.nonlinearity = nonlinearity def __call__(self, x): """Dense layer currying, to apply layer to any input tensor `x`""" # tf.Tensor -> tf.Tensor with tf.name_scope(self.scope): while True: try: # reuse weights if already initialized return self.nonlinearity(tf.matmul(x, self.w) + self.b) except(AttributeError): self.w, self.b = self.wbVars(x.get_shape()[1].value, self.size) self.w = tf.nn.dropout(self.w, self.dropout) ...

We can initialize a Dense layer with our choice of nonlinearity for the layer nodes (i.e. neural network units that apply a nonlinear activation function to a linear combination of their inputs, as per line 18).

We’ll use ELUs (Exponential Linear Units), a recent advance in building nodes that learn quickly by avoiding the problem of vanishing gradients. We wrap up the class with a helper function (Dense.wbVars) for compatible random initialization of weights and biases, to further accelerate learning.

In TensorFlow, neural networks are defined as numerical computation graphs. We will build the graph using partial function composition of sequential layers, which is amenable to an arbitrary number of hidden layers.

def composeAll(*args): """Util for multiple function composition i.e. composed = composeAll([f, g, h]) composed(x) # == f(g(h(x))) """ # adapted from https://docs.python.org/3.1/howto/functional.html return partial(functools.reduce, compose)(*args)

Now that we’ve defined our model primitives, we can tackle the VAE itself.

Keep in mind: the TensorFlow computational graph is cleanly divorced from the numerical computations themselves. In other words, a tf.Graph wireframes the underlying skeleton of the model, upon which we may hang values only within the context of a tf.Session.

Below, we initialize class VAE and activate a session for future convenience (so we can initialize and evaluate tensors within a single session, e.g. to persist weights and biases across rounds of training).

Here are some relevant snippets, cobbled together from the full source code:

class VAE(): """Variational Autoencoder see: Kingma & Welling - Auto-Encoding Variational Bayes (https://arxiv.org/abs/1312.6114) """ DEFAULTS = { "batch_size": 128, "learning_rate": 1E-3, "dropout": 1., # keep_prob "lambda_l2_reg": 0., "nonlinearity": tf.nn.elu, "squashing": tf.nn.sigmoid } RESTORE_KEY = "to_restore" def __init__(self, architecture, d_hyperparams={}, meta_graph=None, save_graph_def=True, log_dir="./log"): """(Re)build a symmetric VAE model with given: * architecture (list of nodes per encoder layer); e.g. [1000, 500, 250, 10] specifies a VAE with 1000-D inputs, 10-D latents, & end-to-end architecture [1000, 500, 250, 10, 250, 500, 1000] * hyperparameters (optional dictionary of updates to `DEFAULTS`) """ self.architecture = architecture self.__dict__.update(VAE.DEFAULTS, **d_hyperparams) self.sesh = tf.Session() if not meta_graph: # new model handles = self._buildGraph() ... self.sesh.run(tf.initialize_all_variables())

Assuming that we are building a model from scratch (rather than restoring a saved meta_graph), the key initialization step is the call to VAE._buildGraph (line 32). This internal method constructs nodes representing the placeholders and operations through which the data will flow—before any data is actually piped in.

Finally, we unpack the iterable handles (populated by _buildGraph) into convenient class attributes—pointers not to numerical values, but rather to nodes in the graph:

... # unpack handles for tensor ops to feed or fetch (self.x_in, self.dropout_, self.z_mean, self.z_log_sigma, self.x_reconstructed, self.z_, self.x_reconstructed_, self.cost, self.global_step, self.train_op) = handles

How are these nodes defined? The _buildGraph method encapsulates the core of the VAE model framework—starting with the encoder/inference network:

def _buildGraph(self): x_in = tf.placeholder(tf.float32, shape=[None, # enables variable batch size self.architecture[0]], name="x") dropout = tf.placeholder_with_default(1., shape=[], name="dropout") # encoding / "recognition": q(z|x) encoding = [Dense("encoding", hidden_size, dropout, self.nonlinearity) # hidden layers reversed for function composition: outer -> inner for hidden_size in reversed(self.architecture[1:-1])] h_encoded = composeAll(encoding)(x_in) # latent distribution parameterized by hidden encoding # z ~ N(z_mean, np.exp(z_log_sigma)**2) z_mean = Dense("z_mean", self.architecture[-1], dropout)(h_encoded) z_log_sigma = Dense("z_log_sigma", self.architecture[-1], dropout)(h_encoded)

Here, we build a pipe from x_in (an empty placeholder for input data \(x\)), through the sequential hidden encoding, to the corresponding distribution over latent space—the variational approximate posterior, or hidden representation, \(z \sim q_\phi(z|x)\).

As observed in lines 14 - 15, latent \(z\) is distributed as a multivariate normal with mean \(\mu\) and diagonal covariance values \(\sigma^2\) (the square of the “sigma” in z_log_sigma) directly parameterized by the encoder: \(\mathcal{N}(\mu, \sigma^2I)\). In other words, we set out to “explain” highly complex observations as the consequence of an unobserved collection of simplified latent variables, i.e. independent Gaussians. (This is dictated by our choice of a conjugate spherical Gaussian prior over \(z\)—see Part I.)

Next, we sample from this latent distribution (in practice, one draw is enough given sufficient minibatch size, i.e. >100). This method involves a trick—can you figure out why?—that we will explore in more detail later.

z = self.sampleGaussian(z_mean, z_log_sigma)

The sampled \(z\) is then passed to the decoder/generative network, which symmetrically builds back out to generate the conditional distribution over input space, reconstruction \(\tilde{x} \sim p_\theta(x|z)\).

# decoding / "generative": p(x|z) decoding = [Dense("decoding", hidden_size, dropout, self.nonlinearity) for hidden_size in self.architecture[1:-1]] # assumes symmetry # final reconstruction: restore original dims, squash outputs [0, 1] decoding.insert(0, Dense( # prepend as outermost function "reconstruction", self.architecture[0], dropout, self.squashing)) x_reconstructed = tf.identity(composeAll(decoding)(z), name="x_reconstructed")

Alternately, we add a placeholder to directly feed arbitrary values of \(z\) to the generative network (to fabricate realistic outputs—no input data necessary!):

# ops to directly explore latent space # defaults to prior z ~ N(0, I) z_ = tf.placeholder_with_default(tf.random_normal([1, self.architecture[-1]]), shape=[None, self.architecture[-1]], name="latent_in") x_reconstructed_ = composeAll(decoding)(z_)

TensorFlow automatically flows data through the appropriate subgraph, based on the nodes that we fetch and feed with the tf.Session.run method. Defining the encoder, decoder, and end-to-end VAE is then trivial (see linked code).

We’ll finish the VAE._buildGraph method later in the post, as we walk through the nuances of the model.

The Reparameterization Trick

In order to estimate the latent representation \(z\) for a given observation \(x\), we want to sample from the approximate posterior \(q_\phi(z|x)\) according to the distribution defined by the encoder.

However, model training by gradient descent requires that our model be differentiable with respect to its learned parameters (which is how we propagate the gradients). This presupposes that the model is deterministic—i.e. a given input always returns the same output for a fixed set of parameters, so the only source of stochasticity are the inputs. Incorporating a probabilistic “sampling” node would make the model itself stochastic!

Instead, we inject randomness into the model by introducing input from an auxiliary random variable: \(\epsilon \sim p(\epsilon)\).

For our purposes, rather than sampling \(z\) directly from \(q_\phi(z|x) \sim \mathcal{N}(\mu, \sigma^2I)\), we generate Gaussian noise \(\epsilon \sim \mathcal{N}(0, I)\) and compute \[z = \mu + \sigma \odot \epsilon\] (where \(\odot\) is the element-wise product). In code:

def sampleGaussian(self, mu, log_sigma): """Draw sample from Gaussian with given shape, subject to random noise epsilon""" with tf.name_scope("sample_gaussian"): # reparameterization trick epsilon = tf.random_normal(tf.shape(log_sigma), name="epsilon") return mu + epsilon * tf.exp(log_sigma) # N(mu, sigma**2)

By “reparameterizing” this step, inference and generation become entirely differentiable and hence, learnable.

Cost Function

Now, in order to optimize the model, we need a metric for how well its parameters capture the true data-generating and latent distributions. That is, how likely is observation \(x\) under the joint distribution \(p(x, z)\)?

Recall that we represent the global encoder and decoder parameters (i.e. neural network weights and biases) as \(\phi\) and \(\theta\), respectively.

In other words, we want to simultaneously tune these complementary parameters such that we maximize \(log(p(x|\phi, \theta))\)—the log-likelihood across all datapoints \(x\) under the current model settings, after marginalizing out the latent variables \(z\). This term is also known as the model evidence.

We can express this marginal likelihood as the sum of what we’ll call the variational or evidence lower bound \(\mathcal{L}\) and the Kullback-Leibler (KL) divergence \(\mathcal{D}_{KL}\) between the approximate and true latent posteriors: \[ log(p(x)) = \mathcal{L}(\phi, \theta; x) + \mathcal{D}_{KL}(q_\phi(z|x) || p_\theta(z|x)) \]

Here, the KL divergence can be (fuzzily!) intuited as a metric for the misfit of the approximate posterior \(q_\phi\). We’ll delve into this further in a moment, but for now the important thing is that it is non-negative by definition; consequently, the first term acts as a lower bound on the total. So, we maximize the lower bound \(\mathcal{L}\) as a (computationally-tractable) proxy for the total marginal likelihood of the data under the model. (And the better our approximate posterior, the tighter the gap between the lower bound and the total model evidence.)

With some mathematical wrangling, we can decompose \(\mathcal{L}\) into the following objective function: \[ \mathcal{L}(\phi, \theta; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[log(p_\theta(x|z))] - \mathcal{D}_{KL}(q_\phi(z|x) || p_\theta(z)) \] (Phrased as a cost, we optimize the model by minimizing \({-\mathcal{L}}\).)

Here, the perhaps unfriendly-looking first term is, in fact, familiar! It’s the probability density of generated output \(\tilde{x}\) given the inferred latent distribution over \(z\)—i.e. the (negative) expected reconstruction error. This loss term is intrinsic to perhaps every autoencoder: how accurately does the output replicate the input?

Choosing an appropriate metric for image resemblance is hard (but that’s another story). We’ll use the binary cross-entropy, which is commonly used for data like MNIST that can be modeled as Bernoulli trials. Expressed as a static method of the VAE class:

@staticmethod def crossEntropy(obs, actual, offset=1e-7): """Binary cross-entropy, per training example""" # (tf.Tensor, tf.Tensor, float) -> tf.Tensor with tf.name_scope("cross_entropy"): # bound by clipping to avoid nan obs_ = tf.clip_by_value(obs, offset, 1 - offset) return -tf.reduce_sum(actual * tf.log(obs_) + (1 - actual) * tf.log(1 - obs_), 1)

The second term in the objective is the KL divergence of the prior \(p\) from the (approximate) posterior \(q\) over the latent space. We’ll approach this conceptually, then mathematically.

The KL divergence \(\mathcal{D}_{KL}(q||p)\) is defined as the relative entropy between probability density functions \(q\) and \(p\). In information theory, entropy represents information content (measured in nats), so \(\mathcal{D}_{KL}\) quantifies the information gained by revising the candidate prior \(p\) to match some “ground truth” \(q\).

In a related vein, the KL divergence between posterior and prior beliefs (i.e. distributions) can be conceived as a measure of “surprise”: the extent to which the model must update its “worldview” (parameters) to accomodate new observations.

(Note that the formula is asymmetric—i.e. \(\mathcal{D}_{KL}(q||p) \neq \mathcal{D}_{KL}(p||q)\)—with implications for its use in generative models. This is also why it is not a true metric.)

By inducing the learned approximation \(q_\phi(z|x)\) (the encoder) to match the continuous imposed prior \(p(z)\), the KL term encourages robustness to small perturbations along the latent manifold, enabling smooth interpolation within and between classes (e.g. MNIST digits). This reduces “spottiness” in the latent space that is often observed in autoencoders without such regularization.

Mathematical bonus: we can strategically choose certain conjugate priors over \(z\) that let us analytically integrate the KL divergence, yielding a closed-form equation. This is true of the spherical Gaussian we chose, such that \[ {-\mathcal{D}}_{KL}(q_\phi(z|x) || p_\theta(z)) = \frac{1} 2 \sum{(1 + log(\sigma^2) - \mu^2 - \sigma^2)} \] (summed over the latent dimensions). In TensorFlow, that looks like this:

@staticmethod def kullbackLeibler(mu, log_sigma): """(Gaussian) Kullback-Leibler divergence KL(q||p), per training example""" # (tf.Tensor, tf.Tensor) -> tf.Tensor with tf.name_scope("KL_divergence"): # = -0.5 * (1 + log(sigma**2) - mu**2 - sigma**2) return -0.5 * tf.reduce_sum(1 + 2 * log_sigma - mu**2 - tf.exp(2 * log_sigma), 1)

Together, these complementary loss terms capture the trade-off between expressivity and concision, between data complexity and simplicity of the prior. Reconstruction loss pushes the model toward perfectionist tendencies, while KL loss (along with the addition of auxiliary noise) encourages it to explore sensibly.

To elaborate (building on the VAE._buildGraph method started above):

# reconstruction loss: mismatch b/w x & x_reconstructed # binary cross-entropy -- assumes p(x) & p(x|z) are iid Bernoullis rec_loss = VAE.crossEntropy(x_reconstructed, x_in) # Kullback-Leibler divergence: mismatch b/w approximate posterior & imposed prior # KL[q(z|x) || p(z)] kl_loss = VAE.kullbackLeibler(z_mean, z_log_sigma) # average over minibatch cost = tf.reduce_mean(rec_loss + kl_loss, name="cost")

Beyond its concise elegance and solid grounding in Bayesian theory, the cost function lends itself well to intuitive metaphor:

Information theory-wise, the VAE is a terse game of Telephone, with the aim of finding the minimum description length to convey the input from end to end. Here, reconstruction loss is the information “lost in translation,” while KL loss captures how overly “wordy” the model must be to convey the message through an unpredictable medium (hidden code imperfectly optimized for the input data).

Or, framing the VAE as a lossy compression algorithm, reconstruction loss accounts for the fidelity of (de)compression while KL loss penalizes the model for using a sub-optimal compression scheme.

Training

At last, our VAE cost function in hand (after factoring in optional \(\ell_2\)-regularization), we finish VAE._buildGraph with optimization nodes to be evaluated at each step of SGD (with the Adam optimizer)…

# optimization global_step = tf.Variable(0, trainable=False) with tf.name_scope("Adam_optimizer"): optimizer = tf.train.AdamOptimizer(self.learning_rate) tvars = tf.trainable_variables() grads_and_vars = optimizer.compute_gradients(cost, tvars) clipped = [(tf.clip_by_value(grad, -5, 5), tvar) # gradient clipping for grad, tvar in grads_and_vars] train_op = optimizer.apply_gradients(clipped, global_step=global_step, name="minimize_cost") # back-prop

…and return all of the nodes we want to access in the future to the VAE.__init__ method where buildGraph was called.

return (x_in, dropout, z_mean, z_log_sigma, x_reconstructed, z_, x_reconstructed_, cost, global_step, train_op)

Using SGD to optimize the function parameters of the inference and generative networks simultaneously is called Stochastic Gradient Variational Bayes.

This is where TensorFlow really shines: all of the gradient backpropagation and parameter updates are performed via automatic differentation, and abstracted away from the researcher in the train_op (essentially) one-liner on line 48.

Model training (with optional cross-validation) is then as simple as feeding minibatches from dataset X to the x_in placeholder and evaluating (“fetching”) the train_op. Here are some relevant chunks, excerpted from the full class method:

def train(self, X, max_iter=np.inf, max_epochs=np.inf, cross_validate=True, verbose=True, save=False, outdir="./out", plots_outdir="./png"): try: err_train = 0 now = datetime.now().isoformat()[11:] print("------- Training begin: {} -------\n".format(now)) while True: x, _ = X.train.next_batch(self.batch_size) feed_dict = {self.x_in: x, self.dropout_: self.dropout} fetches = [self.x_reconstructed, self.cost, self.global_step, self.train_op] x_reconstructed, cost, i, _ = self.sesh.run(fetches, feed_dict) err_train += cost if i%1000 == 0 and verbose: print("round {} --> avg cost: ".format(i), err_train / i) if i >= max_iter or X.train.epochs_completed >= max_epochs: print("final avg cost (@ step {} = epoch {}): {}".format( i, X.train.epochs_completed, err_train / i)) now = datetime.now().isoformat()[11:] print("------- Training end: {} -------\n".format(now)) break

Helpfully, TensorFlow comes with a built-in visualization dashboard. Here’s the computational graph for an end-to-end VAE with two hidden encoder/decoder layers (that’s what all the tf.name_scope-ing was for):

Wrapping Up

The future of deep latent models lies in models that can reason about the world—“understanding” complex observations, transforming them into meaningful internal representations, and even leveraging these representations to make decisions—all while coping with scarce data, and in semisupervised or unsupervised settings. VAEs are an important step toward this future, demonstrating the power of new ways of thinking that result from unifying variational Bayesian methods and deep learning.

We now understand how these fields come together to make the VAE possible, through a theoretically-sound objective function that balances accuracy (reconstruction loss) with variational regularization (KL loss), and efficient optimization of the fully differentiable model thanks to the reparameterization trick.

We’ll wrap up for now with one more way of visualizing the condensed information encapsulated in VAE latent space.

Previously, we showed the correspondence between the inference and generative networks by plotting the encoder and decoder perspectives of the latent space in the same 2-D coordinate system. For the decoder perspective, this meant feeding linearly spaced latent coordinates to the generative network and plotting their corresponding outputs.

To get an undistorted sense of the full latent manifold, we can sample and decode latent space coordinates proportionally to the model’s distribution over latent space. In other words—thanks to variational regularization provided by the KL loss!—we simply sample relative to our chosen prior distribution over \(z\). In our case, this means sampling linearly spaced percentiles from the inverse CDF of a spherical Gaussian.1

Once again, evolving over (logarithmic) time:

Interestingly, we can see that the slim tails of the distribution (edges of the frame) are not well-formed. Presumably, this results from few observed inputs being mapped to latent posteriors with significant density in these regions.

Here are a few resulting constellations (from a single model):

Theoretically, we could subdivide the latent space into infinitely many points (limited in practice only by the computer’s floating point precision), and let the generative network dream up infinite constellations of creative variations on MNIST.

That’s enough digits for now! Keep your eyes out for the next installment, where we’ll tinker with the vanilla VAE model in the context of a new dataset.

– Miriam

Thanks Kyle McDonald (@kcimc) and Tom White (@dribnet) for noting this!↩

#code #deep learning #probabilistic programming

6 notes · View notes

fastforwardlabs · 9 years ago

Text

Giving Speech a Voice in the Home

This is a guest post by Sean Lorenz, the Founder & CEO of SENTER, a Boston-based startup using sensors and data science to support healthcare in the home. Sean explains how techniques from computational neuroscience can help make the smart home smarter and describes the speech recognition hurdles developers have to overcome to realize smart home potential.

Consumer IoT pundits rave about the “smart home,” where our lights, shades, sprinklers and coffeemakers do what we want them to do automatically as they learn about our behaviors and habits. But the fact is that our homes are still far from being smart. Manufacturers have focused primarily on enabling existing products to send and receive data to/from a customer’s mobile phone. Much of this work is outsourced to services teams with expertise in full stack web and mobile app development; they’re great at whipping up dashboards and control buttons, but not at solving the problems that matter most to consumers.

Today, the smart home lacks three critical ingredients that hinder widespread consumer adoption:

1. Lack of protocol agreement, with increasing local protocol alliances.

2. Lack of intelligence. IFTTT is great for early adopter techies, but my cookie-baking 62-year-old midwestern mom is never going to create a rule to combine her Philips Hue lights and SmartThings motion sensors to perform an automated action. Like...never ever ever.

3. Lack of user experience. Home automation software seems to be stuck in the era of Wham!, Duck Hunt, and power-dressing with shoulder pads. Apple announced their own HomeKit app to give IoT iPhone developers a hand, but we’re a long way from usability.

The upshot? Data scientists have an important role to play in taking the home from being connected to being smart. I believe this will result from creating context-aware, speech-based applications that combine, and make better use of, data streaming in from sensors across potentially dozens of connected products in the home.

At SENTER, we are tackling one lobe in the smart home brain – health. As the US transitions from fee-for-service to value-based care, health care management is migrating from the hospital to our homes. We all know that a few charts and graphs telling patients how many steps they took today isn’t enough to reduce hospitalizations or predict flare ups in chronic illness.

But magic happens when we combine data from multiple sensors to create a user experience that makes managing health easier and more natural. At Senter we knew that a simple, rules-based system wouldn’t work for predicting an individual’s unique health concerns. Traditional machine learning approaches weren’t working well either. The biggest data science problem we faced was dealing with feature stacking across numerous time series streams.

In the rest of this post, I’ll dive deeper into the data science problems we’re working on to make a smart home health system work: 1) learnable sensor fusion algorithms and 2) better voice-based intelligent assistant applications.

Sensor fusion

As a computational neuroscience PhD student, I devoured papers on multimodal sensor integration in the mammalian brain. There’s a very special part of the brain called the posterior parietal cortex (PPC) whose job is to bind together inputs from across the sensory and motor areas to create higher level cognitive decision-making and planning. Modeling this area of the brain is extremely nonlinear and very hard to do (see my sad attempt here).

(An old-school brain functions diagram)

What does this have to do with the smart home? IoT needs to tackle the same problem, only with non-biological sensors. The goal of sensor fusion is to combine data from various sensor inputs to make smarter decisions.

Consider the example of predicting sepsis, a very serious condition among the elderly. Some key symptoms are fever, shaking chills, very low body temperature, decreased urination, rapid pulse and breathing rate, and vomiting. With smart home tools, we could use a connected bed mat to track body temperature, shaking or chills motion, and breathing rate; motion sensors to track times entered the bathroom; urine detection sensors in the toilet; and a wrist wearable to track heart rate. With the data collected, how would we fuse all these sensors to predict septic events?

There are different data science methods that are well-equipped for time series analysis. I’ve looked into recurrent neural networks with LSTM (see my IoT Slam talk and Ajit Jaokar’s work for reference). Another popular method, used by the Google self-driving car team, is Bayesian inference (see here). Alexandre Pouget and his research team even suggested that the brain uses a form of Bayesian inference to integrate and make sense of all this sensory input data. That said, there is plenty of preprocessing that goes on before it even reaches sensor fusion…but that’s a topic for another time!

How voice-based interaction systems need to evolve

Just predicting that a person is septic is not enough. We probably want to let them know! Part two of making the smart home actually smart requires seamless user interaction to improve algorithmic performance over time. So the next question becomes, what’s the best way to get users to engage with the systems and make their smart homes smarter? I believe voice-based devices and intelligent assistants like the Amazon Echo or Google Home will soon be the predominant site of user interaction, overtaking smartphones or tablets.

Imagine you’re a homeowner who just contracted a developer to build an AWS application that streams real-time IoT products to manage and reduce home energy usage. Your developer starts by creating a lambda function on sensor fusion algorithms to automatically adjust lights and shades, turn off outlets, and change room temperatures to keep the electric bill low. If her algorithms open the shades at times you don’t like, you need a way to correct that behavior, to tell the application to adjust its network weights. You could certainly tune weights with a prompt in a smartphone app, but it’s far more natural to say “Alexa, please raise my blinds back up.” An Alexa custom skill can then relay this feedback up the chain to the AWS application so it can update its behavior.

While this may sound good in theory, is it actually possible? Yes and no. At SENTER we’ve found that people (particularly elderly patients) absolutely love the idea of using voice-based devices for user experience. Returning to our sepsis example, we can now ask seniors qualitative questions about how they’re feeling to strengthen confidence scores. But a number of UX and interaction issues still need to be solved before systems like Amazon Echo can really take off in smart home applications.

The biggest issue with voice-based interfaces (and Amazon Echo in particular) is two-way interaction. There is currently no way for a developer to program Echo to ask homeowners unprompted questions (but developers frequently request this from the Alexa team). Let’s say we want to use the motion sensors to trigger when someone is in the same room as the Echo so that we can ask them the occasional health-related question or ask “Did you just fall, Mrs. Jones? Should I call for help?” Denny Britz’s excellent vision of conversational interfaces with machines in the home will have to wait a little longer.

Equally problematic is how these devices process human language. It’s a very hard problem to build a bot that can process a statement it hasn’t seen before, making inferences like we do in daily conversations. Indeed, there are frustratingly many responses to the simple question “How are you feeling today?” When building an Echo app today, developers must provide a list of sample utterances for how a user might respond, which hinders the ability to continually learn. Deep learning may advance flexibility in the future, but we have work to do. Amazon’s got a healthy head start, and Viv, Apple and Google are following suite.

Lastly, the combination of smart homes and voice-based interfaces need stronger use cases (beyond knowing my IoT toothbrush brush count or having my refrigerator tweet when I need milk). I’ve spoken to hundreds of device manufacturers, investors, homeowners and IoT conference attendees over the years, and can confidently say that people don’t want a smartphone app for every connected product they buy. They want it all to just work together. In one simple user experience. And most importantly — they want their smart home to manage typical functions like energy, safety, lighting or health.

Intelligent, semi-supervised sensor fusion coupled with natural communication via a speech-based assistant in the home will get us there. Alexa, please Google “sensor fusion papers”. Let’s get to work.

- Sean Lorenz

#smart home #deep learning #connected home #data science #whitepaper

0 notes

fastforwardlabs · 9 years ago

Text

Introducing Variational Autoencoders (in Prose and Code)

Effective machine learning means building expressive models that sift out signal from noise—that simplify the complexity of real-world data, yet accurately intuit and capture its subtle underlying patterns.

Whatever the downstream application, a primary challenge often boils down to this: How do we represent, or even synthesize, complex data in the context of a tractable model?

This challenge is compounded when working in a limited data setting—especially when samples are in the form of richly-structured, high-dimensional observations like natural images, audio waveforms, or gene expression data.

Cue the Variational Autoencoder, a fascinating development in unsupervised machine learning that marries probabilistic Bayesian inference with deep learning.

Benefiting from advances in both research communities, the Variational Autoencoder addresses these challenges by leveraging innovative deep learning techniques grounded in a solid Bayesian theoretical framework...and can be explained through mesmerizing GIFs:

(Read on, and all will become clear...)

Intro

Traditional autoencoders are models (usually multilayer artificial neural networks) designed to output a reconstruction of their input. Specifically, autoencoders sequentially deconstruct input data into hidden representations, then use these representations to sequentially reconstruct outputs that resemble the originals. Fittingly, this process of teasing out a mapping from input to hidden representation is called representation learning.

The appeal of this setup is that the model learns its own definition of a "meaningful" representation based only on the data—no human-derived heuristics or labels! This approach stands in contrast to the majority of deep learning systems in production today, which rely on expensive-to-obtain labeled data ("This image is a kitten; this image is a panda."). Alternatives to such supervised learning frameworks provide a way to benefit from a world brimming with valuable raw data.

Though trained holistically, autoencoders are often built for the part instead of the whole: researchers might exploit the data-to-representation mapping for semantic embeddings, or the representation-to-output mapping for extraordinarily complex generative modeling

But an autoencoder with unlimited capacity is doomed to the role of a wonky, computationally-expensive Xerox machine. To ensure that the transformations to or from the hidden representation are useful, we impose some type of regularization or constraint. As a tradeoff for some loss in fidelity, such impositions push the model to distill the most salient features from a cacophonous real-world dataset.

Variational Autoencoders (VAEs) incorporate regularization by explicitly learning the joint distribution over data and a set of latent variables that is most compatible with observed datapoints and some designated prior distribution over latent space. The prior informs the model by shaping the corresponding posterior, conditioned on a given observation, into a regularized distribution over latent space (the coordinate system spanned by the hidden representation).

As a result, VAEs are an excellent tool for manifold learning—recovering the "true" manifold in lower-dimensional space along which the observed data lives with high probability mass—and generative modeling of complex datasets like images, text, and audio—conjuring up brand new examples, consistent with the observed training set, that do not exist in nature.

Building on other informative posts, this is the first installment of a guide to Variational Autoencoders: the lovechild of Bayesian inference and unsupervised deep learning.

In this post, we'll sketch out the model and provide an intuitive context for the math- and code-flavored follow-up. In Post II, we'll walk through a technical implementation of a VAE (in TensorFlow and Python 3). In Post III, we'll venture beyond the popular MNIST dataset using a twist on the vanilla VAE.

The Variational Autoencoder Setup

An end-to-end autoencoder (input to reconstructed input) can be split into two complementary networks: an encoder and a decoder. The encoder maps input \(x\) to a latent representation, or so-called hidden code, \(z\). The decoder maps the hidden code to reconstructed input value \(\tilde x\).

Whereas a vanilla autoencoder is deterministic, a Variational Autoencoder is stochastic—a mashup of:

a probabilistic encoder \(q_\phi(z|x)\), approximating the true (but intractable) posterior distribution \(p(z|x)\), and

a generative decoder \(p_\theta(x|z)\), which notably does not rely on any particular input \(x\).

Both the encoder and decoder are artificial neural networks (i.e. hierarchical, highly nonlinear functions) with tunable parameters \(\phi\) and \(\theta\), respectively.

Learning these conditional distributions is facilitated by enforcing a plausible mathematically-convenient prior over the latent variables, generally a standard spherical Gaussian: \(z \sim \mathcal{N}(0, I)\).

Given this conjugate prior, the encoder's job is to supply the mean and variance of the Gaussian posterior over each latent space dimension corresponding to a given input. Latent \(z\) is sampled from this distribution, then passed to the decoder to be transformed back into a distribution over the original data space.

In other words, a VAE represents a directed probabilistic graphical model, in which approximate inference is performed by the encoder and optimized alongside an easy-to-sample generative decoder. For this reason, these complementary halves are also known as the inference (or recognition) network and the generative network. By reformulating this graphical model as a differentiable neural net with a single, pithy cost function (derived from the variational lower bound), the whole package can be trained by stochastic gradient descent (SGD) thanks to the "amusing" universe we live in.

Bayes, Meet Neural Networks

In fact, many developments in deep learning research can also be understood through a probabilistic, or Bayesian, lens. Some of these analogies are more theoretical, whereas others share a parallel mathematical interpretation. For example, \(\ell_2\)-regularization can be viewed as imposing a Gaussian prior over neural network weights, and reinforcement learning can be formalized through variational inference.

VAEs exemplify a case where this relationship is made explicit and elegant, and variational Bayesian inference is the guiding principle shaping the model's cost function and instrinsic architecture.

Why does this setup make sense?

In the Bayesian worldview, datapoints are observations drawn from some data-generating distribution: (observed) variable \(x \sim p(x)\). So, the MNIST dataset of handwritten digits describes a random variable with an intricate set of dependencies among all 28*28 pixels. Each MNIST image offers a glimpse into one arrangement of 784 pixel values with high probability—whereas a 28*28 block of white noise, or the Jolly Roger, (theoretically) occupy low probability mass under the distribution.

It would be a headache to model the conditional dependencies in 784-dimensional pixel space. Instead, we make the simplifying assumption that the distribution over these observed variables is the consequence of a distribution over some set of hidden variables: \(z \sim p(z)\). Intuitively, this paradigm is analogous to how scientists study the natural world, by working backwards from observed phenomena to recover the unifying hidden laws that govern them. In the case of MNIST, these latent variables could represent concepts like number identity and tiltedness, whereas more complex natural images like the Frey faces could have latent dimensions for facial expression and azimuth.

Inference is the process of disentangling these rich real-world dependencies into simplified latent dependencies, by predicting \(p(z|x) -\) the distribution over one set of variables (the latent variables) conditioned on another variable (the observed data). (This is where Bayes' theorem enters the picture.)

With this Bayesian frame-of-mind, training a generative model is the same as learning the joint distribution over the data and latent variables: \(p(x, z)\). This approach lends itself well to small datasets, since inference relies on the data-generating distribution rather than individual datapoints per se. It also lets us bake prior knowledge into the model by imposing simplifying a priori distributions over variables.

Classical (iterative, non-learned) approaches to inference are often inefficient and do not scale well to large datasets. With a few theoretical and mathematical tricks, we can train a neural network to do the dirty work of both variational inference and generative modeling...while reaping the additional benefits deep learning provides (universal approximating power, cheap test-time evaluation, minibatched SGD, advances like batch normalization and dropout, etc).

The next post in the series will delve into these theoretical and mathematical tricks and show how to implement them in TensorFlow (a toolbox for efficient numerical computation with data flow graphs).

MNIST

For now, we will take our VAE model for a spin using handwritten MNIST digits.

import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data import vae # this is our model - to be explored in the next post IMG_DIM = 28 ARCHITECTURE = [IMG_DIM**2, # 784 pixels 500, 500, # intermediate encoding 50] # latent space dims # (and symmetrically back out again) HYPERPARAMS = { "batch_size": 128, "learning_rate": 1E-3, "dropout": 0.9, "lambda_l2_reg": 1E-5, "nonlinearity": tf.nn.elu, "squashing": tf.nn.sigmoid } mnist = input_data.read_data_sets("mnist_data") v = vae.VAE(ARCHITECTURE, HYPERPARAMS) v.train(mnist, max_iter=20000)

Let's verify the model by eye, by plotting how well it parses random MNIST inputs (top) and reconstructs them (bottom):

Note that these inputs are from the test set, so the model has never seen them before. Not bad!

For latent space visualizations, we can train a VAE with 2-D latent variables (though this space is generally too small for the intrinsic dimensionality of real-world data). Picturing this compressed latent space lets us see how the model has disentangled complex raw data into abstract higher-order features.

We'll visualize the latent manifold over the course of training in two ways, to see the complementary evolution of the encoder and decoder over (logarithmic) time.

This is how the encoder/inference network learns to map the training set from the input data space to the latent space...

...and this is how the decoder/generative network learns to map latent coordinates into reconstructions of the original data space:

Here we are sampling evenly-spaced percentiles along the latent manifold and plotting their corresponding output from the decoder, with the same axis labels as above.

Looking at both plots side-by-side clarifies how optimizing the encoder and decoder in tandem enables efficient pairing of inference and generation:

This tableau highlights the overall smoothness of the latent manifold—and how any "unrealistic" outputs from the generative decoder correspond to apparent discontinuities in the variational posterior of the encoder (e.g. between the "7-space" and the "1-space"). These gaps could probably be improved by experimenting with model hyperparameters.

Whereas the original data dotted a sparse landscape in 784 dimensions, where "realistic" images were few and far between, this 2-dimensional latent manifold is densely populated with such samples. Beyond its inherent visual coolness, latent space smoothness shows the model's ability to leverage its "understanding" of the underlying data-generating process to generalize beyond the training set.

Smooth interpolation within and between digits—in contrast to the spotty latent space characteristic of many autoencoders—is a direct result of the variational regularization intrinsic to VAEs.

Take-aways

Bayesian methods provide a framework for reasoning about uncertainty. Deep learning provides an efficient way to approximate arbitrarily complex functions, and ripe opportunities to probe uncertainty (over parameters, hyperparameters, data, model architectures...).

While differences in language can obscure overlapping ideas, recent research has revealed not just the power of cross-validating theories across fields (interesting in itself), but also a productive new methodology through a unified synthesis of the two.

This research becomes ever more relevant as we seek to leverage today's most interesting real-world data, which is often high-dimensional and rich in structure, yet limited in number and wholly or partially unlabeled.

(But don't take my word for it.)

Variational Autoencoders are:

A reminder that productive sparks fly when deep learning and Bayesian methods are not treated as alternatives, but combined.

Just the beginning of creative applications for deep learning.

Stay tuned for more technical details (math and code!) in Part II.

- Miriam

#code #deep learning #probabilistic programming

6 notes · View notes