Text
SigOpt Fundamentals: Covariance Kernels for Avoiding Boundaries
By: Mike McCourt, Ph.D.
Here at SigOpt, Gaussian processes and reproducing kernel Hilbert spaces (RKHS) are important components of our Bayesian optimization methodology. Our research into this topic often exists at the intersection of approximation theory and spatial statistics. We recently attended the SIAM Computational Science and Engineering 2017 meeting in Atlanta to continue learning about outstanding accomplishments in these and other exciting topics in computation.
In addition to participating in the career fair, we also organized a minisymposium to bring together experts in Gaussian processes and RKHS and discuss recent advances that are relevant to Bayesian optimization. First, we would like to extend our thanks to Jian Wu, Jie Chen and Greg Fasshauer for contributing talks to the session; their presentations provided a diverse set of insights into new strategies for improving the viability and performance of RKHS theory in Bayesian optimization. This post serves to discuss our ongoing collaboration with Dr. Fasshauer and explore the topic of his talk at a level suitable for a broad audience. For a more thorough discussion of the interplay between Green’s functions and covariance kernels, we will be releasing an In Depth blog post soon.
Covariance Kernels
A fundamental component of a Gaussian process is its covariance kernel, the mechanism by which information at different locations is believed to jointly vary. This defines the degree to which knowledge at one location provides knowledge at another; it also provides the key tool by which existing function observations can be used to predict function values at unobserved locations.
As such, the choice of covariance kernel can play a significant role in the quality of the kernel-based approximation to the function and, in turn, the quality of any Bayesian optimization in which that approximation is involved. We have previously discussed the topic of optimally choosing covariance kernels from a parametrized family using maximum likelihood estimation as well as the value of alternate parametrization metrics. In this post we explore the potential for a new family of kernels, with specially designed properties, to yield benefits during Bayesian optimization.
It should be noted that the role of positive definite (covariance) kernels extends beyond our application here in spatial statistics. Their role in support vector machines, studying text data, high dimensional integration, density estimation and other fields demonstrates their broad value. In this discussion, however, we will be focusing only on their role in scattered data approximation.
Isotropic and Anisotropic Radial Covariance Kernels
Before we talk about new kernels, it would be prudent to review kernels already common in the literature, especially as this affords us an opportunity to look at some pictures. By far, the most popular covariance kernels across disciplines are radial (thus the term radial basis function), and among those the most popular is the Gaussian kernel (also referred to, in a persistent bit of misnomer, as simply the RBF kernel). The images below show examples of these radial Gaussian (a.k.a. squared exponential) kernels with different shape parameters.
Figure 1: Isotropic radial Gaussian covariance kernels with different shape parameters. (left) A smaller shape parameter yields a larger length scale and interaction between data at longer distances. (right) A larger shape parameter yields a shorter length scale and and thus more localized behavior.
These are examples of radial kernels (also called isotropic), but not all kernels are radial. Radial kernels can be converted from their isotropic form (where all dimensions have the same bandwidth/length scale) to their anisotropic form (where dimensions can act differently, also know more generally as stationary or, in some communities, automatic relevance determination) by redefining the sense of distance. Below are two anisotropic variants on the Gaussian kernels above with a skewed sense of distance.
Figure 2: Anisotropic Gaussian covariance kernels. (left) The x dimension has a wider length scale than the y dimension. (right) A skewed sense of distance yields a longer length scale in the x + y direction than in the x - y direction.
Nonstationary Iterated Brownian Bridge Covariance Kernels
All of the kernels described above, and actually most common covariance kernels, have the property that the distance between observations defines the covariance (thus the term radial). The reasons for their popularity are both practical (computational benefits exist for radial kernels) and historical (this article reviews much of this history regarding completely monotone functions and interpretations in the context of the radial Fourier transform).
Nonradial kernels have, in the spatial statistics literature, enjoyed less popularity. Part of this is for theoretical reasons: kernels which are a function of only a distance quantity can then be studied as functions of one variable. In the recent past, however, there has been a renewed emphasis on the potential benefits of using nonradial kernels, partly driven by the numerical/functional analysis community (this text provides some context).
The potential benefits of using nonradial kernels include the opportunity to model functions which vary in a nonstationary way; because these kernels are not translation invariant they have the ability to recognize variations in the function on different scales in different parts of the domain. See, e.g., this article for more discussion on the topic. One point of growth has been in the use of kernels defined through Green’s functions solutions to differential operators, a topic normally restricted to the solution of differential equations but which serves as a previously untapped source of positive definite kernels. Early papers on this topic include these two articles. An upcoming blog post will discuss this topic more thoroughly.
To put a tangible face to this topic, we can look at examples of one dimensional nonradial kernels belonging to the iterated Brownian bridge (IBB) family of kernels (introduced first in this article, and later in this text).
Figure 3: Several iterated Brownian bridge kernels, centered throughout the domain, all of which are zero on the boundaries. (left) Certain parameters can yield nonsmooth kernels. (right) Other parameters can yield kernels with greater smoothness and locality.
Those readers that are familiar with the Matérn class of positive definite kernels may see some similarities to the kernels in the figure above. This is not a coincidence: the full space Matérn kernels share the same differential operator, but the kernels above are only defined on a compact domain with specific boundary conditions. Originally, in fact, these IBB kernels were given the title “Compact Matérn” for this reason. A visual comparison between the full space Matérn and IBB kernels is shown in the figure below.
Figure 4: IBB kernels(0) in blue and standard Matérn kernels in dashed red; they behave similarly near the middle of the domain. (left) Nonsmooth kernels. (right) Kernels with continuous derivatives.
In the setting as I have described it so far, these kernels only exist in 1D; this can be extended to higher dimensions by considering product forms of these kernels. The figure below demonstrates this in 2 dimensions (beyond 2D it is rather difficult to visualize). In a future post we will more thoroughly study the performance of these tensor product kernels in higher dimensional spaces, where the pain of searching around the boundary is amplified.
Figure 5: Tensor products of these IBB kernels can be used to define covariances in more than one dimension. (left) A nonsmooth kernel. (right) A kernel with continuous derivatives.
IBB Kernels in Bayesian Optimization
These kernels were originally invented with the goal of using them in a collocation solution of boundary value problems: by automatically enforcing boundary conditions, these kernels are an attractive basis for which to approximate a solution. Chapter 20 of this book provides an example of this strategy.
Recently, we decided that these IBB kernels could potentially be adapted for use in Bayesian optimization. In particular, one common complaint about some Bayesian optimization methods is their propensity to sample near the boundary; this problem was addressed recently in this article which describes using a quadratic mean to dissuade samples from the boundary. Our strategy here is to use these IBB kernels to enforce behavior near the boundary, and thus decrease the desire to sample around the boundary until the interior has been sufficiently explored.
In the figures below, we demonstrate the impact of this choice in comparison to a standard Matérn kernel with the same smoothness and shape parameter; a deterministic linear mean is assumed in these kriging models. Eight initialization points, randomly selected from the domain, are used to seed the optimization. These examples involve fixed covariances over the course of the optimization.(1)
Figure 6a: The evolution of the optimization using a radial Matérn kernel may result in an inefficient sampling in the corners and around the boundary.
Figure 6b: By using the iterated Brownian bridge kernel with the same smoothness and locality properties, there is less desire to sample on the boundary.
In the example above, the function actually satisfies the boundary conditions (zero values on the boundary), so one could (legitimately) argue that this superior behavior for the iterated Brownian bridge kernel is to be expected.(2) Below we show a function with arbitrary boundary behavior and multiple local optima and again compare the behavior.
Figure 7a: The radial kernel samples somewhat inefficiently for the multimodal nature of this function.
Figure 7b: The iterated Brownian bridge kernel is better able to sample near the optimum by avoiding sampling near the boundary.
Conclusion
A common complaint of Bayesian optimization methods using Gaussian process models is their propensity to sample around the boundary and in the corners of the solution domain. Here we introduced a strategy using special positive definite kernels to sample away from the boundary by enforcing specific behavior at the boundary. In a future blog post we will delve into the derivation of these kernels and explore more complicated optimization problems.
Acknowledgements
Special thanks, again, to Greg Fasshauer for his continued collaboration on this project. Also, thanks to both Jian Wu and Jie Chen for their valued contributions to our SIAM minisymposium.
More from SigOpt
SigOpt’s 90 Second Demo demonstrates our optimization engine with no sign-up required.
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
Interested in helping push the state of optimization forward? SigOpt is hiring!
(0) The IBB kernels in this picture have been rescaled to match the scale of the Matérn kernels.
(1) This strategy is in contrast to the standard strategy of refitting the covariances using the current data but will help to visualize the described behavior. Also, the EI values can drop too low to be resolved in the plot, although the optimization is still conducted as is standard.
(2) There is an argument to be made here, which is often made in the approximation theory community, that using the correct interpolating basis is the appropriate behavior. However, in a truly black box setting, such knowledge is impractical.
0 notes
Text
SigOpt In Depth: Expected Improvement vs. Knowledge Gradient
table { font-family: arial, sans-serif; border: 2px solid black; border-collapse: collapse; width: 80%; } td, th { border: 1px solid #dddddd; text-align: right; padding: 10px; } tr:nth-child(1) { padding-right: 2px; background-color: #dddddd; }
By: Harvey Cheng.
In previous blog posts, we discussed the intuition behind using Gaussian processes and compared the difference between using marginal maximum likelihood and kriging variance for estimating the hyperparameters of the covariance kernels. These insights contribute to building better surrogate models of the objective function to guide our search for an optimal solution. In this blog post, we discuss another ingredient of Bayesian optimization: the sampling policy (or acquisition function). The combination of these tools empowers SigOpt to help customers efficiently optimize their critical metrics, including the accuracy of machine learning models, the performance of a reinforcement learning algorithm and the output of complex simulations.
Bayesian optimization in the Markov decision process framework
Before we dive into the different types of sampling policies, we would like to familiarize the reader with the notations used in post. We want to look at the optimization loop in the context of a sequential decision making problem or, more precisely, a Markov decision process (MDP). In this setting, the decision maker alternates between choosing decisions ($x^n$) and observing outcomes from a stochastic process $\{Y^n\}$, as shown below:
$$ \text{choose } x^n \rightarrow \text{observe } Y^{n+1} \rightarrow \text{choose } x^{n+1} \rightarrow \text{observe } Y^{n+2} \cdots $$
We define the state of the world ($S^n$) as all the relevant information sufficient to make a decision regarding $x^{n}$; in the standard definition, it is common to restrict the state to only include necessary information, so as to have minimal dimension. For this post, our state will always include the current estimate of the objective function, and certain sampling policies requiring additional information (such as the best incumbent observation).
For Bayesian optimization, the decisions ($x^n$) are the suggestions (or the query points). For each decision, we receive a random/noisy observation ($Y_{x^n}^{n+1}$) from the customer(1); this is the exogenous stochastic process. Using our knowledge of the state, the decision, and the observation, we can evolve to another state via a transition function $S^{n+1} = g( S^n, x^n, Y^{n+1} )$.
Because of the random nature of the observations, the idea of the optimal decision does not apply since decisions are influenced by the random outcomes. Instead, the goal is to search for a policy that is optimal in expectation (or over some risk measure)(2). A policy is a decision rule consisting of a function mapping a state to a decision for each iteration $n$.
A simple illustration
Consider a problem where our decision space consists of five options, A through E, each with an outcome. Our goal is to discover the best option given a limited budget to sequentially sample the options. Let us further assume that observation from the samples are noisy.
We have said that state variable encapsulates all the information required to compute the decision. If our state variable only contains the posterior mean of the options, as seen in Figure 1, where do we want to sample next if we want to discover the best option?
Figure 1. The posterior means of the five options.
In this simple example, we would want to sample option A since it has the highest estimated outcome. But what if we expand our state variable to include the posterior variance of the options, as shown in Figure 2?
Figure 2. The full posterior distributions of the five options.
With the added information, this becomes a more interesting decision. If your answer is still option A (because it has the highest estimated value), then you are exploiting. If you changed to option D (because it has the highest standard deviation), then you are exploring. A good sampling policy needs to carefully balance between exploitation and exploration.
One class of policies involves defining an improvement criterion using the state variable for the next decision. The idea of using an improvement-based approach to help guide the sampling procedure can be traced back to Kushner, 1964, namely the probability of improvement policy.
Expected improvement
One of the most well known optimization criteria is the expected improvement (EI), first introduced in Mockus et al., 1978. This idea was combined with the Gaussian processes (GP) model in the efficient global optimization [Jones et al., 1998] and sequential kriging optimization [Huang et al., 2006]. These two algorithms form the foundation for most Bayesian optimization algorithms.
Let $f^n(x)$ be the estimated mean of the objective function evaluate at $x$ after $n$ samples and $\mathcal X$ be the decision space of the problem. The EI function is defined as
$$EI^n(x) = \textbf E \Big[ \max (f^{n+1} (x) − y_{\text{best}}, 0 ) | S^n, x^n = x \Big],$$
where the expectation is taken with respect to the random observation from decision $x^n$. The expected improvement policy is to select the maximal argument:
$$X^{EI,n} = \arg \max_{x \in \mathcal X} EI^n(x).$$
The EI policy has become the de facto policy for Bayesian optimization because the EI function can be easily computed (with an analytical solution) under the Gaussian assumptions.
Knowledge gradient
A close variant of the EI algorithm is the knowledge gradient (KG) policy, which was introduced in Frazier et al., 2006 for finite discrete decision space with normally distributed alternatives. Scott et al., 2010 later extends the KG policy to work with GP models.
The KG function is defined as
$$ KG^n (x) = \textbf E \Big[ \max_{x’ \in \mathcal X} f^{n+1} (x’) − \max_{x’ \in \mathcal X} f^n (x’) | S^n, x^n = x \Big], $$
where $\max_{x \in \mathcal X} f(x)$ is the optimal value of the posterior mean. Analogously, the KG policy is then
$$ X^{KG,n} = \arg \max_{x \in \mathcal X} KG^n(x). $$
The exact computation of the KG function is more costly than EI, due to the maximization within the expected value.
A numerical comparison
The EI and KG functions appear to be almost identical at a glance. In fact, these two are equivalent when the observations are deterministic(3). In the presence of noisy observations, these two policies behave very differently, but in what way? Let us revisit the example in Figure 2; we associate some numbers with the five options below in Table 1.
posterior A B C D Emean 10 7 6 8.5 7 variance 0.5 1.0 0.5 2.5 2.5
Table 1. Posterior means and variances of the five options.
Let us also assume the variance of the observation is 0.5 across all options and that the best incumbent outcome is 9.5. Computing the KG and EI values using these numbers, we can compare the results in Figure 3.
Figure 3. The KG and EI values for the five options (in log scale) when our best incumbent outcome is 9.5. The KG policy prefers option D and the EI policy prefers option A.
It is apparent from this table that the EI policy will select option A as the next sample; the KG policy, option D. The EI policy relies on the best incumbent solution as a guideline for the next sampling point, while the KG policy relies on the posterior distribution of the model for guideline. In a sense, the KG policy trusts the model and believes that option A has a lower marginal value (since it is already the optimal choice so far). The EI policy does not solely rely on the posterior distribution but also on the best incumbent solution. If our best incumbent result is 11.5 instead, the EI policy now suggests that we should be sampling option D next.
Figure 4. The KG and EI values for the five options (in log scale) when our best incumbent outcome is 11.5. Both KG and EI policies prefers option D now.
This difference in the policies also exists when we use a GP to model the objective function. In Figure 5, we can see that given the same GP model, EI policy wants to direct the next sample near the true optimal region of the function (also that of the posterior mean). On the other hand, KG policy wants to direct the sampling decision to the region where the posterior variance is high. The KG policy is confident in the model in the optimal region, and wants to sample in the region that is more likely to improve upon the current best estimate.
Figure 5. Comparing KG and EI policies for an unknown continuous function (in dotted orange) using GP as the surrogate model. The EI policy prefers to sample around region around 0.5 and the KG policy prefers to sample around the region around 1.5.
Conclusion
We outlined two improvement-based policies and examined the difference in their behaviors. While the EI policy is the de facto sampling policy in most Bayesian optimization algorithms, users who are confident in their statistical model might consider the KG policy for a greedier approach(4).
(1) The reason that we index the random outcome $Y$ by the superscript $n+1$ here is to create a consistent way of denoting whether variables are deterministic or random at iteration $n$. More formally, the sigma algebra $\mathcal F^n$ is generated by everything up to iteration $n$, i.e. $\mathcal F^n = \sigma (S^m, Y^m)_{m \leq n}$. We can define the filtration $\{ \mathcal F^n \}_{n \geq 0}$, such that $\mathcal F^0 \subset \mathcal F^1 \subset \cdots \subset \mathcal F$.
(2) We have used the terms sampling policy and acquisition function interchangeably thus far; however, it should be noted that they are two different concepts. The EI policy should really be called the max EI policy, since it takes the maximal argument of the EI function.
(3) The curious reader can find the derivation of this statement in Scott et al., 2010.
(4)KG is greedier in the sense that it believes in the posterior model and thus spends less time validating the posterior maximal region if the variance is low.
0 notes
Text
Common Problems in Hyperparameter Optimization
By Alexandra Johnson, Software Engineer.
Orginally given as a talk at MLConf NYC 2017. Slides.
SigOpt’s Bayesian Optimization service tunes hyperparameters in machine learning pipelines at Fortune 500 companies and top research labs around the world. In this blog post, we are going to show solutions to some of the most common problems we’ve seen people run into when implementing hyperparameter optimization.
Introduction
Hyperparameters are your model’s magic numbers — values you set on your model before you train with any data. Examples include the number of trees in a random forest or the number of hidden layers in a deep neural net. Tweaking the values of your hyperparameters by just a small amount can have a huge impact on the performance of your model.
The process by which we search for “the best” hyperparameters is commonly known as hyperparameter optimization. Hyperparameter optimization is a powerful tool for unlocking the maximum potential of your model, but only when it is correctly implemented. Here, we are going to share seven common problems we’ve seen while executing hyperparameter optimization.
#1 Trusting the Defaults
The biggest mistake in hyperparameter optimization is not performing hyperparameter optimization at all. When you don’t explicitly set the hyperparameters on your model you are implicitly relying on the model developer’s default hyperparameters — and these values may be completely inappropriate for your problem. In an example from the SigOpt blog of building and tuning a TensorFlow ConvNet to predict Google Street View house digits from the SHVN dataset, we saw a 315% improvement over the baseline default hyperparameters hand optimized for a similar task using the similar MNIST dataset.
#2 Using the Wrong Metric
In the early days of Bing, Microsoft researchers used the total number of searches as a measurement of the quality of their algorithmic search engine results. After optimizing their search engine for that metric, they found that searches had indeed gone up; however, on manual inspection, the search results had become worse after the optimization. Why? The underlying assumption that if the number of searches increased, then the algorithm was performing better was false — the number of searches had gone up because users needed to search more to find what they were looking for!
Stories like this illustrate a large issue when performing hyperparameter optimization; it is designed to amplify the evaluation criterion that you, the practitioner, have chosen. If you have incorrect underlying assumptions about your metric, hyperparameter optimization has the potential to amplify those incorrect underlying assumptions.
In your model you can balance multiple, competing metrics in evaluation, such as revenue and quality. You may want to consider building a scalar-valued composite of competing metrics, or exploring the space of all possible solutions through a multi-metric approach.
#3 Overfitting
Overfitting is the issue in which our model performs extremely well during training and optimization, and very poorly out of sample. You can avoid overfitting by using techniques such as cross validation, backtesting, or regularization. When using a technique like k-fold cross validation, your model evaluation for hyperparameter optimization would be the average of k model evaluations from each of the k folds of your data. Techniques like this will help ensure that the metric you optimize for correlates generalizes well to unseen data.
#4 Too Few Hyperparameters
When building a machine learning pipeline, from raw data to feature extraction to model building, often feature extraction will involve tunable parameters like transformations or learned feature representations. It is important to remember to optimize your feature parameters as well to get maximum performance. In this example from the SigOpt blog we built an xgboost classifier for SVHN digits and show compelling results for tuning your feature parameters at the same time as your model hyperparameters. It is recommended that you optimize all hyperparameters of your model, including architecture parameters and model parameters, at the same time.
#5 Hand-tuning
An optimization method is the strategy by which the next set of hyperparameters are suggested during hyperparameter optimization. There are many different optimization methods to choose from, and they will all have different setup steps, time requirements, and performance outcomes.
When you manually tweak the values of your hyperparameters, you are the optimization method. And you are most likely an inefficient optimization strategy. At the end of the day, humans are usually poor at performing high dimensional, non-convex optimization in their heads. In this example from the SigOpt blog we show that algorithmic optimization can beat out hand tuning for a deep neural net in a number of hours, only requiring knowledge of a bounding box for the hyperparameters representing algorithmic weights or the structure of a neural net. Choosing an algorithmic optimization method will save you time and help you achieve better performance. Below, we go over the differences between some of the most popular methods.
Grid search, random search, and Bayesian optimization are three popular optimization methods that produce very different outcomes for your model’s peak performance
#6 Grid Search
Grid search a very common and often advocated approach where you lay down a grid over the space of possible hyperparameters, and evaluate at each point on the grid; the hyperparameters from the grid which had the best objective value is then used in production. At SigOpt, we are not fans of grid search. The most prominent reason is that grid search suffers from the curse of dimensionality: the number of times you are required to evaluate your model during hyperparameter optimization grows exponentially in the number of parameters. Additionally, it is not even guaranteed to find the best solution, often aliasing over the best configuration.
#7 Random Search
Random search is as easy to understand and implement as grid search and in some cases, theoretically more effective. It is performed by evaluating n uniformly random points in the hyperparameter space, and select the one producing the best performance.
The drawback of random search is unnecessarily high variance. The method is, after all, entirely random, and uses no intelligence in selecting which points to try. You are relying on luck to get good results. Intelligent methods like simulated annealing, Bayesian optimization (used by SigOpt), genetic algorithms, convex optimizers, and swarm intelligence methods, can produce better performance for your model compared to grid search and random search. Furthermore, these methods requires less evaluations of your model, and have lower variance because they intelligently search the parameter space. Best of all, in the last few years, simple interfaces have been published for many of these methods meaning that you don’t need a PhD in mathematics to bring the most sophisticated techniques from research into your model-building pipeline.
Conclusion
Hyperparameter optimization is an important part of any modern machine learning pipeline. To achieve performance gains, though, it must be implemented correctly. We hope that after reading this blog post you will avoid some of these common mistakes in hyperparameter optimization. Thanks for reading, and happy optimizing!
SigOpt provides a simple API for hyperparameter optimization. With three easy API calls you can integrate a sophisticated ensemble of Bayesian Optimization techniques into your machine learning pipeline. Learn more at sigopt.com/getstarted.
0 notes
Text
SigOpt for ML: Using Bayesian Optimization for Reinforcement Learning
By: Eric Bai and Olivia Kim, Software Engineering Interns.
In this post, we will show you how Bayesian optimization was able to dramatically improve the performance of a reinforcement learning algorithm in an AI challenge. We’ll provide background information, detailed examples, code, and references.
Background
Reinforcement learning is a field of machine learning in which a software agent is taught to maximize its acquisition of rewards in a given environment. Observations of the state of the environment are used by the agent to make decisions about which action it should perform in order to maximize its reward.
Reinforcement learning has recently garnered significant news coverage as a result of innovations in deep Q-networks (DQNs) by DeepMind Technologies. Through deep reinforcement learning, DeepMind was able to teach computers to play Atari games better than humans, as well as defeat one of the top Go players in the world.
As noted in DeepMind’s paper, an “informal search” for hyperparameter values was conducted in order to avoid the high computational cost of performing grid search. Because the complexity of grid search grows exponentially with the number of parameters being tuned, experts often spend considerable time and resources performing these “informal searches.” This may lead to suboptimal performance, or can lead to the systems not being tuned at all. Bayesian optimization represents a way to efficiently optimize these high dimensional, time consuming, and expensive problems.
We will demonstrate the power of hyperparameter optimization by using SigOpt’s ensemble of state-of-the-art Bayesian optimization techniques to tune a DQN. We’ll show how this approach finds better hyperparameter values much faster than traditional methods such as grid and random search, without requiring expert time spent doing “informal” hand tuning of parameters. The DQN under consideration will be used to solve a classic learning control problem called the Cart-Pole problem(1). In this problem, a pole must be balanced upright on a cart for as long as possible. To simulate this environment, we will use OpenAI’s Gym library.
Figure 1: A rendered episode from the OpenAI Gym’s Cart-Pole environment
The OpenAI Gym provides a common interface to various reinforcement learning environments; the code written for this post (available on Github) can be easily modified to solve other learning control problems from the Gym’s environments.
The Environment
In OpenAI’s simulation of the cart-pole problem, the software agent controls the movement of the cart, earning a reward of +1 for each timestep until the terminating step. A terminating step occurs when the pole is more than 15 degrees from vertical or if the cart has moved more than 2.4 units from the center. The agent receives 4 continuous values that make up the state of the environment at each timestep: the position of the cart on the track, the angle of the pole, the cart velocity, and the rate of change of the angle. The agent’s only possible actions at each timestep are to push the cart to the left or right by applying a force of either -1 or +1, respectively. A series of states and actions, ending in a terminating state, is known as an episode. The agent will have no prior concept about the meaning of the values that represent these states and actions.
Q-learning
Q-learning is a reinforcement learning technique that develops an action-value function (also known as the Q-function) that returns an expected utility of an action given a current state. Thus, the policy of the agent is to take the action with the highest expected utility.
Assume there exists an all-knowing Q-function that always selects the best action for a given state. Through Q-learning, we construct an approximation of this all-knowing function by continually updating the approximation using the results of previously attempted actions. The Q-learning algorithm updates the Q-function iteratively, as is explained below; initial Q-values are arbitrarily selected. An existing expected utility is updated when given new information using the following algorithm(2),
$$Q_{t+1}(s_t,a_t) = Q_t(s_t,a_t) + \alpha (r_{t+1} + \gamma \max_a(Q_t(s_{t+1}, a)) - Q_t(s_t,a_t)).$$
$a_t$ is the action executed in the state $s_t$.
$s_{t+1}$ is the new state observed. In a deterministic environment(3), it is a function of $s_t$ and $a_t$.
$r_{t+1}$ is the immediate reward gained. It is a function of $s_t$, $a_t$, and $s_{t+1}$.
$\alpha$ is the constant learning rate; how much the new information is weighted relative to the old information.
$\gamma$ is the constant discount factor that determines how much long-term rewards should be valued.
In its simplest form, the Q-function can be implemented as a table mapping all possible combinations of states and actions to expected utility values. Since this is infeasible in environments with large or continuous action and observation spaces, we use a neural net to approximate this lookup table. As the agent continues act within the environment, the estimated Q-function is updated to better approximate the true Q-function via backpropagation.
The Objective Metric
To properly tune the hyperparameters of our DQN, we have to select an appropriate objective metric value for SigOpt to optimize. While we are primarily concerned with maximizing the agent’s reward acquisition, we must also consider the DQN’s stability and efficiency. To ensure our agent’s training is efficient, we will train the DQN over the course of only 350 episodes and record the total reward accumulated for each episode. We use a rolling average of the reward for each set of 100 consecutive episodes (episodes 1 to 100, 2 to 101, etc.) and take the maximum for our objective metric. This helps stabilize the agent’s learning while also giving a robust metric for the overall quality of the agent with respect to the reward.
Tunable Parameters of Reinforcement Learning Via Deep Q-Networks
While there are many tunable hyperparameters in the realm of reinforcement learning and deep Q-networks(4), for this blog post the following 7 parameters(5) were selected:
minibatch_size: The number of training cases used to update the Q-network at each training step. These training cases, or minibatches, are randomly selected from the agent’s replay memory. In our implementation, the replay memory contains the last 1,000,000 transitions in the environment.
epsilon_decay_steps: The number of episodes required for the initial $\varepsilon$ value to linearly decay until it reaches its end value. $\varepsilon$ is the probability that our agent takes a random action, which decreases over time to balance exploration and exploitation. The upper bound for this parameter depends on the total number of episodes run. Initially, $\varepsilon$ is 1, and it will decrease until it is 0.1, as suggested in DeepMind’s paper.
hidden_multiplier: Determines the number of nodes in the hidden layers of the Q-network. We set the number of nodes by multiplying this value by the size of the observation space. We formulated this parameter in this way to make it easier to switch to environments with different observation spaces.
initial_weight_stddev and intial_bias_stddev: Both the Q-network’s weights and biases are randomly initialized from normal distributions with a mean of 0. The standard deviations of these distributions affect the rate of convergence of the network.
learning_rate: Regulates the speed and accuracy of the Q-network by controlling the rate at which the weights of the network are updated. We look at this parameter on the logarithmic scale. This is equivalent to $\alpha$ in the Q-learning formula.
discount_factor: Determines the importance of future rewards to the agent. A value closer to zero will place more importance on short-term rewards, and a value closer to 1 will place more importance on long-term rewards. This is equivalent to $\gamma$ in the Q-learning formula.
Other good hyperparameters to consider tuning are the minimum epsilon value, the replay memory size, and the number of episodes of pure exploration (_final_epsilon, _replay_memory_size, and _episodes_pure_exploration in the Agent class).
The Code
The code is available on Github. The only dependencies required to run this example are NumPy, Gym, TensorFlow, and SigOpt. If you don’t have a SigOpt account, you can sign up for a free trial. SigOpt also has a free plan available for academic users.
Running this code can be computationally intensive. If possible, try running this example on a CPU optimized machine. On a c4.4xlarge AWS instance, the entire example can take up to 5 hours to run. If you are running the code on an AWS instance, you can try using the SigOpt Community AMI that includes several pre-installed machine learning libraries.
Results
We compared the results of SigOpt’s Bayesian optimization to two standard hyperparameter tuning methods: grid search and random search(6). 128 objective evaluations for each optimization method were run, and we took the median of 5 runs.
SigOpt Random Search Grid Search Best Found 847.19 569.93 194.62
Table 1: SigOpt outperforms random search and grid search.
Figure 2: The best seen trace of hyperparameter tuning methods over the course of 128 objective evaluations.
SigOpt does dramatically better than random search and grid search! For fun, let’s look at the performance of the DQN with the best configuration found by SigOpt. Below are snapshots showing the progress of the sample network’s evolution over the 350 episodes.
Figure 3: Episode 64 of 350. The pole tilts too far, ending the episode.
Figure 4: Episode 125 of 350. The cart goes too far in one direction, ending the episode.
Figure 5: Episode 216 of 350. The agent performs well. The video cuts off before the agent fails.
As shown above, the agent initially has trouble keeping the pole balanced. Eventually, it learns that it can go in the direction that the pole is angled in order to prevent it from falling over immediately. However, the cart would then would then travel too far in one direction, causing the episode to terminate. Finally, the agent learns to move just enough to swing the pole the opposite way so that it is not constantly travelling in a single direction. This is the power of tuning discount_factor effectively!
Closing Remarks
Through hyperparameter tuning with Bayesian optimization, we were able to achieve better performance than otherwise possible with standard search methods. The example code presented in this post is easily adaptable to explore more computationally intensive tasks. We encourage you to try:
Implementing more sophisticated DQN features to improve performance
Tuning a greater number of hyperparameters
Attempting more complicated games from the OpenAI Gym, such as Acrobot-v1 and LunarLander-v0. Our code currently supports games with a discrete action space and a 1-D array of continuous states for the observation space
Tuning a DQN to maximize general performance in multiple environments
Let us know what you try!
More From SigOpt
SigOpt’s 90 Second Demo demonstrates our optimization engine with no sign-up required.
Deep Learning: Tensorflow ConvNets on a Budget
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
Interested in helping push the state of optimization forward? SigOpt is hiring!
(1) We use the version of the cart-pole problem as described by Barto, Sutton, and Anderson (warning: large PDF).
(2) For further insight into the Q-function, as well as reinforcement learning in general, check out this blog post from Nervana
(3) The environment does not need to be deterministic for Q-learning to work. The proof that Q-learning converges takes into account stochastic environments.
(4) DeepMind lists the hyperparameters used in their algorithm to train an agent to play Atari games
(5) The types and ranges of the hyperparameters used in this example:
minibatch_size: integer [10, 500]
epsilon_decay_steps: integer [$n$/10, $n$] where $n$ is the number of episodes (for the cart-pole problem, this is set to 350)
hidden_multiplier: integer [5, 100]
initial_weight_stddev: double [0.01, 0.5]
initial_bias_stddev: double [0.0, 0.5]
log(learning_rate): double [log(0.0001), log(1)]
discount_factor: double [0.5, 0.9999]
(6) We used uniform random search and the vertices of the hypercube for grid search.
0 notes
Text
SigOpt In Depth: Building a Better Mousetrap via Multicriteria Bayesian Optimization
by: Mike McCourt, Ph.D.
In an earlier post, we introduced the topic of multicriteria optimization and some mechanisms for solving multicriteria problems (also called multiobjective problems) using the SigOpt optimization tool; SigOpt provides a RESTful API to solve scalar black-box optimization problems for customers from fields such as industrial design and image processing. The earlier introduction discussed a somewhat simple problem of choosing the driving speed on a trip subject to concerns about minimizing both time in transit and cost of fuel. In this post we want to analyze a more complex situation in which the parameters of a given model produce a random output, and our multicriteria problem involves maximizing the mean while minimizing the variance of that random variable.
The standard example in literature of simultaneously considering the mean behavior of a model/process and the variance is in financial portfolio management. Generally, investors want to maximize expected gain while minimizing risk (which exists in the form of variance) which motivated the development of the Nobel prize winning theory of mean-variance analysis. The original work and a 2014 retrospective by the same author (as well as modern texts such as here and here) serve as helpful introductions to the theory of defining (even implicitly) a utility function and, potentially, an associated quadratic (mean-variance) approximation. Under some conditions (discussed in those references) an expert’s selection of a portfolio from the mean-variance Pareto frontier should approximately optimize the utility function.
Our example today is less practical and more fun(1). We start with a mouse that has some ability to intelligently traverse a maze and study how long it takes the mouse to solve a randomly generated maze. Then we try to tune parameters of the random maze generation algorithm to create mazes which are very difficult for the mouse to solve. This will be phrased as a multicriteria optimization problem, with the goal of maximizing the mean time to solve while minimizing its variance.
Variations on a theme The motivating goal of mean-variance analysis involves both wanting good performance (high portfolio return) and wanting to avoid bad performance (high risk and potential for loss). These are certainly related, but are not necessarily the same thing.
For example, gamblers who play slot machines are told on the machine exactly how much they will win/lose on average by playing the slot machine. If the slot machine paid out \$0.90 for every \$1.00 put into it, with no variance whatsoever, I doubt anyone would play(2). But, gamblers may be willing to accept a low mean payout in exchange for a high variance because the higher variance implies a greater chance for a million dollar payout. In this setting, the gambler embracing the high likelihood of a bad performance in exchange for a potentially exceptional performance.
Other circumstances where our goal would not be to maximize the mean and minimize variance include many industrial processes: in the formation of a tool such as a nail, the goal is not to maximize the mean length of the nail but rather to minimize deviation from a desired length. This goal was formalized with the development of the Taguchi loss function(3), which served as a scalarization of the multicriteria optimization problem involving minimizing both the variance and the deviation from the desired outcome.
Maze solver description Our maze solver strategy is the very simple wall follower strategy, which will, admittedly, only work on simply connected mazes (no loops, no levels). Given that, it is quite easy to implement and has predictable (non-random) behavior which helps push the entirety of the randomness to the maze generation algorithm.
For those of you interested, you can see the implementation of the maze construction and solution in our examples Github repo.
The strategy involves the mouse continually following the wall on the right all the way through the maze until reaching the end. If the mouse can turn right, it does so, otherwise it moves forward; if both those paths are blocked it attempts to turn left. If all three directions are blocked, the mouse has reached a dead end and reverses course, continuing to hug the right wall to guide it forward. There is an outstanding explanation and graphic of this wall follower strategy available here which I will not try to supercede. The figure below, instead, shows why the wall follower strategy succeeds for simply connected mazes, but may fail for mazes which have their exit inside a loop.
Figure 1: Two mice moving through slightly different mazes (see if you can spot the difference) using the wall follower strategy. (left) This maze is solvable because all walls are attached to the boundary (right) This maze is not solvable because the solution is next to a wall which is not attached to the boundary; eventually, the mouse simply returns to the start.
Maze generator description It is quite tempting for myself, as a research-oriented person, to see the problem of designing a maze as an opportunity to spend several days devising diabolical labyrinths under the cover of “developing content marketing”. I restrained, however, if only because so many great resources are already available on the internet. In particular, Wikipedia already had available a perfectly acceptable starting point involving a random depth-first search. Furthermore, the resulting maze structure guarantees that the failure to solve observed in Figure 1 cannot be encountered.
The only shortcoming with the code already available is that there were no tunable parameters. We needed to introduce a mechanism for making the mazes harder or easier for the mouse to solve; we did this by varying the relative probability of the depth-first search choosing each of the possible directions with which it could move, after eliminating those directions which it has already traversed. Essentially, at each step of the maze generation, there is a relative probability of punching through a wall. The figure below shows one such randomly generated maze, which has no preference for any direction.
Figure 2: A sample maze creation process with equal probability of moving in each direction.
By allowing the creation algorithm the possibility of different probabilities, mazes will start to gain a more identifiable structure, as demonstrated in the figure below. The probability of moving left, up, right and down are the parameters that we tune using Bayesian optimization.(4)
Figure 3: In this sample maze creation, the depth-first search algorithm is set with relative probabilities of [.2, 1, .2, 1] for [left, up, right, down]. This results in a maze which prefers long vertical corridors.
Problem Structure At this point we have a relatively smart mouse with a fixed maze solving strategy and a random maze generation tool with 3 tunable parameters (the four construction direction relative probabilities minus the need for probabilities to sum to 1(4)). Our goal is to create a maze which maximizes the average time required to solve the maze while minimizing the standard deviation(5) of that time. Basically, how can we make a maze which is consistently difficult?
A precursor to that involves studying the impact randomness has on the mouse’s solution time, and how that random variable changes as a function of the various parameters. The histograms in the figure below would not be practical in a production setting, but are provided here to hint at the impact of the parameters on the mouse’s performance. Mouse maze solve times are described in terms of the proportion of total cells in the maze; the minimum possible value is 0 and the max is 2(6). The mouse always starts at the top left cell and the exit is always at the bottom right cell, for simplicity.
Figure 4: Histograms (showing the proportion, not frequency) of 5000 30x30 maze solve times for various maze construction parameterizations of the [left, up, right, down] relative probabilities. (top left) [1, 1, 1, 1]. (top right) [5, 1, 1, 1]. (bottom left) [1, 5, 1, 1]. (bottom right) [1, 1, 5, 1].
As we can see there is a significant, and seemingly unpredictable, impact from changing the parameters. Maximizing the mean (or conceivably, median) of this distribution while minimizing the variance (or maybe interquartile range) of this mouse’s speed will be non-intuitive. This black box style of problem is exactly what SigOpt is designed to handle … except for the fact that SigOpt does not (as of August 2016) natively support multicriteria optimization (though, thanks to the support of our investors, we hopefully will soon).
Devising a solution strategy So, how can we use scalar Bayesian optimization to provide a solution to this problem? An earlier post introducing the topic of multicriteria optimization provided possible strategies, which we implement here. Depending on the situation, there may be multiple ways to interpret “solution” for a multicriteria optimization problem:
Some scalar function of the multicriteria components could be formed, which is then optimized to define the “solution”. This would require only a single scalar optimization, but would require prior understanding as to how the scalarization balances the mean and variance.
An approximation to the Pareto frontier can be formed by solving several scalarized versions of the problem. A selection from these scalarized solutions can then be made by a domain expert to determine which is most appropriate given the level of risk aversion. This strategy is the focus of our work today.
As a baseline, the figure below was generated using a 1000 point low discrepancy grid sampling over the 3D parameter space (varying left, up, right construction probabilities while holding down fixed): it shows the spread of mean/variance mouse solve times estimated over 100 randomly generated 30x30 mazes.
Figure 5: 1000 low discrepancy samples, logarithmically spaced in the domain [0.01, 100]3, are used for left, up, right probabilities to construct random mazes. We seek parametrizations that are high on the y-axis and low on the x-axis (compare to portfolio optimization). The red points are Pareto efficient samples.
There are a lot of things that can be said about this figure, and a lot of interesting research questions (e. g., why is there an empty region in the bottom left?), but we will restrain from a thorough investigation in the interests of following a practicable workflow; the amount of work required to create this picture is significantly more than we hope to invest to find a solution. The remainder of this section will detail our attempts to do this efficiently.
The shape of the feasible set indicates that the metrics of interest may not be convex; this presents limitations to the use of weighted sum scalarization in trying to define the Pareto frontier. Of course, we likely would not know this at the start of the optimization process, so we will proceed as though we believe the component functions to be convex and see how the results turn out.
Our first strategy is to still use the weighted sum method (even though it could be troubled) to approximate the Pareto frontier; recall that, if the component functions are convex, then a weighted sum (which sums to 1) will have an optimum on the Pareto frontier. We will consider 15 weighted sum experiments with weights evenly spaced in [.05, .95]. Mean and variance values for suggested left, up and right turn parameters will be estimated over 100 maze constructions. The figure below shows the result of such optimizations, as conducted using SigOpt.
Figure 6: (left) A sequence of weighted sum scalarizations with 300 total maze generations produces the approximate Pareto frontier in red; all inefficient points are in blue. The yellow points were the final points of a scalar optimization which did not lie on the frontier. (right) A magnified view of Figure 5 which has the randomly searched Pareto frontier in black. 1000 points were needed to produce this approximate frontier, with no consistent gain in quality over the cheaper solution on the left produced using SigOpt.
The mean and variance times computed for a given set of parameters are not unique to the weight in the scalarization; as such, we can reuse information from previous weight values as “initial data” for the next weighted sum experiment(7). This allows us to use only 20 function evaluations per scalarized experiment without a substantial loss in quality. Because the mean and variance are not convex functions of the turn parameters, however, some of the optimization solutions are not present on the true Pareto frontier.
To work around any potential lack of convexity in the component functions, we can instead phrase the problem as an epsilon-constrained scalar optimization, similarly to what was discussed in our earlier multicriteria post. Each of these scalar optimization problems can be phrased as minimizing only the standard deviation, subject to the mean being greater than a pre-chosen constraint. The image below was created with 10 constraint-based experiments, using mean constraints evenly spaced in [.95, 1.2].
Figure 7: (left) Like Figure 6, this depicts the Pareto frontier (red are efficient, blue are inefficient, yellow are the scalar constraint terminations), but here it is approximated using constrained optimization in SigOpt. (right) The same random search approximated frontier from Figure 6, which required 1000 maze generations, rather than the 300 for the SigOpt generated frontier on the left.
Choosing a solution So, after all of this, it is important to remember that simply forming the Pareto frontier does not solve the practical issue of designing the product: an expert must choose the “best” solution from the approximate frontier. Because we are not maze experts, we simply present some of the possible choices in the figure below and allow the reader to make their own judgment. In a more practical/industrial setting, the balance imposed by the expert may take the name utility or risk aversion.
Figure 8: Sample mazes (starting in the top left, ending in the bottom right) representing points on the Pareto Frontier: the parameters are listed, along with the estimated mean and standard deviation. (top left) [.034, .021, 88.365], mean .068, sd .002; (top right) [51.4, 51.3, 1.385], mean 1.0, sd .009; (bottom left) [.015, 5.85, .0196], mean 1.1, sd .09; (bottom right) [.042, 61.226, .052], mean 1.21, sd .173.
Conclusion In many settings there is a need to balance performance against robustness: Toyota may want to increase fuel efficiency but it cannot produce vehicles which gain 20% more efficiency at the cost of failing to start 1 out of every 5 times they are used. This tradeoff between improving average behavior and introducing a greater potential for bad behavior is the crux of this post, which we have addressed in the context of making a better mouse-stumping maze.
We have used SigOpt to identify multiple approximately Pareto efficient maze configurations from which an expert (maze designer, not an expert mouse) could select the maze which best balances difficulty and consistency. For these multicriteria problems, the need to efficiently conduct black box optimization is amplified because of the need to generate multiple solutions on the Pareto frontier. Using Bayesian optimization has a multiplicative benefit because each point on the Pareto frontier is generated at less experimental cost than with, e. g., grid search.
Again, any interested readers can feel free to experiment on their own using our implementation of the maze construction and solution available in our examples Github repo.
Special Thanks Again, I would like to extend my thanks to Devon Sigler, an expert in multicriteria optimization who will shortly be finishing his Ph.D. He has helped introduce me to the wide range of literature that discusses multicriteria optimization and have provided valuable insights regarding the theoretical implications of popular solution strategies.
More from SigOpt
SigOpt's 90 Second Demo demonstrates our optimization engine with no sign-up required.
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
Interested in helping push the state of optimization forward? SigOpt is hiring!
(1) I have been told that my posts are too serious and informative and should be more fun. I hope you find this post to be more fun.
(2) SigOpt does not advocate for gambling. If you choose to gamble, please gamble responsibly. Gambling at a game where you automatically lose 10% of your bet with no potential to win is an example of irresponsible gambling.
(3) Normally, this is where I would interject with my own discussion of the competing interests of theoretical statistics/mathematics and practical applications, given the compelling history of Genichi Taguchi, but Wikipedia has already beaten me to it. Kudos, again, to the outstanding contributions on Wikipedia and those of you out there doing the contributing.
(4) In our experiments, we allow ourselves 3 tunable parameters: the relative probabilities of moving left, up and right. We use the term “relative probability” to mean relative to the probability of moving down, which is always fixed at 1; the requirement that all possible probabilities sum to 1 is why there are only 3 tunable parameters. It is entirely possible (and more natural) to phrase this situation as having 4 free parameters, but in doing so there will be repeated behavior throughout the 4D parameter space. The random mazes generated with (1, 2, 3, 4) would look the same as (2, 4, 6, 8) or (.1, .2, .3, .4). To avoid this degeneracy (where a 4D problem actually only has 3 dimensions of complexity) we fix the final element in the vector to 1. This is not the only way to resolve this (see, e. g., mixture designs) but it is the simplest, in our opinion.
(5) It is more common to read about minimizing the variance than the standard deviation (as the name mean-variance analysis suggests) but here we deal with the standard deviation because it has the same units as mean. This helps to put both components of the multicriteria problem on the same scale, which can be helpful when identifying a solution.
(6) Because the mouse follows the wall follower strategy, it can never reach a location more than twice (once coming and once going if that path led to a dead end). That is why the maximum is 2. The minimum is conceivably 0 if the exit and start were the same location, but functionally a number greater than 0 because we place the exit at the other side of the maze.
(7) Since latter experiments (in the sequence of weighted sum problems) have the benefit of earlier data, they should produce a better solution to the scalarized optimization problem. Although we did not here, it might be wise to run earlier experiments with a few extra observations to provide them with more opportunity for the optimization algorithm to converge.
0 notes
Text
Solving Common Issues in Distributed Hyperparameter Optimization
By: Alexandra Johnson, Software Engineer.
Easily identify issues in distributed hyperparameter optimization with SigOpt’s Monitor Dashboard.
You’ve spent time gathering data and feature engineering and now you’re ready to pick a model and maximize its performance. The next step is to optimize the hyperparameters of your model. You’ve chosen Bayesian optimization for hyperparameter optimization because it has been shown to achieve better results than than grid search and random search, while also requiring fewer evaluations of your model [1].
Since you probably want to complete hyperparameter optimization in a reasonable amount of time, you’re considering some method of distributing the total work across a cluster. If you’re considering distributing the model itself over the cluster, Netflix has a great blog post about the tradeoffs involved in that decision [2]. For now, let’s assume that model training is contained on a single machine, and individual workers train new models for each set of hyperparameters, in a setup similar to the one below.
Distributed hyperparameter optimization using the SigOpt API to request Suggestions (suggested sets of hyperparameters,) and report Observations (the value of the model on a given set of hyperparameters). From SigOpt’s docs on parallelization.
However, you still have a few problems:
Distributed compute management and monitoringProgress monitoring
Manual interventions and controlCustom hyperparameters
We’ll show you how you can use SigOpt’s API for Bayesian optimization to manage your cluster and solve both of these problems.
Problem #1: Progress Monitoring
The performance benefits of Bayesian optimization come at the cost of additional bookkeeping. Unlike grid search or random search, a master node will also need to store the model performance associated with each set of hyperparameters. Because of this added complexity, the development of a distributed Bayesian hyperparameter optimization cluster requires simultaneously debugging both your job scheduler and the health of your cluster. Spark and AWS offer cluster monitoring tools for the health of your machines, but you need an at-a-glance answer to the question “Is my hyperparameter optimization running smoothly?”
SigOpt’s answer is visualization. In your hyperparameter optimization cluster, each node will repeatedly execute the follow steps:
Receive a suggested configuration of hyperparameters
Evaluate the model using the suggested hyperparameters
Report back an observation on the model’s performance
When using SigOpt, steps 1 and 3 are both simple API calls. We built a dashboard of useful graphs on top of information that we infer from API calls, such as the time an API call is made.
SigOpt’s web dashboards visualize the progress of your model’s performance, so you can quickly diagnose potential system failings. If a classifier’s accuracy was 80% with default hyperparameters but stayed flat at 50% after 20 model evaluations, you might have a problem.
Side-by-side model evaluation progress for well-running optimization (left) versus poorly-running optimization (right).
In addition, SigOpt’s web dashboards visualize the number of performance observations from step 3 that have been reported. Our API allows you to report failed performance for any reason, and we include this data in our visualization. You can also use metadata to slice your data by hostname. It’s easy to glance at a chart, see red in the column for host-2, and know that something is wrong.
Side-by-side histogram of performance observations tagged by hostname for well-running optimization (left) versus poorly-running optimization (right).
Problem #2: Custom Hyperparameters
Now, let’s say you’ve got one particular set of hyperparameters that you’re dying to try out because your intuition tells you it’s gonna be great. The problem is, you’ve already started the distributed hyperparameter optimization and you don’t want to shut anything down. At SigOpt, we have designed our API to act as a distributed job scheduler. We manage storing, queueing, and serving hyperparameter configurations, allowing you to run the same optimization code on every node, whether in the sequential or parallel workflow. This centralized design allowed us to easily build a feature for queuing up customer hyperparameter configurations.
From the website, navigate to an experiment’s monitoring dashboard to queue a suggestion. It will automatically be included in the distributed hyperparameter jobs that are sent out via the API, with no code changes necessary on your end. It even works on mobile.
Queue up Suggestions from the desktop or mobile website.
Conclusion
SigOpt’s API for hyperparameter optimization leaves us well-positioned to build exciting features for anyone who wants to perform Bayesian hyperparameter optimization in parallel. We offer an easy-to-use REST API with R, Java and Python clients, and all data that is passed through the API is available on both desktop and mobile websites. We’re constantly building new features and visualizations for monitoring your progress and exploring your data.
More From SigOpt
SigOpt’s 90 Second Demo demonstrates our optimization engine with no sign-up required.
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
Interested in helping push the state of optimization forward? SigOpt is hiring!
References
[1] Ian Dewancker, Michael McCourt, Scott Clark, Patrick Hayes, Alexandra Johnson, George Ke. A Stratified Analysis of Bayesian Optimization Methods. 2016 PDF
[2] Alex Chen, Justin Basilico, and Xavier Amatriain. Distributed Neural Networks with GPUs in the AWS Cloud. 2014 Link
0 notes
Text
We Raised $6.6 Million To Amplify Your Research
By: Scott Clark, CEO.
Our Series A Round Was Led By Andreessen Horowitz
Today is a big day at SigOpt. Since the seed round we secured last year, we’ve continued to build toward our mission to ‘optimize everything,’ and are now helping dozens of companies amplify their research and development and drive business results with our cloud Bayesian optimization platform.
Following the surge of investment in cloud, big data, and analytics, enterprises have begun leveraging their data through machine learning, artificial intelligence, and predictive analytics. SigOpt amplifies model development and process research, efficiently tuning them to maximize the business value of these investments.
In recognition of this traction, we’ve received one of our strongest validations to date, securing a $6.6 million Series A financing round. The round was led by Andreessen Horowitz, who also led our seed round. You can read a post by Martin Casado of Andreessen Horowitz about the need for SigOpt, and how many business endeavors can be framed as an optimization problem here. We also received additional follow-on participation from Data Collective and welcomed new investors: SV Angel, Stanford University, and Blumberg Capital. We’ll use the money to scale our team, continue to execute for our customers, and expand the capabilities of our platform.
The idea for SigOpt came while I was a PhD student at Cornell. I proved it out at Yelp, optimizing the machine learning models that powered their advertising system. SigOpt was founded on the belief that researchers across a variety of industries face the same challenges I did when they need to tune their systems.
Since starting SigOpt, we’ve gained traction within the financial services and consumer products industries - notably companies that can drive an immediate impact on their bottom line through improvements in their modeling and processes. Today, we count Prudential, MillerCoors, Johnson & Johnson, and some of the world’s leading hedge funds as customers, among others.
“SigOpt helps us optimize the financial models in our funds management platform,” said SigOpt customer Max Logunov, PhD, a developer at Optimus Socially Responsible Investments LTD. “The API is convenient and easy to set up. SigOpt gets us better results, optimizing 15 times faster than before. Now, we’re happy to focus on other matters, because SigOpt saves us so much time.”
Make Your Modelers The Hero
The SigOpt optimization platform helps experts across industries amplify their modeling and process optimization initiatives. Our customers include algorithmic traders at hedge funds, advanced risk modelers at large banks and research scientists working to brew tastier beers. By replacing costly trial and error and traditional Design of Experiments (DoE), SigOpt empowers experts to get the most out of their models and processes.
By using SigOpt to improve model performance and reduce the time and resources spent fine tuning, a data scientist or researcher can become a hero on their team, driving significant business results.
Continuing To Invest In Technology Leadership
Our platform was built for seamless adoption. It easily bolts on top of existing machine learning, AI, and predictive analytics pipelines. In developing our technology we’ve sought to anticipate what our customers need and to reinforce our leading position in the market. Since our last funding, we’ve extended our platform to become more:
Flexible: Integrations with Google’s TensorFlow, SAP HANA, R, Java, Python, and scikit-learn enable SigOpt customers to quickly integrate into any machine learning stack and conduct efficient feature parameter and hyperparameter optimization.
Intuitive: We expanded SigOpt dashboard to include multi-dimensional visualization, collaboration, and improved experiment management. We’ve also launched a number of improvements to its core API, including a complete REST API overhaul.
Powerful: We’ve invested to ensure that SigOpt’s core optimization engine continues to be the most powerful in the world. The SigOpt research engineering team has extended support for parallel optimization across a cluster of machines, and continues to produce world class, peer reviewed research in Bayesian optimization and optimal learning.
“We are thrilled to partner with Scott and the SigOpt team,” said Matt Bornstein, principal at Blumberg Capital. “Machine learning holds massive potential, but it’s complicated – once described as ‘a black box with 500 million knobs.’ SigOpt turns the knobs for you. We think that’s a pretty powerful concept and the company’s customers think so, too.”
The attention we received from investors during this round was humbling and encouraging, and we’re thrilled with our growing team that believes in our vision. We’re also proud of the customer traction and product leadership that we’ve been able to achieve, but we won’t stop here. This round is going to help us execute for our existing customers and grow our business, build our team, and drive product innovation to bring us closer to our vision of optimizing everything. As we say at SigOpt: Stay tuned!
0 notes
Text
SigOpt In Depth: Intro to Multicriteria Optimization
By: Mike McCourt, Ph.D.
In this post we will discuss the topic of multicriteria optimization, when you need to optimize a model for more than a single metric, and how to use SigOpt to solve these problems.
Here at SigOpt, we help clients to more quickly tune financial models and design industrial processes by providing access to an ensemble of Bayesian Optimization strategies through our REST API or web interface and associated Python, Java and R clients. Our fundamental goal is to find the maximum of a black box function (or metric)---one which takes as input several values (real numbers, integers or categories) and outputs 1 real number to be maximized, with no information about the underlying method for which the inputs interact to result in the output (which is why it is called a “black box”). Tasks as diverse as classifying images and betting on sports require optimization of a suitable metric black box function (which serves as a proxy for quality) to achieve the best performance; failure to do so can yield subpar results for even powerful tools such as deep neural networks.
Fortunately, many situations neatly present such a black box function: financial trading problems may leverage backtesting, machine learning problems may use cross-validation, industrial projects may study product quality subject to tolerance. Some situations, however, do not fit as neatly into the black box function setting described above, where a single metric is being optimized. Consider this relatively standard situation of driving on the freeway(1): an increase in speed will cause a decrease in the time until I reach my destination, but may also cause an increase in the cost by using more fuel. How, then, can I choose a speed so as to simultaneously minimize time to the destination and minimize cost of the trip?
Such a problem is a multicriteria optimization and does not fit into the standard framework for black box optimization; applications including portfolio management, cancer treatment and network theory have already benefited from the use of multicriteria optimization. This post will discuss the complexities of this problem and some strategies for manipulating it into a single criterion problem.
Vector-valued functions
\[ \def\a{\alpha} \def\e{\varepsilon} \def\s{\sigma} \def\RR{\mathbb{R}} \def\mC{\mathsf{C}} \def\mK{\mathsf{K}} \def\mI{\mathsf{I}} \def\ggamma{\boldsymbol{\gamma}} \def\kk{\boldsymbol{k}} \def\uu{\boldsymbol{u}} \def\vv{\boldsymbol{v}} \def\ww{\boldsymbol{w}} \def\xx{\boldsymbol{x}} \def\xopt{\xx_{\text{opt}}} \def\yy{\boldsymbol{y}} \def\zz{\boldsymbol{z}} \def\cD{\mathcal{D}} \def\cX{\mathcal{X}} \def\lmle{L_{\text{MLE}}} \def\lmple{L_{\text{MPLE}}} \def\lkv{L_{\text{KV}}} \def\lclv{L_{\text{CLV}}} \def\ppa{\frac{\partial}{\partial\a}} \DeclareMathOperator*{\argmin}{argmin} \]
Single criterion optimization problems are phrased in the following language: \begin{align*} \xopt = \argmin_{\xx\in\Omega} f(\xx), \qquad f:\Omega\to\RR, \end{align*} where $\Omega$ is the set of possible inputs (think real numbers, integers and categories), $\xx$ is an object (just a vector if $\Omega$ consists of only real numbers) describing the inputs and $f$ is the function we hope to minimize. The standard problem is phrased as minimization, though $f$ could instead be maximized by minimizing $-f$. This optimization problem can be efficiently solved using the Bayesian optimization tools in SigOpt.
The situation changes when $f:\Omega\to\RR$ becomes the vector-valued function $g:\Omega\to\RR^k$, that is, the function we hope to optimize now maps inputs $\xx\in\Omega$ to vectors $g(\xx)\in\RR^k$. The reason this is a more complicated situation is that an ordering of vectors in $\RR^k$ does not exist. Any two numbers $\alpha,\beta\in\RR$ must satisfy exactly one of $\alpha<\beta$, $\alpha>\beta$ or $\alpha=\beta$, but such a requirement is not true for vectors $\uu,\vv\in\RR^k$; how would one order the vectors \begin{align*} \uu=\begin{pmatrix}1\\\\2\\\\3\end{pmatrix}, \qquad \vv=\begin{pmatrix}2\\\\1\\\\3\end{pmatrix}, \qquad \ww=\begin{pmatrix}3\\\\2\\\\1\end{pmatrix}? \end{align*} The inability to order these vectors produces ambiguity in any attempt to minimize $g$: if $\uu<\vv$ is ill-defined, how could one ever declare that $\uu=g(\xx)$ is the minimum value? To read in depth about problems of this variety and their solution, see this text or this text.
Multicriteria Example
We can return now to the problem posed earlier about simultaneously trying to minimize transit time and cost while driving on a freeway. In that setting, there is some function $g$ of the form \begin{align*} g(\text{speed}) = \begin{pmatrix}\text{time to destination} \\ \text{cost of trip}\end{pmatrix} \qquad\longleftrightarrow\qquad g(\xx) = \uu = \begin{pmatrix}u_1 \\ u_2\end{pmatrix}. \end{align*} What does $g$ look like? We do not know---that is the nature of a black box problem. We likely know certain attributes of $g$, such as a decrease in speed probably causes an increase in time to destination, but the exact relationship may be difficult to write down. Experimentation has been conducted on this problem to empirically estimate $g$ (see here or here). We use the 1997 data on "Fuel Economy by Speed" hosted at Oak Ridge to generate the figure below.
Figure 1: (left) Mileage data from this report and a model used to make predictions. (right) Relationship between the cost and time required to make a trip as determined below. Using the data from that report, we can create an approximation with which we can predict mileage at speeds that have not yet been observed.
While looking at this red curve, it is apparent that there are some results which could not possibly be considered acceptable minima: the speed of 10 miles/hour requires 10 hours and would cost more than $\$14$ and both of those values can be improved upon by a better choice of speed. Such points, where all components of the objective vector $\uu$ are greater than different observed results $\vv$, are called dominated. Any results which are not dominated are considered Pareto optimal (also called Pareto efficient), and the entire set of such points form the Pareto frontier, which is displayed for this travel problem in the figure below.
Figure 2: A zoomed-in version of the right half of Figure 1, where the Pareto frontier (manifold of Pareto optimal results) is colored in blue. All other possible outcomes are dominated by at least one point on the Pareto frontier. The two stars represent two speeds which have the same cost; the yellow star’s travel time is greater than the blue star, thus it is dominated.
The blue portion of the curve in this figure contains points which are all considered Pareto optimal, which implies that further decreasing one of the components of the objective would cause an increase in the other. Because we have limited our speed to 75 miles/hour(1) our required time cannot approach 0 (which should happen as we drive increasingly fast(2)) which is why the curve ends where it does. Orange represents speeds which are inferior; the two stars have equal cost but unequal time.
Tradeoffs
It should be clear from Figure 2 why speeds that are not Pareto optimal cannot be considered solutions to the multicriteria optimization problem: there is little reason to drive 38.5 miles/hour and arrive in 2.6 hours (orange star) when one could drive 61.2 miles/hour and arrive one hour earlier at the same cost. For this example, the interface between the orange and blue sections represents the minimum cost to reach the destination; any attempt to decrease the time of transit would result in an increase in the cost of the trip. How then can we decide which of the results on the blue curve are "the solution"?
In truth, there is no single way to decide which of the speeds that represent the blue curve are the solution. Driving 52.4 miles/hour represents the minimum cost, driving 75 miles/hour represents the minimum time, and any speed in between could be considered optimal depending on personal preference. The tradeoffs between these two criteria can either be managed by some supervisory decision maker (the driver of the car in this example) or by merging the multiple criteria into some single criteria and phrasing the problem as a standard optimization problem. Either way, once this Pareto frontier has been identified (or, more likely, approximated), no further refinement of the solution the multicriteria optimization is possible.
These types of tradeoffs appear in the real world all the time, as shown in this popular comic strip among graduate students; specifically, image below depicts (in an entirely non-rigorous way) multiple factors competing for the lunch dollars of a graduate student. The point below denoted "Optimal Mediocrity" is dependent on the degree to which "Gross", "Effort" and "Cost" are valued by the decision maker. It may, in fact, be very common for students to more greatly value "Cost" and prefer the Instant Noodles.

Figure 3: The tradeoffs under consideration for a graduate student at lunchtime. Graphic courtesy of www.phdcomics.com.
Linear scalarization
Each research field has its own domain experts that can serve as decision makers, and as such it is difficult to discuss the decision maker strategy in generalities. The method of forming a scalar objective function $f:\Omega\to\RR$ from the vector objective function $g:\Omega\to\RR^k$ is more universal, so we restrict our discussion to this linear scalarization methodology. Scalarization (discussed nicely in chapter 2 of this text) is the process by which some transition function $T:\RR^k\to\RR$ is applied to the multiple objectives from $g$ to rephrase the multicriteria problem as \begin{align*} \xopt = \argmin_{\xx\in\Omega} T(g(\xx)). \end{align*} Unfortunately, all this does is transition the burden of decision making between the components of $g$ from some expert decision maker to the function $T$; choosing an appropriate function $T$ is important because a poor choice can result in solutions to this scalar optimization problem which are not Pareto optimal in the multicriteria problem.
A variety of scalarization strategies have been developed for different circumstances, but the simplest among them is a weighted-sum strategy involving a nonnegative weighted average of the components of $g$: for weights $\gamma_1,\ldots,\gamma_k>0$ such that $\gamma_1+\ldots+\gamma_k=1$, we define \begin{align*} T_\ggamma(g(\xx)) = T_\ggamma\begin{pmatrix}g_1(\xx)\\\\\vdots\\\\g_k(\xx)\end{pmatrix} = \gamma_1 g_1(\xx) + \ldots + \gamma_k g_k(\xx). \end{align*} One benefit of this strategy is that any solution to this problem must be Pareto optimal(3). Another benefit is its easy interpretability: larger $\gamma$ values correspond to components which are more important to minimize. Within the context of an industrial process, it would probably still take an expert to choose an appropriate balance between these $\gamma$ values, but this strategy provides a strategy to apply certain preferences and the benefit of Pareto optimality.
Referring back to the car driving problem, $\gamma_1$ and $\gamma_2$ values could be chosen to scalarize the vector-valued objective? Because $\gamma_1+\gamma_2=1$, we actually need only choose one parameter, which we will just call $\gamma$, to create the scalar objective function \begin{align*} T_\gamma(g(\text{speed})) = \gamma (\text{time to destination}) + (1-\gamma) (\text{cost of trip}). \end{align*} So, what impact does the choice of $\gamma$ have on the objective function? The figure below shows the $T_\gamma(g)$ objective for several choices of $\gamma$, as well as the optimal speed as a function of $\gamma$.
Figure 4: (left) The scalarized form of the objective function for various $\gamma$ values. (right) As explained by the equation above, smaller $\gamma$ values place more emphasis on fuel efficiency and larger values prefer a shorter duration. Because we assume no one will violate traffic laws(1), choosing $\gamma>.91$ will always return 75 miles/hour.
Choosing $\gamma$ large states a preference towards less time, and choosing $\gamma$ small states a preference towards less cost. The actual relationship, as depicted in the right half of Figure 4, is complicated by the fact that the the components of the objective $g$ do not have the same units and on the Pareto frontier they are not on the same scale (the topic of scale is mentioned in this post): cost varies between $\$9-\$14$ whereas time varies between 1-2 hours. From a physical sense, even adding values with different units is inconsistent, though we skirt this issue by stating the $T_\gamma(g)$ optimization problem in a fully mathematical (read unphysical) setting. Given that, the role that $\gamma$ plays is very much a function of those units, which is why there is a 10 miles/hour range for $\gamma\in[0, .8]$ and another 10 miles/hour range for $\gamma\in[.8, .9]$. Problems phrased in this way can be readily optimized by tools like SigOpt.
In a perfect world, all the components of a problem would be on exactly the same scale, allowing for an easier prediction of the role that $\ggamma$ plays. Below is a figure which shows the same optimal solution relationship, but for slightly different versions of the $g$ function. We can clearly see that rescaled versions of these functions have a significant impact on the resulting optimum for a given $\gamma$. On the left, we see, perhaps, a more consistent transition as $\gamma$ increases; this is the result of rescaling the objective components to $[0,1]$, which is only possible for this simple problem because we know the exact values of the Pareto frontier. On the right, we see an extreme transition period, caused by converting the currency of the gasoline purchase to Chinese renminbi (multiplying $g_2$ by 6.67 RMB/USD).
Figure 5: (left) Objective values from $g$ are scaled to $[0,1]$. (right) $g_2$ is multiplied by 6.67.
From a purely mathematical perspective, it does not matter what rescaling, if any, is done to this problem: for all the values on the Pareto frontier, there is at least one $\gamma$ for which it is the solution of the scalarized problem4. From a practical standpoint, it can be significant, though, because some expert must make a decision regarding the appropriate $\gamma$, and that decision is much easier in the left half of Figure 5 (where a small change in $\gamma$ has a small change in optimal speed) than in the right half of Figure 5 (where the impact of a small change in $\gamma$ has an unpredictable impact on the optimal speed).
$\epsilon$-constraint scalarization
In some circumstances, it may be the case that one component of $g$, say the $i$th component $g_i$, is more important than the rest; as long as the other $k-1$ components are "satisfactory" only the $g_i$ component needs to be minimized. Mathematically, this becomes a constrained scalar optimization problem, \begin{align*} \xopt = \argmin_{\xx\in\cX} g_i(\xx), \quad\mbox{ subject to }\quad g_1(\xx)\leq\epsilon_1,\;\ldots,\;g_{i-1}(\xx)\leq\epsilon_{i-1},\; g_{i+1}(\xx)\leq\epsilon_{i+1},\;\ldots,\;g_k(\xx)\leq\epsilon_k. \end{align*} This notation is somewhat dense, but can be interpreted more easily in the context of the example above; suppose the driver has only $\$11$ available and must reach the destination at a cost below that. In that setting, $\epsilon_2=11$ and the multicriteria optimization problem becomes the scalar problem \begin{align*} \min_{0\leq\text{speed}\leq75}\; \text{time to destination}(\text{speed}), \qquad\mbox{ subject to }\qquad \text{cost of trip}(\text{speed})\leq 11. \end{align*} Graphically this can be seen in the figure below. Problems phrased in this way can be optimized by tools like SigOpt by taking advantage of reporting “failure” observations when criteria are not met.
Figure 6: (left) If we constrain the cost to less than $\$11$ and concern ourselves only with minimizing the time to destination, the red star represents the solution. (right) If, instead, we simply must arrive in less than 90 minutes and want to incur the minimum cost, the solution is the red star.
Conclusion
Optimization is what we do at SigOpt, but not all optimization problems are created equally. Multicriteria optimization problems, which involve the simultaneous optimization of multiple objective functions, require additional specification to solve them uniquely. This post shows an example of such a problem, and walks through a strategy for converting it into a standard problem which can be efficiently optimized using the Bayesian optimization tools within SigOpt. Stay tuned for more updates on using SigOpt to tune these problems and others as we continue to expand the capabilities of our optimization platform.
Special thanks
I would like to extend my gratitude to Devon Sigler, a top-tier mathematician and multicriteria expert at the University of Colorado Denver, for his help refining some of the descriptions in this post and clearing up my own understanding of the implications of nonconvexity in weighted sum scalarization.
More from SigOpt
SigOpt's 90 Second Demo demonstrates our optimization engine with no sign-up required.
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
Interested in helping push the state of optimization forward? SigOpt is hiring!
(1) SigOpt recommends following posted safety laws while driving, including driving under the speed limit.
(2) Of course limited by the speed of light, and any associated relativistic implications.
(3) The requirement that the $\gamma$ values be strictly greater than zero is needed to guarantee Pareto optimality. If, instead, only weak Pareto optimality is required, it is possible to allow some $\gamma$ values to be 0.
(4) This is only true if all objective functions are convex and the feasible region is convex. If this is not the case there can be points on the Pareto frontier that no value of gamma will find. This is one of the primary drawbacks for the weighted sum scalarization method. This is, however, not a problem for the $\epsilon$-constraint method---there are $\epsilon$ choices that can be used to find each point on the Pareto frontier.
3 notes
·
View notes
Text
SigOpt for ML : Bayesian Optimization for Collaborative Filtering with MLlib
By: Ian Dewancker, Research Engineer.
In this post we will show how to tune an MLlib collaborative filtering pipeline using Bayesian optimization via SigOpt. Code examples from this post can be found on our github repo.
Collaborative Filtering
A popular approach when building the basis of a recommendation system is to learn a model that can predict user preferences or product ratings. With an effective predictive model, and enough contextual information about users, online systems can better suggest content or products, keeping customers better engaged and leading to more sales, subscribers or conversions.
Figure 1 : Collaborative Filtering via Low-Rank Matrix Factorization
A common recommender systems model involves using a low-rank factorization of a user-product ratings matrix to predict the ratings of other products for each user [2]. In general, algorithms related to collaborative filtering and recommendation systems will have tunable parameters similar to ones we have discussed in previous posts. In this problem, for example, the regularization term on the user and product factors can be difficult to choose a priori without some trial and error. Using the Bayesian optimization service provided by SigOpt, machine learning engineers and data scientists can quickly tune algorithms and systems to get the most out of models in production.
Show Me The Code
In this example we consider the MovieLens dataset and use the MLlib package within Apache Spark. The code for this example is available in our github repo
The largest MovieLens dataset ratings matrix has approximately 22 million user ratings for 33,000 movies by 240,000 users. To run this example, we recommend creating a small spark cluster in EC2 using the spark-ec2 tool provided by the spark library. We ran this experiment using a 3 machine cluster (1 master, 2 workers) in AWS using the m1.large instance for all nodes. Once the cluster is setup, you can run the provided setup script on the master instance, build the example project, and submit the job to your master spark instance.
Alternating Least Squares
To solve for the latent user and movie factors, MLlib implements a variant of what is known as quadratically regularized PCA [1]. Intuitively, this optimization problem aims to learn latent factors $X,Y$ that best recreate the ratings matrix $A$, with a regularization penalty coefficient $\lambda$ on the learned factors.
This minimization problem can be solved using a technique known as alternating least squares [1]. A distinct advantage of using this formulation is that it can be easily parallelized into many independent least square problems as outlined in the pseudocode below. Each factor matrix $X,Y$ is randomly initialized and the algorithm alternates between solving for the user factors $X$, holding the movie factors $Y$ constant, then solving for the $Y$ factors, holding $X$ constant. The algorithm takes as input $A$ the ratings matrix, $\lambda$ the regularization term, $k$ the desired rank of the factorization, and $T$ the number of iterations of each alternating step in the minimization. We expose $\lambda$, $k$ and $T$ as tunable parameters to SigOpt.
The regularization term $\lambda$ is particularly difficult to select optimally as it can drastically change performance of the algorithm. Previous work has attempted to use a Bayesian formulation of this problem to avoid optimizing for this regularization term explicitly [4].
Performance
As an error metric for this example, we used the standard measurement of the root mean square error [3] of the reconstructions on a random subset of nonzero entries from the ratings matrix.
Defining an appropriate error measurement for a recommendation task is critical for achieving success. Many other metrics have been proposed for evaluating recommendation systems and careful selection is required to tune for models that are best for the application at hand. Bayesian optimization methods like SigOpt can be used to tune any underlying metric, or a composite metric of many metrics (like accuracy and training time).
In this example the training, validation and holdout rating entries are randomly sampled non-zero entries from the full ratings matrix $A$ as summarized in the diagram below:
SigOpt tunes the alternating least square algorithm parameters with respect to the root mean squared error of the validation set. We also report the performance on the hold out set as a measure of how well the algorithm generalizes to data it has not seen. We compare parameters tuned using SigOpt against leaving the alternating least square parameters untuned. While the ratings entries for the train, valid and test sets were randomly sampled, they were identical sets for both SigOpt and the untuned comparison.
table, th, td { border: 1px solid black; border-collapse: collapse; text-align: center; }
SigOpt No Tuning (Default MLlib ALS) Hold out RMSE 0.7864 (-40.7%) 1.3263
Table 1: Comparison of RMSE on the hold out (test) ratings after tuning ALS algorithm
Closing Remarks
Collaborative filtering and recommendation system techniques have the potential to add new levels of personalization and drive user engagement and conversions in online systems. However, it can often be an unintuitive process to tune the parameters governing the machine learning or optimization algorithms involved in building a production recommendation system. SigOpt offers a succinct API with clients in Python, Java, and R to quickly give machine learning engineers and data scientists access to our hosted, proven Bayesian optimization service.
git clone https://github.com/sigopt/sigopt-examples.git
More from SigOpt
SigOpt’s 90 Second Demo demonstrates our optimization engine with no sign-up required.
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
Interested in helping push the state of optimization forward? SigOpt is hiring!
References:
[1] Madeleine Udell, Corinne Horn, Reza Zadeh and Stephen Boyd. Generalized Low-Rank Models. Foundations and Trends in Machine Learning. 2016 [PDF]
[2] AMPLab UC Berkeley. Movie Recommendation with MLlib. AMP Camp Hands On Exercises. 2014 [LINK]
[3] Yehuda Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. Proceedings of the 14th ACM SIGKDD. 2008 [PDF]
[4] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic Matrix Factorization. Advances in Neural Information Processing Systems. 2008 [PDF]
25 notes
·
View notes
Text
SigOpt Fundamentals: Breaking Free of the Grid
By: Mike McCourt, Ph.D.
SigOpt is a software as a service optimization platform that accelerates modeling and industrial processes. Scientists across industries face challenging optimization problems. Data scientists building predictive models to classify images, financial analysts building trading models, and industrial engineers building complex systems like airplane wings all need to optimize the performance of a model or process that has numerous input parameters and is expensive to test. In this post, we review the history of some of the tools implemented within SigOpt, and then we discuss the original solution to this black box optimization problem, known as full factorial experiments or grid search. Finally, we compare both naive and intelligent grid-based strategies to the latest advances in Bayesian optimization, delivered via the SigOpt platform.
The Mathematical and Statistical Foundations of SigOpt The recent advances that power SigOpt have built upon generations of prior research: most substantially, reproducing kernel Hilbert space theory began more than 100 years, although one could argue that the theory was not fully developed until roughly 60 years ago (see also this classic text, starting at chapter III.4). The separate development of the topic within spatial statistics can be dated back precisely to 1951 though the formal theory involving Gaussian processes was not developed until at least the mid-1960s, and the connection between the two fields was later realized in 1970. Other parts of our blog have discussed interpreting Gaussian processes (here, here and here) and, for the truly interested reader, chapter 1 of my book provides a more comprehensive retrospective on the development of this theory.
Given this long mathematical and statistical history, it may be somewhat surprising that the earliest well-known instances of the use of Gaussian processes in a Bayesian optimization framework were documented within the USSR by this early 1970s conference report. In fact, the real adoption of Gaussian process-backed optimization (also called kriging metamodel optimization) seems to have only begun subsequent to the 1998 development of expected improvement (discussed more simply in slides I picked up from grad school). Since then, there has been an explosion of research (including outside Gaussian processes) which SigOpt is continually surveying to update our own software and provide the best possible experience for our customers in an easy-to-use API. Our posters at ICML 2016 workshops discuss our hierarchical comparison methodology for simultaneously analyzing multiple optimization techniques within our quality assurance framework.
Grid search as an optimization tool Of course, prior to these new developments in efficient optimization, researchers still had black box functions that needed to be optimized---planes needed to fly before computers even existed. This raises the interesting question of “How were these metrics maximized before Bayesian optimization existed?” In a word, the answer is slowly. Historically, experts would spend significant time analyzing their models so as to manually identify the, for example, financial investment strategy which is predicted to maximize their portfolio’s value. Each time the strategy was changed, the expert would be forced to reinvest this time, which was a waste of that expertise. This 2004 article belabored the cost of hand-tuning in the context of robotics, and this 2014 retrospective recalls the earlier belief that hand-tuning was required for good numerical linear algebra computational performance.
More formulaic methods for global black box optimization were developed prior to sequential kriging optimization to help remove this burden from the experts. The simplest and most logical among these methods is grid search: try small changes in each of the parameters of the model as defined using a regular grid. For a financial investment scenario, one could imagine a very simple portfolio where 100 units of asset A are purchased and x units of asset B are sold to appropriately hedge the position. Using grid search to determine the value of x which maximized the worst case value of the portfolio would consist of defining a domain for x, a desired number of points T at which to compute the expected value, and then computing the expected value at T equally spaced points within that domain, as shown in the figure below.
Figure 1: (left) The median and inner 90th percentile range for the value of the portfolio. (right) Grid search using T=10 observations to maximize the worst case behavior of the portfolio.
Grid search does an acceptable job identifying the maximum of the red curve using T=10 investment predictions (each of which costs some amount of time for an expert or model to generate). But, unfortunately, as more dimensions (or parameters or degrees of freedom) are introduced to model a more complex process, the resulting metric becomes too complicated to feasibly implement a grid search. Sampling with a grid using T points in d dimensions requires Td total points, which quickly becomes impracticable as suggested in the figure below.
Figure 2: (from left to right) For 1, 2 and 3 dimensions, a grid search with T=10 points per dimension would require 10, 100 and 1000 total metric evaluations. Quickly, this becomes too costly for all but the simplest of underlying models.
Modifications to grid search We have identified grid search as a classical strategy for maximizing black box functions, but also suggested that for larger problems it is unreasonably costly. Below, we discuss ways in which grid search can be changed to allow it to work in higher dimensional settings.
Random subset grid search The simplest way to simulate a 10000 point grid search with only 100 samples would be to create the full 10000 point grid and only sample from a subset of those points. To do so, we will randomly sample, without replacement, 100 of the 10000 points and call that a random subset grid search. This is similar to a standard random search(1), except rather than sampling from the continuous uniform distribution we will sample from the discrete distribution defined by the full grid. The figure below shows the benefit of randomly sampling from a dense grid when the dense grid contains too many points as to be feasibly computed.
Figure 3: (left) The full grid search with T=10 fills very little of the space in the first 100 points. (right) 100 points are randomly chosen from the grid, which provides a better spread of values throughout the domain.
Iterative 1D grid slices This concept fixes all dimensions except one, and then conducts a 1D grid search in that dimension. The best result for that dimension is then fixed and a 1D grid search is conducted over a different dimension. We continue to cycle through 1D grid searches one dimension at a time until 100 function evaluations have been exhausted. The figure below shows one possible version of this process.
Figure 4: (left) For this example function, the maximum in dark blue is not near the first 100 points of the full grid search. (right) 1D slices through the domain provide a faster route to the maximum of this sample function.(2)
There are some design decisions which can have an important (but unpredictable) impact on the outcome of this marginal optimization strategy. The first decision that should be made is in what order the dimensions should be marginally optimized, or even if they should be optimized in a fixed (rather than adaptive, or even random) order. The next decision involves the T value to be used for each dimension’s 1D grid search, and whether that should be constant across all dimensions. Our experiments below involve T=10 in all dimensions and the initial values for all dimensions are chosen randomly from the 10000 point grid. The order through the dimensions is also chosen randomly.
Low discrepancy grid search The computational limitations of uniform grids (or, more generally, tensor product grids) have long been a thorn in the side of the numerical integration community. Quasi-Monte Carlo methods have been developed to facilitate integration in higher dimensions with faster convergence, closer to standard quadrature methods than true Monte Carlo methods (which resemble random search). These quasi-Monte Carlo methods require the development of low discrepancy points (such as the Sobol and Halton sequences, latin hypercube sampling and lattice rules) which are very effective at sampling in a high dimensional space, which is shown in the figure below.
Figure 5: (left) The standard uniform grid search with T=5 has difficulty recognizing certain relationships between dimensions as evidenced by the inconsistent gaps between points when viewed at certain angles. (right) In contrast, the (2, 3, 5) Halton sequence(3) with the same 125 points is able to keep a more consistent density at all angles, which suggests a better understanding of the relationship between dimensions.
Numerical experimentation Here we try maximizing a number of 4D functions which are available in the SigOpt evalset(4). We compare the performance on a T=10 full grid (with a total of 10000 function evaluations) to each of the modified versions described above using only 100 function evaluations; we also provide a comparison to SigOpt’s optimization service. The figures below contain the best seen traces (described in an earlier blog post) which depict the best function value observed after a specific number of function evaluations.
Figure 6: For these two functions, we see roughly similar behavior for the random subsample and Halton low discrepancy strategies. The iterative 1D slice strategy performs better than them, including potentially reaching the full grid performance, as seen on the left. (left) The Hartmann4 function. (right) The Shekel (m=7) function.
Figure 7: All three modifications have significantly different behavior for these two functions. On the left, the random subsampling is distinctly better than Halton, whereas on the right the Halton method can surpass even the full 10000 point grid search solution. (left) The negative log10 of the Gear function. (right) The Michalewicz function.
Conclusion In higher dimensional problems, conducting a full grid search is infeasible; we have proposed three modifications to that simple approach to parameter tuning which allow for some approximation to grid search using fewer model evaluations. None of these three strategies are always the best, although the iterative 1D slice strategy seems most likely to reproduce the quality of the full grid using (in the experiments above) only 1% of the full grid evaluations. Take caution when generalizing these results, since these are only 4 examples out of the entire possible space of functions that could be considered, but the graphs above suggest these modifications could allow for viable results using many fewer evaluations than the full grid.
Another important point to be reminded of here is the superior performance of Bayesian optimization through SigOpt on each of these functions. Using that same limit of 100 function evaluations, these advanced mathematical methods are able to outperform not only all of the modified strategies but even the full grid search requiring 10000 function evaluations. For our finance customers, this improvement could correspond to finding a more profitable or less risky trading/investment strategy faster. For our machine learning customers, this could correspond to models with higher predictive capacity or more robustness with less compute cost. For any clients who need the best possible performance from their models with as little training time as possible, SigOpt provides a RESTful API, accessible through our Python, Java or R clients, which conducts this optimization using an ensemble of Bayesian methods, so that domain experts can focus their time and energy utilizing their expertise rather than tuning models.
SigOpt is offering a free 30 day trial to TensorFlow users. To get started, download the code:
git clone https://github.com/sigopt/sigopt-examples.git
More from SigOpt
SigOpt’s 90 Second Demo demonstrates our optimization engine with no sign-up required.
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
Interested in helping push the state of optimization forward? SigOpt is hiring!
(1) We reference our blog post about random search because we believe that to be the more common definition of random search within model tuning literature. The Wikipedia entry more closely hews towards a randomized trust region method, which can be expanded to a simulated annealing strategy to deal with nonconvex functions that are cheap to evaluate or to conduct MCMC sampling from a multimodal function.
(2) For simplicity in generating this plot, we ignore the fact that during each 1D slice, some number of desired function values may already be present from previous slices. Correctly storing and managing those values would reduce the cost of this method.
(3) Some literature uses Halton sequences starting at the origin, but this example skipped that first point. The (2, 3, 5) Halton sequence refers to the prime numbers used to generate the sequence in each of the dimensions.
(4) https://www.github.com/sigopt/evalset is a Python implementation of real-valued functions that SigOpt uses to conduct regression testing as we try to deploy new optimization strategies. Some of these functions were used for testing in our ICML AutoML submission. The functions provided in the evalset are meant to be minimized, so these experiments instead maximize the inverse of the functions.
1 note
·
View note
Text
Much Deeper, Much Faster: Deep Neural Network Optimization with SigOpt and Nervana Cloud
By: Scott Clark, Ian Dewancker, and Sathish Nagappan
Tools like neon, Caffe, Theano, and TensorFlow make it easier than ever to build custom neural networks and to reproduce groundbreaking research. Current advancements in cloud-based platforms like the Nervana Cloud enable practitioners to seamlessly build, train, and deploy these powerful methods. Finding the best configurations of these deep nets and efficiently tuning their parameters, however, remains one of the most limiting aspects in realizing the benefit of these techniques.
“...we still don't really know why some configurations of deep neural networks work in some cases and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.”
Xavier Amatriain VP Engineering at Quora (former Director of Research at Netflix)
Bayesian optimization represents an approach to solving the problem of parameter tuning in a robust way. SigOpt was designed to help solve this problem directly by providing a simple, scalable, and stable ensemble of these methods. SigOpt has been shown to outperform standard methods of model tuning like random and grid search as well as open source global optimization alternatives across a wide variety of machine learning models, including neural networks.
The Nervana Cloud is built to be an underlying deep learning infrastructure and democratizes access to the tools and infrastructure necessary to create neural nets. It obviates managing infrastructure, facilitates data ingestion from a motley of sources, aids data exploration, is powered by the fastest deep learning framework, and can seamlessly be deployed in production applications.
By combining the ease of building, training, and deploying neural networks with Nervana Cloud with the efficient tuning of SigOpt, the promise of easily accessible, production quality deep learning is now a reality for everyone. In this post we will detail how to use SigOpt with the Nervana Cloud and show results on how SigOpt and Nervana are able to reproduce, and beat, the state of the art performance in two papers.
Figure 1: The SigOpt Optimization Loop. SigOpt bolts on top of machine learning systems like the Nervana Cloud via a simple REST API. First, SigOpt suggests parameter configurations. These configurations are evaluated on the Nervana Cloud. The observed results are then reported back to SigOpt and the process is repeated, converging on an optimal configuration.
How does Bayesian optimization work?
Bayesian optimization is fundamentally about making the best possible decision of what parameter configuration to evaluate next for a model, given an objective to maximize and previous observations about how historical configurations have performed. Inherently it is making a tradeoff between exploration and exploitation. Exploration learns more about the model, the length scales over which individual parameters vary, and how they combine to influence the overall objective. Exploitation uses the historical knowledge already gained from previous observations to suggest the best possible next configuration, maximizing the expected result. By efficiently and automatically making these tradeoffs, Bayesian Optimization techniques can quickly find the global optima difficult to optimize problems, often much faster than traditional methods like local search or brute force methods like grid or random search that ignore historical information.
There are many open source packages devoted to providing interfaces to this research including MOE, Spearmint, SMAC, and Hyperopt. In practice these tools usually represent a single Bayesian optimization approach and are often too infrastructurally brittle to be deployed generically in production environments, even after the substantial administration required to get them up and running. SigOpt represents an ensemble of the state-of-the-art of Bayesian Optimization research behind a parallel API, allowing users to get the best of the research, without any of the overhead. More information about the underlying optimization methods in this post can be found on the SigOpt research page.
Bayesian optimization of deep neural networks
Deep neural networks (DNNs) have many hyperparameters to tune, including learning rate, weight decay, momentum, weight initializations, number of activations, number of layers, batch size, among many others. Even a simple neural network can have dozens of parameters (Figure 1), and often these parameters are continuous variables with only loose guidelines on numerical ranges. Tuning all of these knobs has historically been more an art than a science. Even small changes can greatly impact whether or not a model converges and produces good results. Further complicating the problem, many of these parameters have coupling effects. This leads to much wasted time on trial and error, as each configuration of parameters must be tried and iterated on based on results. Training DNNs can be extremely time-consuming given the volume of data and compute density required, which means that Bayesian Optimization can be especially helpful in tuning DNNs.
Figure 2: A visualization of a simple neural network for classifying binary data in a 2D space that has 22 different tunable parameters. Finding the best configuration via an exhaustive grid search would be computationally intractible.
Using SigOpt and Nervana Cloud to train DNNs
SigOpt and Nervana Cloud work well together to help data scientists produce high-accuracy trained DNN solutions quickly (Figure 2). First, the data scientist selects a deep learning model architecture from Nervana’s state-of-the-art model library. The particular starting model will depend on the problem to be solved. For example, Fast Region-based Convolutional Neural Networks (Fast R-CNN) are well-suited for detecting object locations in images, while Deep Residual Nets are better at classification. Once the data scientist has selected the model architecture to start with, they customize the model script based on various factors such as the specific problem in question, the format of the input data, etc. Finally, the model script is updated to support hyperparameter suggestions coming from the SigOpt REST API rather than from manual inputs by the data scientist or default values from the model.
After each suggested parameter configuration is received from SigOpt the model script is submitted to Nervana Cloud for training and evaluation. Nervana Cloud has been optimized down to the silicon to handle the most complex deep learning training at scale, which has enabled Nervana Could to achieve training speeds 10x faster than conventional GPU-based systems and frameworks. So each training run takes place at world-class speed.
Through SigOpt integration, the model script conducts many training evaluations in succession, each using a unique suggested configuration of hyperparameters. After an evaluation has completed, the observed accuracy (or any other objective metric) of the trained model is measured and reported back to SigOpt. Based on this result, SigOpt suggests a new set of hyperparameters to be evaluated next. The process continues, and ultimately, the system converges on the best combination of hyperparameters. SigOpt finds good parameter configurations 10x faster than traditional approaches like grid search, allowing experts to get to the best version of their models in less time and with less compute overhead.
The combination of Nervana Cloud and SigOpt enables the data scientist to achieve a level of accuracy much faster and in far fewer steps than with standard methods, and without the need for manual expert tuning.
Convolutional Neural Network Error Reduction
By combining SigOpt and neon on the Nervana Cloud we are able to more efficiently find better parameter configurations than standard methods and expertly published results on an all convolutional neural network. These architectures are increasingly growing in popularity and comprise solely of convolutional layers. Notably, pooling and affine layers are supplanted with strided convolutions and one-by-one filter convolutions, respectively. All convolutional nets allow for learned spatial sampling, dense labeling, image generation, among others.
Convolutional neural networks contain many tunable parameters that affect performance including learning rates, epochs, stochastic gradient descent parameters, and more. Finding the best configuration for these parameters is extremely non-intuitive and can often be very dataset dependent. In practice the configurations published or used in production are often found via an expertly refined grid or random search. Bayesian optimization represents a way to get to better configurations in less evaluations, without requiring expert administration.
Neon is equipped with an extensive model zoo including the all convolutional neural network from Springenberg et al. We compared the expert fine tuning of this network on the CIFAR-10 dataset to SigOpt and random search, a common non-bayesian hyperparameter tuning strategy. SigOpt was able to achieve better results than the standard method while also requiring fewer total iterations. In addition it was also able in improve upon the expertly tuned results from the paper, reducing the relative error rate by 1.6%.
Figure 3: The best found trace of SigOpt and random search. The line represents the accuracy on the validation set achieved by the best performing configuration from each optimization strategy after a number of total evaluations. SigOpt is usually able to find good optima after a number of evaluations equal to 10 to 20 times the number of parameters being. In this case we tuned 9 hyperparameters(1).
Method Accuracy Achieved Required Evaluations Error reduction vs Expert Baseline Random Search 0.6086 121 N/A SigOpt 0.9011 86 1.6% Expert Tuning 0.8995(2) unknown -
Table 1: A comparison of various hyperparameter tuning methods for tuning an all convolutional neural net. SigOpt was able to get better results than the standard method and the expert, while also requiring far fewer total evaluations than the standard method, saving expert and computational time while also producing a better model.
By using the ensemble of Bayesian optimization methods provided by SigOpt we were able to find a better result faster than the standard random search approach, while also improving the expert result without any expert time required.
15% Deep Residual Net ImageNet Error Reduction
These results are not limited to only neural networks with specific architectures, but can be extended to any method. We show that using SigOpt and the Nervana Cloud we were able to reduce the relative error in a deep residual network (ResNet) as well. These methods currently hold the crown in the annual ImageNet competition, achieving 3.57% error. These networks surpass even humans in image classification in many cases. Depth has been shown to be an extremely important attribute in the quality of neural networks. Increasing depth naively, however, leads to a host of optimization challenges. ResNets stem from the residual learning framework. The authors argue that learning perturbations from the identity function is easier than learning an unreferenced function. Their final network is composed of 152 layers and can take up to a few weeks to train.
In this example we took the deep residual network from Kaiming et al. and compared the expert fine tuning to SigOpt, using the implementation in Neon’s model zoo on the CIFAR-10 dataset. Random search was omitted from this example due to time constraints, these networks typically take a very long time to evaluate, which highlights the need for intelligent hyperparameter searching methods. SigOpt was able in improve upon the expertly tuned results from the paper, reducing the relative error rate by 15%, while not requiring any expert time usually required for fine tuning.
Method Accuracy Achieved Required Evaluations Error reduction vs Expert Baseline SigOpt 0.9436 126 15% Expert Tuning 0.9339(3) unknown -
Table 2: SigOpt was able to reduce the relative error rate by 15% compared to an expertly tuned baseline by tuning 9 hyperparameters in the model(4).
Figure 4: SigOpt was able to reduce the relative error rate by 15% compared to an expertly tuned baseline, it was able to do so exponentially faster (after 130 iterations) than standard methods like grid search.
Conclusion
Combining Nervana Cloud with SigOpt allows users to train and tune neural networks a combined 100x faster than standard methods while also producing better results.
The code from these examples can be found here. You can sign up for a free SigOpt trial, and can contact Nervana about accessing the Nervana Cloud by emailing [email protected].
Footnotes
[1]: Parameters optimized in All CNN experiment:
epochs : number of passes over dataset during SGD - int [50, 500]
log(learning_rate) : step size in SGD - double [-3.0, -0.3]
log(weight_decay) : weight decay in SGD - double [-3.0, 0.0]
gaussian_scale : standard dev of weight initialization - double [0.01, 0.5]
momentum_coef : momentum term in SGD - double [0.001, 0.999]
momentum_step_change : mul amount to decrease momentum - double [0.001, 0.999]
momentum_step_schedule_start : epoch to start momentum decay - double [50,300]
momentum_step_schedule_step_width : # epoch between momentum decay - int [5,100]
momentum_step_schedule_steps : how many momentum decays to occur - int [1,20]
[2]: Accuracy observed on the Nervana Cloud using the hyperparameters published in the paper.
[3]: The accuracy reported in the paper.
[4]: Parameters optimized in the DeepRes experiment :
depth : number of layers in net - int [1, 20]
all the same parameters and domains as All CNN(1)
1 note
·
View note
Text
Winning on Wall Street: Tuning Trading Models with Bayesian Optimization
By: Mike McCourt, Ph.D.
SigOpt provides customers the opportunity to build better machine learning and financial models by providing users a path to efficiently maximizing key metrics which define their success. In this post we demonstrate the relevance of model tuning on a basic prediction strategy for investing in bond futures.
The First Piece: A Linear Model
Many factors can be used to build a financial trading model of an asset’s price/value, but there are generally three agreed upon components that are of most significance: correlations with other assets, the trend of the asset we wish to model, and external economic factors such as government reports. Our goal in this blog is to create a model of the US 10 year bond future contract price $Y_{10}$ using only correlation data with the 2 year $Y_2$ and 5 year $Y_5$ bond future contract prices; these prices are recorded at the end of each trading day $t$ and we hope to make predictions about the next trading day $t+1$. There are a number of other factors we do not consider here that may be relevant to the contract price, but our goal initially is to create as simple a model as possible and demonstrate the benefit that can be gained by tuning it.
To that end, we start with the absolute simplest model possible: the linear model
\[\hat{Y}_{10}(t+1) = a + bY_2(t) + cY_5(t);\]
note that the use of the hat on $\hat{Y}_{10}$ reminds us that it is a prediction and not the true value. The numbers a, b and c are constants which define the degree to which $Y_2$ and $Y_5$ influence our predictions. This model should be interpreted as “Tomorrow’s 10 year contract price prediction $\hat{Y}_{10}(t+1)$ is a (weighted) summation of today’s 2 year $Y_2(t)$ and 5 year $Y_5(t)$ prices, plus a constant term which serves as a base”.
We can fit this model (use a least squares solver to determine the best $a$, $b$ and $c$ values) to historical data from 01/05/2009 - 03/21/2016(0). It turns out that even this simple model is somewhat capable of accurate predictions thanks to the strong correlation between contract prices. The figure below shows the outcome of this linear model.
Figure 1: (left) Predictions from the linear model for tomorrow’s 10 year bond contract price given today’s 2 and 5 year prices(0) are represented by the transparent blue plane. (right) Predictions for the US 10 year price given Germany’s and Britain’s. The circles represent actual data which is colored to represent distance from the plane (blue is closest, red is furthest).
The left component of this figure shows a clear relationship between the input values $Y_2$ and $Y_5$ as can be recognized graphically by the proximity of the observed results (the circles) to the prediction (the blue plane). The right component presents a contrast by instead trying to predict $Y_{10}$ using Germany’s and Britain’s 10 year bond future contract price; as we can see, there is a much greater distance from the circles to the plane, suggested a much lower predictive capacity. Beyond this qualitative analysis, there are rigorous statistical tests available for quantifying how well or poorly the predictions match the data.
The Structure of the Trading Model
Given that we have identified the 2 year and 5 year contract price as having a strong correlation to the 10 year price, we want to build a model that can use the knowledge of this correlation to profitably buy or sell 10 year contracts. Our strategy for doing so centers around a standard concept: our model should change over time(1). We choose to fit the model using only the most recent m days, using that information to determine the $a$, $b$ and $c$ values. We then consider all of the most recent n days of contract prices (rather than only today) to make a prediction for tomorrow. We also introduce a decay factor, $\lambda$, to increase the impact of more recent data. This changes our models in two ways:
the values $a$, $b$ and $c$ which represent the correlations are replaced with $a(t;m)$, $b(t;m)$ and $c(t;m)$ because they can change over time, and
the values $Y_2(t)$ and $Y_5(t)$ are replaced with $\bar{Y}_2(t;\lambda,n)$ and $\bar{Y}_5(t;\lambda,n)$ to denote the inclusion of data from previous days.
This changes the model from its simple form above to the similar but more complicated form: \[\hat{Y}_{10}(t+1) = a(t;m) +b(t;m)\bar{Y}_2(t;\lambda,n) + c(t;m)\bar{Y}_5(t;\lambda,n).\] $\bar{Y}_2(t;\lambda,n)$ and $\bar{Y}_5(t;\lambda,n)$ will determine how the information from earlier days plays a role in forming the prediction $\hat{Y}_{10}(t+1)$. Earlier data already plays an indirect role in establishing the coefficients $a(t;m)$, $b(t;m)$ and $c(t;m)$, but we are trying to also provide a direct path to asserting influence.
There are many ways to construct $\bar{Y}_2(t;\lambda,n)$ and $\bar{Y}_5(t;\lambda,n)$; for this post we use the simple strategy of the weighted average of earlier values with an exponentially decaying weight as we walk further back in time: \[\bar{Y}_2(t;\lambda,n)=\frac{\sum_{i=0}^n Y_2(t-n)e^{-\lambda n}}{\sum_{i=0}^n e^{-\lambda n}}.\] As the decay factor $\lambda\geq 0$ grows, the recent history is much more heavily weighted. If $\lambda=0$ we recover an unweighted average with all of the most recent $n$ values treated as equally important. The coefficients $a(t;m)$, $b(t;m)$ and $c(t;m)$ which measure the correlations from the previous m days are still determined using a least squares fit.
Trading Decisions Utilizing this Trading Model
The $\hat{Y}_{10}(t+1)$ predictions use the knowledge of the past to make predictions about the future. For the same reason that we limit the influence of the past to the most recent m days, we limit the viability of our coefficients $a(t;m)$, $b(t;m)$ and $c(t;m)$ to only the nearest $p$ days in the future. This prevents us from trying to make predictions using stale data, but also requires the recomputation of the coefficients every $p$ days. The figure below shows the role of these $n$, $m$ and $p$ values.
Figure 2: This timeline shows how the different regions of time are defined.
At the end of day $t$ we know the true value of $Y_{10}(t)$ and can use the model to predict the value $\hat{Y}_{10}(t+1)$. If $\hat{Y}_{10}(t+1)>Y_{10}(t)$, we should consider a long position and if the predicted value is less than today’s value we prefer a short position; we assume all positions are closed at the end of the trading day(1). We introduce a trading threshold $\sigma\geq0$ so that no position is taken on day $t+1$ unless $|\hat{Y}_{10}(t+1) -Y_{10}(t)|/ \hat{Y}_{10}(t+1)>\sigma$. This has the effect of preventing our trading strategy from reacting to situations where $\hat{Y}_{10}(t+1)\approx Y_{10}(t)$ and making trades based on predictions which show incredibly small change (and thus no change may fall within the predictive interval).
Tuning the Trading Model
While the design of this trading model is logically consistent (tomorrow should look more like today than 67 days ago) it introduces new complexities. In particular, it leaves open the question of how far in the past the contract values should be studied to determine the model coefficients, and how far in the future we can make predictions without resetting those values. These choices are not fixed; we are free to use the last 30, 35, 77, or 1254 days to determine the correlations and we are free to make predictions for the next 1, 5, 18, or 543 days. Moreover, these choices will impact our ability to make effective predictions: looking too far in the past includes outdated data, but failing to look far enough in the past can yield bad correlation estimates. As such, these free parameters in our model must be chosen in some intelligent way.
To optimally choose these parameters we must define a key metric which serves as a proxy for the quality of the model. In this setting, we use data from January 2009 to April 2014 as a sort of validation data set, whereby the actual profit per trade is determined using known historical data. By choosing the free parameters
$n$ - number of past days over which we form $\bar{Y}_2$ and $\bar{Y}_5$,
$\lambda$ - decay rate determining the relative impact between subsequent days in forming $\bar{Y}_2$ and $\bar{Y}_5$,
$m$ - number of past days used to determine $a$, $b$ and $c$,
$p$ - number of future days for which the $a$, $b$ and $c$ values are considered valid before requiring a refitting, and
$\sigma$ - trading threshold for which we choose not to trade if the predicted difference is too small
to maximize this profit per trade, we can create a trading strategy which, in some way, is designed to maximize the profit. Of course, the question then becomes “How can we maximize the profit per trade?”
SigOpt provides an efficient optimization tool for exactly these kind of black box problems where the interaction of $n$, $\lambda$, $m$, $p$, $\sigma$ and profit is complicated and unintuitive. Furthermore, the cost of finding the coefficients $a(t;m)$, $b(t;m)$ and $c(t;m)$, while relatively cheap here, could be much more expensive in less simple settings. The ensemble of Bayesian optimization strategies behind SigOpt can efficiently conduct the optimization in these contexts.
The Impact of Tuning on Profitability
Even with a relatively simple design, models must be appropriately tuned in order to behave well in practice. To demonstrate this point, we analyze the profit per trade for different optimization mechanisms and compare them to the defaults suggested by some experts. For simplicity, we only allow the purchase or sale of one bond contract, although one could easily imagine a more complicated strategy with larger positions based on the model’s predictions.
One important component when tuning parameters is the domain which they span. Setting these bounds inappropriately can significantly cripple optimization methods like exhaustive search by forcing them to search large regions of parameter space with little benefit. We use the following domains for the parameters in this example: $n\in[200,400]$, $\lambda\in[0,4]$, $m\in[1,20]$, $p\in[4,20]$, $\lambda\in[0,.0078]$. The value $p$ was set to be at least 4 to require some amount of prediction into the future (beyond simply tomorrow), and $\sigma$ was capped at .0078 to compel some number of trades (.0078 was the 95th percentile of magnitude change over the data set).
The model tuning progress, as a function of the number of parameter configurations that were tested on the model, is shown below for SigOpt as well as two common, simpler alternatives: a random optimization, and a partial grid search we refer to as a parameter sweep, where each parameter is optimized sequentially over a 1D grid with the other parameters fixed using, cumulatively, the best parameter found from the previous 1D grid searches.
Figure 3: These curves show the progress of the optimization, with the median and interquartile ranges presented to show a range of possible outcomes(2) over 30 independent optimizations of the same model. SigOpt is more efficient and effective than a random search strategy or a parameter sweep, getting to better parameter values, faster. See this earlier post for a detailed analysis on how to robustly compare optimization strategies.
We find that the Bayesian optimization strategies we employ often tune models effectively after 10 times the number of parameters so we use 50 model evaluations to tune these 5 parameters. Using the best parameters observed over the 50 model evaluations we can back test our model on the contract prices from April 2014 to April 2016, the data omitted from the model tuning as a “holdout” dataset. The total profit accumulated for that time is on display in the figure below for each of the optimization strategies described above.
Figure 4: These histograms show the distribution(2) of future profits for the models tuned by SigOpt (median=\$9523), random search (median=\$7781) and a parameter sweep (median=\$7977). The median values are depicted with a black line.
Takeaways
This post shows that even a simple financial model will likely have free parameters and that tuning those free parameters can be a difficult task with serious implications regarding the viability of the model. SigOpt can provide benefits to practitioners who want to see better performance from their models. This also allows for more models to be developed and tested because SigOpt performs the accelerates the necessary tuning process.
There are significant extensions that can be made to the underlying financial model we present here. First, no trend/autoregressive term is present which would likely improve its predictive capacity; this choice was intentional, to limit the model’s complexity, but given the baseline in this post it would be interesting to see the benefit of such a component. What interests us the most is the introduction of an “Economic News” component which allows the model to account for erratic behavior on days with important reports such as GDP or unemployment rate. Stay tuned for a follow up post looking at that problem.
SigOpt is offering a free 30 day trial to new users. To get started, download the example code:
git clone https://github.com/sigopt/sigopt-examples.git
More from SigOpt
SigOpt’s 90 Second Demo demonstrates our optimization engine with no sign-up required.
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
Interested in helping push the state of optimization forward? SigOpt is hiring!
(0) The data used in this analysis was extracted from CME group. The quantities under analysis represent the dollar value of a bond futures contract: this is equivalent to the price multiplied by \$1000 for the 5 year and 10 year bonds and \$2000 for the 2 year bond. This might also be referred to as the contract value.
(1) In futures markets we do not explicitly model the time value of money under the assumption that the clearing house clears trades every day and thus the necessary transfer of margin (from the short position to long or vice versa) account for the change in price.
(2) An optimization using any of these three methods will be random because certain decisions must be made randomly: the initialization phase for SigOpt, the order of one dimensional optimizations for parameter sweep, or everything for random search. The medians and interquartile ranges provide some insight as to the average behavior.
1 note
·
View note
Text
Evaluating Hyperparameter Optimization Strategies
By: Alexandra Johnson, Software Engineer.
Hyperparameter optimization is a common problem in machine learning. Machine learning algorithms, from logistic regression to neural nets, depend on well tuned hyperparameters to reach maximum effectiveness. Different hyperparameter optimization strategies have varied performance and cost (in time, money, and compute cycles.) So how do you choose?
Evaluating optimization strategies is non-intuitive. Stochastic optimization strategies produce a distribution of best-found values. And how should we choose between two similarly-performing strategies. This blog post provides solutions for comparing different optimization strategies for any optimization problem, using hyperparameter tuning as the motivating example.
Function Optimization
We can generalize this problem as: given a function that accepts inputs and returns a numerical output, how can we efficiently find the inputs, or parameters, that maximize the function’s output?
In the case of hyperparameter tuning, the input to our function is the hyperparameters of the model, and the output is the result of measuring the corresponding model’s performance on an offline dataset, a common practice in machine learning.
Parameter Space
The inputs of the function are parameters, and the parameter space represents all possible values of the parameters, usually defined as “acceptable” bounds for each parameter. The dimension of the function is the number of these parameters.
Optimization Strategy
An optimization strategy defines how we will select parameters and how many function evaluations we will perform for every optimization. The final result of an optimization is the “winning” set of parameters that have produced the highest output observed for our function. Considerations when evaluating the performance of an optimization strategy include:
What is the best output value observed by this strategy?
What is the expected cost of this strategy? How many times do I have to evaluate the function to achieve the best observed value?
Example Optimization Strategies
We’ll explore three different methods of optimizing hyperparameters: grid search, random search, and Bayesian optimization. There are other potential strategies, but many of these require too many function evaluations per optimization to be feasible.
A visual presentation of three optimization strategies: grid search, random search, and Bayesian optimization
Grid Search
Grid search suggests parameter configurations deterministically, by laying down a grid of all possible configurations inside your parameter space. To optimize with this strategy, evaluate your function at every point on this grid. With as few as four parameters this problem can become impractical, because the number of function evaluations required for this strategy increases exponentially with each additional parameter, due to the curse of dimensionality.
Random Search
Random Search suggests configurations randomly from your parameter space. While less common in machine learning practice than grid search, random search has been shown to find equal or better values than grid search within fewer function evaluations for certain types of problems. To optimize with random search, evaluate your function at some number of random configurations in the parameter space; note that it may be unclear how to determine the number of function evaluations required for your particular problem.
Bayesian Optimization
Bayesian optimization is an adaptive approach to parameter optimization, trading off between exploring new areas of the parameter space, and exploiting historical information to find the parameters that maximize the function quickly. Like random search, Bayesian optimization is stochastic. In practice, tools like SigOpt perform well with a number of evaluations equal to 10 to 20 times the dimension of the optimization problem.
Optimization Strategy Performance
Best Found Value
The best found value is the highest output value of your function reported during an optimization. When comparing stochastic optimization strategies, we consider the distribution of the best found values over many optimizations of a single strategy and apply a statistical test, similar to the test commonly used when comparing CTR distributions across variations in an A/B test.
Pictured: The Distribution of Best Found Values over 50 Optimizations of a 2-parameter version of this text sentiment classifier. Dashed lines indicate the median of the distribution.
Each optimization in the graph above consists of 20 function evaluations. For grid search, we used a 4 x 5 grid. Grid search, a deterministic optimization strategy, found effectively the same best value on every optimization, but the median accuracy was low, likely because the grid was too coarse. On the other hand, the best found values for random search and SigOpt have a wider range, but the median of each distribution is higher than that of grid search.
Number of Function Evaluations
Optimization strategies must be practical, and an important consideration is the number of function evaluations required by the strategy to reach its optimal value. A strategy like grid search requires an exponentially growing number of function evaluations with the number of parameters and becomes intractable in higher dimensions. Strategies like random search and Bayesian optimization attempt to reach an optimal value with function evaluations linearly proportional to the dimensions of the problem.
Best Seen Trace
To evaluate how quickly a strategy reaches an optimal value, you can examine a best seen trace, which plots the best value observed so far at function evaluation 1, 2, 3…n. Because some of our optimization methods are inherently stochastic we examine the interquartile range of the best seen trace – when aggregating 50 optimizations the interquartile range represents the span of the middle 25 results. This plot shows the evolution of the best seen trace over 40 function evaluations.
Pictured: Best Seen Trace and interquartile ranges of aggregated optimizations, each with 40 function evaluations [source]
When two optimizations with the same fixed number of function evaluations result in similar best found values we can break ties by determining which optimization strategy was able to observe the best result first. We use the Area Under Curve (AUC) of the best seen trace to see which strategy got to the best results in the fewest function evaluations. To convert the best seen traces into numerical measures fit for comparison, we use the following formula to find the AUC of these best seen traces:
where fLB is a lower bound for f designed to ensure that the AUC is always positive.
Comparing with Confidence
We want assurances that one strategy is better than another with statistical significance. We define “better” to mean:
The best found value from method A will be better than the best found value from method B with some level of confidence
If A and B find the same best value, A finds this value within fewer function evaluations than B with some level of confidence as determined by the AUC of the best seen trace
To compare two optimization strategies with a specific level of confidence, we can apply the Mann-Whitney U Test with to the best found values from each strategy. We can then use the Area Under the Curve of the best seen traces to break ties. More information can be found in this paper. Our internal testing on functions publicly available in the EvalSet shows that, at significance .01, SigOpt outperforms grid search in 399 out of 400 statistically significant tests.
Conclusion
There are many theoretical and practical concerns when evaluating optimization strategies. The best strategy for your problem is the one that finds the best value the fastest–with the fewest function evaluations. The best found value and the Area Under the Curve are two measurements to compare optimization strategies. These measurements combined with the Mann-Whitney U Test allow us to make rigorous statistical statements about which optimization strategies performs better. SigOpt uses Bayesian optimization to help you find the best parameters for a given function or model the fastest, and we frequently use these methods to determine our own performance against popular optimization strategies. Sign up today!
More from SigOpt
SigOpt’s 90 Second Demo demonstrates our optimization engine with no sign-up required.
Deep Learning: Tensorflow ConvNets on a Budget
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
0 notes
Text
SigOpt Fundamentals: Dealing with Troublesome Metrics
By: Mike McCourt, Ph.D.
At SigOpt, our goal is to help our customers build better models, simulations, and processes by maximizing key metrics for them. These metrics can come from any number of different fields, including optical character recognition, financial modeling and even sports betting. Each metric may have a different interpretation or a different significance (e.g., financial modeling practitioners may be trying to maximize portfolio value), but all of them are functions, a key mathematical concept which is the focus of this blog post. We will introduce the concept of functions and discuss what properties make certain functions more dangerous to maximize.
Metrics as functions
In the simplest sense, a function is a mapping between inputs and outputs: for a given input a function should return a single output(1). In a business context, these functions are a formal representation of the relationship between decisions a business can make (inputs/parameters) and key quantities which may describe the business’s success or failure (outputs/metric values). I use the term “formal” representation because, unfortunately, these functions exist only in some cryptic (and potentially implicit) sense and the actual structure of the function is obscured by layers of complexity specific to the business operations, like a machine learning model or physical process. As a result, while the input-output relationship may be well-established (see the figure below) basic mathematical tools are rendered useless by the lack of a clean mathematical formulation (say, e=mc2). Such a function is often referred to as a black-box function because the inner workings of the function are unknown.
It is likely that you are surrounded by many such metrics on any given day and that you are, perhaps without even realizing it, trying to maximize them to make your life easier (few people try to find the least bang for their buck). As suggested, this could arise in various circumstances, a number of which we discuss at the end of this post, but we will try to focus our discussion here on a topic of common knowledge: cookies. Indeed, what factors play into the “quality” of baked cookies, the metric of primary interest here? We define quality here to be the average results of a customer survey where customers ranked the taste of the cookies on a scale between 0 and 10. Only four key factors are mentioned here, all of which happen to be continuous: the amount of sugar, the amount of shortening, the temperature at which they are baked and then duration for which they are baked.
Figure 1: Many factors, including sugar, shortening, temperature and time, can play a role in the quality of delicious cookies, but their interaction may be complicated and unintuitive. We call such a metric, which is dependent on many factors in confusing ways, a black box function.
These black-box functions, which incorporate the complicated aspects of corporate and industrial operations into a single fundamental measurement of success, are the metrics of most relevance. This leaves experts administering these operations, who have the task of maximizing these metrics, in the difficult position of having to do so without some of the powerful mathematics developed for simpler optimization problems(2). SigOpt was built to help these experts optimize these operations, guiding them to the best metric values with less trial and error. This is especially important because each measurement of this metric is often costly, either in time (unfortunately, cookies must cool before they can be eaten) or in resources (eggs are not free, and the oven in which the cookies are baked is not cheap either). This cost implies the need to maximize the function in as few iterations as possible.
Fortunately, strategies exist for maximizing these metrics even under such restrictive circumstances. At SigOpt, our weapon of choice is sequential model-based optimization (SMBO) which, as the name suggests, requires the construction of a model of the function being maximized. This model is continuously fit and updated as new outputs of the function (observations of the metric) are generated, thus the inclusion of the word sequential. The quality of this model is paramount, since a good model will effectively represent the desired metric and suggest likely inputs for which the metric is maximized.
We use Gaussian processes to model customer metrics (other options also exist) which have natural mechanisms for developing a well-fit model, including, but not limited to, maximum likelihood estimation. Unfortunately, a naively implemented Gaussian process model may be limited by properties of the desired function, leaving users of open source SMBO software with a complicated question: which functions can be effectively modeled, and thus effectively optimized, with this Gaussian process model-based strategy? The inability to answer this question can stifle the progress of an SMBO algorithm, costing users time and money.
From a purely mathematical standpoint, the correct strategy for identifying functions which are well-modeled with a given Gaussian process involves the reproducing kernel Hilbert space associated with the covariance of that Gaussian process. For the 90% of readers who do not know what that means, do not worry because the 10% of readers who know what that means know that statement has limited practical relevance. This blog post will discuss potentially troubling functions (all in one dimension(3), to keep the topic more tangible) and strategies that SigOpt employs to help our customers efficiently maximize their important metrics.
Unscaled functions
Quick explanation: Functions with values on different scales often may produce models that can only approximate one scale well.
As is the case with all of these properties, this is best described with a picture. In the figure below the same function is displayed on two different domains, which present a very different perspective.
Figure 2a: A function with a distinct maximum near x=12.98. left: The function values for 5<x<15, are overwhelmed by the much smaller values for x<5 and x>15. right: If only the typical values on 5<x<15 are studied, the maximum is easier to visually identify. The problem on the right is better scaled and, because it may be easier to model, it may be easier to maximize.
The idea behind scale is a measurement of what is considered a “typical,” or perhaps average, value of the function and a well-scaled function would be a function whose values are all nearby each other. A function with a range on the same order as its typical value could be considered reasonably scaled. In the context of optimization, the typical value around which we are interested in creating a model is the maximum of the function. Therefore, a poorly scaled function would be a function with values many orders of magnitude away from the optimum.
Consider a cookie example that might look like this: perhaps x could be a measurement of time spent baking, and y could be a measurement of how much customers like the resulting cookies. As the graph implies, baking for too short or long (less than 5 minutes or more than 15 minutes) yields a swift decline in perceived quality, which anyone who has ever made cookies would know. But for bakers who want to maximize their cookie quality, finding the optimal baking time can be the difference between opening a kiosk at the mall and competing with Mrs. Fields.
There is nothing fundamentally different between modeling either of the functions presented in the figure above, but there is a potential downside to modeling a function that varies over such a large range in the metric, as demonstrated in the figure below.
Figure 2b: Two models (in blue) fit to data (in orange) sampled from the functions in Figure 1a. left: Data sampled at 32 points(4) over 5<x<15 creates a model which acts erratically within the domain of interest; only the region of 5<x<15 has been plotted. right: A model is created using the same data within 5<x<15 (without any of the values at a different scale) and more closely models the true function of interest.
Both models in the figure above pass through the same observed data, but the model on the left has the added pressure of needing to match values in the (ungraphed) regions of x<5 or x>15. This pressure spawns the erratic behavior in the model on the left, and can cause that model to perform worse than the model on the right.
Main takeaway about reasonably scaled functions: Choosing a function whose typical value is on the same order of magnitude as the entire range can produce a better model.
How SigOpt manages this for you: At SigOpt, we understand that some of the most interesting metrics are also those which exist on multiple different scales, which is why we have multiple strategies to help facilitate efficient modeling of even the worst scaled functions. At the simplest level we always apply some form of invertible transformation to the observed data to help guarantee that a wildly large range will not dominate the computation (including so-called warped Gaussian processes). There are also decisions we actively make to choose an appropriate covariance kernel for these unscaled situations. We are also developing newer strategies for modeling data that quickly shifts between scales, including the adaptation of differential equation solvers for shocks.
Well chosen domain
Quick explanation: Prior to optimization, a domain of interest must be specified; if that region is too large, modeling may be inefficient.
For a specific amount of shortening, certain sugar amounts would be too great to produce viable cookies - including those large values in the optimization problem would make the function harder to model. That philosophy is what motivates this idea of a well chosen domain, and the distinction between a well-chosen domain and a poorly chosen domain is on display in the figure below.
Figure 3a: The same function plotted over two domains. left: The function appears rather boring in most of the domain, with the region of interest occupying only a small portion. right: By choosing a better (smaller) domain, the action of the function is more easily modeled because less energy is expended in the boring region.
Visual inspection of the different domains in the figure above would present a cognizant user with the same conclusion: there is a maximum somewhere near x=.85 cups of sugar. But, when constructing a model on which to optimize, only limited samples of the function are available. Thus, using some of those samples in a region nowhere near the minimum is wasteful and can slow the development of the model and, in turn, the progress of the optimization. This is demonstrated in the figure below.
Figure 3b: The evolution of the models (in blue) as increasingly many points (in orange) are sampled. left: When samples from the boring region dominate the model, it takes a longer time to resolve the interesting region. right: When the domain is well focused on the interesting region the model requires fewer points to converge on the desired behavior.
Clearly, the model with the well chosen domain is able to more quickly act like the true function, whereas the model with the poorly chosen domain(5) is still evolving even with the same number of function evaluations. Since function evaluations are assumed to be expensive (either in the form of time or money/resources), being able to generate a better model with fewer points is beneficial.
Main takeaway about well chosen domains: Better identifying the location of the minimum may yield a better model with less (costly) function evaluations.
How SigOpt manages this for you: At SigOpt, we want customers to focus on developing the best models and to leave the model tuning to us. Given the data that customers provide, we have implemented several strategies to identify appropriate sampling strategies and avoid the wasted observations shown on the left in Figure 3b. Our optimization routines often begin with a low discrepancy phase, providing a mathematically optimal mechanism for preparing the underlying model. We also have researched strategies for dividing up the region of interest, so as to fit different Gaussian processes in different regions: these strategies including treed Gaussian processes and local interpolation. Furthermore, our models propagate the optimization by studying a modified form of expected improvement, which balances the need to fit an effective model (exploration) against the need to find the maximum in as few function evaluations as possible (exploitation).
Continuous functions
Quick explanation: Functions that are disconnected can behave unpredictably and are more difficult to model.
Continuity, is an important topic which can, at times, be tricky. In the simplest sense, a continuous function is a function which is connected, and, thus, small changes in inputs yield small changes in outputs. Many functions of interest are continuous, but some important metrics are not continuous, and a naive implementation of Gaussian processes may have difficulty in that setting. An example of this difficulty is displayed in the figure below.
Figure 4: We attempt to model a discontinuous function. left: The function, which seems harmless and has a clear maximum at x=400 with discontinuities at x=350 and x=450. right: Our model (in blue) of this reasonable function, which predicts minima near x=350 and x=450, even as increasingly much data (in orange) is provided to the model.
In the context of baking cookies, suppose the baking time is already fixed and the optimal temperature (the x-axis in the figure above, in Fahrenheit) is under analysis. If the temperature is too low, the cookies fail to bake and the chef disposes of the cookies. If the temperature is too high, the cookies are burnt to a crisp and are, again, disposed of. In both circumstances, the pastry chef perceives the customer preference as the worst value possible which could produce a discontinuity when actual customer preferences are recorded; after all, some of us really enjoy the uncooked dough as much as the baked cookies.
As is painfully obvious, this disconnect between the true customer preferences and the censoring imposed by the pastry chef may yield a model that predicts specious behavior around x=350 and x=450. Historically, this has been referred to as the Gibbs phenomenon and can not be avoided for this particular choice of Gaussian process model(6). In a sense, the model is trying to “build up energy” to jump from one branch of the function to the other branch, which results in these undesired oscillations near the discontinuities.
Main takeaway about discontinuous functions: Creating a model for discontinuous data requires careful work to make sure that unphysical behavior does not dominate the predictions.
How SigOpt manages this for you: This situation has troubled researchers since the 1800s, and among possible resolutions the simplest is to create a model using a more suitable covariance kernel. Unfortunately, models which work well for Figure 4 may perform worse on functions that have no discontinuities and are actually smooth, such as the function from Figure 3a. Effectively conducting this tradeoff is a key part of the work we do at SigOpt to better model our customer’s metrics. We also consider nontraditional Gaussian process strategies involving local interpolation (discussed here, originally, in the context of preconditioning) and the adaptation of methods primarily designed for the solution of partial differential equations (including partition of unity methods and essentially non-oscillatory methods).
Functions are all around us
Although all of the examples discussed here involve baking cookies, the mathematics applies to a broad range of situations which can be described with functions. The efficiency of an assembly line could be a function of the number of operators, the number of supervisors, the ratio of work time to break time, the temperature, and other such factors. The accuracy of a support vector machine may be a function of the choice of feature map, associated hyperparameters, the box constraint, and the choice of loss function. The profit of a financial asset trading strategy could be a function of the desired risk, perceived volatility, duration of past data influence and frequency of model refitting. But in all of these cases, the quality of an SMBO algorithm is limited by the ability of the model to resemble the function of interest.
This blog post provides some insight into common issues within the function approximation/modeling community and advice that we would give practitioners who wanted to implement an open source or internal SMBO solution. At SigOpt, our goal is to provide users a simple framework through which to maximize their important metrics, whether they be as industrial as the design of an airplane or as satisfying as the tastiness of cookies. We do not expect users to conduct complicated manipulations on their metrics to avoid the problems described above; we would rather users spend their time using their expert knowledge to develop more interesting and valuable models. Thus, our best advice is to sign up for SigOpt today to start optimizing!
SigOpt is offering a free 30 day trial to new users. To get started, download the example code:
git clone https://github.com/sigopt/sigopt-examples.git
More from SigOpt
SigOpt’s 90 Second Demo demonstrates our optimization engine with no sign-up required.
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
Footnotes
(1) There are also multivalued functions where a single input maps to multiple outputs. They arise naturally when dealing with complex variables and may occur in some applications, but are not something we consider here.
(2) The standard reference for numerical optimization would be the classic text by Dennis and Schnabel. That book deals with standard mathematical strategies for numerical optimization, and not the model-based optimization strategy we employ at SigOpt, but some concepts, including those in chapter 7, are relevant in any optimization setting.
(3) The examples presented in this post all describe a four dimensional problem (where sugar, shortening, temperature and time are turned into cookies of varying quality) in terms of a single dimension. This analysis of a single dimension, with all others fixed, is done to make easily interpretable graphs, but is not, generally, an effective strategy for conducting optimization in high dimensional spaces.
(4) In this particular example we have used equally spaced points to create our model, but this decision is only for demonstrative purposes. During sequential optimization, these points would almost certainly be sampled according to some acquisition function. Furthermore, if the goal were only to create a good model, statistics would suggest an optimal design while approximation theory might suggest using the Chebyshev-Gauss-Lobatto points.
(5) This larger domain with a boring region serves a useful purpose in demonstrating the value of identifying a domain which tightly bounds the optimum. The idea that there may be different regions of activity for a function (as opposed to just boring and interesting) is indicative of something larger, however. This property might be referred to as nonstationarity in the spatial statistics or machine learning communities, and has been addressed (albeit somewhat anonymously) in the numerical analysis community with local interpolation and multilevel methods.
(6) This model was designed to accentuate this undesired behavior, to demonstrate a severe situation that could arise when someone new to SMBO first experiments. As suggested above, a different choice of covariance kernel (in particular, one with less smoothness) may not have such large undesired oscillations. Of course, knowing that a priori would require extra analysis on the function of interest prior to optimization.
1 note
·
View note
Text
SigOpt for ML: TensorFlow ConvNets on a Budget with Bayesian Optimization
By: Ian Dewancker, Research Engineer
In this post on integrating SigOpt with machine learning frameworks, we will show you how to use SigOpt and TensorFlow to efficiently search for an optimal configuration of a convolutional neural network (CNN). There are a large number of tunable parameters associated with defining and training deep neural networks ( Bergstra [1] ) and SigOpt accelerates searching through these settings to find optimal configurations. This search is typically a slow and expensive process, especially when using standard techniques like grid or random search, as evaluating each configuration can take multiple hours. SigOpt finds good combinations far more efficiently than these standard methods by employing an ensemble of state-of-the-art Bayesian optimization techniques, allowing users to arrive at the best models faster and cheaper.
In this example, we consider the same optical character recognition task of the SVHN dataset as discussed in a previous post. Our goal is to build a model capable of recognizing digits (0-9) in small, real-world images of house numbers. We use SigOpt to efficiently find a good structure and training configuration for a convolutional neural net. Check out the code here if you’d like to start experimenting!
Convolutional Neural Net Structure
The structure and topology of a deep neural network can have dramatic implications for performance on a given task ( Bengio [3] ). Many small decisions go into the connectivity and aggregation strategies for each of the layers that make up a deep neural net. These parameters can be non-intuitive to choose in an optimal, or even acceptable, fashion. In this experiment we used a TensorFlow CNN example designed for the MNIST dataset as a starting point. Figure 1 represents a typical CNN structure, highlighting the parameters we chose to vary in this experiment. A more complete discussion of these architectural decisions can be found in an online course from Stanford ( Li [5] ). It should be noted that Figure 1 is an approximation of the architecture used in this example, and the code serves as a more complete reference.
Figure 1: Representative convolutional neural net topology. Important parameters include the width and depth of the convolutional filters, as well as dropout probability. (Sermanet [2])
TensorFlow has greatly simplified the effort required to build and experiment with deep neural network (DNN) designs. Tuning these networks, however, is still an incredibly important part of creating a successful model. The optimal structural parameters often highly depend on the dataset under consideration. SigOpt offers Bayesian optimization as a service to minimize the amount of trial and error required to find good structural parameters for DNNs and CNNs.
Stochastic Gradient Descent Parameters ($\alpha, \beta, \gamma$)
Once the structure of the neural net has been selected, an optimization strategy based on stochastic gradient descent (SGD) is used to fit the weight parameters of the convolutional neural net. There is no shortage of SGD algorithm variations implemented in TensorFlow and several parametrizations of RMSProp, a particular SGD variation, are compared in Figure 2.
Figure 2: Progression of RMSProp gradient descent with different parametrizations. left: Various decay rates with other parameters fixed: purple = .01, black = .5, red = .93. center: Various learning rates with other parameters fixed: purple = .016, black = .1, red = .6. right: Various momentums with other parameters fixed: purple = .2, black = .6, red = .93.
It can be a counterintuitive and time consuming task to optimally configure a particular SGD algorithm for a given model and dataset. To simplify this tedious process, we expose to SigOpt the parameters that govern the RMSProp optimization algorithm. Important parameters governing its behavior are the learning rate ($\alpha$) , momentum ($\beta$) and decay ($\gamma$) terms. These parameters define the RMSProp gradient update step:
Algorithm 1: Pseudocode for RMSProp stochastic gradient descent. Stochastic gradient refers to the fact that we are estimating the loss function gradient using a subsample (batch) of the entire training data
For this example, we used only a single epoch of the training data, where one epoch refers to a complete presentation of the entire training data (~500K images in our example). Batch size refers to the number of training examples used in the computation of each stochastic gradient (10K images in our example). One epoch is made up of several batch sized updates, so as to minimize the in-memory resources associated required for the optimization (Hinton [4]). Using only a single epoch can be detrimental to performance, but this was done in the interest of time for this example.
Classification Performance
To compare tuning the CNNs hyperparameters when using random search versus SigOpt, we ran 5 experiments using each method and compared the median best seen trace. The objective was the classification accuracy on a single 80 / 20 fold of the training and “extra” set of the SVHN dataset (71K + 500K images). The median best seen trace for each optimization strategy is shown below in Figure 3.
In our experiment we allowed SigOpt and random search to perform 80 function evaluations (each representing a different proposed configuration of the CNN). A progression of the best seen objective at each evaluation for both methods is shown below in Figure 3. We include, as a baseline, the accuracy of an untuned TensorFlow CNN using the default parameters suggested in the official TensorFlow example. We also include the performance of a random forest classifier using sklearn defaults.
Figure 3: Median best seen trace of CV accuracy over 5 independent optimization runs using SigOpt, random search as well as two baselines where no tuning was performed
After hyperparameter optimization was completed for each method, we compared accuracy using a completely held out data set (SHVN test set, 26k images) using the best configuration found in the tuning phase. The best hyperparameter configurations for each method in each of the 5 optimization runs was used for evaluation. The mean of these accuracies is reported in the table below. We also include the same baseline models described above and report their performance on the held out evaluation set.
SigOpt (TensorFlow CNN) Random Search (TensorFlow CNN) No Tuning (sklearn RF) No Tuning (TensorFlow CNN) Hold out ACC 0.8130 ( +315.2% )0.5690 0.5278 0.1958
Table 1: Comparison of model accuracy on the held out dataset after different tuning strategies
Cost Analysis
Using SigOpt to optimize deep learning architectures instead of a standard approach like random search can translate to real savings in the total cost of tuning a model. This is especially true when expensive computational resources (for example GPU EC2 instances) are required by your modelling efforts.
We compare the cost required to reach specific performances on the CV accuracy objective metric in our example experiment. Quickly finding optimal configurations has a direct savings on computational costs associated with tuning on top of the performance benefits of having a better model. Here we assume each observation costs $2.60, which is the cost per hour of using a single on-demand g2.8xlarge instance in EC2.
Model Performance (CV Acc. threshold) Random Search Cost SigOpt Cost SigOpt Cost Savings Potential Production Savings (50 GPUs) 87% $275 $42 84% $12,530 85% $195 $23 88% $8,750 80% $46 $21 55% $1,340 70% $29 $21 27% $400
Table 2: Required costs for achieving same performance when tuning with SigOpt and random search. For CNNs in production more epochs are traditionally used; for this example we assume 50 GPUs and that the results scale perfectly with the parallelism.
We observe that SigOpt offers a drastic discount in cost to achieve equivalent performance levels when compared with a standard method like random search. While this experiment required only a relatively modest amount of computational resources, more sophisticated models and larger datasets require more instances training for up to weeks at a time, as was the case for the AlphaGo DNN, which used 50 GPUs for training. In this setting, an 80% reduction in computational costs could easily translate to tens of thousands of dollars in savings.
Closing Remarks
Deep learning has quickly become an exciting new area in applied machine learning. Development and innovation is often slowed by the complexity and effort required to find optimal structure and training strategies for deep learning architectures. Optimal configurations for one dataset don’t necessarily translate to others, and using default parameters can often lead to suboptimal results. This inhibited R&D cycle can be frustrating for practitioners, but it also carries a very real monetary cost. SigOpt offers Bayesian optimization as a service to assist machine learning engineers and data scientists in being more cost effective in their modelling efforts. Start building state-of-the-art machine learning models on a budget today!
SigOpt is offering a free 30 day trial to TensorFlow users. To get started, download the code:
git clone https://github.com/sigopt/sigopt-examples.git
More from SigOpt
SigOpt’s 90 Second Tutorial demonstrates our optimization engine with no sign-up required.
Using Model Tuning to Beat Vegas (sklearn)
Automatically Tuning Text Classifiers (sklearn)
Unsupervised Learning with Bayesian Optimization (xgboost)
References
[1]: James Bergstra, Rémi Bardenet, Yoshua Bengio and Balázs Kégl. Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems. 2011 [PDF]
[2]: Pierre Sermanet, Soumith Chintala and Yann LeCun. Convolutional Neural Networks Applied to House Numbers Digit Classification. Pattern Recognition International Conference (ICPR). 2012. [PDF]
[3]: Yoshua Bengio. Learning Deep Architectures for AI. Foundations and trends in Machine Learning. 2009. [PDF]
[4]: Geoffrey Hinton, Nitish Srivastav and Kevin Swersky. Neural Networks for Machine Learning. University of Toronto Course Slides [LINK]
[5]: Fei-Fei Li, Andrej Karpathy, and Justin Johnson. Convolutional Neural Networks for Visual Recognition. Stanford Online Course. [LINK]
0 notes
Text
SigOpt for ML: Unsupervised Learning with Even Less Supervision Using Bayesian Optimization
By: Ian Dewancker, Research Engineer
In this post on integrating SigOpt with machine learning frameworks, we will show you how to use SigOpt and XGBoost to efficiently optimize an unsupervised learning algorithm’s hyperparameters to increase performance on a classification task.
As we previously discussed, fully supervised learning algorithms require each data point to have an associated class or output. In practice, however, it is often the case that relatively few labels are available during training time and labels are costly or time consuming to acquire. For example, it might be a very slow and expensive process for a group of experts to manually investigate and classify thousands of credit card transaction records as fraudulent or legitimate. A better strategy might be to study the large collection of transaction data without labels, building a representation that better captures the variations in the transaction data automatically.
Unsupervised Learning
Unsupervised learning algorithms are designed with the hope of capturing some useful latent structure in data. These techniques can often enable dramatic gains in performance on subsequent supervised learning tasks, without requiring more labels from experts. In this post we will use an unsupervised method on an image recognition task posed by researchers at Stanford [1] where we try to recognize house numbers from images collected using Google street view (SVHN). This is a more challenging problem than MNIST (another popular digit recognition data set) as the appearance of each house number varies quite a bit and the images are often cluttered with neighboring digits:
Figure 1: 32x32 cropped samples from the classification task of the SVHN dataset. Each sample is assigned only a single digit label (0 to 9) corresponding to the center digit. (Sermanet [6])
In this example, we assume access to a large collection of unlabelled images $X_u$, where the correct answer is not known, and a relatively small amount of labelled data $(X_s, y)$ for which the true digit in each image is known (often requiring a non-trivial amount of time and money to collect). Our hope is to find a suitable unsupervised model, built using our large collection of unlabelled images, that transforms images into a more useful representation for our classification task.
Unsupervised and supervised learning algorithms are typically governed by small sets of hyperparameters $(\lambda_u, \lambda_s)$, that control algorithm behavior. In our example pipeline below, $X_u$ is used to build the unsupervised model $f_u$ which is then used to transform the labelled data $(X_s, y)$ before the supervised model $f_s$ is trained. Our task is to efficiently search for good hyperparameter configurations $(\lambda_u, \lambda_s)$ for both the unsupervised and supervised algorithms. SigOpt minimizes the classification error $E(\lambda_u, \lambda_s)$ by sequentially generating suggestions for the hyperparameters of the model $(\lambda_u, \lambda_s)$. For each suggested hyperparameter configuration a new unsupervised data representation is formed and fed into the supervised model. The observed classification error is reported and the process repeats, converging on the set of hyperparameters that minimizes the classification error.
Figure 2 : Process for coupled unsupervised and supervised model tuning
Data scientists use their domain knowledge to select appropriate unsupervised and supervised models, but the task of selecting the best parameters for these models is often daunting and non-intuitive. Even simple unsupervised models—such as the whitening strategy we discuss below—often introduce tunable parameters, leaving potentially good models on the table simply because a winning configuration was never found.
SigOpt offers Bayesian optimization as a service, capable of efficiently searching through the joint variations $(\lambda_u, \lambda_s)$ of both the supervised and unsupervised aspects of machine learning systems (Figure 2.) This allows experts to unlock the power of unsupervised strategies with the assurance that each model is reaching its full potential automatically. The gains achieved by methods like SigOpt are additive with feature engineering, allowing for better results and faster iteration with less trial and error.
Show Me The Code
The source code for this experiment and setup script are available here. SigOpt can quickly identify the optimal configuration of these complicated models, much faster than traditional methods such as grid search and random search, especially when more than 2 or 3 hyperparameters are at play. Finding optima more quickly can also drastically save on computational resource costs (e.g. AWS instances) while still achieving comparable or better model performance.
Unsupervised Model
We start with the initial features describing the data: raw pixel intensities for each image. The goal of the unsupervised model is to transform the data from its original representation to a new (more useful) learned representation without using labeled data. Specifically, you can think of this unsupervised model as a function $f : \mathbb{R}^N \rightarrow \mathbb{R}^J$. Where $N$ is the number of features in our original representation and $J$ is the number of features in the learned representation. In practice, expanded representations (sometimes referred to as a feature map) where $J$ is much larger than $N$ often work well for improving performance on classification tasks [2].
Image Transform Parameters ($s$, $w$, $K$)
A simple but surprisingly effective transformation for small images was proposed in a paper by Coates [1] where image patches are transformed into distances to $K$ learned centroids (average patches) using the k-means algorithm, and then pooled together to form a final feature representation as outlined in the figure below:
Figure 3: Feature extraction using a w-by-w receptive field and stride s. w-by-w patches separated by s pixels each, then map them to K-dimensional feature vectors to form a new image representation. These vectors are then pooled over 4 quadrants of the image to form classifier feature vector. ( Coates [1] )
In this example we are working with the 32x32 (n=32) converted gray-scale (d=1) images of the SVHN dataset. We allow SigOpt to vary the stride length ($s$) and patch width ($w$) parameters. The figure above illustrates a pooling strategy that considers quadrants in the 2x2 grid of the transformed image representation, summing them to get the final transformed vector. We used the suggested resolution in [1] and kept $\text{pool_res}$ fixed at 2. $f(x)$ represents a $K$ dimensional vector that encodes the distances to the $K$ learned centroids, and $f_i(x)$ refers to the distance of instance $x$ to centroid $i$. In this experiment, $K$ is also a tunable parameter. The final feature representation of each image will have $J = K * \text{pool_res}^2$ features.
Whitening Transform Parameter ($\epsilon_{\text{zca}}$)
Before generating the image patch centroids and any subsequent patch comparisons to these centroids, we apply a whitening transform to each patch. When dealing with image data, whitening is a common preprocessing transform which removes the correlation between all pairs of individual pixels [3]. Intuitively, it can be thought of as a transformation that highlights contrast in images. It has been shown to be helpful in image recognition tasks, and may also be useful for other feature data. The figure below shows several example image patches before and after the whitening transform is applied.
Figure 4: Comparison of image patches before and after whitening ( Stansbury [7] )
The whitening transformation we use is known as ZCA whitening [4]. This transform is achieved by cleverly applying the eigendecomposition of the covariance matrix estimate to a mean adjusted version of the data matrix, so that the expected covariance of the data matrix becomes the identity. A regularization term $\epsilon_{\text{zca}}$ is added to the diagonal eigenvalue matrix, and $\epsilon_{\text{zca}}$ is exposed as a tunable parameter to SigOpt.
# fit cov estimate to data matrix X (n x m, n samples, m feats) cov = LedoitWolf().fit(X) D, U = numpy.linalg.eigh(cov.covariance_) V = numpy.sqrt(numpy.linalg.inv(numpy.diag(D+eps_zca))) Wh = numpy.dot(numpy.dot(U,V),U.T) # ZCA whitening transform matrix mu = numpy.mean(X, axis=0) X_zca = numpy.dot(X-mu, Wh)
Centroid Distance Sparsity Parameter ($\text{sparse_p}$)
Each whitened patch in the image is transformed by considering the distances to the learned $K$ centroids. To control this sparsity of the representation we report only distances that are below a certain percentile, $\text{sparse_p}$, when considering the pairwise distances between the current patch and the centroids. Intuitively this acts as a threshold which allows for only the “close” centroids to be active in our representation.
# compute distances between patches and all centroids Z = k_means.transform(img_ptchs) tau = numpy.percentile(Z, sparse_p, axis=1, keepdims=True) Z = numpy.maximum(0, tau - Z)
The figure below illustrates the idea with a simplified example. A whitened image patch (in the upper right) is compared against the 4 learned centroids after k-means clustering. Here, let’s imagine we have set the percentile threshold to 50, so only the distances in the lower half of all centroid distances persist in the final representation, the others are zeroed out
Figure 5: Sparsity transform; distances from a patch to centroids above 50th percentile are set to 0
While the convolutional aspects of this unsupervised model are tailored to image data, the general approach of transforming feature data into a representation that reflects distances to learned archetypes seems suitable for other data sets and feature spaces [8].
Supervised Model
With the learned representation of our data, we now seek to maximize performance on our classification task using a smaller labelled dataset. While random forests are an excellent, and simple, classification tool, better performance can typically be achieved by using carefully tuned ensembles of boosted classification trees.
Gradient Boosting Parameters ($\gamma, \theta, M$)
We consider the popular library XGBoost as our gradient boosting implementation. Gradient boosting is a generic boosting algorithm that incrementally builds an additive model of base learners, which are themselves simpler classification or regression models. Gradient boosting works by building a new model at each iteration that best reconstructs the gradient of the loss function with respect to the previous ensemble model. In this way it can be seen as a sort of functional gradient descent, and is outlined in more detail below. In the pseudocode below we outline building an ensemble of regression trees, but the same method can be used with a classification loss function $L$
Algorithm 1: Pseudocode for supervised gradient boosting using regression trees as base learners
Important parameters governing the gradient boosting algorithm include $M$, the number of base learner models in the ensemble, and $\gamma$ the learning rate, which controls the relative contribution of each new base learner in the final additive model. Each base learner is also governed by it’s own set of parameters $\theta$. Here we consider classification trees as our base learners, governed by a familiar set of parameters managing tree growth and regularization (e.g., max depth, sub_sample). We expose these parameters to SigOpt to optimize simultaneously with the parameters governing the unsupervised transformation discussed previously.
Classification Performance
To compare model performance, we use accuracy, which can be understood as a measure of the probability of correctly classifying for each image in the test set. For example, a model that correctly recognizes the house numbers for 91% of the images would result in an accuracy score of 0.91.
We compare the ability of SigOpt to find the best hyperparameter configuration to the industry standard methods of random search, which usually outperforms grid search and manual search (Bergstra [9]) and a baseline of using an untuned model.
Because the underlying methods used are inherently stochastic we performed 10 independent hyperparameter optimizations using both SigOpt and random search for both the purely supervised and combined models. Hyperparameter optimization was performed on the accuracy estimate from a 80/20 cross validation fold of the training data (73k examples). The ‘extra’ set associated with the SVHN dataset (530K examples) was used to simulate the unlabelled data $X_u$ in the unsupervised parts of this example.
For the unsupervised model 90 sequential configuration evaluations (~50 CPU hrs) were used for both SigOpt and random search. For the purely supervised model 40 sequential configuration evaluations (~8 CPU hrs) were used for both SigOpt and random search. In practice, SigOpt is usually able to find good hyperparameter configurations with a number of evaluations equal to 10 times the number of parameters being tuned (9 for the combined model, 4 for the purely supervised model). The same parameters and domains were used for XGBoost in both the unsupervised and purely supervised settings. As a baseline, the hold out accuracy of an untuned scikit-learn random forest using the raw pixel intensity features.
After hyperparameter optimization was completed for each method we compared accuracy using a completely held out data set (SHVN test set, 26k examples) using the best configuration found in the tuning phase. The hold out dataset was run 10 times for each best hyperparameter configuration for each method, the mean of these runs is reported in the table below. SigOpt outperforms random search with a p-value of 0.0008 using the unpaired Mann-Whitney U test.
SigOpt (xgboost + Unsup. Feats) Rnd Search (xgboost + Unsup. Feats) SigOpt (xgboost + Raw Feats) Rnd Search (xgboost + Raw Feats) No Tuning (sklearn RF + Raw Feats) Hold out ACC 0.8601 (+49.2%) 0.8190 0.7483 0.7386 0.5756
Table 1: Comparison of model accuracy on held out (test) dataset after different tuning strategies
The chart below tracks the optimization path of SigOpt vs random search optimization strategies when tuning the unsupervised model (Unsup Feats) and only the supervised model (Raw Feats). We plot the interquartile range of the best seen cross validated accuracy score on the training set at each objective evaluation during the optimization. As mentioned above, 90 evaluations were used in the optimization of the unsupervised model and 40 in the supervised setting. SigOpt outperforms random search in both settings on this training data (p-value 0.005 using the same Mann-Whitney U test as before).
Figure 6: Optimization traces of CV accuracy using SigOpt and random search
Closing Remarks
Unsupervised learning algorithms can be a powerful tool for boosting the performance of your supervised models when labelling is an expensive or slow process. Tuning automatically brings each model to its full potential. SigOpt was built to help with this non-intuitive task. As this example demonstrates, careful parameter tuning can enable engineering and data science teams to better leverage their unlabelled data and build more predictive data products in less time and lower cost. Sign up for a free evaluation today and get the most from your models!
Additional Reading
SigOpt effectively optimizes machine learning models across a variety of datasets and algorithms. See our other examples:
Using model tuning to beat Vegas
Automatically Tuning Text Classifiers
References
[1]: Adam Coates, Honglak Lee, Andrew Y. Ng. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. International conference on artificial intelligence and statistics (AISTATS). 2011. [PDF]
[2]: Yoshua Bengio. Deep Learning of Representations for Unsupervised and Transfer Learning. JMLR Workshop Proceedings: Unsupervised and Transfer Learning. 2012 [PDF]
[3]: Alex Krizhevsky, Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009 [PDF]
[4]: Adam Coates, Andrew Y. Ng. Learning Feature Representations with K-means. 2012 [PDF]
[5]: Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. 2011 [PDF]
[6]: Pierre Sermanet, Soumith Chintala and Yann LeCun. Convolutional Neural Networks Applied to House Numbers Digit Classification. Pattern Recognition International Conference (ICPR). 2012. [PDF]
[7]: Dustin Stansbury. The Statistical Whitening Transform. The Clever Machine. 2014. [LINK]
[8]: Sander Dieleman and Benjamin Schrauwen. Multiscale Approaches to Music and Audio Feature Learning. International Society for Music Information Retrieval Conference. 2013. [PDF]
[9]: James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research (JMLR). 2012 [PDF]
2 notes
·
View notes
Text
Lessons from a RESTful API Redesign
By: Alexandra Johnson, Software Engineer
The fast-paced initial design of our v0 API left us with a few outstanding issues related to data representation, authentication, and resource management. When building the new and improved v1 API, we stepped back and chose to more closely adhere to RESTful architectural principles. Structured resource management improved the clarity of our core endpoints, but unexpected benefits also came to a feature we call parallelism. The v0 implementation required additional parameters, endpoints and resources; in v1, parallelism can be handled with normal RESTful resource management.
Background of our original API
The most important resource in SigOpt is an experiment. An experiment encapsulates an objective that SigOpt is optimizing, the relevant parameters, and the underlying data. After creating an experiment, you run an optimization loop (shown in the code snippet below): receive a suggestion for the experiment, evaluate your metric on the suggested parameters, and report back an observation.
The v0 API was organized around the experiment. The HTTP verbs were restricted to GET and POST, and most endpoints were structured as:
GET /v0/experiments/id/FUNCTION POST /v0/experiments/id/FUNCTION
Thus, the v0 optimization loop was as follows:
v0 Optimization Loop
Step 1: receive a suggestion
curl -X POST "https://api.sigopt.com/v0/experiments/ID/suggest" \ -d 'client_token=client_token'
Step 2: Evaluate your metric (implement a function)
Step 3: report an observation
curl -X POST "https://api.sigopt.com/v0/experiments/ID/report" \ -d 'client_token=client_token' \ -d 'data={ "assignments": { "param2": "XXXX", "param1": "XXXX" }, "value": "YYY" }'
In the basic use case, the user runs one iteration of the optimization loop after another and evaluates their metric sequentially. When clients have the necessary computational resources, multiple evaluations of the metric in parallel can be advantageous. Organization of the work (running the optimization loop) follows the standard thread pool design pattern.
The report endpoint creates an observation, and extends naturally to the parallel case.
The suggest endpoint was trickier. Originally created as a POST request, this endpoint didn’t necessarily create a new suggestion. This endpoint may return a stored suggestion or may create a new one, depending on availability. There was no way to re-request a specific suggestion that you previously saw. To support parallelism, we introduced a system to "claim" suggestions. The claiming system necessitated the creation of several new endpoints to manage its infrastructure. Between resource administration, the optimize loop, and parallelism, we had almost 20 function-endpoints for parallel experiments.
Switchover
We chose REST for our public API because we wanted an architectural style that was clear, easy to follow, and familiar to developers[0]. REST is structured, meaning that knowledge about how one resource is manipulated easily transfers to other resources.
Our first goal was to improve the optimization loop. To be truly RESTful, we needed to stop thinking of observations and suggestions as objects returned by functions, and start thinking of them as resources. We focused on keeping experiments, observations and suggestions at predictable URIs based on unique identifiers, and using the set of HTTP verbs POST, GET, PUT, and DELETE for performing creates, fetches, updates and deletes. An experiment now lives at /v1/experiments/id, while observations and suggestions are nested beneath their respective experiments.
Overall, we reduced our list of endpoints from around 20 experiment-function endpoints to 6 supported endpoints for each of these objects (suggestions currently do not support an update, because there are no fields that can be updated). In the table below, you can see how resources are manipulated with different HTTP verbs.
Table 1: v1 API endpoints organized around the observation resource
Parallelism
Extending to parallelism, we scrapped the claiming system and started over from scratch. We needed to support multiple workers each running their own optimization loops at the same time for a given experiment.
The RESTful redesign makes suggestions resources that are manipulated by the user. Similar to experiments and observations, the initial GET /v1/experiments/id/suggestions returns an empty list, because no resources have been created yet. Unlike the old POST /v0/experiments/id/suggest, each call to POST /v1/experiments/id/suggestions creates a brand new suggestion resource for the user.
As it turns out, since resource management was built into the RESTful redesign, we did not need to build any additional API infrastructure to handle parallelism. Because each POST creates an independent resource, individual threads, processes, or machines can create their own suggestions without any additional API calls or parameters. Should one or more threads fail, all open suggestions can be viewed or deleted with on of the following two requests:
GET /v1/experiments/id/suggestions?state=open DELETE /v1/experiments/id/suggestions?state=open
allowing the user to easily recover from the failure. The v1 optimization loop reflects this resource management:
v1 Optimization Loop
Suggestions, observations, and experiments are all resources.
Step 1: Receive a suggestion
curl -X POST "https://api.sigopt.com/v1/experiments/ID/suggestions" \ -u $CLIENT_TOKEN:
Step 2: Evaluate your metric (implement a function)
Step 3: Report an observation
curl -X POST "https://api.sigopt.com/v1/experiments/ID/observations" \ -u $CLIENT_TOKEN: \ -H "Content-Type: application/json" \ -d '{ "suggestion":"SUGGESTION_ID", "value":"YYY" }'
Conclusion
Imagine you have a cluster of machines on which you can conduct experiments. An effective use of that pool of resources would be to have machines examining suggested configurations in parallel. Our v1 API has, through its RESTful design, the clarity and structure to facilitate these simultaneous optimization loops. Resource management is now under the user’s control, which gives more freedom but also the responsibility of cleaning up the created resources. Endpoints and resources are more easily explained, and parallelism, an important feature for advanced customers, no longer requires additional API infrastructure to manage.
To learn more about our API, check out our API Docs. We also have an official python client!
Footnotes
[0]: Why REST Keeps Me Up At Night. It’s worth noting that REST is not the only way to design an API, but it is one of the best solutions for “a large number of known and unknown developers to easily consume the services that the API offers”, which is exactly what we were designing for.
0 notes