Text
End of February Reading List
Yang, Tianbao and Li, Yu-Feng and Mahdavi, Mehrdad and Jin, Rong and Zhou, Zhi-Hua, "Nystrom Method vs Random Fourier Features A Theoretical and Empirical Comparison", Advances in Neural Information Processing Systems, 2012
Bach, Francis, "On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions", 2015
Eldredge, Nathaniel, "Analysis and Probability on Infinite-Dimensional Spaces", 2016
Bach, Francis, "Breaking the curse of dimensionality in regression", JMLR, 2017
Roy, D. and Teh, Y. W., "The Mondrian Process", Advances in Neural Information Processing Systems, 2009
Rudi, Alessandro and Camoriano, Raffaello and Rosasco, Lorenzo, "Less is More: Nystr$\backslash$"om Computational Regularization", 2015
Schölkopf, Bernhard and Smola, A J and Muller, K R, "Kernel Principal Component Analysis", Advances in kernel methods support vector learning, 1999
Rosasco, Lorenzo, "On Learning with Integral Operators", Journal of Machine Learning Research, 2010
Avron, Haim and Clarkson, Kenneth L. and Woodruff, David P., "Faster Kernel Ridge Regression Using Sketching and Preconditioning", 2016
Gonen, Alon and Shalev-Shwartz, Shai, "Faster SGD Using Sketched Conditioning", 2015
0 notes
Text
What’s currently on my Remarkable
Rosasco, Lorenzo, "On Learning with Integral Operators", Journal of Machine Learning Research, 2010
Avron, Haim and Clarkson, Kenneth L. and Woodruff, David P., "Faster Kernel Ridge Regression Using Sketching and Preconditioning", 2016
Gonen, Alon and Shalev-Shwartz, Shai, "Faster SGD Using Sketched Conditioning", 2015
Rudi, Alessandro and Camoriano, Raffaello and Rosasco, Lorenzo, "Less is More: Nyström Computational Regularization", 2015
Rudi, Alessandro and Carratino, Luigi and Rosasco, Lorenzo, "FALKON: An Optimal Large Scale Kernel Method", 2017
Daniely, Amit and Frostig, Roy and Singer, Yoram, "Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity", 2016
Ma, Siyuan and Belkin, Mikhail, "Diving into the shallows: a computational perspective on large-scale shallow learning", 2017
Bach, Francis, "Breaking the curse of dimensionality in regression", JMLR, 2017
Liu, Yong and Liao, Shizhong and Lin, Hailun and Yue, Yinliang and Wang, Weiping, "Infinite Kernel Learning : Generalization Bounds and Algorithms", Proceedings of the 31th Conference on Artificial Intelligence (AAAI 2017), 2017
0 notes
Text
What’s currently on my Remarkable
Cortes, Corinna and Kloft, Marius and Mohri, Mehryar, "Learning Kernels Using Local Rademacher Complexity", Nips, 2013
Foster, Dylan J. and Rakhlin, Alexander and Sridharan, Karthik, "ZigZag: A new approach to adaptive online learning", 2017
Bartlett, Peter L. and Bousquet, Olivier and Mendelson, Shahar, "Local rademacher complexities", Annals of Statistics, 2005
Suzuki, Taiji, "Fast learning rate of deep learning via a kernel perspective", 2017
Zhu, Yinchu and Bradic, Jelena, "Breaking the curse of dimensionality in regression", 2017
Zhang, Yuchen and Lee, Jason and Jordan, Michael, "l1-regularized Neural Networks are Improperly Learnable in Polynomial Time", Arxiv, 2015
Daniely, Amit and Frostig, Roy and Singer, Yoram, "Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity", 2016
Kaibel, Volker, "Basic Polyhedral Theory", Wiley Encyclopedia of Operations Research and Management Science, 2011
Zhang, Yuchen and Liang, Percy and Wainwright, Martin J., "Convexified Convolutional Neural Networks", 2016
McCoy, Michael B. and Tropp, Joel A., "From Steiner Formulas for Cones to Concentration of Intrinsic Volumes", Discrete and Computational Geometry, 2014
Wu, Mike and Hughes, Michael C. and Parbhoo, Sonali and Zazzi, Maurizio and Roth, Volker and Doshi-Velez, Finale, "Beyond Sparsity: Tree Regularization of Deep Models for Interpretability", 2017
Kawaguchi, Kenji and Kaelbling, Leslie Pack and Bengio, Yoshua, "Generalization in Deep Learning", 2017
Liang, Tengyuan and Poggio, Tomaso and Rakhlin, Alexander and Stokes, James, "Fisher-Rao Metric, Geometry, and Complexity of Neural Networks", 2017
Maurer, Andreas, "A vector-contraction inequality for rademacher complexities", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2016
Mohri, Mehryar and Rostamizadeh, Afshin and Talwalkar, Ameet, "Foundations of Machine Learning", 2012
0 notes
Link
0 notes
Text
How-to print onto the Remarkable via USB on Mac
I will show you how to setup the remarkable in a way that you can use it like a printer for your Mac.
First of all, you have to enable the USB mode on the remarkable device. You’ll find the option in the settings.
Now start the Automator App installed on every Mac. Create a new workflow and choose to create a new Print plugin.
Next in the list of possible actions, search for “Run Shell script” and drag’n’drop it in your new workflow. Set it up as follows:
using the following script
for f in "$@" do tempname=`basename $0` TMPPS_PREFIX=$(mktemp "XXXXXX") TMPPS=$(mktemp "${TMPPS_PREFIX}.pdf") rm ${TMPPS_PREFIX} cp "${f}" $TMPPS curl -vvv -H "Origin: http://10.11.99.1" -H 'Referer: http://10.11.99.1/' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' -F "file=@${TMPPS};filename=\"${f}\";type=application/pdf" http://10.11.99.1/upload rm $TMPPS done
The script creates a termporary copy of the printed PDF you want on your device and then pushes it to the small webserver running on your remarkable before deleting the temporary file. On a sidenote, if you want a .json with all the documents on your device, you can query http://10.11.99.1/documents. You can now save the workflow under a descriptive name, I chose "Print to Remarkable via USB".
Now connect the remarkable to your mac via USB, open a document you want on your reader and open the print dialoge. Under "PDF" you will find your new workflow:
Your document will show up on your device almost instantly. Have fun with it!
3 notes
·
View notes
Link
0 notes
Link
0 notes
Link
Learning Low-Rank Document Embeddings with Weighted Nuclear Norm Regularization
Abstract Recently, neural embeddings of documents have shown success in various language processing tasks. These low- dimensional and dense feature vectors of text documents capture semantic similarities better than traditional methods. However, the underlying optimization problem is non-convex and usually solved using stochastic gradient descent. Hence solutions are most-likely sub-optimal and not reproducible, as they are the result of a randomized algorithm.
We present an alternative formulation for learning low-rank representations.Instead of explicitly learning low-dimensional features, we compute a low-rank representation implicitly by regularizing full-dimensional solutions. Our approach uses the weighted nuclear norm, a regularizer that penalizes singular values of matrices. We optimize the regularized objective using accelerated proximal gradient descent. We apply the approach to learn embeddings of documents.
We show that our approach outperforms traditional convex approaches in a numerical study. Furthermore we demonstrate that the embeddings are useful for detecting similarities on a standard dataset. Then we apply our approach in an interdisciplinary research project to detect topics in religious online discussions. The topic descriptions obtained from a clustering of embeddings are coherent and insightful. In comparison to existing approaches, they are also reproducible.
An earlier version of this work stated, that the weighted nuclear norm is a convex regularizer. This is wrong – the weighted nuclear norm is non-convex, even though the name falsely suggests that it is a matrix norm.
0 notes
Text
Spectrally-Normalized Margin Bounds for Neural Networks
Bartlett, Foster and Telgarsky show that the generalization performance of deep networks depends on the margin achieved on the training dataset [1]. We study classification tasks and assume the following setting: A deep network predicts by outputting the class with maximum activation of its corresponding output neuron. The margin is defined as the difference between the activation of the output neuron corresponding to the true class and the highest activation of all other neurons. The margin is positive if the output of the network is correct.
Obviously the margin is a relative term, as scaling all weight matrices by a factor two yields double the margin when we use relu-activations. In general, when we use Lipschitz-continous activations, the maximum change of margin is constrained by the Lipschitz-constant. Recall that the spectral norm measures the maximal change of Euclidean norm induced by a matrix multiplication. Thus we have to normalize the margins by the spectral norm of all affine transformations as well as the norm of the data sample to obtain an absolute measure of margin.
This post is going to summarize the contribution of [1] in some detail once I have the time to write it up. Till then, also see [2].
[1] P. Bartlett, D. J. Foster, and M. Telgarsky, “Spectrally-normalized margin bounds for neural networks,”, 2017. [https://arxiv.org/abs/1706.08498]
[2] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro, “A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks,” pp. 1–7, 2017. [https://arxiv.org/pdf/1707.09564.pdf]
0 notes
Text
Empirical Rademacher Complexities of Neural Networks
Zhang et al. [1] observe that standard deep neural network can perfectly learn with zero training error even datasets with random labels. Obviously, these models cannot generalize, which raises the question why deep neural networks generalize in practice.
An established method to derive generalization properties of function classes are Rademacher complexities. The empirical Rademacher complexity of a function class $\mathcal F$ given a data sample $x_1,...,x_n$ is defined as $$R_n(\mathcal F) = \mathbb E_\sigma \left[\sup\limits_{f\in \mathcal F}\frac 1 n \sum\limits_{i=1}^n \sigma_i f(x_i)\right]$$ where the expectation is with respect to the Rademacher distribution of $\sigma_i \in \{-1,+1\}$. With probability $1-\delta$, we have that the risk of a classifier $f$ under a loss function $l$ with $0 \leq l(y,y’) \leq 1$ is related to the empirical risk on a dataset of size $n$ $$E_n(l(f(x),y)=\frac 1 n \sum\limits_{i=1}^n l(f(x_i),y_i)$$ via $$\mathbb E_{(x,y) \sim D} l(f(x),y) \leq E_n(l(f(x),y) + 2 R_n(l \circ \mathcal F) + 3\sqrt{\frac {\ln \frac 2 \delta} {2n}}.$$
Now what does this mean for deep neural networks? Zhang et al. point out that the empirical Rademacher complexity basically contains their experiment of learning a neural net with real data but random labels. We can verify that the supremum $\sup\limits_{f\in \mathcal F}\frac 1 n \sum\limits_{i=1}^n \sigma_i l(f(x_i),y_i)$ is achieved when $f$ predicts correctly for $\sigma_i=-1$ and as bad as possible for $\sigma_i=1$. If a neural network can learn arbitrary labels, it can learn the label that achieves the highest possible loss. Let us assume that worst loss is 1 and the best loss is 0. Thus the empirical Rademacher complexity of the loss function class $l \circ \mathcal F$ is $0.5$. Plugging this into the risk bound above yields a useless bound that only tells us that the expected loss of our function $f$ is smaller than 1, which is trivially true as the maximum loss value is 1.
A slightly more rigorous argument can be made if we limit our analysis to specific network architectures. For instance if we use only ReLu activations $a(x) = \max \{0,x\}$, we can characterise the function class $\mathcal F$ as a subset of all piecewise linear functions. We consider a binary classification problem with classification rule $\mbox{sign}(f(x))$. Arora et al. [2] show that such a neural network with $k$ hidden layer and $s$ artificial neurons represents a piecewise linear function with at most $\left(\frac{2s}k\right)^k$ pieces. This number is often larger than $n$, hence $\mathcal F$ contains the piecewise linear function that constructs the worst possible loss. By the results of [2], a network with $k$ hidden layer needs at least $\frac 1 2 k n^{\frac 1 k} -1$ neurons to express a piecewise linear function with $n$ pieces. Every class of relu-networks that fulfils this requirement can obtain the empirical Rademacher complexity of $0.5$.
Does this mean all efforts to bound empirical Rademacher complexities for neural networks are doomed? Not necessarily, we can still use analysis to identify factors that influence generalization, develop regularization based on these factors or restrict the function class that can be expressed by deep nets to guarantee generalization with high probabiltity.
[1] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in ICLR 2017, 2016. [https://openreview.net/pdf?id=Sy8gdB9xx]
[2] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding Deep Neural Networks with Rectified Linear Units,” pp. 1–21, 2016. [https://arxiv.org/pdf/1611.01491.pdf]
0 notes
Link
0 notes
Text
The one with the REINFORCE baseline
The REINFORCE algorithm minimizes $R(\theta) := \mathbb E_{a \sim \pi_\theta} L(a)$ with respect to $\theta$ by gradient descent. The gradient with respect to $\theta$ can be expressed as
$$\nabla_\theta R(\theta) = \mathbb E_{a \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a) \cdot L(a) \right]$$
using the little trick $\nabla \pi(a) = \pi(a)\nabla\log \pi(a)$.
Since $\sum\limits_a \pi_\theta(a) \nabla\log \pi_\theta(a) = \sum\limits_a \nabla_\theta \pi_\theta(a) = \nabla_\theta 1 = 0$, we can subtract a constant baseline $b$ from the reward [1] and optimize using the gradient estimator $$\nabla_\theta R(\theta) = \mathbb E_{a \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a) \cdot (L(a) - b) \right].$$ Choosing a baseline is important in practice. Usually, we choose $b$ that minimizes the variance, but a simper approach is to use the expected value of $R(\theta)$.
Choosing a good baseline is particularly crucial in the setting of [2], where we jointly optimize a policy network $\pi_\theta$ and a network that computes the loss $L_\theta$. The gradient for the joint objective is a sum of two components $$\mathbb E_{a \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a) \cdot L_\theta(a) + \nabla_\theta L_\theta(a)\right].$$ We run into problems when $L_\theta(a)$ is large. Then, the first term dominates and the gradient overall is large. We end up optimizing only over the policy. While this formulation is mathematically sound, in practice it does not work. After a few gradient steps (often only 1) the policy converges to a deterministic policy. The network is then stuck with this bad policy, but cannot minimize $L$ on it since the gradient is still dominated by the policy term and not the loss term.
In this scenario, subtracting the baseline $b = \mathbb E_{a \sim \pi_\theta} R(\theta)$ can help. If the variance of $R$ is small, we no longer multiply the policy gradient by large values, but only by small variances.
If subtracting the baseline is not enough, we can manually scale down the gradient term by a small constant. However it is unclear how we can justify this from a theoretical perspective.
[1].Peters, Jan, and Stefan Schaal. "Reinforcement learning of motor skills with policy gradients." Neural networks 21.4 (2008): 682-697.
[2] https://digitalehipsterkladde.tumblr.com/post/154546721174/learning-models-with-reinforcement
0 notes
Text
Learning Models with Reinforcement
Most neural models have a static computation graph: The model performs the same computations $\hat y(x)$ with different inputs $x$ regardless of the inputs. In a sense this is also true for recurrent neural networks: While the computation graph grows linearly with time as it is being unfolded, each timestep consists of the same sequence of computations $\hat y_t(x_t,\hat y_{1},\ldots,\hat y_{t-1})$.
However, we can also train models, that vary their computation graph based on the input. Consider the following example:
\begin{equation} \hat y(x) = \left\{\begin{array}{lr} f(x) & \mbox{ if }a = F \mbox{ for }a \sim \pi(x) \newline g(x) & \mbox{otherwise} \end{array}\right. \end{equation}
where $\pi$ is a probability distribution over the space of events ${F,G}$. Obviously, this is an abuse of mathematical notation, since $\hat y$ is no longer a function but the result of a stochastic process that depends on $\pi$. Now $f$ and $g$ might be neural networks with parameters $\theta_f,\theta_g$. If we want to minimize a loss function of the output of $\hat y$ with respect to these matrices $\theta$, we can do so by using stochastic gradient descent.
Formally, we minimize
\begin{equation} \min_\theta \mathbb E_{a \sim \pi}\left[L(\hat y_\theta(x,a))\right] \end{equation}
where $\hat y_\theta(x,a)$ is the deterministic version of $\hat y$ that does not sample $a$ but is given its choice as a parameter. The gradient $\nabla_\theta \mathbb E_{a \sim \pi}\left[L(\hat y_\theta(x,a))\right]=\mathbb E_{a \sim \pi}\left[\nabla_\theta L(\hat y_\theta(x,a))\right]$ can be estimated by sampling actions $a \sim \pi$. For these samples we can compute the gradient using backpropagation applied to either $f$ or $g$ depending on the choice of $a$ and average over the set of samples. If the set of actions is small and the probability distribution $\pi$ is known, we can compute the expected value exactly without the need to sample.
What happens if $\pi$ is also parametriced by a neural network? That is, we consider $\pi_\theta(a \mid x)$ that computes probabilities for all actions in a set of possible actions $A$ given an example $x$. In the example above this set contains only two actions $F,G$. If the space of possible actions is small, we can still compute the exact expectation and use standard backpropagation to find a good policy. This is for instance done by Kontschieder et al. [1] in order to learn Decision Trees with neural network nodes. They define a tree structure and learn the decision rules by backpropagation over the expected output of the network.
In order to learn a good policy $\pi_\theta(x)$ when the space of possible actions is too large to compute the exact expected loss value, we apply reinforce learning, more specifically the REINFORCE algorithm [2,3].
We want to minimize $$\mathbb E_{a \sim \pi_\theta}\left[L(\hat y_\theta(x,a))\right] = \sum\limits_{a} \pi_\theta(a) \cdot L(\hat y_\theta(x,a))$$ by gradient descent. Thus we have to compute the gradient with respect to $\theta$. Using the product-rule of calculus, we obtain $$\nabla_\theta \mathbb E_{a \sim \pi_\theta}\left[L(\hat y_\theta(x,a))\right] = \sum\limits_a \nabla_\theta \pi_\theta(a) \cdot L(\hat y_\theta(x,a)) + \pi_\theta(a) \cdot \nabla_\theta L(\hat y_\theta(x,a))$$ and using the trick $\nabla \pi(a) = \pi(a)\nabla\log \pi(a)$ we obtain $$\nabla_\theta \mathbb E_{a \sim \pi_\theta}\left[L(\hat y_\theta(x,a))\right] = \sum\limits_a \pi_\theta(a) \nabla_\theta \log \pi_\theta(a) \cdot L(\hat y_\theta(x,a)) + \pi_\theta(a) \cdot \nabla_\theta L(\hat y_\theta(x,a))$$ which can be rewritten as $$\nabla_\theta \mathbb E_{a \sim \pi_\theta}\left[L(\hat y_\theta(x,a))\right] = \mathbb E_{a \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a) \cdot L(\hat y_\theta(x,a)) + \nabla_\theta L(\hat y_\theta(x,a))\right]$$
This expectation can be empirically estimated by averaging over a set of samples.
References
[1] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulo, “Deep neural decision forests,” in Proceedings of the IEEE International Conference on Computer Vision, 2016, vol. 11–18-Dece, pp. 1467–1475.
[2] R. J. Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning,” Mach. Learn., vol. 8, no. 3, pp. 229–256, 1992.
[3] http://www.scholarpedia.org/article/Policy_gradient_methods
0 notes
Text
Mathematiker
Well this is basically just a test. But I would really love to have some formulas in this blog, like $f(x)=x^2$ and some other stuff.
$$\frac{\partial f(x)}{\partial x} = 2x.$$
0 notes