digitalehipsterkladde - Tumblr blog

digitalehipsterkladde · 7 years ago

Text

End of February Reading List

Yang, Tianbao and Li, Yu-Feng and Mahdavi, Mehrdad and Jin, Rong and Zhou, Zhi-Hua, "Nystrom Method vs Random Fourier Features A Theoretical and Empirical Comparison", Advances in Neural Information Processing Systems, 2012

Bach, Francis, "On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions", 2015

Eldredge, Nathaniel, "Analysis and Probability on Infinite-Dimensional Spaces", 2016

Bach, Francis, "Breaking the curse of dimensionality in regression", JMLR, 2017

Roy, D. and Teh, Y. W., "The Mondrian Process", Advances in Neural Information Processing Systems, 2009

Rudi, Alessandro and Camoriano, Raffaello and Rosasco, Lorenzo, "Less is More: Nystr$\backslash$"om Computational Regularization", 2015

Schölkopf, Bernhard and Smola, A J and Muller, K R, "Kernel Principal Component Analysis", Advances in kernel methods support vector learning, 1999

Rosasco, Lorenzo, "On Learning with Integral Operators", Journal of Machine Learning Research, 2010

Avron, Haim and Clarkson, Kenneth L. and Woodruff, David P., "Faster Kernel Ridge Regression Using Sketching and Preconditioning", 2016

Gonen, Alon and Shalev-Shwartz, Shai, "Faster SGD Using Sketched Conditioning", 2015

0 notes

digitalehipsterkladde · 7 years ago

Photo

Great Paper Title

0 notes

digitalehipsterkladde · 7 years ago

Text

What’s currently on my Remarkable

Rosasco, Lorenzo, "On Learning with Integral Operators", Journal of Machine Learning Research, 2010

Avron, Haim and Clarkson, Kenneth L. and Woodruff, David P., "Faster Kernel Ridge Regression Using Sketching and Preconditioning", 2016

Gonen, Alon and Shalev-Shwartz, Shai, "Faster SGD Using Sketched Conditioning", 2015

Rudi, Alessandro and Camoriano, Raffaello and Rosasco, Lorenzo, "Less is More: Nyström Computational Regularization", 2015

Rudi, Alessandro and Carratino, Luigi and Rosasco, Lorenzo, "FALKON: An Optimal Large Scale Kernel Method", 2017

Daniely, Amit and Frostig, Roy and Singer, Yoram, "Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity", 2016

Ma, Siyuan and Belkin, Mikhail, "Diving into the shallows: a computational perspective on large-scale shallow learning", 2017

Bach, Francis, "Breaking the curse of dimensionality in regression", JMLR, 2017

Liu, Yong and Liao, Shizhong and Lin, Hailun and Yue, Yinliang and Wang, Weiping, "Infinite Kernel Learning : Generalization Bounds and Algorithms", Proceedings of the 31th Conference on Artificial Intelligence (AAAI 2017), 2017

0 notes

digitalehipsterkladde · 8 years ago

Text

What’s currently on my Remarkable

Cortes, Corinna and Kloft, Marius and Mohri, Mehryar, "Learning Kernels Using Local Rademacher Complexity", Nips, 2013

Foster, Dylan J. and Rakhlin, Alexander and Sridharan, Karthik, "ZigZag: A new approach to adaptive online learning", 2017

Bartlett, Peter L. and Bousquet, Olivier and Mendelson, Shahar, "Local rademacher complexities", Annals of Statistics, 2005

Suzuki, Taiji, "Fast learning rate of deep learning via a kernel perspective", 2017

Zhu, Yinchu and Bradic, Jelena, "Breaking the curse of dimensionality in regression", 2017

Zhang, Yuchen and Lee, Jason and Jordan, Michael, "l1-regularized Neural Networks are Improperly Learnable in Polynomial Time", Arxiv, 2015

Daniely, Amit and Frostig, Roy and Singer, Yoram, "Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity", 2016

Kaibel, Volker, "Basic Polyhedral Theory", Wiley Encyclopedia of Operations Research and Management Science, 2011

Zhang, Yuchen and Liang, Percy and Wainwright, Martin J., "Convexified Convolutional Neural Networks", 2016

McCoy, Michael B. and Tropp, Joel A., "From Steiner Formulas for Cones to Concentration of Intrinsic Volumes", Discrete and Computational Geometry, 2014

Wu, Mike and Hughes, Michael C. and Parbhoo, Sonali and Zazzi, Maurizio and Roth, Volker and Doshi-Velez, Finale, "Beyond Sparsity: Tree Regularization of Deep Models for Interpretability", 2017

Kawaguchi, Kenji and Kaelbling, Leslie Pack and Bengio, Yoshua, "Generalization in Deep Learning", 2017

Liang, Tengyuan and Poggio, Tomaso and Rakhlin, Alexander and Stokes, James, "Fisher-Rao Metric, Geometry, and Complexity of Neural Networks", 2017

Maurer, Andreas, "A vector-contraction inequality for rademacher complexities", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2016

Mohri, Mehryar and Rostamizadeh, Afshin and Talwalkar, Ameet, "Foundations of Machine Learning", 2012

0 notes

digitalehipsterkladde · 8 years ago

Link

0 notes

digitalehipsterkladde · 8 years ago

Text

How-to print onto the Remarkable via USB on Mac

I will show you how to setup the remarkable in a way that you can use it like a printer for your Mac.

First of all, you have to enable the USB mode on the remarkable device. You’ll find the option in the settings.

Now start the Automator App installed on every Mac. Create a new workflow and choose to create a new Print plugin.

Next in the list of possible actions, search for “Run Shell script” and drag’n’drop it in your new workflow. Set it up as follows:

using the following script

for f in "$@" do tempname=`basename $0` TMPPS_PREFIX=$(mktemp "XXXXXX") TMPPS=$(mktemp "${TMPPS_PREFIX}.pdf") rm ${TMPPS_PREFIX} cp "${f}" $TMPPS curl -vvv -H "Origin: http://10.11.99.1" -H 'Referer: http://10.11.99.1/' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' -F "file=@${TMPPS};filename=\"${f}\";type=application/pdf" http://10.11.99.1/upload rm $TMPPS done

The script creates a termporary copy of the printed PDF you want on your device and then pushes it to the small webserver running on your remarkable before deleting the temporary file. On a sidenote, if you want a .json with all the documents on your device, you can query http://10.11.99.1/documents. You can now save the workflow under a descriptive name, I chose "Print to Remarkable via USB".

Now connect the remarkable to your mac via USB, open a document you want on your reader and open the print dialoge. Under "PDF" you will find your new workflow:

Your document will show up on your device almost instantly. Have fun with it!

3 notes · View notes

digitalehipsterkladde · 8 years ago

Link

0 notes

digitalehipsterkladde · 8 years ago

Link

0 notes

digitalehipsterkladde · 8 years ago

Link

Learning Low-Rank Document Embeddings with Weighted Nuclear Norm Regularization

Abstract Recently, neural embeddings of documents have shown success in various language processing tasks. These low- dimensional and dense feature vectors of text documents capture semantic similarities better than traditional methods. However, the underlying optimization problem is non-convex and usually solved using stochastic gradient descent. Hence solutions are most-likely sub-optimal and not reproducible, as they are the result of a randomized algorithm.

We present an alternative formulation for learning low-rank representations.Instead of explicitly learning low-dimensional features, we compute a low-rank representation implicitly by regularizing full-dimensional solutions. Our approach uses the weighted nuclear norm, a regularizer that penalizes singular values of matrices. We optimize the regularized objective using accelerated proximal gradient descent. We apply the approach to learn embeddings of documents.

We show that our approach outperforms traditional convex approaches in a numerical study. Furthermore we demonstrate that the embeddings are useful for detecting similarities on a standard dataset. Then we apply our approach in an interdisciplinary research project to detect topics in religious online discussions. The topic descriptions obtained from a clustering of embeddings are coherent and insightful. In comparison to existing approaches, they are also reproducible.

An earlier version of this work stated, that the weighted nuclear norm is a convex regularizer. This is wrong – the weighted nuclear norm is non-convex, even though the name falsely suggests that it is a matrix norm.

0 notes

digitalehipsterkladde · 8 years ago

Text

Spectrally-Normalized Margin Bounds for Neural Networks

Bartlett, Foster and Telgarsky show that the generalization performance of deep networks depends on the margin achieved on the training dataset [1]. We study classification tasks and assume the following setting: A deep network predicts by outputting the class with maximum activation of its corresponding output neuron. The margin is defined as the difference between the activation of the output neuron corresponding to the true class and the highest activation of all other neurons. The margin is positive if the output of the network is correct.

Obviously the margin is a relative term, as scaling all weight matrices by a factor two yields double the margin when we use relu-activations. In general, when we use Lipschitz-continous activations, the maximum change of margin is constrained by the Lipschitz-constant. Recall that the spectral norm measures the maximal change of Euclidean norm induced by a matrix multiplication. Thus we have to normalize the margins by the spectral norm of all affine transformations as well as the norm of the data sample to obtain an absolute measure of margin.

This post is going to summarize the contribution of [1] in some detail once I have the time to write it up. Till then, also see [2].

[1] P. Bartlett, D. J. Foster, and M. Telgarsky, “Spectrally-normalized margin bounds for neural networks,”, 2017. [https://arxiv.org/abs/1706.08498]

[2] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro, “A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks,” pp. 1–7, 2017. [https://arxiv.org/pdf/1707.09564.pdf]

0 notes

digitalehipsterkladde · 8 years ago

Text

Empirical Rademacher Complexities of Neural Networks

Zhang et al. [1] observe that standard deep neural network can perfectly learn with zero training error even datasets with random labels. Obviously, these models cannot generalize, which raises the question why deep neural networks generalize in practice.

An established method to derive generalization properties of function classes are Rademacher complexities. The empirical Rademacher complexity of a function class $\mathcal F$ given a data sample $x_1,...,x_n$ is defined as $$R_n(\mathcal F) = \mathbb E_\sigma \left[\sup\limits_{f\in \mathcal F}\frac 1 n \sum\limits_{i=1}^n \sigma_i f(x_i)\right]$$ where the expectation is with respect to the Rademacher distribution of $\sigma_i \in \{-1,+1\}$. With probability $1-\delta$, we have that the risk of a classifier $f$ under a loss function $l$ with $0 \leq l(y,y’) \leq 1$ is related to the empirical risk on a dataset of size $n$ $$E_n(l(f(x),y)=\frac 1 n \sum\limits_{i=1}^n l(f(x_i),y_i)$$ via $$\mathbb E_{(x,y) \sim D} l(f(x),y) \leq E_n(l(f(x),y) + 2 R_n(l \circ \mathcal F) + 3\sqrt{\frac {\ln \frac 2 \delta} {2n}}.$$

Now what does this mean for deep neural networks? Zhang et al. point out that the empirical Rademacher complexity basically contains their experiment of learning a neural net with real data but random labels. We can verify that the supremum $\sup\limits_{f\in \mathcal F}\frac 1 n \sum\limits_{i=1}^n \sigma_i l(f(x_i),y_i)$ is achieved when $f$ predicts correctly for $\sigma_i=-1$ and as bad as possible for $\sigma_i=1$. If a neural network can learn arbitrary labels, it can learn the label that achieves the highest possible loss. Let us assume that worst loss is 1 and the best loss is 0. Thus the empirical Rademacher complexity of the loss function class $l \circ \mathcal F$ is $0.5$. Plugging this into the risk bound above yields a useless bound that only tells us that the expected loss of our function $f$ is smaller than 1, which is trivially true as the maximum loss value is 1.

A slightly more rigorous argument can be made if we limit our analysis to specific network architectures. For instance if we use only ReLu activations $a(x) = \max \{0,x\}$, we can characterise the function class $\mathcal F$ as a subset of all piecewise linear functions. We consider a binary classification problem with classification rule $\mbox{sign}(f(x))$. Arora et al. [2] show that such a neural network with $k$ hidden layer and $s$ artificial neurons represents a piecewise linear function with at most $\left(\frac{2s}k\right)^k$ pieces. This number is often larger than $n$, hence $\mathcal F$ contains the piecewise linear function that constructs the worst possible loss. By the results of [2], a network with $k$ hidden layer needs at least $\frac 1 2 k n^{\frac 1 k} -1$ neurons to express a piecewise linear function with $n$ pieces. Every class of relu-networks that fulfils this requirement can obtain the empirical Rademacher complexity of $0.5$.

Does this mean all efforts to bound empirical Rademacher complexities for neural networks are doomed? Not necessarily, we can still use analysis to identify factors that influence generalization, develop regularization based on these factors or restrict the function class that can be expressed by deep nets to guarantee generalization with high probabiltity.

[1] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in ICLR 2017, 2016. [https://openreview.net/pdf?id=Sy8gdB9xx]

[2] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding Deep Neural Networks with Rectified Linear Units,” pp. 1–21, 2016. [https://arxiv.org/pdf/1611.01491.pdf]

0 notes

digitalehipsterkladde · 8 years ago

Link

0 notes

digitalehipsterkladde · 8 years ago

Link

Something I read today.

0 notes

digitalehipsterkladde · 9 years ago

Text

The one with the REINFORCE baseline

The REINFORCE algorithm minimizes $R(\theta) := \mathbb E_{a \sim \pi_\theta} L(a)$ with respect to $\theta$ by gradient descent. The gradient with respect to $\theta$ can be expressed as

$$\nabla_\theta R(\theta) = \mathbb E_{a \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a) \cdot L(a) \right]$$

using the little trick $\nabla \pi(a) = \pi(a)\nabla\log \pi(a)$.

Since $\sum\limits_a \pi_\theta(a) \nabla\log \pi_\theta(a) = \sum\limits_a \nabla_\theta \pi_\theta(a) = \nabla_\theta 1 = 0$, we can subtract a constant baseline $b$ from the reward [1] and optimize using the gradient estimator $$\nabla_\theta R(\theta) = \mathbb E_{a \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a) \cdot (L(a) - b) \right].$$ Choosing a baseline is important in practice. Usually, we choose $b$ that minimizes the variance, but a simper approach is to use the expected value of $R(\theta)$.

Choosing a good baseline is particularly crucial in the setting of [2], where we jointly optimize a policy network $\pi_\theta$ and a network that computes the loss $L_\theta$. The gradient for the joint objective is a sum of two components $$\mathbb E_{a \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a) \cdot L_\theta(a) + \nabla_\theta L_\theta(a)\right].$$ We run into problems when $L_\theta(a)$ is large. Then, the first term dominates and the gradient overall is large. We end up optimizing only over the policy. While this formulation is mathematically sound, in practice it does not work. After a few gradient steps (often only 1) the policy converges to a deterministic policy. The network is then stuck with this bad policy, but cannot minimize $L$ on it since the gradient is still dominated by the policy term and not the loss term.

In this scenario, subtracting the baseline $b = \mathbb E_{a \sim \pi_\theta} R(\theta)$ can help. If the variance of $R$ is small, we no longer multiply the policy gradient by large values, but only by small variances.

If subtracting the baseline is not enough, we can manually scale down the gradient term by a small constant. However it is unclear how we can justify this from a theoretical perspective.

[1].Peters, Jan, and Stefan Schaal. "Reinforcement learning of motor skills with policy gradients." Neural networks 21.4 (2008): 682-697.

[2] https://digitalehipsterkladde.tumblr.com/post/154546721174/learning-models-with-reinforcement

0 notes

digitalehipsterkladde · 9 years ago

Text

Learning Models with Reinforcement

Most neural models have a static computation graph: The model performs the same computations $\hat y(x)$ with different inputs $x$ regardless of the inputs. In a sense this is also true for recurrent neural networks: While the computation graph grows linearly with time as it is being unfolded, each timestep consists of the same sequence of computations $\hat y_t(x_t,\hat y_{1},\ldots,\hat y_{t-1})$.

However, we can also train models, that vary their computation graph based on the input. Consider the following example:

\begin{equation} \hat y(x) = \left\{\begin{array}{lr} f(x) & \mbox{ if }a = F \mbox{ for }a \sim \pi(x) \newline g(x) & \mbox{otherwise} \end{array}\right. \end{equation}

where $\pi$ is a probability distribution over the space of events ${F,G}$. Obviously, this is an abuse of mathematical notation, since $\hat y$ is no longer a function but the result of a stochastic process that depends on $\pi$. Now $f$ and $g$ might be neural networks with parameters $\theta_f,\theta_g$. If we want to minimize a loss function of the output of $\hat y$ with respect to these matrices $\theta$, we can do so by using stochastic gradient descent.

Formally, we minimize

\begin{equation} \min_\theta \mathbb E_{a \sim \pi}\left[L(\hat y_\theta(x,a))\right] \end{equation}

where $\hat y_\theta(x,a)$ is the deterministic version of $\hat y$ that does not sample $a$ but is given its choice as a parameter. The gradient $\nabla_\theta \mathbb E_{a \sim \pi}\left[L(\hat y_\theta(x,a))\right]=\mathbb E_{a \sim \pi}\left[\nabla_\theta L(\hat y_\theta(x,a))\right]$ can be estimated by sampling actions $a \sim \pi$. For these samples we can compute the gradient using backpropagation applied to either $f$ or $g$ depending on the choice of $a$ and average over the set of samples. If the set of actions is small and the probability distribution $\pi$ is known, we can compute the expected value exactly without the need to sample.

What happens if $\pi$ is also parametriced by a neural network? That is, we consider $\pi_\theta(a \mid x)$ that computes probabilities for all actions in a set of possible actions $A$ given an example $x$. In the example above this set contains only two actions $F,G$. If the space of possible actions is small, we can still compute the exact expectation and use standard backpropagation to find a good policy. This is for instance done by Kontschieder et al. [1] in order to learn Decision Trees with neural network nodes. They define a tree structure and learn the decision rules by backpropagation over the expected output of the network.

In order to learn a good policy $\pi_\theta(x)$ when the space of possible actions is too large to compute the exact expected loss value, we apply reinforce learning, more specifically the REINFORCE algorithm [2,3].

We want to minimize $$\mathbb E_{a \sim \pi_\theta}\left[L(\hat y_\theta(x,a))\right] = \sum\limits_{a} \pi_\theta(a) \cdot L(\hat y_\theta(x,a))$$ by gradient descent. Thus we have to compute the gradient with respect to $\theta$. Using the product-rule of calculus, we obtain $$\nabla_\theta \mathbb E_{a \sim \pi_\theta}\left[L(\hat y_\theta(x,a))\right] = \sum\limits_a \nabla_\theta \pi_\theta(a) \cdot L(\hat y_\theta(x,a)) + \pi_\theta(a) \cdot \nabla_\theta L(\hat y_\theta(x,a))$$ and using the trick $\nabla \pi(a) = \pi(a)\nabla\log \pi(a)$ we obtain $$\nabla_\theta \mathbb E_{a \sim \pi_\theta}\left[L(\hat y_\theta(x,a))\right] = \sum\limits_a \pi_\theta(a) \nabla_\theta \log \pi_\theta(a) \cdot L(\hat y_\theta(x,a)) + \pi_\theta(a) \cdot \nabla_\theta L(\hat y_\theta(x,a))$$ which can be rewritten as $$\nabla_\theta \mathbb E_{a \sim \pi_\theta}\left[L(\hat y_\theta(x,a))\right] = \mathbb E_{a \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a) \cdot L(\hat y_\theta(x,a)) + \nabla_\theta L(\hat y_\theta(x,a))\right]$$

This expectation can be empirically estimated by averaging over a set of samples.

References

[1] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulo, “Deep neural decision forests,” in Proceedings of the IEEE International Conference on Computer Vision, 2016, vol. 11–18-Dece, pp. 1467–1475.

[2] R. J. Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning,” Mach. Learn., vol. 8, no. 3, pp. 229–256, 1992.

[3] http://www.scholarpedia.org/article/Policy_gradient_methods

0 notes

digitalehipsterkladde · 9 years ago

Text

Mathematiker

Well this is basically just a test. But I would really love to have some formulas in this blog, like $f(x)=x^2$ and some other stuff.

$$\frac{\partial f(x)}{\partial x} = 2x.$$

0 notes