tag:blogger.com,1999:blog-9707590.post5787904569009894000..comments2017-04-27T16:00:58.211-07:00Comments on Random Ponderings: A Brief Overview of Deep LearningYisong Yuehttps://plus.google.com/102868941298496735783noreply@blogger.comBlogger27125tag:blogger.com,1999:blog-9707590.post-35328799672751362392017-04-27T16:00:58.211-07:002017-04-27T16:00:58.211-07:00Your website is really cool and this is a great in...<br /> Your website is really cool and this is a great inspiring article.<br /><a href="http://www.ihtmlvault.com/basics-of-html-that-you-need-to-learn-seo/" rel="nofollow">HTML basics</a>naila nazhttp://www.blogger.com/profile/07586960217816126563noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-19565092109813998162016-11-03T08:51:37.390-07:002016-11-03T08:51:37.390-07:00Hello all knowing Yisong Yue. Great article!
&qu...Hello all knowing Yisong Yue. Great article! <br /><br />"Sometimes, when the input dimension varies by orders of magnitude, it is better to take the log(1 + x) of that dimension."<br /><br />Is it bad to use log(1 + x) also when the input dimension doesn't vary alot. If yes, why? <br /><br /><br />Logical Deductionhttp://www.blogger.com/profile/05675460953195321023noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-33583493211330943522016-01-01T16:44:01.460-08:002016-01-01T16:44:01.460-08:00Great post, thanks!
I have a question regarding p...Great post, thanks!<br /><br />I have a question regarding preprocessing that I would love if you or Prof. Hinton could give me your thought on. You write the following:<br /><br />"It is essential to center the data so that its mean is zero and so that the variance of each of its dimensions is one. Sometimes, when the input dimension varies by orders of magnitude, it is better to take the log(1 + x) of that dimension. Basically, it’s important to find a faithful encoding of the input with zero mean and sensibly bounded dimensions. Doing so makes learning work much better."<br /><br />This makes sense. The issue is that my dataset is highly sparse, meaning that it becomes difficult to obtain both unit variance and sensibly bounded dimensions. To obtain unit variance I must multiply by the std. deviation, which is roughly 0.25 for each channel, making around 30-50% of my values in each channel fall outside of what I would call a "sensibly bounded interval" such as [-3,3].<br /><br />My dataset consists of roughly ~1.4 mio. observations of non-negative data and it is trained using a ConvNet with 8 layers with weights (conv. layers and dense layers). I am not modeling images or text (typical uses of ConvNets). One can imagine that 90% of my values are 0 and the rest are uniformly distributed on ]0,10]]0,10].<br /><br />I have the following questions:<br /><br />1. Based on a knowledge of SGD (and perhaps ConvNets) - is it most important to aim for σ=1σ=1 or a sensible interval such as [3,3][3,3]?<br /><br />2. What useful transformations could I do to fulfill both σ=1σ=1 and keeping my values in a sensible interval?<br /><br />Note that I have asked the same question on http://stats.stackexchange.com/questions/188925/feature-standardization-for-convolutional-network-on-sparse-data in case you don't have time to reply. Feel very free to answer there instead.Bjarkenoreply@blogger.comtag:blogger.com,1999:blog-9707590.post-39937379052997184352015-11-26T04:24:30.173-08:002015-11-26T04:24:30.173-08:00wow...Great Blog ...So Much Things I have Learnt A...wow...Great Blog ...So Much Things I have Learnt About deep Learning From This Blog....Thank YouFree equity tipshttp://www.researchvia.com/equity-tips/noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-37376419902221526192015-04-29T02:38:17.850-07:002015-04-29T02:38:17.850-07:00so, Deep learning is essentially a self-created so...so, Deep learning is essentially a self-created solution lookup table?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-9707590.post-83068955192651783122015-04-11T23:59:30.414-07:002015-04-11T23:59:30.414-07:00Gary Bradski: What the heck are you talking about?...Gary Bradski: What the heck are you talking about?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-9707590.post-56938770241267092272015-03-12T18:33:07.230-07:002015-03-12T18:33:07.230-07:00BTW, as "philosophical" comments from al...BTW, as "philosophical" comments from all comers seem to be allowed - I always think that the fact that these nets work just shows that we look at stuff that is adapted to our vision - in other words we have built alphabets for our hand, eyes,and quills; we have created roads and cars together, most of what we aim to recognize out there is actually already in some way evolved to be recognized. And so we should put stuff like astro data or even stock market data into a different category as it is not necessarily evolved to be understood. <br /><br />EdmundEdmund Ronald Ph. D.http://www.blogger.com/profile/04377336351536210266noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-75295340374129618422015-03-12T18:24:39.154-07:002015-03-12T18:24:39.154-07:00Ysong - I've taken the liberty of linking to t...Ysong - I've taken the liberty of linking to this post from my own Deep Neural blog.<br /><br />I would like to thank Ilya for giving away all that sacred knowledge for free :)<br /><br />Ilya, do you think recurrent nets will now be much more commonly used thanks to your work? I believe it was possible in the past to train them by reinforcement.<br /><br />Edmund<br />http://deepneural.blogspot.fr/2015/03/go-and-read-ilya-sutskevers.html<br /><br />Edmund Ronald Ph. D.http://www.blogger.com/profile/04377336351536210266noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-35325356735337320712015-01-30T01:05:50.117-08:002015-01-30T01:05:50.117-08:00How about the opinion that Deep Neural Networks ar...How about the opinion that Deep Neural Networks are have no similiarities to biological neural networks and thus naming them Neural is deep offense to biology? Meaning that DNNs can be easily cheated, are not robust to clutter and noise, results of training are limited to specific data sets, etc.?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-9707590.post-44755706093842851122015-01-24T19:10:15.563-08:002015-01-24T19:10:15.563-08:00Nice summary. One thing you've left out of the...Nice summary. One thing you've left out of the "ten step bio compute" speculation can be seen in robotics and in your dreams: ongoing (IMO causal-dynamic) state. The feedforward networks update a causal model of the world. Robots represent themselves in the world and use simulations in the model for planning and simulated learning. We do the same. <br /><br />There is a lot more feedback than feedforward in the brain. Wires are expensive. The brain does this IMO to support this simulated world where perception actually takes place -- in the consensus of this abstracted physical model and it's deep NN inputs. So, 10 steps to feed data in, but really many more steps support the ongoing recognition.<br /><br />Build a robot and you'll end up building some version of this interior causal model. I add dynamics because, unlike our NNs, our brain is mostly dynamically stable, not absolutely. Learning is never turned off, most of the stable patterns are there because the stability comes from the external world. Again, IMO. There are some key learning critical periods to bootstrap up the categorical structure that drives the internal model, again, IMHO. <br /><br />GaryGary Bradskihttp://www.blogger.com/profile/06981290138335322319noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-83635287755706167702015-01-19T18:08:07.831-08:002015-01-19T18:08:07.831-08:00That is right. It is not about one specific algori...That is right. It is not about one specific algorithm, but about the architecture of the net being deep! There is even theory which only deals with the depth aspect, irrespective of how you learn. Of course everyone would like to have *the algorithm*, but we are not there yet!Yoshua Bengiohttp://www.blogger.com/profile/01024092022366764581noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-38540529169259849832015-01-19T04:50:45.000-08:002015-01-19T04:50:45.000-08:00Thanks for the response. There is a lot of termino...Thanks for the response. There is a lot of terminology flying around on DL; You and LeCun are all about convnets, Hinton paper are mostly about RBMs, and they are considered "deep", utilizing backprop, in some shape of form. I guess I thought, naively, when there was a breakthrough on DL it was one specific method. It now looks like the breakthough is really about the net being deep (hence the name), in the network implementation a multitude of approaches can be utilized. Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-9707590.post-57562871400700405672015-01-19T04:32:25.888-08:002015-01-19T04:32:25.888-08:00Stacks of unsupervised feature learning layers are...Stacks of unsupervised feature learning layers are STILL useful when you are in a regime with insufficient labeled examples, for transfer learning or domain adaptation. It is a regularizer. But when the number of labeled examples becomes large enough, the advantage of that regularizer becomes much less. I suspect however that this story is far from ended! There are other ways besides pre-training of combining supervised and unsupervised learning, and I believe that we still have a lot to improve in terms of our unsupervised learning algorithms.Yoshua Bengiohttp://www.blogger.com/profile/01024092022366764581noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-48236157452741576892015-01-19T02:58:15.172-08:002015-01-19T02:58:15.172-08:00So is this post out of date?
http://deeplearning...So is this post out of date? <br /><br />http://deeplearning.net/tutorial/DBN.html<br /><br />Or Kevin Murphy's latest book (he himself said it might be out of date, but) where he talks about stacked RBMs. <br /><br />And I just saw this post <br /><br />https://www.paypal-engineering.com/2015/01/12/deep-learning-on-hadoop-2-0-2/<br /><br />where a data scientist at Paypal implemented deep learning on Hadoop, using a stack of RBMs. <br /><br />Are you saying all these are behind the most recent state-of-art? Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-9707590.post-88106384635790704162015-01-18T22:24:14.095-08:002015-01-18T22:24:14.095-08:00It's obvious that humans learn from a very wid...It's obvious that humans learn from a very wide variety of information sources, and that input-output examples is only one of many such information sources. Humans get most of their "labels" indirectly, and only a small fraction of human learning is done with explicit input-output examples. Ilya Sutskeverhttp://www.blogger.com/profile/12014815911370408456noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-47970616937099351092015-01-18T18:15:45.717-08:002015-01-18T18:15:45.717-08:00Hi Ilya, very nice post thanks.
You say:
> Mak...Hi Ilya, very nice post thanks.<br /><br />You say:<br />> Make sure that you have a high-quality dataset of input-output examples that is large, representative, and has relatively clean labels. Learning is completely impossible without such a dataset.<br /><br />My question is, do you think that humans really learn from this type of data? I find it implausible since it seems there are many things humans can learn from a handful of examples. On the other hand, I think I remember Yoshua arguing that, for vision, e.g., even if you can learn about some new object from a few examples, it is only because of the huge amount of "training data" we have processed in childhood. <br />Anyway, do you think there is some difference between the data humans can learn from, and what LDNNs can learn from?Greg Vhttp://www.blogger.com/profile/08362278486243625798noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-48230098001047387672015-01-18T14:25:00.365-08:002015-01-18T14:25:00.365-08:00And initialization and depth.And initialization and depth.Yoshua Bengiohttp://www.blogger.com/profile/01024092022366764581noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-25374394191729046642015-01-18T14:17:41.780-08:002015-01-18T14:17:41.780-08:00No one uses RBMs any more. All the state of the ar...No one uses RBMs any more. All the state of the art models for speech and object recognition are the standard feedforward neural network, trained with backprop. The only algorithmic differences from 80s are ReLU and Dropout.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-9707590.post-29561847116809307842015-01-18T06:29:02.201-08:002015-01-18T06:29:02.201-08:00You seem to equate 80s NNs with today's NNs. B...You seem to equate 80s NNs with today's NNs. Beyond computers being slow, having less data, the NN tech has changed. Instead of punishing/readjusting each neuron, we now have layers of RBMs which do their own learning, seperately. This is an entirely different approach then what we had during 80s.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-9707590.post-75957904746932395952015-01-17T11:20:30.723-08:002015-01-17T11:20:30.723-08:00Olivier: basically, if you focus on the training ...Olivier: basically, if you focus on the training data, you risk overfitting. If your training set is enormous in comparison to the size of the model, then overfitting is not a concern. But if your training set is smaller than the model, then as you keep training, eventually the validation (and hence test) error will start increasing. You definitely don't want to continue training once this happens. <br /><br />However, it has been observed that a 2x-10x reduction in the LR size results in a very rapid reduction in both training and validation (and hence test) errors. Which is why, when you see that validation error no longer makes any progress with the large LR (so it may start increasing soon, which is bad), you reduce the LR, to get an additional gain.<br /><br />I am pretty sure that this effect will hold true for convex problem as well -- this particular learning rate schedule attempts to find a parameter setting with the lowest validation error, which is something we care about much more than training error, which is relevant only to the extent it is correlated with the test error.<br /><br />Optimization and generalization are intertwined, and that it is possible to optimize the training set better while doing worse on the test set. <br /><br />Michael: it is possible that some types of additional structures will be helpful, especially if they are very general or easily trainable. As for biological realism, it's a matter of opinion. I think that our artificial neural networks have much in common to biological neural networks, but people who know more about real neurons may disagree.Ilya Sutskeverhttp://www.blogger.com/profile/12014815911370408456noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-79599827533578445572015-01-15T21:51:40.438-08:002015-01-15T21:51:40.438-08:00Ilya, what do you think about capsules based neura...Ilya, what do you think about capsules based neural networks of Geoff Hinton? Do you agree we need the additional structure, as opposed to "simple" layers of neurons? <br />Do you think we are gradually moving towards more biologically realistic models of information processing? Michael Klachkonoreply@blogger.comtag:blogger.com,1999:blog-9707590.post-6070374969898405432015-01-15T14:19:10.159-08:002015-01-15T14:19:10.159-08:00Very clear and interesting blog post. Thanks Ilya ...Very clear and interesting blog post. Thanks Ilya for taking the time to write it down.<br /><br />I have a question about the learning rate schedule you recommend. For convex models trained with SGD, finding the optimal learning rate schedule is purely an optimization problem and therefore the optimal learning rate schedule should only depend on the training data and the loss function it-self. It is my understanding that there should be no need to use an held out validation set to find the optimal learning rate schedule in that case (leaving overfitting issues aside, assuming we have enough labeled samples to train on).<br /><br />However for deep nets, you explicitly mentions that you need to decay learning rate based on the lack of improvement on the loss computed on some held-out validation set rather than using evolution of the cost of the training set.<br /><br />What can go wrong when you do the learning rate scheduling based on the training cost instead of using an held-out validation set? Would the network just converge slower to an equivalent solution (assuming you still use the validation set for early stopping but not for the learning rate scheduling)? Or is it expected to converge to a significantly worse solution (e.g. by getting stuck on a plateau near a saddle point more easily)?<br /><br />It seems that for deep networks, it might no longer be possible to separate the optimization problem from the learning / estimation problem. Do you have more intuitions to share on this topic?Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-39685313877309910552015-01-14T20:59:07.809-08:002015-01-14T20:59:07.809-08:00I basically agree with everything. Abstract indep...I basically agree with everything. Abstract independent hidden factors of real data is unquestionably part of the explanation as to why regular deep nets succeed as well as they do in practice. At the same time, I think that independent hidden factors are not the whole story, and that there may be models whose representations will so different from the ones we are dealing with now that we may not think of them as of conventional distributed representations at all (although they will necessarily be distributed, strictly speaking).<br /><br />Ilya Sutskeverhttp://www.blogger.com/profile/12014815911370408456noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-32801779762817001512015-01-14T20:37:24.633-08:002015-01-14T20:37:24.633-08:00Ok.
It is what you are calling 'powerful'...Ok.<br /><br />It is what you are calling 'powerful' which encompasses my notion of 'better prior which gives rise to better generalization' (in addition to having enough capacity). I am not 100% sure why this prior works so well, but I have a theory, which I have expressed in various papers. The idea is that the data was generated by different factors, almost independently. This is what allows you to learn about the effect of one of the factors, without having to know everything about all the other factors and their exponentially large number of interactions. For example, I see images of people, and one factor is gender, another is the hair color, and other is age, and another is wearing glasses. You see that you can build a detector for each of these factors without really needing to see all the configurations of all the other factors in the data. This assumption is really equivalent to saying that a good representation is a distributed one, in terms of having a good generalization from few examples. But to have to right representation, you need enough depth (think about the depth needed, in my example with images of persons), to extract these almost independent factors or features. So you need deep distributed representations. This assumption also arises naturally if you first assume something that appears very straightforward: the input data we observe are the effects of some underlying causes, and these causes are marginally related to each other in simple ways (e.g. independent causes being the extreme case), while the things to predict are more directly connected to causes (whereas the inputs are effects). This assumption also suggests that unsupervised pre-training and semi-supervised learning of representations will work well, and they do (when there is not enough labeled data for supervised learning to do it all). It would make sense that brains have evolved to find the causes of what we observe, by the way...<br /><br />Another note: in principle (and maybe not in practice), I argue that even a shallow neural net has a potentially very serious advantage over a Gaussian kernel SVM. Some functions representable very efficiently by the neural net can require an exponential number of support vectors for the kernel SVM (this idea is found in several papers, including in the most crisp way in our last NIPS paper on the number of regions associated with shallow and deep rectifier nets).<br /><br />-- Yoshua Bengio<br />Yoshua Bengiohttp://www.blogger.com/profile/01024092022366764581noreply@blogger.comtag:blogger.com,1999:blog-9707590.post-29765540791128438022015-01-14T20:19:58.556-08:002015-01-14T20:19:58.556-08:00Thanks!
I think that we are expressing similar ...Thanks! <br /><br />I think that we are expressing similar concepts in different words, and I suspect that we use the term generalization in slightly different ways. <br /><br />I don't see a particular difference between a shallow net with a reasonable number of neurons and a kernel machine with a reasonable number of support vectors (its not useful to consider Kernel machines with exponentially many support vectors just like there isn't a point in considering the universal approximation theorem as both require exponential resources) --- both of these models are nearly identical, and thus equally unpowerful. Both of these models will be inferior to an LDNN with a comparable number of parameters precisely because the LDNN can do computation and the shallow models cannot. The LDNN can sort, do integer-multiplication, compute analytic functions, decompose an input into small pieces and recombine it later in a higher level representation, partition the input space into an exponential number of non-arbitrary tiny regions, etc. Ultimately, if the LDNN has 10,000 layers, then it can, in principle, execute any parallel algorithm that runs in fewer than 10,000 steps, giving this LDNN an incredible expressive power. Thus, I don't think that the argument in the article suggests that huge kernel machines should be able to solve these hard problems --- they would need to have exponentially many support vectors.<br /><br />Although I didn't define it in the article, generalization (to me) means that the gap between the training and the test error is small. So for example, a very bad model that has similar training and test errors does not overfit, and hence generalizes, according to the way I use these concepts. It follows that generalization is easy to achieve whenever the capacity of the model (as measured by the number of parameters or its VC-dimension) is limited --- we merely need to use more training cases than the model has parameters / VC dimension. Thus, the difficult part is to get a low training error.<br /><br />Now why is it that models that can do computation are in some sense "right" compared to models that cannot? Why is the inductive bias captured by an LDNN "good", or even "correct"? Why do LDNNs succeed on the natural problems that we often want to solve in practice? I think that it is a very nontrivial fact about the universe, and is a bit like asking "why are typical recognition problems solvable by an efficient computer program". I don't know the answer but I have two theories: 1) if they weren't solvable by an efficient computer program, then humans and animals wouldn't be solving them in the first place; and 2) there is something about the nature of physics and possibly even evolution that gives raise to problems that can usually be solvable by efficient algorithms. But that is idle speculation on my part.Ilya Sutskeverhttp://www.blogger.com/profile/12014815911370408456noreply@blogger.com