[Scikit-learn-general] multilayer perceptron questions

Discussion:

[Scikit-learn-general] multilayer perceptron questions

David Marek

2012-05-14 22:12:34 UTC

Hi,

I have worked on multilayer perceptron and I've got a basic
implementation working. You can see it at
https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most
important part is the sgd implementation, which can be found here
https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx

I have encountered a few problems and I would like to know your opinion.

1) There are classes like SequentialDataset and WeightVector which are
used in sgd for linear_model, but I am not sure if I should use them
here as well. I have to do more with samples and weights than just
multiply and add them together. I wouldn't be able to use numpy
functions like tanh and do batch updates, would I? What do you think?
Am I missing something that would help me do everything I need with
SequentialDataset? I implemented my own LossFunction because I need a
vectorized version, I think that is the same problem.

2) I used Andreas' implementation as an inspiration and I am not sure
I understand some parts of it:
* Shouldn't the bias vector be initialized with ones instead of
zeros? I guess there is no difference.
* I am not sure why is the bias updated with:
bias_output += lr * np.mean(delta_o, axis=0)
shouldn't it be:
bias_output += lr / batch_size * np.mean(delta_o, axis=0)?
* Shouldn't the backward step for computing delta_h be:
delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden)
where hidden.doutput is a derivation of the activation function for
hidden layer?

I hope my questions are not too stupid. Thank you.

David

Andreas Mueller

2012-05-15 07:37:34 UTC

Hi David.
I'll have a look at your code later today.
Let me first answer your questions to my code

Post by David Marek
Hi,
2) I used Andreas' implementation as an inspiration and I am not sure
* Shouldn't the bias vector be initialized with ones instead of
zeros? I guess there is no difference.

I am always initializing it with zeros. If you initialize it
with ones, you might get out of the linear part of the
nonlinearity. At the beginning, you definitely want to stay
close to the linear part to have meaningful derivatives.
What would be the reason to initialize with ones?
Btw, there is a Paper by Bengios group on how to initialize
the weights in a "good" way. You should have a look at that,
but I don't have the reference at the moment.

Post by David Marek
bias_output += lr * np.mean(delta_o, axis=0)
bias_output += lr / batch_size * np.mean(delta_o, axis=0)?

By doing the mean, the batch_size doesn't have an influence on the size
of the gradient if I'm not mistaken.

Post by David Marek
delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden)
where hidden.doutput is a derivation of the activation function for
hidden layer?

Yes, it should be. For softmax and maximum entropy loss, loads of stuff
gets canceled and the derivative wrt the output is linear.
Try wolfram alpha if you don't believe me ;) I haven't really found a place
with a good derivation for this. It is not very obvious to me.

Post by David Marek
I hope my questions are not too stupid. Thank you.

Not at all.

Cheers,
Andy

David Warde-Farley

2012-05-15 14:59:14 UTC

Post by David Marek
Hi,
I have worked on multilayer perceptron and I've got a basic
implementation working. You can see it at
https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most
important part is the sgd implementation, which can be found here
https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx
I have encountered a few problems and I would like to know your opinion.
1) There are classes like SequentialDataset and WeightVector which are
used in sgd for linear_model, but I am not sure if I should use them
here as well. I have to do more with samples and weights than just
multiply and add them together. I wouldn't be able to use numpy
functions like tanh and do batch updates, would I? What do you think?

I haven't had a look at these classes myself but I think working with raw
NumPy arrays is a better idea in terms of efficiency.

Post by David Marek
Am I missing something that would help me do everything I need with
SequentialDataset? I implemented my own LossFunction because I need a
vectorized version, I think that is the same problem.
2) I used Andreas' implementation as an inspiration and I am not sure
* Shouldn't the bias vector be initialized with ones instead of
zeros? I guess there is no difference.

If the training set is mean-centered, then absolutely, yes.

Otherwise the biases should in the hidden layer should be initialized to
the mean over the training set of -Wx, where W are the initial weights.
This ensures that the activation function is near its linear regime.

Post by David Marek
bias_output += lr * np.mean(delta_o, axis=0)
bias_output += lr / batch_size * np.mean(delta_o, axis=0)?

As Andy said, the former allows you to set the learning rate without taking
into account the batch size, which makes things a little simpler.

Post by David Marek
delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden)
where hidden.doutput is a derivation of the activation function for
hidden layer?

Offhand that sounds right. You can use Theano as a sanity check for your
implementation.

David

Mathieu Blondel

2012-05-15 15:16:21 UTC

On Tue, May 15, 2012 at 11:59 PM, David Warde-Farley <

Post by David Warde-Farley
I haven't had a look at these classes myself but I think working with raw
NumPy arrays is a better idea in terms of efficiency.

Since it abstracts away the data representation, SequentialDataset is
useful if you want to support both dense and sparse representations in your
MLP implementation.

Mathieu

David Warde-Farley

2012-05-15 15:44:40 UTC

Post by Mathieu Blondel
On Tue, May 15, 2012 at 11:59 PM, David Warde-Farley <

Post by David Warde-Farley
I haven't had a look at these classes myself but I think working with raw
NumPy arrays is a better idea in terms of efficiency.

Since it abstracts away the data representation, SequentialDataset is
useful if you want to support both dense and sparse representations in your
MLP implementation.

Ah, ok. As long as there are sufficient ways to avoid lots of large
temporaries being allocated, that seems like a good idea.

David

Andreas Mueller

2012-05-15 19:23:55 UTC

Post by Mathieu Blondel
On Tue, May 15, 2012 at 11:59 PM, David Warde-Farley
I haven't had a look at these classes myself but I think working with raw
NumPy arrays is a better idea in terms of efficiency.
Since it abstracts away the data representation, SequentialDataset is
useful if you want to support both dense and sparse representations in
your MLP implementation.

I am not sure if we want to support sparse data. I have no experience
with using MLPs on sparse data.
Could this be done efficiently? The weight vector would need to be
represented explicitly and densely, I guess.

Any ideas?

David Warde-Farley

2012-05-15 20:06:00 UTC

I am not sure if we want to support sparse data. I have no experience with using MLPs on sparse data.
Could this be done efficiently? The weight vector would need to be represented explicitly and densely, I guess.
Any ideas?

People can and do use neural nets with sparse inputs, dense-sparse products aren't usually too bad in my experience. Careful regularization and/or lots of data (a decent number of examples where each feature is non-zero) will be necessary to get good results, but this goes for basically any parametric model operating on sparse inputs.

Aside: there was interesting work on autoencoder-based pre-training of MLPs with sparse (binary, I think) inputs done by my colleagues here in Montreal. They showed that in the reconstruction step, you can get away with reconstructing the non-zero entries in the original input and a small random sample of the zero entries, and it works just as well as doing the (much more expensive, when the input is high-dimensional) exhaustive reconstruction. Neat stuff.

David

Andreas Mueller

2012-05-15 20:31:16 UTC

Post by David Warde-Farley

I am not sure if we want to support sparse data. I have no experience with using MLPs on sparse data.
Could this be done efficiently? The weight vector would need to be represented explicitly and densely, I guess.
Any ideas?

People can and do use neural nets with sparse inputs, dense-sparse products aren't usually too bad in my experience. Careful regularization and/or lots of data (a decent number of examples where each feature is non-zero) will be necessary to get good results, but this goes for basically any parametric model operating on sparse inputs.

Looking at the SequentialDataset implementation and the algorithms
again, I tend to agree with David (M.),
in that using numpy arrays might be better. If we want to support a
sparse version, we'd need another
implementation (of the low level functions).

The SequentialDataset was made for vector x vector operations. Depending
on whether we
do mini-batch or online learning, the MLP needs vector x matrix or
matrix x matrix operations.
In particular matrix x matrix is probably not feasible with the
SequentialDataset, though I think
even vector x matrix might be ugly and possibly slow, though I'm not
sure there.

What do you think Mathieu (and the others)?

On the same topic: I'm not sure if we decided whether we want minibatch,
batch and online learning.
I have the feeling that it might be possible to do particular
optimizations for online learning, and this
is the algorithm that I favor the most.

Comments?

David M., what do you think?

Btw, two comments on your current code:
I think this looks pretty good already. Atm, the tests are failing, though.
Also, I feel like using squared error for classification is a very bad habit
that for some reason survived the last 20 years in some dark corner.

Did you compare timings and results against my implementation?
Once you are pretty sure that the code is correct, you should disable
the boundscheck
in cython, as this can improve speed a lot :)

Cheers,
Andy

Mathieu Blondel

2012-05-16 04:11:40 UTC

On Wed, May 16, 2012 at 5:31 AM, Andreas Mueller

Post by Andreas Mueller
The SequentialDataset was made for vector x vector operations. Depending
on whether we
do mini-batch or online learning, the MLP needs vector x matrix or
matrix x matrix operations.
In particular matrix x matrix is probably not feasible with the
SequentialDataset, though I think
even vector x matrix might be ugly and possibly slow, though I'm not
sure there.
What do you think Mathieu (and the others)?

I think that it is worth investigating the separation between the core
algorithm logic and the data representation dependent parts. SGD used to be
implemented separately for dense and sparse inputs but the rewrite based on
SequentialDataset significantly simplified the source code (but Peter is
the best person to comment on this). David could start by getting the numpy
array based implementation right, then before implementing the sparse
version, investigate how to abstract away the data representation dependent
parts either by using/extending SequentialDataset/WeightVector or by
creating his own utility classes.

Mathieu

PS: When it makes sense, it would be nice if we could strive to add sparse
matrix support whenever we add a new estimator.

Peter Prettenhofer

2012-05-16 07:23:27 UTC

Post by Mathieu Blondel

Post by Andreas Mueller
The SequentialDataset was made for vector x vector operations. Depending
on whether we
do mini-batch or online learning, the MLP needs vector x matrix or
matrix x matrix operations.
In particular matrix x matrix is probably not feasible with the
SequentialDataset, though I think
even vector x matrix might be ugly and possibly slow, though I'm not
sure there.
What do you think Mathieu (and the others)?

I think that it is worth investigating the separation between the core
algorithm logic and the data representation dependent parts. SGD used to be
implemented separately for dense and sparse inputs but the rewrite based on
SequentialDataset significantly simplified the source code (but Peter is the
best person to comment on this). David could start by getting the numpy
array based implementation right, then before implementing the sparse
version, investigate how to abstract away the data representation dependent
parts either by using/extending SequentialDataset/WeightVector or by
creating his own utility classes.
Mathieu
PS: When it makes sense, it would be nice if we could strive to add sparse
matrix support whenever we add a new estimator.

I totally agree

Post by Mathieu Blondel
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

David Marek

2012-05-16 11:20:40 UTC

On Tue, May 15, 2012 at 10:31 PM, Andreas Mueller

Post by Andreas Mueller
On the same topic: I'm not sure if we decided whether we want minibatch,
batch and online learning.
I have the feeling that it might be possible to do particular
optimizations for online learning, and this
is the algorithm that I favor the most.
Comments?
David M., what do you think?

Well, I am not sure what optimizations could be done for online
learning, yet. At first I thought it would be possible to use
SequentialDataset for online learning, but now I don't think it's a
good idea to reimplement matrix operations that will be needed, when
we have numpy. If we find optimizations that would make online
learning faster than other options than I'd vote for it. But so far I
think the batch_size argument is ok.

Post by Andreas Mueller
I think this looks pretty good already. Atm, the tests are failing, though.
Also, I feel like using squared error for classification is a very bad habit
that for some reason survived the last 20 years in some dark corner.

Well, the first test should not fail, it's just XOR, the second one is
recognizing hand-written numbers and I don't expect it to be 100%
successful, I am just using it as a simple benchmark. Thank you for
pointing out what I thought about my Neural Networks course in
university, they teach 20 year old things :-D

Post by Andreas Mueller
Did you compare timings and results against my implementation?
Once you are pretty sure that the code is correct, you should disable
the boundscheck
in cython, as this can improve speed a lot :)

I haven't yet, will look at it. I have seen boundscheck and other
options used in sgd_fast, will have to try them.

Thanks

David

Peter Prettenhofer

2012-05-16 07:14:44 UTC

Post by Mathieu Blondel
On Tue, May 15, 2012 at 11:59 PM, David Warde-Farley

Post by David Warde-Farley
I haven't had a look at these classes myself but I think working with raw
NumPy arrays is a better idea in terms of efficiency.

Since it abstracts away the data representation, SequentialDataset is useful
if you want to support both dense and sparse representations in your MLP
implementation.
Mathieu

Hi everybody,

sorry for my late reply - Mathieu is correct: The only purpose of
SequentialDataset is to create a common interface to both dense and
sparse representations. It is pretty much tailored to the needs of the
SGD module (as Andy already pointed out). I think if you want to
support both dense and sparse data you'll have to think about such an
abstraction eventually. Maybe its a good idea to start with a dense
implementation and then we could try to refactor it to support both
dense and sparse inputs using a suitable abstraction.

best,
Peter

Post by Mathieu Blondel
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

David Marek

2012-05-16 10:15:56 UTC

On Tue, May 15, 2012 at 4:59 PM, David Warde-Farley

Post by David Warde-Farley

Post by David Marek
Hi,
I have worked on multilayer perceptron and I've got a basic
implementation working. You can see it at
https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most
important part is the sgd implementation, which can be found here
https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx
I have encountered a few problems and I would like to know your opinion.
1) There are classes like SequentialDataset and WeightVector which are
used in sgd for linear_model, but I am not sure if I should use them
here as well. I have to do more with samples and weights than just
multiply and add them together. I wouldn't be able to use numpy
functions like tanh and do batch updates, would I? What do you think?

I haven't had a look at these classes myself but I think working with raw
NumPy arrays is a better idea in terms of efficiency.

Post by David Marek
Am I missing something that would help me do everything I need with
SequentialDataset? I implemented my own LossFunction because I need a
vectorized version, I think that is the same problem.
2) I used Andreas' implementation as an inspiration and I am not sure
* Shouldn't the bias vector be initialized with ones instead of
zeros? I guess there is no difference.

If the training set is mean-centered, then absolutely, yes.
Otherwise the biases should in the hidden layer should be initialized to
the mean over the training set of -Wx, where W are the initial weights.
This ensures that the activation function is near its linear regime.

Ok, the rule of thumb is that the bias should be initialized so the
activation function starts in linear regime.

Post by David Warde-Farley

Post by David Marek
bias_output += lr * np.mean(delta_o, axis=0)
bias_output += lr / batch_size * np.mean(delta_o, axis=0)?

As Andy said, the former allows you to set the learning rate without taking
into account the batch size, which makes things a little simpler.

I see, it's pretty obvious when I look at it now.

Post by David Warde-Farley

Post by David Marek
delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden)
where hidden.doutput is a derivation of the activation function for
hidden layer?

Offhand that sounds right. You can use Theano as a sanity check for your
implementation.

Thank you David and Andreas for answering my questions. I will look at Theano.

David

Andreas Mueller

2012-05-16 10:22:04 UTC

Hi David.
Did you also see this mail:
http://permalink.gmane.org/gmane.comp.python.scikit-learn/3071
For some reason it doesn't show up in my inbox and you didn't quote it.
So just making sure.

Cheers,
Andy

Post by David Marek
Thank you David and Andreas for answering my questions. I will look at Theano.

David Marek

2012-05-16 10:29:29 UTC

Hi

Yes, I did. I am using gmail so I just quote one mail, didn't want to
answer each mail separately when they are so similar. Sorry, I will
try to be more specific in quoting.

David

Post by Andreas Mueller
Hi David.
http://permalink.gmane.org/gmane.comp.python.scikit-learn/3071
For some reason it doesn't show up in my inbox and you didn't quote it.
So just making sure.
Cheers,
Andy

Post by David Marek
Thank you David and Andreas for answering my questions. I will look at Theano.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2012-05-16 10:31:16 UTC

Post by David Marek
Hi
Yes, I did. I am using gmail so I just quote one mail, didn't want to
answer each mail separately when they are so similar. Sorry, I will
try to be more specific in quoting.

Never mind, probably my mail program just acted up.
Btw, I am not sure theano is the best way to compute derivatives ;)

David Warde-Farley

2012-05-16 17:30:56 UTC

Post by Andreas Mueller
Btw, I am not sure theano is the best way to compute derivatives ;)

No? I would agree in the general case. However, in the case of MLPs and backprop, it's a use case for which Theano has been designed and heavily optimized. With it, it's very easy and quick to produce a correct MLP implementation (the deep learning tutorials contain one).

It's *not* the best way to obtain a readable mathematical expression for the gradients, but it'll allow you to compute them easily/correctly, which makes it a useful thing to verify against. I've done this a fair bit myself.

I've never had so much success with symbolic tools like Wolfram Alpha in situations involving lots of sums over indexed scalar quantities and whatnot, but perhaps I didn't try hard enough.

Once the initial version is working, Theano will serve another purpose: as a speed benchmark to try and beat (or at least not be too far behind). :)

David

Justin Bayer

2012-05-17 08:56:27 UTC

Post by David Marek

Post by David Warde-Farley

Post by David Marek
delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden)
where hidden.doutput is a derivation of the activation function for
hidden layer?

Offhand that sounds right. You can use Theano as a sanity check for your
implementation.

Thank you David and Andreas for answering my questions. I will look at Theano.

Alternatively, you can just check it numerically. Scipy already comes
with an implementation [1] for scalar-to-scalar mappings, which you
can use with a double for loop for vector-to-vector functions. It is
much more straightforward to add this to unit tests than theano
(obiviously, because of no additional dependency) and less hassle than
to write out results of derivatives by hand.

[1] http://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.derivative.html

Post by David Marek
David
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Dipl. Inf. Justin Bayer
Lehrstuhl für Robotik und Echtzeitsysteme, Technische Universität München
http://www6.in.tum.de/Main/Bayerj

Andreas Mueller

2012-05-31 19:02:02 UTC

Hey David.
How is it going?
I haven't heard from you in a while.
Did you blog anything about your progress?

Cheers,
Andy

Post by David Marek
On Tue, May 15, 2012 at 4:59 PM, David Warde-Farley

Post by David Warde-Farley

Post by David Marek
Hi,
I have worked on multilayer perceptron and I've got a basic
implementation working. You can see it at
https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most
important part is the sgd implementation, which can be found here
https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx
I have encountered a few problems and I would like to know your opinion.
1) There are classes like SequentialDataset and WeightVector which are
used in sgd for linear_model, but I am not sure if I should use them
here as well. I have to do more with samples and weights than just
multiply and add them together. I wouldn't be able to use numpy
functions like tanh and do batch updates, would I? What do you think?

I haven't had a look at these classes myself but I think working with raw
NumPy arrays is a better idea in terms of efficiency.

Post by David Marek
Am I missing something that would help me do everything I need with
SequentialDataset? I implemented my own LossFunction because I need a
vectorized version, I think that is the same problem.
2) I used Andreas' implementation as an inspiration and I am not sure
* Shouldn't the bias vector be initialized with ones instead of
zeros? I guess there is no difference.

If the training set is mean-centered, then absolutely, yes.
Otherwise the biases should in the hidden layer should be initialized to
the mean over the training set of -Wx, where W are the initial weights.
This ensures that the activation function is near its linear regime.

Ok, the rule of thumb is that the bias should be initialized so the
activation function starts in linear regime.

Post by David Warde-Farley

Post by David Marek
bias_output += lr * np.mean(delta_o, axis=0)
bias_output += lr / batch_size * np.mean(delta_o, axis=0)?

As Andy said, the former allows you to set the learning rate without taking
into account the batch size, which makes things a little simpler.

I see, it's pretty obvious when I look at it now.

Post by David Warde-Farley

Post by David Marek
delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden)
where hidden.doutput is a derivation of the activation function for
hidden layer?

Offhand that sounds right. You can use Theano as a sanity check for your
implementation.

Thank you David and Andreas for answering my questions. I will look at Theano.
David
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

David Marek

2012-06-01 00:49:35 UTC

Hi,

I don't have much time these days because I have got exams in school. I am
sorry I haven't informed you.

I have implemented a multi class cross entropy and soft max function and
turned off some of the cython checks, the result is that the cython
implementation is only slightly better, I guess that's because I am using
objects as an output functions, I will have to benchmark them to know more.

The next step is to test that the gradient descent is working correctly. I
am a little unsure how to approach this. One thing I will do is to compute
one step of backpropagation by hand and check that the implementation is
doing the same. Another thing I will try to do is to compute the gradients
numerically, I am not exactly sure if its enough to use the derivative from
scipy and apply it on the forward step.

David

Post by Andreas Mueller
Hey David.
How is it going?
I haven't heard from you in a while.
Did you blog anything about your progress?
Cheers,
Andy

Post by David Marek
On Tue, May 15, 2012 at 4:59 PM, David Warde-Farley

Post by David Warde-Farley

Post by David Marek
Hi,
I have worked on multilayer perceptron and I've got a basic
implementation working. You can see it at
https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most
important part is the sgd implementation, which can be found here

https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx

Post by David Marek

Post by David Warde-Farley

Post by David Marek
I have encountered a few problems and I would like to know your

opinion.

Post by David Marek

Post by David Warde-Farley

Post by David Marek
1) There are classes like SequentialDataset and WeightVector which are
used in sgd for linear_model, but I am not sure if I should use them
here as well. I have to do more with samples and weights than just
multiply and add them together. I wouldn't be able to use numpy
functions like tanh and do batch updates, would I? What do you think?

I haven't had a look at these classes myself but I think working with

raw

Post by David Marek

Post by David Warde-Farley
NumPy arrays is a better idea in terms of efficiency.

Post by David Marek
Am I missing something that would help me do everything I need with
SequentialDataset? I implemented my own LossFunction because I need a
vectorized version, I think that is the same problem.
2) I used Andreas' implementation as an inspiration and I am not sure
* Shouldn't the bias vector be initialized with ones instead of
zeros? I guess there is no difference.

If the training set is mean-centered, then absolutely, yes.
Otherwise the biases should in the hidden layer should be initialized to
the mean over the training set of -Wx, where W are the initial weights.
This ensures that the activation function is near its linear regime.

Ok, the rule of thumb is that the bias should be initialized so the
activation function starts in linear regime.

Post by David Warde-Farley

Post by David Marek
bias_output += lr * np.mean(delta_o, axis=0)
bias_output += lr / batch_size * np.mean(delta_o, axis=0)?

As Andy said, the former allows you to set the learning rate without

taking

Post by David Marek

Post by David Warde-Farley
into account the batch size, which makes things a little simpler.

I see, it's pretty obvious when I look at it now.

Post by David Warde-Farley

Post by David Marek
delta_h[:] = np.dot(delta_o, weights_output.T) *

hidden.doutput(x_hidden)

Post by David Marek

Post by David Warde-Farley

Post by David Marek
where hidden.doutput is a derivation of the activation function for
hidden layer?

Offhand that sounds right. You can use Theano as a sanity check for your
implementation.

Thank you David and Andreas for answering my questions. I will look at

Theano.

Post by David Marek
David

------------------------------------------------------------------------------

Post by David Marek
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2012-06-01 05:27:33 UTC

Post by David Marek
I don't have much time these days because I have got exams in school.

Good luck!

Post by David Marek
I have implemented a multi class cross entropy and soft max function and
turned off some of the cython checks, the result is that the cython
implementation is only slightly better, I guess that's because I am using
objects as an output functions, I will have to benchmark them to know more.

Do you have any code on github that you can show us? I am not trying to
micro-manage you, but more to see if we can help by giving ideas on
seeing the code.

G

Andreas Mueller

2012-06-18 06:44:19 UTC

Hey David.
Olivier dug up this paper by LeCun's group:
http://users.ics.aalto.fi/kcho/papers/icml11.pdf
I think this might be quite interesting for the MLP.

It is probably also interesting for the linear SGD.
I'm surprised that they didn't compare against diagonal stochastic
Levenberg-Marquardt
with constant learning rate...

Cheers,
Andy

Post by David Marek
Hi,
I have worked on multilayer perceptron and I've got a basic
implementation working. You can see it at
https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most
important part is the sgd implementation, which can be found here
https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx
I have encountered a few problems and I would like to know your opinion.
1) There are classes like SequentialDataset and WeightVector which are
used in sgd for linear_model, but I am not sure if I should use them
here as well. I have to do more with samples and weights than just
multiply and add them together. I wouldn't be able to use numpy
functions like tanh and do batch updates, would I? What do you think?
Am I missing something that would help me do everything I need with
SequentialDataset? I implemented my own LossFunction because I need a
vectorized version, I think that is the same problem.
2) I used Andreas' implementation as an inspiration and I am not sure
* Shouldn't the bias vector be initialized with ones instead of
zeros? I guess there is no difference.
bias_output += lr * np.mean(delta_o, axis=0)
bias_output += lr / batch_size * np.mean(delta_o, axis=0)?
delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden)
where hidden.doutput is a derivation of the activation function for
hidden layer?
I hope my questions are not too stupid. Thank you.
David
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2012-06-18 09:43:51 UTC

Post by Andreas Mueller
Hey David.
http://users.ics.aalto.fi/kcho/papers/icml11.pdf
I think this might be quite interesting for the MLP.

Err, no. The paper I mentioned is even newer:

http://arxiv.org/abs/1206.1106

Just to make it more explicit about the paper content, the title is:
"No More Pesky Learning Rates" and it's a method for estimating the
optimal learning rate schedule online from the data, while learning a
model using SGD with a smooth loss (convex or not). It's a pre-print
for NIPS 2012. It looks very promising. It would great to try and
reproduce some of their empirical results.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Andreas Mueller

2012-06-18 09:45:25 UTC

Post by Olivier Grisel

Post by Andreas Mueller
Hey David.
http://users.ics.aalto.fi/kcho/papers/icml11.pdf
I think this might be quite interesting for the MLP.

http://arxiv.org/abs/1206.1106
"No More Pesky Learning Rates" and it's a method for estimating the
optimal learning rate schedule online from the data, while learning a
model using SGD with a smooth loss (convex or not). It's a pre-print
for NIPS 2012. It looks very promising. It would great to try and
reproduce some of their empirical results.

Sorry, copy&paste error :-/

David Marek

2012-06-19 09:09:40 UTC

Hi

On Mon, Jun 18, 2012 at 11:43 AM, Olivier Grisel

Post by Olivier Grisel
http://arxiv.org/abs/1206.1106
"No More Pesky Learning Rates" and it's a method for estimating the
optimal learning rate schedule online from the data, while learning a
model using SGD with a smooth loss (convex or not). It's a pre-print
for NIPS 2012. It looks very promising. It would great to try and
reproduce some of their empirical results.

Thanks, I will look at it.

David

24 Replies
3 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

David Marek 2012-05-14 22:12:34 UTC

Andreas Mueller 2012-05-15 07:37:34 UTC

David Warde-Farley 2012-05-15 14:59:14 UTC

Mathieu Blondel 2012-05-15 15:16:21 UTC

David Warde-Farley 2012-05-15 15:44:40 UTC

Andreas Mueller 2012-05-15 19:23:55 UTC

David Warde-Farley 2012-05-15 20:06:00 UTC

Andreas Mueller 2012-05-15 20:31:16 UTC

Mathieu Blondel 2012-05-16 04:11:40 UTC

Peter Prettenhofer 2012-05-16 07:23:27 UTC

David Marek 2012-05-16 11:20:40 UTC

Peter Prettenhofer 2012-05-16 07:14:44 UTC

David Marek 2012-05-16 10:15:56 UTC

Andreas Mueller 2012-05-16 10:22:04 UTC

David Marek 2012-05-16 10:29:29 UTC

Andreas Mueller 2012-05-16 10:31:16 UTC

David Warde-Farley 2012-05-16 17:30:56 UTC

Justin Bayer 2012-05-17 08:56:27 UTC

Andreas Mueller 2012-05-31 19:02:02 UTC

David Marek 2012-06-01 00:49:35 UTC

Gael Varoquaux 2012-06-01 05:27:33 UTC

Andreas Mueller 2012-06-18 06:44:19 UTC

Olivier Grisel 2012-06-18 09:43:51 UTC

Andreas Mueller 2012-06-18 09:45:25 UTC

David Marek 2012-06-19 09:09:40 UTC

about - legalese

Loading...