[Scikit-learn-general] Le Bergstra Nouveau est arrivé

Discussion:

[Scikit-learn-general] Le Bergstra Nouveau est arrivé

Olivier Grisel

2012-03-08 19:18:28 UTC

Some fresh news from the hyperparameters tuning front-lines:

http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf

Some interesting snippets from the conclusion (I have not yet read the
rest of the paper):

"""
We have shown that random experiments are more efficient than grid
experiments for hyper-parameter optimization in the case of several
learning algorithms on several data sets. Our analysis of the
hyper-parameter response surface (Ψ) suggests that random experiments
are more efficient because not all hyper- parameters are equally
important to tune. Grid search experiments allocate too many trials to
the exploration of dimensions that do not matter and suffer from poor
coverage in dimensions that are important.
"""

"""
Random experiments are also easier to carry out than grid experiments
for practical reasons related to the statistical independence of every
trial.

• The experiment can be stopped any time and the trials form a
complete experiment.

• If extra computers become available, new trials can be added to an
experiment without having to adjust the grid and commit to a much
larger experiment.

• Every trial can be carried out asynchronously.

• If the computer carrying out a trial fails for any reason, its trial
can be either abandoned or restarted without jeopardizing the
experiment.
"""

I wonder how this would transpose to scikit-learn models that have
often much fewer hyper-parameters that the average Deep Belief
Network. Still it's very interesting food for thought if someone
want's to dive into improving the model selection tooling in the
scikit.

Maybe a new GSoC topic? Anybody would be interested as a mentor or candidate?

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Jacob VanderPlas

2012-03-08 20:25:34 UTC

Interesting!
Has anyone ever seen gaussian process learning used for this sort of
hyperparameter estimation? I'm thinking of something similar to the
Kriging approach to likelihood surfaces, where some random starting
points are used to train a GPML solution, and this surface is minimized
to guess the next best location to try (or locations, if things are
being done in parallel). In this case, the points would be locations in
hyper-parameter space, and the evaluation is the cross-validation score.
It seems like this sort of approach could out-perform the random
selection used in this paper.
Jake

Post by Olivier Grisel
http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf
Some interesting snippets from the conclusion (I have not yet read the
"""
We have shown that random experiments are more efficient than grid
experiments for hyper-parameter optimization in the case of several
learning algorithms on several data sets. Our analysis of the
hyper-parameter response surface (Ψ) suggests that random experiments
are more efficient because not all hyper- parameters are equally
important to tune. Grid search experiments allocate too many trials to
the exploration of dimensions that do not matter and suffer from poor
coverage in dimensions that are important.
"""
"""
Random experiments are also easier to carry out than grid experiments
for practical reasons related to the statistical independence of every
trial.
• The experiment can be stopped any time and the trials form a
complete experiment.
• If extra computers become available, new trials can be added to an
experiment without having to adjust the grid and commit to a much
larger experiment.
• Every trial can be carried out asynchronously.
• If the computer carrying out a trial fails for any reason, its trial
can be either abandoned or restarted without jeopardizing the
experiment.
"""
I wonder how this would transpose to scikit-learn models that have
often much fewer hyper-parameters that the average Deep Belief
Network. Still it's very interesting food for thought if someone
want's to dive into improving the model selection tooling in the
scikit.
Maybe a new GSoC topic? Anybody would be interested as a mentor or candidate?

Alexandre Passos

2012-03-08 20:29:36 UTC

Post by Jacob VanderPlas
Interesting!
Has anyone ever seen gaussian process learning used for this sort of
hyperparameter estimation? I'm thinking of something similar to the
Kriging approach to likelihood surfaces, where some random starting
points are used to train a GPML solution, and this surface is minimized
to guess the next best location to try (or locations, if things are
being done in parallel). In this case, the points would be locations in
hyper-parameter space, and the evaluation is the cross-validation score.
It seems like this sort of approach could out-perform the random
selection used in this paper.
Jake

There's a follow-up to this JMLR paper that came out last NIPS which
does exactly that:
http://books.nips.cc/papers/files/nips24/NIPS2011_1385.pdf . There's
also code online for it: https://github.com/jaberg/hyperopt

--
- Alexandre

Alexandre Gramfort

2012-03-08 20:30:27 UTC

yes:

http://people.fas.harvard.edu/~bergstra/files/pub/11_nips_hyperopt.pdf

and a nice blog post by alex passos:

http://atpassos.posterous.com/bayesian-optimization

Alex

On Thu, Mar 8, 2012 at 9:25 PM, Jacob VanderPlas

Post by Jacob VanderPlas
Interesting!
Has anyone ever seen gaussian process learning used for this sort of
hyperparameter estimation? I'm thinking of something similar to the
Kriging approach to likelihood surfaces, where some random starting
points are used to train a GPML solution, and this surface is minimized
to guess the next best location to try (or locations, if things are
being done in parallel). In this case, the points would be locations in
hyper-parameter space, and the evaluation is the cross-validation score.
It seems like this sort of approach could out-perform the random
selection used in this paper.
Jake

http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf
Some interesting snippets from the conclusion (I have not yet read the
"""
We have shown that random experiments are more efficient than grid
experiments for hyper-parameter optimization in the case of several
learning algorithms on several data sets. Our analysis of the
hyper-parameter response surface (Ψ) suggests that random experiments
are more efficient because not all hyper- parameters are equally
important to tune. Grid search experiments allocate too many trials to
the exploration of dimensions that do not matter and suffer from poor
coverage in dimensions that are important.
"""
"""
Random experiments are also easier to carry out than grid experiments
for practical reasons related to the statistical independence of every
trial.
• The experiment can be stopped any time and the trials form a
complete experiment.
• If extra computers become available, new trials can be added to an
experiment without having to adjust the grid and commit to a much
larger experiment.
• Every trial can be carried out asynchronously.
• If the computer carrying out a trial fails for any reason, its trial
can be either abandoned or restarted without jeopardizing the
experiment.
"""
I wonder how this would transpose to scikit-learn models that have
often much fewer hyper-parameters that the average Deep Belief
Network. Still it's very interesting food for thought if someone
want's to dive into improving the model selection tooling in the
scikit.
Maybe a new GSoC topic? Anybody would be interested as a mentor or candidate?

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2012-03-08 20:34:29 UTC

Post by Alexandre Gramfort
http://people.fas.harvard.edu/~bergstra/files/pub/11_nips_hyperopt.pdf

Darn, we are a bunch of bots, but I am the slowest one.

G

James Bergstra

2012-03-19 15:45:00 UTC

Hey guys, I should add that currently the GP implementation in
hyperopt is not in good shape. The TPE algo works, but the GP algo
was originally written in a very crooked way (I would write
hyper-parameters to a text file, ssh it to France, where Remi's
workstation would run his GP implementation in matlab, and send it
back), and the reimplementation in hyperopt is not finished.

- James

On Thu, Mar 8, 2012 at 3:34 PM, Gael Varoquaux

Post by Gael Varoquaux

Post by Alexandre Gramfort
http://people.fas.harvard.edu/~bergstra/files/pub/11_nips_hyperopt.pdf

Darn, we are a bunch of bots, but I am the slowest one.
G
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Immanuel

2012-03-20 19:51:58 UTC

Hello all,

I followed the mailing list and poked around in the source code for the
last couple of week.
Now, I'm absolutely sure that I would enjoy to work on scikit-learn as
GSoC project.

I especially like the proposed online NMF project, could you enlighten
me on the following points?

There was some discussion about the integration of some NMF code in
scikit-learn. How will
this influence the proposed online NMF project?

@Vlad
Looks like we have the same interest, I like the robust PCA project too.
Have you already
a preference? I guess it makes little sense to pitch against you ;).

@Olivier
I did some preliminary reading on the topic and found the following
paper interesting:
"Efficient Document Clustering via Online Nonnegative Matrix Factorizations"
source: http://research.microsoft.com/apps/pubs/default.aspx?id=143211

It claims:
* to efficiently handle very large-scale and/or streaming datasets
* low memory consumption
Different algorithm versions are presented in the paper. I don't now
which one would be the most attractive for scikit.

Finally, some words about me:
I'm a student at the RWTH Aachen University (Germany) enrolled in
Computational
Engineering Science. Currently writing my diploma theses (master
equivalent) on
a bioinformatic topic using machine learning techniques. I took classes
in machine learning,
optimization, stats, data based modelling etc. I worked as student
research assistant, doing implementations
for different projects.

best,
Immanuel Bayer

Gael Varoquaux

2012-03-20 20:23:13 UTC

Hi Immanuel,

My gut feeling about your project is that it is an interesting proposal,
but idealy a GSOC project should be more ambitious than a single
algorithm. You could consider a full application problem that the
algorithm is trying to solve and contribute a few different algorithms.
This is what Vlad did last year, with different matrix
factorization/dictionary learning algorithms, and it was very succesful.

Thanks a lot for your proposal,

Gaël

Post by Immanuel
Hello all,
I followed the mailing list and poked around in the source code for the
last couple of week.
Now, I'm absolutely sure that I would enjoy to work on scikit-learn as
GSoC project.
I especially like the proposed online NMF project, could you enlighten
me on the following points?
There was some discussion about the integration of some NMF code in
scikit-learn. How will
this influence the proposed online NMF project?
@Vlad
Looks like we have the same interest, I like the robust PCA project too.
Have you already
a preference? I guess it makes little sense to pitch against you ;).
@Olivier
I did some preliminary reading on the topic and found the following
"Efficient Document Clustering via Online Nonnegative Matrix Factorizations"
source: http://research.microsoft.com/apps/pubs/default.aspx?id=143211
* to efficiently handle very large-scale and/or streaming datasets
* low memory consumption
Different algorithm versions are presented in the paper. I don't now
which one would be the most attractive for scikit.
I'm a student at the RWTH Aachen University (Germany) enrolled in
Computational
Engineering Science. Currently writing my diploma theses (master
equivalent) on
a bioinformatic topic using machine learning techniques. I took classes
in machine learning,
optimization, stats, data based modelling etc. I worked as student
research assistant, doing implementations
for different projects.
best,
Immanuel Bayer
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info

Mathieu Blondel

2012-03-21 03:24:39 UTC

On Wed, Mar 21, 2012 at 5:23 AM, Gael Varoquaux

Post by Gael Varoquaux
My gut feeling about your project is that it is an interesting proposal,
but idealy a GSOC project should be more ambitious than a single
algorithm. You could consider a full application problem that the
algorithm is trying to solve and contribute a few different algorithms.
This is what Vlad did last year, with different matrix
factorization/dictionary learning algorithms, and it was very succesful.

If the online NMF and SGD-based matrix factorization proposals are
merged as I suggested before, I think it would make a decent GSOC
project. Besides, if two different students were to work on the two
proposals in parallel, I think there would be too much overlap.

One thing I would like to see is an option to choose the loss
function. In the general case, we can use the squared loss but if the
values/ratings are binary, we can use the hinge loss and obtain
maximum margin matrix factorization, and if the values/ratings are
discrete, we can use ordinal regression losses. Jason Rennie, who is
following this list, did work on both. [*]

Also, I would like the current SGD module for
classification/regression and the future SGD module for matrix
factorization to share as much Cython code as possible. After all,
multivariate regression and multiclass classification can be seen as
matrix factorization problems (the same way you need to solve multiple
Lasso problems to do dictionary learning).

Mathieu

[*]
http://people.csail.mit.edu/jrennie/papers/
http://people.csail.mit.edu/jrennie/writing/

Gael Varoquaux

2012-03-21 05:56:42 UTC

Post by Mathieu Blondel
If the online NMF and SGD-based matrix factorization proposals are
merged as I suggested before, I think it would make a decent GSOC
project. Besides, if two different students were to work on the two
proposals in parallel, I think there would be too much overlap.

Agreed. In general I think that such a project would have a good profile
for a GSOC.

I wonder if adapting Peter's pyrsvd to the scikit would fit in
such project.

Gaël

Immanuel B

2012-03-21 11:35:21 UTC

Post by Gael Varoquaux

Post by Mathieu Blondel
If the online NMF and SGD-based matrix factorization proposals are
merged as I suggested before, I think it would make a decent GSOC
project. Besides, if two different students were to work on the two
proposals in parallel, I think there would be too much overlap.

Agreed. In general I think that such a project would have a good profile
for a GSOC.

Okay, that sounds reasonable to me too.
It appears to me that it might be in everyone interest if I apply for
a different project. I'm considering "Coordinated descent in linear
models beyond squared loss (eg Logistic)"
I'm currently working on a p>>N problem using the R scout package,
where I’m running into "out of memory" and performance issues due to
R's memory restrictions. I could imagine that scikit-learn could
really profit I we could get around this problems.
In short, I think it could be interesting to implement the scout method too:
"We show that ridge regression, the lasso, and the elastic net are
special cases of covariance-regularized regression"
http://www-stat.stanford.edu/~tibs/ftp/WittenTibshirani2008.pdf

Best,
Immanuel

Alexandre Gramfort

2012-03-21 17:42:36 UTC

Post by Immanuel B
Okay, that sounds reasonable to me too.
It appears to me that it might be in everyone interest if I apply for
a different project. I'm considering "Coordinated descent in linear
models beyond squared loss (eg Logistic)"
I'm currently working on a p>>N problem using the R scout package,
where I’m running into "out of memory" and performance issues due to
R's memory restrictions. I could imagine that scikit-learn could
really profit I we could get around this problems.

hum it's seems surprising that a coordinate descent procedure blows up the
memory but i'll have to read the paper. When I find the time …

I had more in mind the glmnet approach for multinomial logistic regression
which scales pretty well AFIAK

Post by Immanuel B
"We show that ridge regression, the lasso, and the elastic net are
special cases of covariance-regularized regression"
http://www-stat.stanford.edu/~tibs/ftp/WittenTibshirani2008.pdf

being more general is neat but the price you might have to pay is less
efficiency for the simpler problems.

Alex

Gael Varoquaux

2012-03-21 22:08:09 UTC

Post by Alexandre Gramfort

Post by Immanuel B
"We show that ridge regression, the lasso, and the elastic net are
special cases of covariance-regularized regression"
http://www-stat.stanford.edu/~tibs/ftp/WittenTibshirani2008.pdf

being more general is neat but the price you might have to pay is less
efficiency for the simpler problems.

That's my gut feeling too. I'd prefer a really fast solver for l1 and l2
penalized regression with the standard losses (square, hinge and
logistic), in the case n >> p. They are different papers mentioning
techniques for that.

My 2 cents,

Gael

Olivier Grisel

2012-03-20 20:48:52 UTC

Post by Immanuel
Hello all,
I followed the mailing list and poked around in the source code for the
last couple of week.
Now, I'm absolutely sure that I would enjoy to work on scikit-learn as
GSoC project.
I especially like the proposed online NMF project, could you enlighten
me on the following points?
There was some discussion about the integration of some NMF code in
scikit-learn. How will
this influence the proposed online NMF project?
@Vlad
Looks like we have the same interest, I like the robust PCA project too.
Have you already
a preference? I guess it makes little sense to pitch against you ;).
@Olivier
I did some preliminary reading on the topic and found the following
"Efficient Document Clustering via Online Nonnegative Matrix Factorizations"
source: http://research.microsoft.com/apps/pubs/default.aspx?id=143211
* to efficiently handle very large-scale and/or streaming datasets
* low memory consumption
Different algorithm versions are presented in the paper. I don't now
which one would be the most attractive for scikit.

Sounds like a good starting point. Please add your name as a potential
candidate on the wiki and the article as a reference in the proposal
on the wiki.

If we are to extend this proposal I would also include extending the
existing MiniBatchSparseDictionaryLearning code (that does online
block coordinate descent) to accept sparse inputs and positivity
constraints.

We could also compare those algorithms with MiniBatchKMeans extended
to perform soft assignments with cosine similarity as metrics instead
of euclidean distance. Maybe @mblondel knows some references for this
part.

But I rather than implementing 3 different algorithms I would prefer
to focus on one implementation and make it scale to large datasets
(large enough to work out-of-core) and make it work as good as
possible on a bunch of realistic datasets.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Gael Varoquaux

2012-03-08 20:33:36 UTC

Post by Jacob VanderPlas
Has anyone ever seen gaussian process learning used for this sort of
hyperparameter estimation?

Yes, James Bergstra did that. The core idea is at each iteration to take
the point maximizing the chance of improving the current score.

I think that this would be a simple way of having an optimizer quite
interesting for optimizing a few parameters together.

Gaël

Immanuel B

2012-03-23 12:31:11 UTC

Post by Alexandre Gramfort
hum it's seems surprising that a coordinate descent procedure blows up the
memory but i'll have to read the paper. When I find the time …
I had more in mind the glmnet approach for multinomial logistic regression
which scales pretty well AFIAK

These remarks were quite useful to me, thanks. I'm now using the
glmnet package which indeed, is both fast and has low memory
consumption (and incudes the strong rules : ) ) .The referenced papers
are quite interesting too.

15 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Olivier Grisel 2012-03-08 19:18:28 UTC

Jacob VanderPlas 2012-03-08 20:25:34 UTC

Alexandre Passos 2012-03-08 20:29:36 UTC

Alexandre Gramfort 2012-03-08 20:30:27 UTC

Gael Varoquaux 2012-03-08 20:34:29 UTC

James Bergstra 2012-03-19 15:45:00 UTC

Immanuel 2012-03-20 19:51:58 UTC

Gael Varoquaux 2012-03-20 20:23:13 UTC

Mathieu Blondel 2012-03-21 03:24:39 UTC

Gael Varoquaux 2012-03-21 05:56:42 UTC

Immanuel B 2012-03-21 11:35:21 UTC

Alexandre Gramfort 2012-03-21 17:42:36 UTC

Gael Varoquaux 2012-03-21 22:08:09 UTC

Olivier Grisel 2012-03-20 20:48:52 UTC

Gael Varoquaux 2012-03-08 20:33:36 UTC

Immanuel B 2012-03-23 12:31:11 UTC

about - legalese

Loading...