[Scikit-learn-general] Hyperparameter optimization

Hi Alex,

When I mentionned that to James, he seem to imply that this approach was
useful only to optimize many parameters, around 8 or more. You would have
to confirm this. I believe that he'll be around at the sprints. I far as
I am concerned, I don't optimize that number of parameters in the scikit.

Gaël

Post by Alexandre Passos
Recent work by James Bergstra demonstrated that careful hyperparameter
optimization, as well as careless random sampling, is often better
than manual searching for many problems. You can see results in the
http://people.fas.harvard.edu/~bergstra/files/pub/11_nips_hyperopt.pdf
I wonder if there's interest in adding some simple versions of these
techniques to the scikit's very useful GridSearchCV? There is code
available https://github.com/jaberg/hyperopt but it seems to be
research code and it uses theano, so it's not applicable to the
scikit.

Paolo Losi

2011-11-15 08:51:13 UTC

Hi Alexandre,

I recently gave a look to the subject as well.

In "Parameter determination of support vector machine and feature
selection using simulated annealing approach" [1] a stochastic optimization
method that has nice theoretical properties [2] is used to optimize at the
same time both feature selection and rbf svm hyper-parameters.

Starting from there I verified that all stochastic and heuristics-based
methods could be effectively used to optimized both problems (feature
selection, hyperparameters optimization, or both at the same time).
There are many papers on the subject...

James Bergstra

2011-11-20 02:15:43 UTC

Hi Alexandre, I haven't been checking my email and I heard about your
message last night from a slightly drunken Gramfort, Grisel, Pinto and
Poilvert in French in a loud bar here in Cambridge. Thanks for the PR
:)

I think there are some findings on this topic that would be good and
appropriate for scikits, and easy to do.

1. random sampling should generally be used instead of grid search.
They may feel similar, but theoretically and empirically, sampling
from a hypercube parameter space will typically work better than
iterating over the points of a grid lattice for hyper-parameter
optimization. For some response functions the lattice can be slightly
more efficient, but risks being terribly inefficient. So if you have
to pick one, pick uniform sampling.

2. Gaussian process w. Expected Improvement global optimization.
This is an established technique for global optimization that has
about the right scaling properties to be good for hyper-parameter
optimization. I think you probably can't do much better than a
Gaussian Process (GP) with Expected Improvement (EI) for optimizing
the parameters of say, an SVM, but we can only try and see (and
compare with the variety of other techniques for global optimization).
The scikit already has GP fitting in it, scipy has good optimization
routines, so why not put them together to make a hyper-parameter
optimizer? I think this would be a good addition to the scikit, and
not too hard (the hard parts are already done).

- James

On Mon, Nov 14, 2011 at 10:06 PM, Alexandre Passos

Post by Alexandre Passos
Hello, scikiters,
Recent work by James Bergstra demonstrated that careful hyperparameter
optimization, as well as careless random sampling, is often better
than manual searching for many problems. You can see results in the
http://people.fas.harvard.edu/~bergstra/files/pub/11_nips_hyperopt.pdf
I wonder if there's interest in adding some simple versions of these
techniques to the scikit's very useful GridSearchCV? There is code
available https://github.com/jaberg/hyperopt but it seems to be
research code and it uses theano, so it's not applicable to the
scikit.
This could be a nice sprint project for someone.
--
- Alexandre
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Alexandre Gramfort

2011-11-20 20:56:47 UTC

Post by James Bergstra
Hi Alexandre, I haven't been checking my email and I heard about your
message last night from a slightly drunken Gramfort, Grisel, Pinto and
Poilvert in French in a loud bar here in Cambridge. Thanks for the PR
:)

too much information :)

Post by James Bergstra
I think there are some findings on this topic that would be good and
appropriate for scikits, and easy to do.
1. random sampling should generally be used instead of grid search.
They may feel similar, but theoretically and empirically, sampling
from a hypercube parameter space will typically work better than
iterating over the points of a grid lattice for hyper-parameter
optimization. For some response functions the lattice can be slightly
more efficient, but risks being terribly inefficient. So if you have
to pick one, pick uniform sampling.
2. Gaussian process w. Expected Improvement global optimization.
This is an established technique for global optimization that has
about the right scaling properties to be good for hyper-parameter
optimization. I think you probably can't do much better than a
Gaussian Process (GP) with Expected Improvement (EI) for optimizing
the parameters of say, an SVM, but we can only try and see (and
compare with the variety of other techniques for global optimization).
The scikit already has GP fitting in it, scipy has good optimization
routines, so why not put them together to make a hyper-parameter
optimizer? I think this would be a good addition to the scikit, and
not too hard (the hard parts are already done).

can you point us to some pdfs ? or maybe write some kind of pseudo code?

And as usual pull request / patch welcome :)

Alex

James Bergstra

2011-11-22 03:41:05 UTC

On Sun, Nov 20, 2011 at 3:56 PM, Alexandre Gramfort

Post by Alexandre Gramfort

Post by James Bergstra
2. Gaussian process w. Expected Improvement global optimization.
This is an established technique for global optimization that has
about the right scaling properties to be good for hyper-parameter
optimization. I think you probably can't do much better than a
Gaussian Process (GP) with Expected Improvement (EI) for optimizing
the parameters of say, an SVM, but we can only try and see (and
compare with the variety of other techniques for global optimization).
The scikit already has GP fitting in it, scipy has good optimization
routines, so why not put them together to make a hyper-parameter
optimizer? I think this would be a good addition to the scikit, and
not too hard (the hard parts are already done).

can you point us to some pdfs ? or maybe write some kind of pseudo code?

Eric Brochu's thesis: chapter 2 is very readable, gives lots of good
reference as well.

Post by Alexandre Gramfort
And as usual pull request / patch welcome :)

Let me work out the bugs in hyperopt's GP optimization first, and then
maybe we can talk more about it at NIPS.

- James

Gael Varoquaux

2011-12-03 06:40:24 UTC

Without knowing that this was an established technique, I had been
thinking about this for quite a while. I am thrilled to know that it
actually works, and would be _very_ interested about having this in the
scikit. Let's discuss it at the sprints.

With regards to the random sampling, I am a bit worried that the results
hold for a fair amount of points, and with a small amount of points
(which is typically the situation in which many of us hide) it becomes
very sensitive to the seed.

Thanks for your input, James,

Gael

Olivier Grisel

2011-12-03 11:32:59 UTC

Post by Gael Varoquaux
thinking about this for quite a while. I am thrilled to know that it
actually works, and would be _very_ interested about having this in the
scikit. Let's discuss it at the sprints.

Alexandre has a new blog post about this with simple python snippet
using sklearn GuassianProcess:

http://atpassos.posterous.com/bayesian-optimization

Post by Gael Varoquaux
With regards to the random sampling, I am a bit worried that the results
hold for a fair amount of points, and with a small amount of points
(which is typically the situation in which many of us hide) it becomes
very sensitive to the seed.

I guess you should monitor the improvement before deciding to stop the search.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Gael Varoquaux

2011-12-03 15:25:04 UTC

Post by Olivier Grisel
Alexandre has a new blog post about this with simple python snippet
http://atpassos.posterous.com/bayesian-optimization

That's pretty cool. If Alexandre agrees, this code could definitely serve
as the basis for a scikit-learn implementation: it is simple and
readable, looks very testable, and brings in the necessary
functionality.

G

Alexandre Passos

2011-12-03 15:38:03 UTC

On Sat, Dec 3, 2011 at 10:25, Gael Varoquaux

Post by Gael Varoquaux

Post by Olivier Grisel
Alexandre has a new blog post about this with simple python snippet
http://atpassos.posterous.com/bayesian-optimization

That was the point of writing that code, actually.

Currently it's in a very bad state for the scikit, as it's far slower
and more limited than it should be, but I plan on cleaning it up
eventually (I'd love to do this at the post-NIPS sprint but personal
life makes it complicated).

The main problems with it right now are:

0. The initialization is left out of it, and it's actually pretty
important for good performance. A few widely-spaced random samples
from the space of possbilities would be ideal.

1. Simulated annealing is a pretty naive way of maximizing over the
gaussian process. It starts from a single point and has no knowledge
of where the objective function is good or bad. Something that is
aware of the previous unevaluated points is a better idea. Is there
any implementation of a GA-like optimizer for scipy we could use? We
could also run more than one simulated annealing pass, starting from
many different good points, to better explore the state space.

2. The simulated annealing code has no way right now of specifying
the boundaries of the state space. This is very bad, as the variance
in Gaussian processes grows the further you go away from the known
points, so naively the simulated annealing will just keep exploring at
infinity and find ridiculously huge upper confidence bounds on the
optimal value.

3. It has no clear way of dealing with discrete variables or setting
up the kernel of the GP to be something less badly chosen. Tuning the
kernel is easy, but dealing with discrete hyperparameters not so much
(as the simulated annealing code and the kernel would have to be
adapted).

--
- Alexandre

James Bergstra

2011-12-05 18:28:06 UTC

I guess you should monitor the improvement before deciding to stop the search.

My experience has been

1. that you start from an idea of a grid you'd like to try (ranges for
hyper-parameters, intervals for each hyper-parameter that might make a
difference),

2. you realize there's a huge number of points in the ideal grid, and
you have a budget for like 250

3a. you pick a good grid that still gets "the most important part" , vs.

3b. you sample randomly in the original (huge) space.

If you sample randomly in a space that is close to the grid you were
going to try, but includes some of the finer resolution that you had
to throw out to get down to 250 grid points, you should do better with
250 random points (3b) than your grid (3a).

You're right that with just a few (i.e. < 10) random samples, mileage
will vary greatly... but that's not really the regime in which you can
do a grid search anyway.

I can hopefully offer more convincing evidence soon... I have a
journal paper on this that has been accepted, but I still need to
polish it up for publication.

- James

James Bergstra

2011-12-05 18:31:03 UTC

I should probably not have scared ppl off speaking of a 250-job
budget. My intuition would be that with 2-8 hyper-parameters, and 1-3
"significant" hyper-parameters, randomly sampling around 10-30 points
should be pretty reliable.

- James

I guess you should monitor the improvement before deciding to stop the search.

My experience has been
1. that you start from an idea of a grid you'd like to try (ranges for
hyper-parameters, intervals for each hyper-parameter that might make a
difference),
2. you realize there's a huge number of points in the ideal grid, and
you have a budget for like 250
3a. you pick a good grid that still gets "the most important part" , vs.
3b. you sample randomly in the original (huge) space.
If you sample randomly in a space that is close to the grid you were
going to try, but includes some of the finer resolution that you had
to throw out to get down to 250 grid points, you should do better with
250 random points (3b) than your grid (3a).
You're right that with just a few (i.e. < 10) random samples, mileage
will vary greatly... but that's not really the regime in which you can
do a grid search anyway.
I can hopefully offer more convincing evidence soon... I have a
journal paper on this that has been accepted, but I still need to
polish it up for publication.
- James

Alexandre Passos

2011-12-05 18:41:53 UTC

Post by James Bergstra
I should probably not have scared ppl off speaking of a 250-job
budget. My intuition would be that with 2-8 hyper-parameters, and 1-3
"significant" hyper-parameters, randomly sampling around 10-30 points
should be pretty reliable.

So perhaps the best implementation of this is to first generate a grid
(from the usual arguments to sklearn's GridSearch), randomly sort it,
and iterate over these points until the budget is exhausted?

Presented like this I can easily see why this is better than (a) going
over the grid in order until the budget is exhausted or (b) using a
coarser grid to match the budget. This would also be very easy to
implement in sklearn.

Do I make sense?

--
- Alexandre

Olivier Grisel

2011-12-05 18:44:54 UTC

So perhaps the best implementation of this is to first generate a grid
(from the usual arguments to sklearn's GridSearch), randomly sort it,
and iterate over these points until the budget is exhausted?
Presented like this I can easily see why this is better than (a) going
over the grid in order until the budget is exhausted or (b) using a
coarser grid to match the budget. This would also be very easy to
implement in sklearn.
Do I make sense?

Yes. +1 for a pull request: one could just add a "budget" integer
argument (None by default) to the existing GridSearchCV class.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Andreas Müller

2011-12-05 19:19:11 UTC

So perhaps the best implementation of this is to first generate a grid
(from the usual arguments to sklearn's GridSearch), randomly sort it,
and iterate over these points until the budget is exhausted?
Presented like this I can easily see why this is better than (a) going
over the grid in order until the budget is exhausted or (b) using a
coarser grid to match the budget. This would also be very easy to
implement in sklearn.
Do I make sense?

Yes. +1 for a pull request: one could just add a "budget" integer
argument (None by default) to the existing GridSearchCV class.

+1

on a related note: what about coarse to fine grid-searches?
For categorial variables, that doesn't make much sense but
I think it does for many of the numerical variables.

Alexandre Passos

2011-12-05 19:23:15 UTC

Post by Andreas MÃ¼ller
on a related note: what about coarse to fine grid-searches?
For categorial variables, that doesn't make much sense but
I think it does for many of the numerical variables.

Coarse-to-fine grid searches (where you expand search in regions near
good points) sound a lot like the Gaussian process approach discussed
above, which can scale to more dimensions (as it doesn't need to
enumerate all candidate grid points).

--
- Alexandre

Alexandre Passos

2011-12-05 19:45:50 UTC

Post by Olivier Grisel
Yes. +1 for a pull request: one could just add a "budget" integer
argument (None by default) to the existing GridSearchCV class.

Just did that, the pull request is at
https://github.com/scikit-learn/scikit-learn/pull/455

So far no tests. How do you think this should be tested? Just a sanity
to see if given a large budget it always finds the same result as
regular GridSearchCV?

--
- Alexandre

James Bergstra

2011-12-05 21:26:34 UTC

So perhaps the best implementation of this is to first generate a grid
(from the usual arguments to sklearn's GridSearch), randomly sort it,
and iterate over these points until the budget is exhausted?
Presented like this I can easily see why this is better than (a) going
over the grid in order until the budget is exhausted or (b) using a
coarser grid to match the budget. This would also be very easy to
implement in sklearn.
Do I make sense?
--
- Alexandre

+1

This is definitely a good idea. I think randomly sampling is still
useful though. It is not hard to get into settings where the grid is
in theory very large and the user has a budget that is a tiny fraction
of the full grid. Within the existing grid implementation though, the
option to shuffle points and stop early would be great.

- James

Alexandre Passos

2011-12-05 21:38:30 UTC

Post by James Bergstra
This is definitely a good idea. I think randomly sampling is still
useful though. It is not hard to get into settings where the grid is
in theory very large and the user has a budget that is a tiny fraction
of the full grid.

I'd like to implement this, but I'm stuck on a nice way of specifying
distributions over each axis (i.e., sometimes you want to sample
across orders of magnitude (say, 0.001, 0.01, 0.1, 1, etc), sometimes
you want to sample uniformly (0.1, 0.2, 0.3, 0.4 ...)) that is obvious
and readable and flexible.

--
- Alexandre

Olivier Grisel

2011-12-05 22:06:57 UTC

You should discuss this with Gael next week during NIPS. I tend to use
np.logspace and np.linspace to build my grids.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

James Bergstra

2011-12-06 15:36:21 UTC

This is essentially why the algorithms in my "hyperopt" project [1]
are implemented as they are. They work for a variety of kinds of
distributions (uniform, log-uniform, normal, log-normal, randint),
including what I call "conditional" ones. For example, suppose you're
trying to optimize all the elements of a learning pipeline, and even
the choice of elements. You only want to pick the PCA pre-processing
parameters *if* you're actually doing PCA, because otherwise your
parameter optimization algorithm might attribute the score (result /
performance) to the PCA parameter choices that you know very well were
irrelevant.

hyperopt implementations are relatively tricky, but at this point I
don't think they could be done in a straightforward simple way that
would make them scikit-learn compatible. I think scikit-learn users
would be better served by specific hand-written hyper-parameter
optimizers for certain specific, particularly useful pipelines. Other
customized pipelines can use grid search, random search, manual
search, or the docs could maybe refer them to hyperopt, as it matures.

- James

[1] https://github.com/jaberg/hyperopt

James Bergstra

2013-02-11 21:10:59 UTC

Interesting to see this thread revived! FYI I've made hyperopt a lot
friendlier since that original posting.

http://jaberg.github.com/hyperopt/

pip install hyperopt

1. It has docs.
2. The minimization interface is based on an fmin() function, that
should be pretty accessible.
3. It can be installed straight from pypi
4. It just depends on numpy, scipy, and networkx. (optional pymongo and nose)

Adding new algorithms to it (SMBO based on GPs and regression trees)
is work in progress. The current non-trivial algorithm that's in there
(TPE) is probably relatively good for high-dimensional spaces, but for
lower-dimensional search spaces I think these other algos might be
more efficient. I'll keep the list posted on how that comes along (or
feel free to get in touch if you'd like to help out.)

- James

On Tue, Dec 6, 2011 at 10:36 AM, James Bergstra

This is essentially why the algorithms in my "hyperopt" project [1]
are implemented as they are. They work for a variety of kinds of
distributions (uniform, log-uniform, normal, log-normal, randint),
including what I call "conditional" ones. For example, suppose you're
trying to optimize all the elements of a learning pipeline, and even
the choice of elements. You only want to pick the PCA pre-processing
parameters *if* you're actually doing PCA, because otherwise your
parameter optimization algorithm might attribute the score (result /
performance) to the PCA parameter choices that you know very well were
irrelevant.
hyperopt implementations are relatively tricky, but at this point I
don't think they could be done in a straightforward simple way that
would make them scikit-learn compatible. I think scikit-learn users
would be better served by specific hand-written hyper-parameter
optimizers for certain specific, particularly useful pipelines. Other
customized pipelines can use grid search, random search, manual
search, or the docs could maybe refer them to hyperopt, as it matures.
- James
[1] https://github.com/jaberg/hyperopt

James Bergstra

2013-02-19 22:36:04 UTC

Further to this: I started a project on github to look at how to
combine hyperopt with sklearn.
https://github.com/jaberg/hyperopt-sklearn

I've only wrapped on algorithm so far: Perceptron
https://github.com/jaberg/hyperopt-sklearn/blob/master/hpsklearn/perceptron.py

My idea is that little files like perceptron.py would encode
(a) domain expertise about what values make sense for a particular
hyper-parameter (see the `search_space()` function and
(b) a sklearn-style fit/predict interface that encapsulates search
over those hyper-parameters (see `AutoPerceptron`)

I just wrote it up today and I've only tried it on one data set, but
at least on Iris it improves the default Perceptron's performance to
85% accuracy from 70%. Better than nothing! Of course it takes 100
times as long when hyperopt is run serially, but .05 seconds and 5
seconds are both pretty quick. (And who would have thought that the
Perceptron would have 8 hyper-parameters??)

I'm not planning to do any more work on this in the very short term,
so if anyone is curious to adapt the Perceptron example to other
algorithms, send PRs :)

- James

On Mon, Feb 11, 2013 at 4:10 PM, James Bergstra

Post by James Bergstra
Interesting to see this thread revived! FYI I've made hyperopt a lot
friendlier since that original posting.
http://jaberg.github.com/hyperopt/
pip install hyperopt
1. It has docs.
2. The minimization interface is based on an fmin() function, that
should be pretty accessible.
3. It can be installed straight from pypi
4. It just depends on numpy, scipy, and networkx. (optional pymongo and nose)
Adding new algorithms to it (SMBO based on GPs and regression trees)
is work in progress. The current non-trivial algorithm that's in there
(TPE) is probably relatively good for high-dimensional spaces, but for
lower-dimensional search spaces I think these other algos might be
more efficient. I'll keep the list posted on how that comes along (or
feel free to get in touch if you'd like to help out.)
- James
On Tue, Dec 6, 2011 at 10:36 AM, James Bergstra

This is essentially why the algorithms in my "hyperopt" project [1]
are implemented as they are. They work for a variety of kinds of
distributions (uniform, log-uniform, normal, log-normal, randint),
including what I call "conditional" ones. For example, suppose you're
trying to optimize all the elements of a learning pipeline, and even
the choice of elements. You only want to pick the PCA pre-processing
parameters *if* you're actually doing PCA, because otherwise your
parameter optimization algorithm might attribute the score (result /
performance) to the PCA parameter choices that you know very well were
irrelevant.
hyperopt implementations are relatively tricky, but at this point I
don't think they could be done in a straightforward simple way that
would make them scikit-learn compatible. I think scikit-learn users
would be better served by specific hand-written hyper-parameter
optimizers for certain specific, particularly useful pipelines. Other
customized pipelines can use grid search, random search, manual
search, or the docs could maybe refer them to hyperopt, as it matures.
- James
[1] https://github.com/jaberg/hyperopt

James Bergstra

2013-02-20 00:12:03 UTC

I should add: if anyone has thoughts about the design, I'm interested
to get your input. Easier to redesign things now, before more code is
written.

- James

On Tue, Feb 19, 2013 at 5:36 PM, James Bergstra

Post by James Bergstra
Further to this: I started a project on github to look at how to
combine hyperopt with sklearn.
https://github.com/jaberg/hyperopt-sklearn
I've only wrapped on algorithm so far: Perceptron
https://github.com/jaberg/hyperopt-sklearn/blob/master/hpsklearn/perceptron.py
My idea is that little files like perceptron.py would encode
(a) domain expertise about what values make sense for a particular
hyper-parameter (see the `search_space()` function and
(b) a sklearn-style fit/predict interface that encapsulates search
over those hyper-parameters (see `AutoPerceptron`)
I just wrote it up today and I've only tried it on one data set, but
at least on Iris it improves the default Perceptron's performance to
85% accuracy from 70%. Better than nothing! Of course it takes 100
times as long when hyperopt is run serially, but .05 seconds and 5
seconds are both pretty quick. (And who would have thought that the
Perceptron would have 8 hyper-parameters??)
I'm not planning to do any more work on this in the very short term,
so if anyone is curious to adapt the Perceptron example to other
algorithms, send PRs :)
- James
On Mon, Feb 11, 2013 at 4:10 PM, James Bergstra

This is essentially why the algorithms in my "hyperopt" project [1]
are implemented as they are. They work for a variety of kinds of
distributions (uniform, log-uniform, normal, log-normal, randint),
including what I call "conditional" ones. For example, suppose you're
trying to optimize all the elements of a learning pipeline, and even
the choice of elements. You only want to pick the PCA pre-processing
parameters *if* you're actually doing PCA, because otherwise your
parameter optimization algorithm might attribute the score (result /
performance) to the PCA parameter choices that you know very well were
irrelevant.
hyperopt implementations are relatively tricky, but at this point I
don't think they could be done in a straightforward simple way that
would make them scikit-learn compatible. I think scikit-learn users
would be better served by specific hand-written hyper-parameter
optimizers for certain specific, particularly useful pipelines. Other
customized pipelines can use grid search, random search, manual
search, or the docs could maybe refer them to hyperopt, as it matures.
- James
[1] https://github.com/jaberg/hyperopt

James Jong

2013-02-20 00:29:35 UTC

Hi there,

I presume some of you may have already seen this, but if not, caret in R is
a nice example of how to do model selection with a unified interface to a
variety of class & reg. methods:

http://caret.r-forge.r-project.org/

James

Post by James Bergstra
I should add: if anyone has thoughts about the design, I'm interested
to get your input. Easier to redesign things now, before more code is
written.
- James
On Tue, Feb 19, 2013 at 5:36 PM, James Bergstra

https://github.com/jaberg/hyperopt-sklearn/blob/master/hpsklearn/perceptron.py

Post by James Bergstra
My idea is that little files like perceptron.py would encode
(a) domain expertise about what values make sense for a particular
hyper-parameter (see the `search_space()` function and
(b) a sklearn-style fit/predict interface that encapsulates search
over those hyper-parameters (see `AutoPerceptron`)
I just wrote it up today and I've only tried it on one data set, but
at least on Iris it improves the default Perceptron's performance to
85% accuracy from 70%. Better than nothing! Of course it takes 100
times as long when hyperopt is run serially, but .05 seconds and 5
seconds are both pretty quick. (And who would have thought that the
Perceptron would have 8 hyper-parameters??)
I'm not planning to do any more work on this in the very short term,
so if anyone is curious to adapt the Perceptron example to other
algorithms, send PRs :)
- James
On Mon, Feb 11, 2013 at 4:10 PM, James Bergstra

nose)

Post by James Bergstra
Adding new algorithms to it (SMBO based on GPs and regression trees)
is work in progress. The current non-trivial algorithm that's in there
(TPE) is probably relatively good for high-dimensional spaces, but for
lower-dimensional search spaces I think these other algos might be
more efficient. I'll keep the list posted on how that comes along (or
feel free to get in touch if you'd like to help out.)
- James
On Tue, Dec 6, 2011 at 10:36 AM, James Bergstra

On Mon, Dec 5, 2011 at 4:38 PM, Alexandre Passos <

On Mon, Dec 5, 2011 at 16:26, James Bergstra <

fraction

Post by James Bergstra
of the full grid.

This is essentially why the algorithms in my "hyperopt" project [1]
are implemented as they are. They work for a variety of kinds of
distributions (uniform, log-uniform, normal, log-normal, randint),
including what I call "conditional" ones. For example, suppose you're
trying to optimize all the elements of a learning pipeline, and even
the choice of elements. You only want to pick the PCA pre-processing
parameters *if* you're actually doing PCA, because otherwise your
parameter optimization algorithm might attribute the score (result /
performance) to the PCA parameter choices that you know very well were
irrelevant.
hyperopt implementations are relatively tricky, but at this point I
don't think they could be done in a straightforward simple way that
would make them scikit-learn compatible. I think scikit-learn users
would be better served by specific hand-written hyper-parameter
optimizers for certain specific, particularly useful pipelines. Other
customized pipelines can use grid search, random search, manual
search, or the docs could maybe refer them to hyperopt, as it matures.
- James
[1] https://github.com/jaberg/hyperopt

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2013-02-20 00:52:28 UTC

On Wed, Feb 20, 2013 at 7:36 AM, James Bergstra
<***@gmail.com> wrote:
And who would have thought that the

Post by James Bergstra
Perceptron would have 8 hyper-parameters??

I think the Perceptron is not a good candidate. I'd rather choose
SGDClassifier (you can thus add the loss function to the parameter
space). Perceptron in scikit-learn has many parameters because it
inherits from the SGDClassifier machinery. However, if you use the
default options, you get the standard Perceptron (which doesn't have
any hyperparameter). Since it is indeed confusing, we could remove the
parameters (people who want to tune those parameters can use
SGDClassifier(loss="perceptron") anyway) or at the very least update
the docstring to reflect that the default options lead to the standard
Perceptron.

Is it possible to gain insights from the hyperparameter search? Like
what parameter (or combination of parameters) contributes the most to
the accuracy?

Mathieu

James Bergstra

2013-02-20 02:02:30 UTC

Post by Mathieu Blondel
On Wed, Feb 20, 2013 at 7:36 AM, James Bergstra
And who would have thought that the

Post by James Bergstra
Perceptron would have 8 hyper-parameters??

Interesting, I didn't dig under the hood of the Perceptron class. If
the Perceptron is essentially just a simplified interface to the
underlying SGDClassifier machinery, then yes - the hyper-parameter
tuning code should instead target the more general underlying API.
Thanks.

Post by Mathieu Blondel
Is it possible to gain insights from the hyperparameter search? Like
what parameter (or combination of parameters) contributes the most to
the accuracy?

When you try some hyper-parameter assignments and measure some model
fitness (e.g. validation set classification accuracy) then you
accumulate a new (input, output) data set. Insight is about finding
some kind of statistical patterns in the regression input -> output.
So for sure you can get insight, by doing... machine learning :)

The algorithm for hyperparameter search that I'm using in hyperopt is
doing that. It's a regression algorithm that slowly adapts to the
hyper-parameter -> performance relationship to make hyperparameter
search faster. Have a look for "Sequential Model Based Optimization"
to learn more about this, or "Bayesian Optimization"

Hyperopt comes with some visualization tools for trying to understand
high-dimensional hyperparameter spaces. It can be interesting to
visualize correlations between individual hyperparameters and fitness,
or pairs, but beyond that there isn't usually enough data to estimate
a correlation accurately (to say nothing of how many possible triples
there are to fit on the screen).

- James

Mathieu Blondel

2013-02-20 06:45:07 UTC

On Wed, Feb 20, 2013 at 11:02 AM, James Bergstra

Post by James Bergstra
Hyperopt comes with some visualization tools for trying to understand
high-dimensional hyperparameter spaces. It can be interesting to
visualize correlations between individual hyperparameters and fitness,
or pairs, but beyond that there isn't usually enough data to estimate
a correlation accurately (to say nothing of how many possible triples
there are to fit on the screen).

My question was more specifically with respect to Hyperopt. So, the
above answers my question.

Thanks,
Mathieu

Lars Buitinck

2013-02-20 00:55:45 UTC

I'm not sure what your long-term goals with this project are, but I
see three problems with this approach:
1. The values might be problem-dependent rather than estimator
dependent. In your example, you're optimizing for accuracy, but you
might want to optimize for F1-score instead.
2. The number is estimators is *huge* if you also consider
combinations like SelectKBest(chi2) -> RBFSamples -> SGDClassifier
pipelines (a classifier that I was trying out only yesterday).
3. The estimator parameters change sometimes, so this would have to be
kept in sync with scikit-learn.

When I wrote the scikit-learn wrapper for NLTK [1], I chose a strategy
where *no scikit-learn code is imported at all* (except when the user
runs the demo or unit tests). Instead, the user is responsible for
importing it and constructing the appropriate estimator. This makes
the code robust to API changes, and it can handle arbitrarily complex
sklearn.Pipeline objects, as well as estimators that follow the API
conventions but are not in scikit-learn proper.

I think a similar approach can be followed here. While some
suggestions for parameters to try might be shipped as examples, an
estimator- and evaluation-agnostic wrapper class ("meta-estimator") is
a stronger basis for a package like the one you're writing.
scikit-learn's own GridSearch is also implemented like this, to a
large extent.

[1] https://github.com/nltk/nltk/blob/f7f3b73f0f051639d87cfeea43b0aabf6f167b8f/nltk/classify/scikitlearn.py

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

James Bergstra

2013-02-20 01:47:19 UTC

Post by Lars Buitinck

I'm not sure what your long-term goals with this project are, but I
1. The values might be problem-dependent rather than estimator
dependent. In your example, you're optimizing for accuracy, but you
might want to optimize for F1-score instead.

Good point, and if I understand correctly, it's related to your other
point below about GridSearch. I think you are pointing out that the
design of the AutoPerceptron is off the mark for 2 reasons:

1. There is only one line in that class that actually refers to
Perceptron, so why not make the actual estimator a constructor
argument? (I agree, it should be an argument.)

2. The class mainly consists of plumbing, but also is hard-coded to
compute classification error. This is silly, it would be better to use
either (a) the native loss of the estimator or else (b) some specific
user-supplied validation metric.

I agree with both of these points. Let me know if I misunderstood you though.

Post by Lars Buitinck
2. The number is estimators is *huge* if you also consider
combinations like SelectKBest(chi2) -> RBFSamples -> SGDClassifier
pipelines (a classifier that I was trying out only yesterday).

Yes, the number of estimators in a search space can be huge. In my
research on visual system models I found that hyperopt was
surprisingly useful, even in the face of daunting configuration
problems. The point of this project, for me, is to see how it stacks
up.

One design aspect that doesn't come through in the current code sample
is that the hard-coded parameter spaces (which I'll come to in a
second) must compose. What I mean is that if someone has written up a
standard SGDClassifier search space, and someone has coded up search
spaces for SelectKBest and RBFSamples, then you should be able to just
string those all together and search the joint space without much
trouble.

Your particular case is exactly the sort of case I would hope
eventually to address - it's difficult to give sensible defaults to
each of those modules before knowing either (a) what kind of data they
will process and (b) what's going on in the rest of the pipeline.
Playing with a bunch of interacting variables as measured by
long-running programs is hard for people; automatic methods don't
actually have to be all that efficient to be competitive.

Post by Lars Buitinck
3. The estimator parameters change sometimes, so this would have to be
kept in sync with scikit-learn.

This is a price I was expecting to have to pay, I don't see any way
around it. Part of the value of this library is encoding parameter
ranges for specific estimators. That tight coupling is not something
to be dodged.

- James

Post by Lars Buitinck
When I wrote the scikit-learn wrapper for NLTK [1], I chose a strategy
where *no scikit-learn code is imported at all* (except when the user
runs the demo or unit tests). Instead, the user is responsible for
importing it and constructing the appropriate estimator. This makes
the code robust to API changes, and it can handle arbitrarily complex
sklearn.Pipeline objects, as well as estimators that follow the API
conventions but are not in scikit-learn proper.
I think a similar approach can be followed here. While some
suggestions for parameters to try might be shipped as examples, an
estimator- and evaluation-agnostic wrapper class ("meta-estimator") is
a stronger basis for a package like the one you're writing.
scikit-learn's own GridSearch is also implemented like this, to a
large extent.
[1] https://github.com/nltk/nltk/blob/f7f3b73f0f051639d87cfeea43b0aabf6f167b8f/nltk/classify/scikitlearn.py

Thanks, yes, there is a strong similarity between what I'm trying to
do and GridSearch, so it makes sense to use similar strategies for
comparing model outputs. The "AutoPerceptron" class would be improved
by being more generic, like GridSearch.

- James

Gael Varoquaux

2011-12-06 04:03:08 UTC

Does sound reasonnable.

When doing grid searches, I find that an important aspect is that some
grid points take a fraction of the time of others. This is actually a big
motivation for doing things in parallel: with enough CPU (8) the time of
a grid search can be fully limited by the time of computing the fit for
the different folds on only one grid point.

Thus the notion of budget is relevant, but the right budget is not
exactly the number of fit points computed.

That said, taking this is account will probably make the code much more
complex, so I suggest that we put it on hold.

G

Olivier Grisel

2011-12-06 09:09:23 UTC

Post by Gael Varoquaux

Does sound reasonnable.
When doing grid searches, I find that an important aspect is that some
grid points take a fraction of the time of others. This is actually a big
motivation for doing things in parallel: with enough CPU (8) the time of
a grid search can be fully limited by the time of computing the fit for
the different folds on only one grid point.
Thus the notion of budget is relevant, but the right budget is not
exactly the number of fit points computed.

This is very true and I think that would be a great a area of future
work for James next papers: train 2 Gaussian processes, one to
estimate the expected cross validation error and the other to estimate
the expected runtime (CPU cost).

Then build a decision function that selects the next points to explore
from the estimated Pareto optimal front of those two objectives (low
cross validation error, low CPU cost).

Intuitively this would amount to using a proxy to the uncomputable yet
universal Solomonoff prior as a regularizer which sounds like a good
thing to do (both Epicurus, Occam and Bayes would agree to work that
way if they had access to macbook pros :). See:
http://www.scholarpedia.org/article/Algorithmic_probability

5 years ago I think the state of the art for multi-objective
optimization was using evolutionary algorithms such as Non-dominated
Sorting Genetic Algorithm-II (NSGA-II) and Strength Pareto
Evolutionary Algorithm 2 (SPEA-2). It might have improved since. There
are interesting links here:
http://en.wikipedia.org/wiki/Multi-objective_optimization . A nice
feat of EA is that they are embarrassingly parallelizable (hence
cloud-ready :).

On a more practical standpoint it would also be able to define a
scalar utility function as that combines the two components (cross
validation error and CPU cost) to get a single objective and select
next points by minimizing this one.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

James Bergstra

2011-12-06 15:26:24 UTC

Post by Gael Varoquaux

Does sound reasonnable.
When doing grid searches, I find that an important aspect is that some
grid points take a fraction of the time of others. This is actually a big
motivation for doing things in parallel: with enough CPU (8) the time of
a grid search can be fully limited by the time of computing the fit for
the different folds on only one grid point.
Thus the notion of budget is relevant, but the right budget is not
exactly the number of fit points computed.

You got me Olivier! I've definitely been thinking about this. Nothing
to report so far though. I suspect there may be some subtleties about
how to go about it but I haven't tried much.

- James

Yaser Martinez

2013-02-11 02:23:54 UTC

Any further development on this? Is a "brute force" grid search the only
alternative to the problem of parameter selection for lets say SVMs?

Ronnie Ghose

2013-02-11 02:27:36 UTC

afaik yes. Please tell me if i'm wrong, more experienced scikitters :)

Post by Yaser Martinez
Any further development on this? Is a "brute force" grid search the only
alternative to the problem of parameter selection for lets say SVMs?
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

a***@ais.uni-bonn.de

2013-02-11 03:03:47 UTC

I have a pull request for randomized seaech but I need to update it as it is quite old...

Post by Ronnie Ghose
afaik yes. Please tell me if i'm wrong, more experienced scikitters :)
On Sun, Feb 10, 2013 at 9:23 PM, Yaser Martinez

Post by Yaser Martinez
Any further development on this? Is a "brute force" grid search the

only

Post by Yaser Martinez
alternative to the problem of parameter selection for lets say SVMs?

------------------------------------------------------------------------------

Post by Yaser Martinez
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Ronnie Ghose

2013-02-11 03:06:46 UTC

Wei LI

2013-02-11 07:39:32 UTC

In my point of view, to optimize the hyperparameters can not use standard
optimization techniques(or else it will become a parameters and cannot be
set empirically?) So some heuristic in brute force searching maybe a good
idea. I am thinking another heuristic to accelerate such process: maybe a
warm start after we have trained models. I do now have any
sound theory about this, but for SVM in particular, as the global optimal
is guaranteed, maybe a warm start will accelerate of the process to
convergence without biasing the trained model?

Best Regards,
Wei LI

Post by a***@ais.uni-bonn.de
I have a pull request for randomized seaech but I need to update it as it is quite old...

Post by Ronnie Ghose
afaik yes. Please tell me if i'm wrong, more experienced scikitters :)

------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
------------------------------
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Alexandre Gramfort

2013-02-11 08:30:05 UTC

indeed SVM (libsvm / liblinear) could benefit also from a path strategy.

Alex

Post by Wei LI
In my point of view, to optimize the hyperparameters can not use standard
optimization techniques(or else it will become a parameters and cannot be
set empirically?) So some heuristic in brute force searching maybe a good
idea. I am thinking another heuristic to accelerate such process: maybe a
warm start after we have trained models. I do now have any sound theory
about this, but for SVM in particular, as the global optimal is guaranteed,
maybe a warm start will accelerate of the process to convergence without
biasing the trained model?
Best Regards,
Wei LI

Post by a***@ais.uni-bonn.de
I have a pull request for randomized seaech but I need to update it as it is quite old...

Post by Ronnie Ghose
afaik yes. Please tell me if i'm wrong, more experienced scikitters :)
On Sun, Feb 10, 2013 at 9:23 PM, Yaser Martinez

________________________________
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2013-02-11 09:12:25 UTC

With respect to C, SVM can definitely be warm-started although nor
libsvm nor our binding allow it at the moment. With respect to kernel
parameters, I doubt that warm-start helps, although I've never tried
(my intuition is that a small perturbation in a kernel parameter can
result in a radically different solution).

Warm-start is supported in some estimators like Lasso, for example:
lasso = Lasso(warm_start=True)
scores = []
for alpha in alphas:
lasso.set_params(alpha=alpha)
lasso.fit(X_train, y_train)
scores.append(lasso.score(X_test, y_test))

I created an issue for a warm-start aware grid search object:
https://github.com/scikit-learn/scikit-learn/issues/1674

Mathieu

unknown

1970-01-01 00:00:00 UTC