Discussion:
[Scikit-learn-general] partial_fit in SGD module
Mathieu Blondel
2011-03-31 11:39:32 UTC
Permalink
As you may remember from a thread on the mailing-list (back a few
months ago), there was an agreement that online algorithms should
implement a partial_fit(X, y) method. The reason for adding a new
method was mainly a matter of semantics: partial_fit makes it clear
that the previous model is not erased when partial_fit is called
again.

I started to look into adding partial_fit to the SGD module. My
original idea was to rename the fit method in BaseSGD to _fit, add a
partial=True|False option and initialize the model parameters only
when partial=False or the parameters are not present yet. This way,
fit and partial_fit could easily be implemented in terms of _fit.
However, it is more difficult than I thought and I found potential
issues.

The first one is that the vector y may contain only a subset of the
classes (or in the extreme case, only one class). This is a problem
since SGD pre-allocate the coef_ matrix (n_classes x n_features). The
obvious solution is to use a dictionary to store the weight vectors of
each class instead of a numpy 2d-array. For compatibility with other
classifiers, we can implement coef_ as a property.

The second potential problem is about the learning schedules. The
routines written in Cython need a n_iter argument. If the user makes
several passes over the dataset (see below) and call partial_fit
repeatedly, we would need to save the state of the learning rate?

Peter, what areas of the code do you think need to be changed and do
you have ideas how to factor as much code as possible?

Another thing I was wondering: is it possible to extract reusable
utils from the SGD module such as dense-sparse dot product,
dense-sparse addition etc? (I suppose we would need a pyd header
file?) I was wondering about that because of custom loss functions
too.

Also to put partial_fit into more context: although partial_fit can
potentially be used in a pure online setting, the plan was mainly to
use it for large scale datasets, i.e. make several iterations over the
datasets but load the data by blocks. The plan was to create an
iterator object which can be reset:

reader = SvmlightReader("file.txt", block_size=10000)
for n in range(n_iter):
for X, y in reader:
clf.partial_fit(X, y)
reader.reset()

It could also be useful to have a method to generate a mini-batch
block randomly:
X, y = reader.random_minibatch(blocksize=1000)

A text-based file format like Svmlight's doesn't offer a direct way to
quickly retrieve a random line. We would need to build a "line => byte
offset" index (can be produced in memory when needed).

# All in all, this made me think that if we want to start playing with
an online API, it would probably be easier to start with a good old
averaged perceptron at first than trying to modify the current SGD
module.

Mathieu
Peter Prettenhofer
2011-03-31 12:36:31 UTC
Permalink
Hi Mathieu,
Post by Mathieu Blondel
[..]
The first one is that the vector y may contain only a subset of the
classes (or in the extreme case, only one class). This is a problem
since SGD pre-allocate the coef_ matrix (n_classes x n_features). The
obvious solution is to use a dictionary to store the weight vectors of
each class instead of a numpy 2d-array. For compatibility with other
classifiers, we can implement coef_ as a property.
I haven't thought about this... a quick and dirty way to solve it is
to specify the number of classes as a constructor argument (similar to
the TheanoSGD classifier in jaberg's image-patch branch).

Anyways complete online multi-class classification requires a serious
refactoring of the current SGD code base!
Post by Mathieu Blondel
The second potential problem is about the learning schedules. The
routines written in Cython need a n_iter argument. If the user makes
several passes over the dataset (see below) and call partial_fit
repeatedly, we would need to save the state of the learning rate?
That's true - the learning rate has to be stored.
Post by Mathieu Blondel
Peter, what areas of the code do you think need to be changed and do
you have ideas how to factor as much code as possible?
When you look at the current cython code you will notice that it
pretty much relies on numpy ndarray or scipy's sparse matrix. However,
if we change the loop over the training examples from row indices [1]
to something which returns a pair of x and y [2] where x may be a
ndarray for the dense case or a sparse matrix with a single row or a
recarray (as in bolt) for the sparse case. This will require just
minor refactorings and will make the current code a little bit slower
(factor of 2).

[1] https://github.com/pprett/scikit-learn/blob/master/scikits/learn/linear_model/sgd_fast.pyx#L310

[2] https://github.com/pprett/bolt/blob/master/bolt/trainer/sgd.pyx#L428
Post by Mathieu Blondel
Another thing I was wondering: is it possible to extract reusable
utils from the SGD module such as dense-sparse dot product,
dense-sparse addition etc? (I suppose we would need a pyd header
file?) I was wondering about that because of custom loss functions
too.
Mathieu Blondel
2011-03-31 14:28:20 UTC
Permalink
Peter,

On Thu, Mar 31, 2011 at 9:36 PM, Peter Prettenhofer
Post by Peter Prettenhofer
I haven't thought about this... a quick and dirty way to solve it is
to specify the number of classes as a constructor argument (similar to
the TheanoSGD classifier in jaberg's image-patch branch).
I thought about this too but it implies that we have two distinct
classes: one for batch and one for online. It would make things
simpler if we could have fit and partial_fit in the same class (or
fit_iterable if we decide to go this way).
Alexandre Passos
2011-03-31 14:48:38 UTC
Permalink
Post by Mathieu Blondel
Peter,
On Thu, Mar 31, 2011 at 9:36 PM, Peter Prettenhofer
Post by Peter Prettenhofer
I haven't thought about this... a quick and dirty way to solve it is
to specify the number of classes as a constructor argument (similar to
the TheanoSGD classifier in jaberg's image-patch branch).
I thought about this too but it implies that we have two distinct
classes: one for batch and one for online. It would make things
simpler if we could have fit and partial_fit in the same class (or
fit_iterable if we decide to go this way).
Peter Prettenhofer
2011-03-31 15:20:34 UTC
Permalink
Wow - that sounds interesting indeed - now I know why vowpal wabit has
such an odd formular for the weight update - multiplying the gradient
by the weight was too simple to be true...

But before you integrate it into the cython code we should merge my
learningrate branch which introduces constant and inverse scaling
learning rates.

best,
Peter
[..]
I was thinking of user-defined losses for example to address cost
sensitive learning (e.g., incur a stronger loss for some classes than
others).
I think it's simpler to adapt all staandard losses to be
cost-sensitive (with an optional parameter which is a vector  with the
cost of getting each example wrong) than it is to handle general
losses. There is even a way to adapt common online/non-online SGD to
http://arxiv.org/pdf/1011.1576 .
I can implement a variant of this for the batch sgd tomorrow in the sprint
if you're interested.
--
 - Alexandre
------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself;
WebMatrix provides all the features you need to develop and
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Alexandre Passos
2011-03-31 16:10:27 UTC
Permalink
On Thu, Mar 31, 2011 at 12:20, Peter Prettenhofer
Post by Peter Prettenhofer
Wow - that sounds interesting indeed - now I know why vowpal wabit has
such an odd formular for the weight update - multiplying the gradient
by the weight was too simple to be true...
But before you integrate it into the cython code we should merge my
learningrate branch which introduces constant and inverse scaling
learning rates.
Or I can fork from your branch, whichever way is easier.
--
 - Alexandre
Peter Prettenhofer
2011-03-31 16:51:47 UTC
Permalink
Post by Alexandre Passos
[..]
Or I can fork from your branch, whichever way is easier.
This would be even better - then you could review the changes (API,
default settings, etc.) :-)

best,
Peter
--
Peter Prettenhofer
Mathieu Blondel
2011-04-02 10:02:12 UTC
Permalink
So what do people think about giving up on partial_fit and use
fit_iterable instead?

reader = SvmlightReader("file.txt", block_size=10000)
clf.fit_iterable(reader, n_iter=10)

To solve the problem that the label set cardinality is not known in
advance, we can require any reader object to implement a n_classes
property.

reader.n_classes

BTW, ProbabilisticPCA can be fit with EM, which means that it could
implement a fit_iterable method too.

Mathieu
Gael Varoquaux
2011-04-02 10:10:45 UTC
Permalink
Post by Mathieu Blondel
So what do people think about giving up on partial_fit and use
fit_iterable instead?
reader = SvmlightReader("file.txt", block_size=10000)
clf.fit_iterable(reader, n_iter=10)
I would rather prefer partial_fit, because it means that in this case it
is clearly the role of the framework (not in the scikit) to distribute
the computation to the algorithm. It is an inversion of control issue:
where do we want the control to be? Why would you prefer fit_iterable?

Iterables can also be nasty beasts, as Vincent discovered yesterday, as
consuming them is a side effect that modifies that. Things can therefore
happen behind your back. I guess that's why I prefer the code outside the
algorithm to be in charge of distributing the data.
Post by Mathieu Blondel
To solve the problem that the label set cardinality is not known in
advance, we can require any reader object to implement a n_classes
property.
reader.n_classes
That I would clearly frown upon, because it means that the contract with
the input object is no longer that it is a plain Python iterable, but
that it is a custom object that has a new interface that people must
learn and code to.
Post by Mathieu Blondel
BTW, ProbabilisticPCA can be fit with EM, which means that it could
implement a fit_iterable method too.
Yes, clearly, which would be great, as it would give us a data reduction
code that works with data not fitting in memory.

G
Mathieu Blondel
2011-04-02 10:29:09 UTC
Permalink
On Sat, Apr 2, 2011 at 7:10 PM, Gael Varoquaux
Post by Gael Varoquaux
I would rather prefer partial_fit, because it means that in this case it
is clearly the role of the framework (not in the scikit) to distribute
where do we want the control to be? Why would you prefer fit_iterable?
Semantically, partial_fit is a pure online setting while fit_iterable
is a large-scale learning setting. With partial_fit, you loose the
notion of iteration over the entire dataset, which may be a problem
for algorithms which update the learning rate after each iteration.

I would like to hear Peter's opinion as he has already thought about
those problems for bolt.
Post by Gael Varoquaux
  reader.n_classes
That I would clearly frown upon, because it means that the contract with
the input object is no longer that it is a plain Python iterable, but
that it is a custom object that has a new interface that people must
learn and code to.
It will indeed be a custom iterable: as mentioned earlier in the
thread, the plan is to have a reset method too.

But indeed reader.n_classes will be a problem for UNIX pipe or
network-based iterables...

Mathieu
Mathieu Blondel
2011-04-02 10:36:21 UTC
Permalink
Post by Mathieu Blondel
But indeed reader.n_classes will be a problem for UNIX pipe or
network-based iterables...
As Peter was saying, it seems hard to support both the pure-online
setting and the large-scale learning setting.

Mathieu
Alexandre Passos
2011-04-02 10:44:08 UTC
Permalink
Post by Mathieu Blondel
On Sat, Apr 2, 2011 at 7:10 PM, Gael Varoquaux
Post by Gael Varoquaux
I would rather prefer partial_fit, because it means that in this case it
is clearly the role of the framework (not in the scikit) to distribute
where do we want the control to be? Why would you prefer fit_iterable?
Semantically, partial_fit is a pure online setting while fit_iterable
is a large-scale learning setting. With partial_fit, you loose the
notion of iteration over the entire dataset, which may be a problem
for algorithms which update the learning rate after each iteration.
I'm partial towards partial_fit. I think it can be used in more
settings, specially as a building block. If you do want something like
fit_iterable, it can be implemented with partial_fit and something
like finish_iteration() to update the learning rates or something
similar.
Post by Mathieu Blondel
It will indeed be a custom iterable: as mentioned earlier in the
thread, the plan is to have a reset method too.
But indeed reader.n_classes will be a problem for UNIX pipe or
network-based iterables...
I think this is probably going to end up as something highly confusing
to the average scikit user, that you can't just say
classifier.fit_iterable(parse_line_as_feature_vector(l) for l in
file(...)).
--
 - Alexandre
Mathieu Blondel
2011-04-02 11:11:31 UTC
Permalink
Post by Alexandre Passos
I'm partial towards partial_fit. I think it can be used in more
settings, specially as a building block. If you do want something like
fit_iterable, it can be implemented with partial_fit and something
like finish_iteration() to update the learning rates or something
similar.
So to illustrate what it could look like:

class OnlineMixin:

def fit_reader(reader, n_iter):
for n in range(n_iter):
for X, y in reader:
self.partial_fit(X, y)
self.finish_iteration()
reader.reset()

reader = SvmlightReader("file.txt", block_size=10000)
clf.fit_reader(reader)
Post by Alexandre Passos
I think this is probably going to end up as something highly confusing
to the average scikit user, that you can't just say
classifier.fit_iterable(parse_line_as_feature_vector(l) for l in
file(...)).
Indeed. We can call it fit_reader to prevent any confusion. The
motivation for having a reader object rather than a plain iterable is
that by definition iterables are for the pure online setting : they
make only one pass over the data.

To solve the problem that the label set cardinality is not known in
advance, we would need to use suitable data structures such as a
dictionary of weight vectors instead of a 2d-matrix. (That seems like
a lot of refactoring for the SGD module)

Mathieu
Gael Varoquaux
2011-04-02 10:48:28 UTC
Permalink
Post by Mathieu Blondel
Post by Gael Varoquaux
Why would you prefer fit_iterable?
Semantically, partial_fit is a pure online setting while fit_iterable
is a large-scale learning setting. With partial_fit, you loose the
notion of iteration over the entire dataset, which may be a problem
for algorithms which update the learning rate after each iteration.
Fair enough. That's a good answer, we can keep it in mind.
Post by Mathieu Blondel
Post by Gael Varoquaux
  reader.n_classes
That I would clearly frown upon, because it means that the contract with
the input object is no longer that it is a plain Python iterable, but
that it is a custom object that has a new interface that people must
learn and code to.
It will indeed be a custom iterable: as mentioned earlier in the
thread, the plan is to have a reset method too.
But indeed reader.n_classes will be a problem for UNIX pipe or
network-based iterables...
Yes, I am not terribly happy about this.

Ideally, we'd like a generator function, and not an iterator. In other
words, something that knows how to create a new iterator, and not an
iterator by itself. The problem is that I am not sure what is the best
way to achieve this with the standard library/standard objects.

Note that you cannot know the length of a standard iterator in advance.
They do not pickle or copy either.

I am a bit worried about going down the iterator alley. I understand the
issue with partial_fit, and I understand that it must be addressed. I am
OK with an inversion of control. But I would like the issue of the data
provider object to be examined in detail, and I would really like us to
avoid creating an additional object that is not in the standard library.

G
Olivier Grisel
2011-04-02 11:39:31 UTC
Permalink
Post by Mathieu Blondel
On Sat, Apr 2, 2011 at 7:10 PM, Gael Varoquaux
Post by Gael Varoquaux
I would rather prefer partial_fit, because it means that in this case it
is clearly the role of the framework (not in the scikit) to distribute
where do we want the control to be? Why would you prefer fit_iterable?
Semantically, partial_fit is a pure online setting while fit_iterable
is a large-scale learning setting. With partial_fit, you loose the
notion of iteration over the entire dataset, which may be a problem
for algorithms which update the learning rate after each iteration.
I agree with Gael. I would like the online-able estimator to only
implement partial_fit and have an generic large scale learning helper
tool in the scikit that takes an iterable dataset reader and a
pipeline of online estimators as input and be in charge of the large
scale learning schedule).

This tool could be in charge of managing online cross validation and
parameters auto-tuning, epochs iterations (if the source dataset is
somehow resetable) and early stopping.

Basically I this this scheduler tool as some sort of large scale dual
of the batch oriented GridSearchCV. I don't want to have to put online
cross validation logics and online parameter tuning heuristics inside
the estimators them-selves if possible.
Post by Mathieu Blondel
I would like to hear Peter's opinion as he has already thought about
those problems for bolt.
Post by Gael Varoquaux
  reader.n_classes
That I would clearly frown upon, because it means that the contract with
the input object is no longer that it is a plain Python iterable, but
that it is a custom object that has a new interface that people must
learn and code to.
It will indeed be a custom iterable: as mentioned earlier in the
thread, the plan is to have a reset method too.
Yes we should support both: if we have explicit n_classes and reset
method we use them, if not we do a bit of look ahead to guess a
reasonable estimate of n_classes (and ignore later discovered new
classes) and support a single epoch.
Post by Mathieu Blondel
But indeed reader.n_classes will be a problem for UNIX pipe or
network-based iterables...
It could be a metadata of the dataset.

We could also provide a tool that converts a svmlight formatted
dataset into a python friendly binary format (memmaped dense numpy
arrays and/or memmap'ed CSR matrix for the sparse case) + metdata on
the number of seen classes and features: basically a largescale Bunch
object.

This first preprocessing scan of the dataset would be offline (but
could probably be made multi-core friendly if CPU bound). That would
result in some kind of two pass learning. It would allow us to adress
the use case of largescale learning properly (which is the most
important in my opinion). For real offline learning, we can postpone
the discussion to after we adress large scale learning correctly.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2011-04-02 11:41:45 UTC
Permalink
Post by Olivier Grisel
For real offline learning, we can postpone
the discussion to after we adress large scale learning correctly.
I meant: For real *online* learning, we can postpone...
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
James Bergstra
2011-04-02 16:06:21 UTC
Permalink
Sorry if I'm jumping in from well out-of-the-loop, but it seems to me that
the online/partial questions are at different levels of analysis.

The common problem is the fit() method: for online learning it makes no
sense because the data isn't available at the time of the call, and in
large-data settings it is inappropriate because it may run unnecessarily
long.

At the same time, many algorithms are inherently iterative, and can usefully
be seen as anytime algorithms even if they also have natural stopping
conditions.
So it seems to me that an easy way to tackle online learning and large-data
learning with the algorithms that are already in place is to refactor them
into incremental pieces of work. This makes them consistent with an
iterator-like style of coding, even though they are not necessarily Python
iterators (formally).

So what about thinking along the lines of incremental fitting:

class IncrementalFitMixin(object):
def fit(self):
while self.incremental_fit():
continue

This is not an API for online learning or for large-scale learning, it is
simply a way of presenting many existing algorithms in a way that makes them
useful for online learning or large-data learning.
Online learning can be done by a wrapper around an Incremental learner if it
swaps the training data between calls to self.incremental_fit(), and
generally controls the stream of data used by the incremental_fit() call.
Even non-online learning like EM can be done this way, and this interface
would bring greater flexibility.

I'm not sure if this way of looking at things is useful or interesting, but
I wanted to throw the idea out there. The learning algorithms we write
(RBMs, neural nets, deep nets) in our lab would fit more naturally into this
sort of package than the basic fit() interface.

James
Post by Olivier Grisel
Post by Olivier Grisel
For real offline learning, we can postpone
the discussion to after we adress large scale learning correctly.
I meant: For real *online* learning, we can postpone...
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself;
WebMatrix provides all the features you need to develop and
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
http://www-etud.iro.umontreal.ca/~bergstrj
Alexandre Passos
2011-04-02 16:34:39 UTC
Permalink
Post by James Bergstra
Sorry if I'm jumping in from well out-of-the-loop, but it seems to me that
the online/partial questions are at different levels of analysis.
The common problem is the fit() method: for online learning it makes no
sense because the data isn't available at the time of the call, and in
large-data settings it is inappropriate because it may run unnecessarily
long.
At the same time, many algorithms are inherently iterative, and can usefully
be seen as anytime algorithms even if they also have natural stopping
conditions.
So it seems to me that an easy way to tackle online learning and large-data
learning with the algorithms that are already in place is to refactor them
into incremental pieces of work.  This makes them consistent with an
iterator-like style of coding, even though they are not necessarily Python
iterators (formally).
          continue
I'm not sure my vote counts, but +1. This would be ideal, specially
for hairy problems where you might want to implement your own stopping
conditions.
--
 - Alexandre
Olivier Grisel
2011-04-02 16:55:50 UTC
Permalink
Post by Alexandre Passos
Post by James Bergstra
So it seems to me that an easy way to tackle online learning and large-data
learning with the algorithms that are already in place is to refactor them
into incremental pieces of work.  This makes them consistent with an
iterator-like style of coding, even though they are not necessarily Python
iterators (formally).
          continue
I'm not sure my vote counts, but +1. This would be ideal, specially
for hairy problems where you might want to implement your own stopping
conditions.
I am ok to have partial_fit(X_batch, y_batch) return False or raise
StopIteration when some internal stopping criterion is reached but I
don't get how James API is feeding the data to the model.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2011-04-02 17:20:20 UTC
Permalink
Post by Olivier Grisel
Post by Alexandre Passos
Post by James Bergstra
So it seems to me that an easy way to tackle online learning and large-data
learning with the algorithms that are already in place is to refactor them
into incremental pieces of work.  This makes them consistent with an
iterator-like style of coding, even though they are not necessarily Python
iterators (formally).
          continue
I'm not sure my vote counts, but +1. This would be ideal, specially
for hairy problems where you might want to implement your own stopping
conditions.
I am ok to have partial_fit(X_batch, y_batch) return False or raise
StopIteration when some internal stopping criterion is reached but I
don't get how James API is feeding the data to the model.
I think he meant to write
         continue
But then you delegate the mini-batch cutting work to the
self.incremental_fit method or X is already a mini-batch? In a large
scale setting you cannot expect that the complete X will fit memory
(unless you use memmaped arrays).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
James Bergstra
2011-04-02 22:27:19 UTC
Permalink
Post by James Bergstra
Post by Olivier Grisel
Post by Alexandre Passos
Post by James Bergstra
So it seems to me that an easy way to tackle online learning and
large-data
Post by Olivier Grisel
Post by Alexandre Passos
Post by James Bergstra
learning with the algorithms that are already in place is to refactor
them
Post by Olivier Grisel
Post by Alexandre Passos
Post by James Bergstra
into incremental pieces of work. This makes them consistent with an
iterator-like style of coding, even though they are not necessarily
Python
Post by Olivier Grisel
Post by Alexandre Passos
Post by James Bergstra
iterators (formally).
continue
I'm not sure my vote counts, but +1. This would be ideal, specially
for hairy problems where you might want to implement your own stopping
conditions.
I am ok to have partial_fit(X_batch, y_batch) return False or raise
StopIteration when some internal stopping criterion is reached but I
don't get how James API is feeding the data to the model.
I think he meant to write
continue
But then you delegate the mini-batch cutting work to the
self.incremental_fit method or X is already a mini-batch? In a large
scale setting you cannot expect that the complete X will fit memory
(unless you use memmaped arrays).
Hmm, good point. I think the idea of an incremental bit of computation goes
well with an incremental bit of data (as in online learning). So how about
we add one special flag to the incremental_fit argument, which is a boolean:
true if the data args have changed since the last call.

How about:

class IncrementalFitMixin(object):
def fit(self, *args, **kwargs):
if 'data_changed' in kwargs: raise TypeError()
while self.incremental_fit(data_changed=False, *args, **kwargs):
continue

BaseEstimators that are not suited to online learning can either fail or
compute nonsense if the caller changes the data from call to call of
incremental_fit... or more ideally they would raise an exception.
A more natural interface for this sort of call would have flags that serve
as "dirty bits" for each of the arguments to fit. I don't know if that
happens frequently enough to warrant a more complicated API, but if later we
want that kind of API, then the data_changed argument can simply serve as a
shorthand for "all bits dirty" and that future API could hopefully be a
backward-compatible extension.

I'm imagining a standard pattern for the incremental_fit would be a switch
of the form:
if first_call:
setup_stuff()
return 1
elif data_changed:
raise NotImplementedError()
elif self.not_done:
normal_incremental_work()
return 1
else:
return 0

James
--
http://www-etud.iro.umontreal.ca/~bergstrj
Mathieu Blondel
2011-04-04 03:56:49 UTC
Permalink
Post by Olivier Grisel
I agree with Gael. I would like the online-able estimator to only
implement partial_fit and have an generic large scale learning helper
tool in the scikit that takes an iterable dataset reader and a
pipeline of online estimators as input and be in charge of the large
scale learning schedule).
It's hard to tell if such a generic scheduler is possible without
actually implementing a bunch of algorithms, so let's postpone the
decision until we can make an informed choice. So, to summarize what
has been suggested so far:

- fit_reader(reader) method working with a reader object (extended iterator)
- partial_fit(X, y) working with a subset of the data in numpy array /
sparse matrix + finish_iteration() to handle learning rate
- partial_fit(X, y) + generic scheduler utility
- incremental fit

- use a dictionary or a growing list of parameter vectors
- require the reader object to provide a n_classes attribute

The decision should focus on large scale learning and the API should
take into account not only large-scale "online" algorithms (SGD,
Perceptron, fast k-means) but also iterative algorithms
(ProbabilisticPCA, ...).
Post by Olivier Grisel
Basically I this this scheduler tool as some sort of large scale dual
of the batch oriented GridSearchCV. I don't want to have to put online
cross validation logics and online parameter tuning heuristics inside
the estimators them-selves if possible.
So far I found the GridSearchCV object only moderately useful. We've
seen that more and more objects have a *CV counterpart to handle
cross-validation efficiently and GridSearchCV has some limitations
when you start having specific needs (e.g., it can't use
sample_weights).
Post by Olivier Grisel
Yes we should support both: if we have explicit n_classes and reset
method we use them, if not we do a bit of look ahead to guess a
reasonable estimate of n_classes (and ignore later discovered new
classes) and support a single epoch.
Ignoring classes is unacceptable in my opinion...

Mathieu
Mathieu Blondel
2011-04-04 04:12:21 UTC
Permalink
Post by Mathieu Blondel
- use a dictionary or a growing list of parameter vectors
- require the reader object to provide a n_classes attribute
If you use one-vs-all and your current example is a newly seen class,
it means that all previously seen examples should have be used as
negative examples to update the weight vector of this new class. So
knowing n_classes in advance does make things infinitely easier.

Mathieu
Gael Varoquaux
2011-04-04 04:39:07 UTC
Permalink
Post by Mathieu Blondel
So far I found the GridSearchCV object only moderately useful. We've
seen that more and more objects have a *CV counterpart to handle
cross-validation efficiently and GridSearchCV has some limitations
when you start having specific needs (e.g., it can't use
sample_weights).
At some point, we should think about a solution to that last problem. One
way to do it could be:

1. removing of **params from the fit method (I know that you don't like
them, Mathieu)

2. having the convention that any argument to 'fit' must be iterable,
and the first direction must represent samples (not sure if this will
hold for every estimator).

3. adding a dispatcher mechanism to the cross-validation utilities that
knows how to split these arguments, and distribute them to the
estimators.

Anyhow, I think that the problem that you are raising is an important
one, and I'd really like it solved.
Post by Mathieu Blondel
Post by Olivier Grisel
Yes we should support both: if we have explicit n_classes and reset
method we use them, if not we do a bit of look ahead to guess a
reasonable estimate of n_classes (and ignore later discovered new
classes) and support a single epoch.
Ignoring classes is unacceptable in my opinion...
Yes, agreed. However, it should be fairly easy to pass the information of
the number of classes, at some point.

G
Olivier Grisel
2011-04-04 09:57:37 UTC
Permalink
Post by Gael Varoquaux
Post by Mathieu Blondel
So far I found the GridSearchCV object only moderately useful. We've
seen that more and more objects have a *CV counterpart to handle
cross-validation efficiently and GridSearchCV has some limitations
when you start having specific needs (e.g., it can't use
sample_weights).
At some point, we should think about a solution to that last problem. One
 1. removing of **params from the fit method (I know that you don't like
   them, Mathieu)
sample_weights are data-related: same first dim as X and y. They must
hence be passed as a fit param, not as a constructor param. But
GridSearchCV allows to pass fit params as well if I am not mistaken.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Peter Prettenhofer
2011-04-04 10:27:28 UTC
Permalink
I've recently tried GridSearchCV in combination with sample_weight and
it did not work (I haven't tried the latest development version
tough).

There is another problem we have to keep in mind: some fit parameters
(sample_weight) are coupled to the current fold while others
(class_weight) are not.

best,
Peter
Post by Olivier Grisel
Post by Gael Varoquaux
Post by Mathieu Blondel
So far I found the GridSearchCV object only moderately useful. We've
seen that more and more objects have a *CV counterpart to handle
cross-validation efficiently and GridSearchCV has some limitations
when you start having specific needs (e.g., it can't use
sample_weights).
At some point, we should think about a solution to that last problem. One
 1. removing of **params from the fit method (I know that you don't like
   them, Mathieu)
sample_weights are data-related: same first dim as X and y. They must
hence be passed as a fit param, not as a constructor param. But
GridSearchCV allows to pass fit params as well if I am not mistaken.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself;
WebMatrix provides all the features you need to develop and
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Olivier Grisel
2011-04-04 10:32:40 UTC
Permalink
Post by Peter Prettenhofer
I've recently tried GridSearchCV in combination with sample_weight and
it did not work (I haven't tried the latest development version
tough).
There is another problem we have to keep in mind: some fit parameters
(sample_weight) are coupled to the current fold while others
(class_weight) are not.
Indeed, and both are data dependent at the same time... unless
class_weight is set to 'auto'.

This API thing is getting, interesting...
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Loading...