Mathieu Blondel
2011-03-31 11:39:32 UTC
As you may remember from a thread on the mailing-list (back a few
months ago), there was an agreement that online algorithms should
implement a partial_fit(X, y) method. The reason for adding a new
method was mainly a matter of semantics: partial_fit makes it clear
that the previous model is not erased when partial_fit is called
again.
I started to look into adding partial_fit to the SGD module. My
original idea was to rename the fit method in BaseSGD to _fit, add a
partial=True|False option and initialize the model parameters only
when partial=False or the parameters are not present yet. This way,
fit and partial_fit could easily be implemented in terms of _fit.
However, it is more difficult than I thought and I found potential
issues.
The first one is that the vector y may contain only a subset of the
classes (or in the extreme case, only one class). This is a problem
since SGD pre-allocate the coef_ matrix (n_classes x n_features). The
obvious solution is to use a dictionary to store the weight vectors of
each class instead of a numpy 2d-array. For compatibility with other
classifiers, we can implement coef_ as a property.
The second potential problem is about the learning schedules. The
routines written in Cython need a n_iter argument. If the user makes
several passes over the dataset (see below) and call partial_fit
repeatedly, we would need to save the state of the learning rate?
Peter, what areas of the code do you think need to be changed and do
you have ideas how to factor as much code as possible?
Another thing I was wondering: is it possible to extract reusable
utils from the SGD module such as dense-sparse dot product,
dense-sparse addition etc? (I suppose we would need a pyd header
file?) I was wondering about that because of custom loss functions
too.
Also to put partial_fit into more context: although partial_fit can
potentially be used in a pure online setting, the plan was mainly to
use it for large scale datasets, i.e. make several iterations over the
datasets but load the data by blocks. The plan was to create an
iterator object which can be reset:
reader = SvmlightReader("file.txt", block_size=10000)
for n in range(n_iter):
for X, y in reader:
clf.partial_fit(X, y)
reader.reset()
It could also be useful to have a method to generate a mini-batch
block randomly:
X, y = reader.random_minibatch(blocksize=1000)
A text-based file format like Svmlight's doesn't offer a direct way to
quickly retrieve a random line. We would need to build a "line => byte
offset" index (can be produced in memory when needed).
# All in all, this made me think that if we want to start playing with
an online API, it would probably be easier to start with a good old
averaged perceptron at first than trying to modify the current SGD
module.
Mathieu
months ago), there was an agreement that online algorithms should
implement a partial_fit(X, y) method. The reason for adding a new
method was mainly a matter of semantics: partial_fit makes it clear
that the previous model is not erased when partial_fit is called
again.
I started to look into adding partial_fit to the SGD module. My
original idea was to rename the fit method in BaseSGD to _fit, add a
partial=True|False option and initialize the model parameters only
when partial=False or the parameters are not present yet. This way,
fit and partial_fit could easily be implemented in terms of _fit.
However, it is more difficult than I thought and I found potential
issues.
The first one is that the vector y may contain only a subset of the
classes (or in the extreme case, only one class). This is a problem
since SGD pre-allocate the coef_ matrix (n_classes x n_features). The
obvious solution is to use a dictionary to store the weight vectors of
each class instead of a numpy 2d-array. For compatibility with other
classifiers, we can implement coef_ as a property.
The second potential problem is about the learning schedules. The
routines written in Cython need a n_iter argument. If the user makes
several passes over the dataset (see below) and call partial_fit
repeatedly, we would need to save the state of the learning rate?
Peter, what areas of the code do you think need to be changed and do
you have ideas how to factor as much code as possible?
Another thing I was wondering: is it possible to extract reusable
utils from the SGD module such as dense-sparse dot product,
dense-sparse addition etc? (I suppose we would need a pyd header
file?) I was wondering about that because of custom loss functions
too.
Also to put partial_fit into more context: although partial_fit can
potentially be used in a pure online setting, the plan was mainly to
use it for large scale datasets, i.e. make several iterations over the
datasets but load the data by blocks. The plan was to create an
iterator object which can be reset:
reader = SvmlightReader("file.txt", block_size=10000)
for n in range(n_iter):
for X, y in reader:
clf.partial_fit(X, y)
reader.reset()
It could also be useful to have a method to generate a mini-batch
block randomly:
X, y = reader.random_minibatch(blocksize=1000)
A text-based file format like Svmlight's doesn't offer a direct way to
quickly retrieve a random line. We would need to build a "line => byte
offset" index (can be produced in memory when needed).
# All in all, this made me think that if we want to start playing with
an online API, it would probably be easier to start with a good old
averaged perceptron at first than trying to modify the current SGD
module.
Mathieu