Discussion:
[Scikit-learn-general] Classifying where some labels are not in dataset
Doug Coleman
2012-09-25 17:31:10 UTC
Permalink
Hi,

I'm making an ensemble of trees by hand for classification and trying
to merge their outputs with predict_proba. My labels are integers
-2..2. The problem is that -2 and 2 are rare labels. Now assume that I
train these trees with different but related data sets, some of which
don't even contain -2 or 2. The shape of predict_proba varies based on
number of unique labels in the input y, so instead of always getting 5
columns in predict_proba, I only get columns wherever there was a
label. So to merge predictions from the trees, now I have to do
bookkeeping to remember which trees had which labels in them, and it's
a mess.

Someone suggested I use sklearn.feature_extraction.DictVectorizer, but
that seems to be to track the X matrix instead of y. What I might end
up doing is unique/sorting the y labels for each tree, calling
predict_proba on each, adding column vectors of zeros to the
predictions, and then merging the results.

What I would prefer to do is call fit with a set of possible labels,
like so: clf.fit(X, y, labels=[-2,1,0,1,2]) so scikit could do the
bookkeeping for me. Obviously some of the trees in my ensemble would
be useless at predicting the -2 or 2 labels, but that's expected.

An analogous example is randomly selecting and training on rows where
the y values are not all represented. This is taken care of for
DecisionTreeClassifiers by the max_features='auto' parameter already,
internally.

Maybe people don't usually use the library in this way so it doesn't come up?

Thanks,
Doug
Lars Buitinck
2012-09-25 18:22:51 UTC
Permalink
Post by Doug Coleman
label. So to merge predictions from the trees, now I have to do
bookkeeping to remember which trees had which labels in them, and it's
a mess.
You did discover the classes_ attribute, did you? That keeps track of
the classes found in y by fit and solves part of the bookkeeping
problem.
Post by Doug Coleman
Someone suggested I use sklearn.feature_extraction.DictVectorizer, but
that seems to be to track the X matrix instead of y. What I might end
up doing is unique/sorting the y labels for each tree, calling
predict_proba on each, adding column vectors of zeros to the
predictions, and then merging the results.
No, that's not what DictVectorizer is for. I guess it *could* be used
for tracking labels and probabilities, if you fit it on the trivial
"dataset"

[dict((str(label),0) for label in [-2, -1, 0, 1, 2])]

but then still, you have to convert from integers to strings all the time.
Post by Doug Coleman
What I would prefer to do is call fit with a set of possible labels,
like so: clf.fit(X, y, labels=[-2,1,0,1,2]) so scikit could do the
bookkeeping for me. Obviously some of the trees in my ensemble would
be useless at predicting the -2 or 2 labels, but that's expected.
That would be nice. I think we actually put that argument on __init__
where appropriate (SGDClassifier) and call is classes, not labels.
Would you perhaps be willing to implement this for decision trees and
submit a pull request?
Post by Doug Coleman
Maybe people don't usually use the library in this way so it doesn't come up?
It only comes up in advanced use cases such as online learning, so
some estimators have this, but others don't.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Doug Coleman
2012-09-25 18:52:03 UTC
Permalink
I'd love to submit a patch.

Looking at SGDClassifier docs, the __init__ doesn't take a classes
parameter, but instead there's a partial_fit() that takes `classes`
exactly like I'd except. However, the docs for partial_fit() are
exactly the same as for fit().

If you examine the code, fit() "warms up" the optimization with some
additional parameters, then calls _partial_fit(). partial_fit() just
calls _partial_fit() directly. So, it looks like fit() and
partial_fit() could take a `classes` parameter for SGDClassifier,
rather than __init__. It seems a bit confused, actually, since
SGDClassifier's __init__ takes a class_weight dict for doing
cost-sensitive learning but then partial_fit() takes a classes
vector--what if they contradict each other?

It seems like the `class_weight` parameter in __init__ could be either
a vector or a dict, where a vector would treat all weights equally and
the dict would have the weights for cost-sensitive learning. Then,
take the classes parameter out of partial_fit(). If the y vector ever
has a class not in the classes vector and one was supplied in
__init__, then you'd throw an error. Then do this for
DecisionTreeClassifiers.

What do you think?

Doug
Post by Lars Buitinck
Post by Doug Coleman
label. So to merge predictions from the trees, now I have to do
bookkeeping to remember which trees had which labels in them, and it's
a mess.
You did discover the classes_ attribute, did you? That keeps track of
the classes found in y by fit and solves part of the bookkeeping
problem.
Post by Doug Coleman
Someone suggested I use sklearn.feature_extraction.DictVectorizer, but
that seems to be to track the X matrix instead of y. What I might end
up doing is unique/sorting the y labels for each tree, calling
predict_proba on each, adding column vectors of zeros to the
predictions, and then merging the results.
No, that's not what DictVectorizer is for. I guess it *could* be used
for tracking labels and probabilities, if you fit it on the trivial
"dataset"
[dict((str(label),0) for label in [-2, -1, 0, 1, 2])]
but then still, you have to convert from integers to strings all the time.
Post by Doug Coleman
What I would prefer to do is call fit with a set of possible labels,
like so: clf.fit(X, y, labels=[-2,1,0,1,2]) so scikit could do the
bookkeeping for me. Obviously some of the trees in my ensemble would
be useless at predicting the -2 or 2 labels, but that's expected.
That would be nice. I think we actually put that argument on __init__
where appropriate (SGDClassifier) and call is classes, not labels.
Would you perhaps be willing to implement this for decision trees and
submit a pull request?
Post by Doug Coleman
Maybe people don't usually use the library in this way so it doesn't come up?
It only comes up in advanced use cases such as online learning, so
some estimators have this, but others don't.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2012-09-25 22:19:00 UTC
Permalink
I think we could have `classes=None` constructor parameter in
SGDClassifier an possibly many other classifiers. When provided we
would not use the traditional `self.classes_ = np.unique(y)` idiom
already implemented in some classifiers of the project (but not all).

+1 also for raising a ValueError exception when `classes != None` and
if the `y` provided at fit time has some values not in `classes`.
However we need to check with some benchmarks that this integrity
check is not too costly.

This constructor parameters could be overriden by a `fit_param` to
preserve backward compat, especially for classifier models with a
`partial_fit` method.

The expected behavior for a classifier that is passed a non-None
`classes` constructor param would be to never predict a class value.
In case of predict_proba method the missing fit-time class
probabilities should be 0.0.

This protocol (including expected exception types and error messages)
should be formalized as a series of common tests in
sklearn/tests/test_common.py and redundant book keeping code should be
factorized in the sklearn.base.py's ClassifierMixin class IMHO.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Andreas Mueller
2012-09-26 18:52:51 UTC
Permalink
Post by Olivier Grisel
I think we could have `classes=None` constructor parameter in
SGDClassifier an possibly many other classifiers. When provided we
would not use the traditional `self.classes_ = np.unique(y)` idiom
already implemented in some classifiers of the project (but not all).
+1 also for raising a ValueError exception when `classes != None` and
if the `y` provided at fit time has some values not in `classes`.
However we need to check with some benchmarks that this integrity
check is not too costly.
This constructor parameters could be overriden by a `fit_param` to
preserve backward compat, especially for classifier models with a
`partial_fit` method.
Could you explain why this is necessary. Whey wouldn't the
default value do the same as the current version?

Thanks,
Andy
Mathieu Blondel
2012-09-26 02:33:00 UTC
Permalink
Post by Doug Coleman
If you examine the code, fit() "warms up" the optimization with some
additional parameters, then calls _partial_fit(). partial_fit() just
calls _partial_fit() directly. So, it looks like fit() and
partial_fit() could take a `classes` parameter for SGDClassifier,
rather than __init__. It seems a bit confused, actually, since
SGDClassifier's __init__ takes a class_weight dict for doing
cost-sensitive learning but then partial_fit() takes a classes
vector--what if they contradict each other?
partial_fit should behave exactly like fit if you call it only once. So,
for your use case, I would just use partial_fit with the classes parameter.

# The difference between fit and partial_fit is that fit erases the
previous model on subsequent calls whereas partial_fit starts from the
previous model.

Mathieu
Gilles Louppe
2012-09-26 08:42:15 UTC
Permalink
Hi,

The ensemble classes handle the problem you describe already. Have a look
at the implementation of predict_proba of BaseForestClassifier in
ensemble.py if you want to do that yourself by hand.

Hope this helps.

Gilles
Post by Mathieu Blondel
Post by Doug Coleman
If you examine the code, fit() "warms up" the optimization with some
additional parameters, then calls _partial_fit(). partial_fit() just
calls _partial_fit() directly. So, it looks like fit() and
partial_fit() could take a `classes` parameter for SGDClassifier,
rather than __init__. It seems a bit confused, actually, since
SGDClassifier's __init__ takes a class_weight dict for doing
cost-sensitive learning but then partial_fit() takes a classes
vector--what if they contradict each other?
partial_fit should behave exactly like fit if you call it only once. So,
for your use case, I would just use partial_fit with the classes parameter.
Post by Mathieu Blondel
# The difference between fit and partial_fit is that fit erases the
previous model on subsequent calls whereas partial_fit starts from the
previous model.
Post by Mathieu Blondel
Mathieu
Gilles Louppe
2012-09-26 09:30:46 UTC
Permalink
@Doug: Sorry I was typing my previous response from my phone.

The snippet of code that I was talking about can be found at:
https://github.com/glouppe/scikit-learn/blob/master/sklearn/ensemble/forest.py#L93

Cheers,

Gilles
Post by Gilles Louppe
Hi,
The ensemble classes handle the problem you describe already. Have a look
at the implementation of predict_proba of BaseForestClassifier in
ensemble.py if you want to do that yourself by hand.
Post by Gilles Louppe
Hope this helps.
Gilles
Post by Mathieu Blondel
Post by Doug Coleman
If you examine the code, fit() "warms up" the optimization with some
additional parameters, then calls _partial_fit(). partial_fit() just
calls _partial_fit() directly. So, it looks like fit() and
partial_fit() could take a `classes` parameter for SGDClassifier,
rather than __init__. It seems a bit confused, actually, since
SGDClassifier's __init__ takes a class_weight dict for doing
cost-sensitive learning but then partial_fit() takes a classes
vector--what if they contradict each other?
partial_fit should behave exactly like fit if you call it only once. So,
for your use case, I would just use partial_fit with the classes parameter.
Post by Gilles Louppe
Post by Mathieu Blondel
# The difference between fit and partial_fit is that fit erases the
previous model on subsequent calls whereas partial_fit starts from the
previous model.
Post by Gilles Louppe
Post by Mathieu Blondel
Mathieu
Doug Coleman
2012-09-26 18:26:21 UTC
Permalink
@Gilles,

Thanks for the link. Those classes basically implement a paper that
has a specific idea of RandomForests™ (no kidding, it's trademarked),
with bootstrapping, oob estimation, and n trees trained on the same
data.

I'm basically looking to take pre-trained classifiers and allows you
to combine the predicted probabilities in custom ways, like favoring
some classifiers over others, etc.

Not that RandomForests™ are not useful--they could be the building
block classifiers in such a system.

@Oliver's writeup would exactly solve my problem.

Cheers,
Doug
Post by Gilles Louppe
@Doug: Sorry I was typing my previous response from my phone.
https://github.com/glouppe/scikit-learn/blob/master/sklearn/ensemble/forest.py#L93
Cheers,
Gilles
Post by Gilles Louppe
Hi,
The ensemble classes handle the problem you describe already. Have a look
at the implementation of predict_proba of BaseForestClassifier in
ensemble.py if you want to do that yourself by hand.
Hope this helps.
Gilles
Post by Mathieu Blondel
Post by Doug Coleman
If you examine the code, fit() "warms up" the optimization with some
additional parameters, then calls _partial_fit(). partial_fit() just
calls _partial_fit() directly. So, it looks like fit() and
partial_fit() could take a `classes` parameter for SGDClassifier,
rather than __init__. It seems a bit confused, actually, since
SGDClassifier's __init__ takes a class_weight dict for doing
cost-sensitive learning but then partial_fit() takes a classes
vector--what if they contradict each other?
partial_fit should behave exactly like fit if you call it only once. So,
for your use case, I would just use partial_fit with the classes parameter.
# The difference between fit and partial_fit is that fit erases the
previous model on subsequent calls whereas partial_fit starts from the
previous model.
Mathieu
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gilles Louppe
2012-09-26 21:59:21 UTC
Permalink
Post by Doug Coleman
I'm basically looking to take pre-trained classifiers and allows you
to combine the predicted probabilities in custom ways, like favoring
some classifiers over others, etc.
Not that RandomForests™ are not useful--they could be the building
block classifiers in such a system.
@Oliver's writeup would exactly solve my problem.
The code I pointed also handles the situation you describe. The trees
in the forest can have different number of classes (because of
bootstrapping) and that snippet of code remap them correctly. This
might help you to write your own system.

Gilles

Gael Varoquaux
2012-09-25 20:57:35 UTC
Permalink
Post by Doug Coleman
I'm making an ensemble of trees by hand for classification and trying
to merge their outputs with predict_proba. My labels are integers
-2..2. The problem is that -2 and 2 are rare labels. Now assume that I
train these trees with different but related data sets, some of which
don't even contain -2 or 2. The shape of predict_proba varies based on
number of unique labels in the input y, so instead of always getting 5
columns in predict_proba, I only get columns wherever there was a
label.
I hate to say, but you are starting in a really difficult position for
learning. So far we do not have tools to work with very sparse output
classes. I think that such situations take a lot of care to get good
results.

For this reason, my own personnal opinion is that I wouldn't favor having
a 'quick fix' landing in the scikit that wouldn't solve the core
statistical problems. I understand that the bookeeping is tedious, but my
gut feeling is that solving it will just make other problems appear.

By the way, have you considered making 'stratified', or balanced
bootstraps, in which you would keep the class ratio constants? This would
help for bookeeping, but might also help for the statistical learning
problem.

Thanks for offering a patch, though, it is much appreciated,

Gaël
Doug Coleman
2012-09-25 21:43:02 UTC
Permalink
I'm not necessarily looking for a quick fix here, and anything I would
consider trying to contribute to scikit would be useful and correct.

You're right that there's not a good chance it can learn to predict
with sparse output classes, but if the problem were easy, then I
wouldn't need scikit at all. I just wanted to try out an idea and the
API is kind of getting in the way. If the output labels were not be
collected out of the y vector but instead provided as a parameter to
tell the classifier what I'm looking for independently, as
SGDClassifier supports, then that would solve the problem.

Maybe the right thing to do is open up an issue about the discrepancy
in the API on github and either hope someone else wants to fix it or
submit patches myself eventually.

Just out of curiosity, what problems do you think could arise from
this other than ultimately the machine learning effort fails because
of sparsity?

Thanks,
Doug




On Tue, Sep 25, 2012 at 1:57 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Doug Coleman
I'm making an ensemble of trees by hand for classification and trying
to merge their outputs with predict_proba. My labels are integers
-2..2. The problem is that -2 and 2 are rare labels. Now assume that I
train these trees with different but related data sets, some of which
don't even contain -2 or 2. The shape of predict_proba varies based on
number of unique labels in the input y, so instead of always getting 5
columns in predict_proba, I only get columns wherever there was a
label.
I hate to say, but you are starting in a really difficult position for
learning. So far we do not have tools to work with very sparse output
classes. I think that such situations take a lot of care to get good
results.
For this reason, my own personnal opinion is that I wouldn't favor having
a 'quick fix' landing in the scikit that wouldn't solve the core
statistical problems. I understand that the bookeeping is tedious, but my
gut feeling is that solving it will just make other problems appear.
By the way, have you considered making 'stratified', or balanced
bootstraps, in which you would keep the class ratio constants? This would
help for bookeeping, but might also help for the statistical learning
problem.
Thanks for offering a patch, though, it is much appreciated,
Gaël
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...