Doug Coleman
2012-09-25 17:31:10 UTC
Hi,
I'm making an ensemble of trees by hand for classification and trying
to merge their outputs with predict_proba. My labels are integers
-2..2. The problem is that -2 and 2 are rare labels. Now assume that I
train these trees with different but related data sets, some of which
don't even contain -2 or 2. The shape of predict_proba varies based on
number of unique labels in the input y, so instead of always getting 5
columns in predict_proba, I only get columns wherever there was a
label. So to merge predictions from the trees, now I have to do
bookkeeping to remember which trees had which labels in them, and it's
a mess.
Someone suggested I use sklearn.feature_extraction.DictVectorizer, but
that seems to be to track the X matrix instead of y. What I might end
up doing is unique/sorting the y labels for each tree, calling
predict_proba on each, adding column vectors of zeros to the
predictions, and then merging the results.
What I would prefer to do is call fit with a set of possible labels,
like so: clf.fit(X, y, labels=[-2,1,0,1,2]) so scikit could do the
bookkeeping for me. Obviously some of the trees in my ensemble would
be useless at predicting the -2 or 2 labels, but that's expected.
An analogous example is randomly selecting and training on rows where
the y values are not all represented. This is taken care of for
DecisionTreeClassifiers by the max_features='auto' parameter already,
internally.
Maybe people don't usually use the library in this way so it doesn't come up?
Thanks,
Doug
I'm making an ensemble of trees by hand for classification and trying
to merge their outputs with predict_proba. My labels are integers
-2..2. The problem is that -2 and 2 are rare labels. Now assume that I
train these trees with different but related data sets, some of which
don't even contain -2 or 2. The shape of predict_proba varies based on
number of unique labels in the input y, so instead of always getting 5
columns in predict_proba, I only get columns wherever there was a
label. So to merge predictions from the trees, now I have to do
bookkeeping to remember which trees had which labels in them, and it's
a mess.
Someone suggested I use sklearn.feature_extraction.DictVectorizer, but
that seems to be to track the X matrix instead of y. What I might end
up doing is unique/sorting the y labels for each tree, calling
predict_proba on each, adding column vectors of zeros to the
predictions, and then merging the results.
What I would prefer to do is call fit with a set of possible labels,
like so: clf.fit(X, y, labels=[-2,1,0,1,2]) so scikit could do the
bookkeeping for me. Obviously some of the trees in my ensemble would
be useless at predicting the -2 or 2 labels, but that's expected.
An analogous example is randomly selecting and training on rows where
the y values are not all represented. This is taken care of for
DecisionTreeClassifiers by the max_features='auto' parameter already,
internally.
Maybe people don't usually use the library in this way so it doesn't come up?
Thanks,
Doug