[Scikit-learn-general] Classifier with binary features? (was: CountVectorizer followed by Binarizer doesn't work)

Discussion:

Lars Buitinck

2011-06-02 12:49:37 UTC

Also if you are to create a special class for BernouillyNB, I would
make the binarization directly from that class.

Would that be the "scikitic" way of implementing a classifier that
wants binary features? I'm asking since there are several other
classifiers that want booleans (e.g. multinomial logic aka MaxEnt),
none of which seem to be represented in scikit-learn.

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Olivier Grisel

2011-06-02 13:40:49 UTC

Permalink

Post by Lars Buitinck

Also if you are to create a special class for BernouillyNB, I would
make the binarization directly from that class.

I could extract the content of the transform method of the Binarizer
into a reusable utility function (while keeping the Binarizer class
for for custom built pipelines).

WDYT?

BTW, I just did it for the normalize method in:

https://github.com/scikit-learn/scikit-learn/pull/193

This should be useful to turn a linear kernel into a full fledged
cosine similarity.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Lars Buitinck

2011-06-02 14:07:23 UTC

Permalink

Post by Olivier Grisel
I could extract the content of the transform method of the Binarizer
into a reusable utility function (while keeping the Binarizer class
for for custom built pipelines).
WDYT?
https://github.com/scikit-learn/scikit-learn/pull/193
This should be useful to turn a linear kernel into a full fledged
cosine similarity.

Let me get this straight:

def binarize(X):
assert X.shape[0] == 1
return normalize(X, norm='l1')

?

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Olivier Grisel

2011-06-02 14:13:00 UTC

Permalink

assert X.shape[0] == 1
return normalize(X, norm='l1')
?

Hum, I don't understand this. I was talking about extraction a
`binarize` function with the content of `Binarizer.transform`.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Lars Buitinck

2011-06-02 14:18:56 UTC

Permalink

Post by Olivier Grisel

assert X.shape[0] == 1
return normalize(X, norm='l1')
?

Hum, I don't understand this. I was talking about extraction a
`binarize` function with the content of `Binarizer.transform`.

Excuse me, misunderstanding on my part. It might be useful to have a
binarize in this style as well, however, if I'm going to binarize in
the class, then I might just as well add a threshold parameter to
__init__ and store a Binarizer that both fit and predict can use.

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Mathieu Blondel

2011-06-02 16:26:34 UTC

Permalink

Post by Lars Buitinck
Excuse me, misunderstanding on my part. It might be useful to have a
binarize in this style as well, however, if I'm going to binarize in
the class, then I might just as well add a threshold parameter to
__init__ and store a Binarizer that both fit and predict can use.

In the docstring, you wrote that you don't actually check if X
contains binary features or not. If that's so, I don't see the point
of having a BernouilliNB class, since the user can do:

X = Binarizer().transform(X)
clf = MultinomialNB()
clf.fit(X)

I guess I would rather not have BernouilliNB at all and add to the
docstring that if X contains binary features, it's a Bernouilli naive
bayes. The information retrieval book has a nice summary table (13.3):
http://nlp.stanford.edu/IR-book/html/htmledition/properties-of-naive-bayes-1.html.

Concerning Maxent / Multinomial LR, real features CAN be used. It's
just that NLP people use binary features a lot.

Mathieu

Lars Buitinck

2011-06-03 11:01:59 UTC

Permalink

Post by Mathieu Blondel

In the docstring, you wrote that you don't actually check if X
contains binary features or not. If that's so, I don't see the point
X = Binarizer().transform(X)
clf = MultinomialNB()
clf.fit(X)

Not exactly; Bernoulli NB penalizes features with 0 value. I will add
binarizing to the class.

Post by Mathieu Blondel
I guess I would rather not have BernouilliNB at all and add to the
docstring that if X contains binary features, it's a Bernouilli naive
http://nlp.stanford.edu/IR-book/html/htmledition/properties-of-naive-bayes-1.html.

I based the implementation on (the paper version of) that book. To
cite the section right before that one: "The models also differ in how
nonoccurring terms are used in classification. They do not affect the
classification decision in the multinomial model; but in the Bernoulli
model the probability of nonoccurrence is factored in when computing
P(c|d).

(http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html)

Post by Mathieu Blondel
Concerning Maxent / Multinomial LR, real features CAN be used. It's
just that NLP people use binary features a lot.

What I understood from the MaxEnt literature, real-valued features
should be used only with binning or with custom algorithms, some of
which are patented.

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Mathieu Blondel

2011-06-03 11:21:29 UTC

Permalink

Post by Lars Buitinck
Not exactly; Bernoulli NB penalizes features with 0 value. I will add
binarizing to the class.

Ok. I'm +1 for adding a threshold parameter (defaulting to 1) and
saving the binarizer object as you suggested before.

Post by Lars Buitinck

Thanks for the clarification.

Post by Lars Buitinck

Post by Mathieu Blondel
Concerning Maxent / Multinomial LR, real features CAN be used. It's
just that NLP people use binary features a lot.

What I understood from the MaxEnt literature, real-valued features
should be used only with binning or with custom algorithms, some of
which are patented.

I may be wrong but I think that nothing prevents MaxEnt from being
used with real features.

One motivation for using binning is to add non-linear features as
explained by Alex Passos and Yoshua Bengio here
http://metaoptimize.com/qa/questions/5621/what-is-the-advantage-of-creating-quantiles-in-datasets
and here http://metaoptimize.com/qa/questions/1927/real-valued-features-in-crfs
.

Mathieu

Olivier Grisel

2011-06-03 12:16:38 UTC

Permalink

Post by Mathieu Blondel
One motivation for using binning is to add non-linear features as
explained by Alex Passos and Yoshua Bengio here
http://metaoptimize.com/qa/questions/5621/what-is-the-advantage-of-creating-quantiles-in-datasets
and here http://metaoptimize.com/qa/questions/1927/real-valued-features-in-crfs

This is discussion is interesting: we could add a new
BinningTransformer (maybe as a a complement to the MidrangeScaler
discussed on the preprocessing-simplification pull request) so as to
make linear models able to capture non linear features in the data.
That should be easy to implement and could potentially make all the
linear model more expressive for a small computational overhead.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Mathieu Blondel

2011-06-03 12:35:01 UTC

Permalink

Post by Olivier Grisel
This is discussion is interesting: we could add a new
BinningTransformer (maybe as a a complement to the MidrangeScaler
discussed on the preprocessing-simplification pull request) so as to
make linear models able to capture non linear features in the data.
That should be easy to implement and could potentially make all the
linear model more expressive for a small computational overhead.

Sounds like an idea (+an option for specifying the range of features
on which we want to apply the binning). This transformer could
potentially be used to transform categorical features to binary
features too (for example a 5-category variable needs to be mapped to
5 binary features).

Mathieu

Olivier Grisel

2011-06-03 12:37:46 UTC

Permalink

Post by Mathieu Blondel

I would prefer to stick to continious (float) binned features rather
than binaries features by default as Y. Bengio explains (so as to not
loose information, just increase the expressive power of the
downstream linear models). But all this set of behaviors offcourse be
made configurable through hyper-parameters.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Mathieu Blondel

2011-06-03 12:49:31 UTC

Permalink

Post by Olivier Grisel
I would prefer to stick to continious (float) binned features rather
than binaries features by default as Y. Bengio explains (so as to not
loose information, just increase the expressive power of the
downstream linear models). But all this set of behaviors offcourse be
made configurable through hyper-parameters.

Agreed. But note that the goal is a bit different here: categorical
features cannot be used as is.

Mathieu

Olivier Grisel

2011-06-03 12:55:16 UTC

Permalink

Post by Mathieu Blondel

Agreed. But note that the goal is a bit different here: categorical
features cannot be used as is.

Indeed. Anyways this gives me even more incentives to finish the work
I started in:

https://github.com/scikit-learn/scikit-learn/pull/193

I will keep you posted when this is ready for merge so that we can
start the implementation discussion / proposal for binning.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Mathieu Blondel

2011-06-03 13:01:18 UTC

Permalink

Post by Olivier Grisel
Indeed. Anyways this gives me even more incentives to finish the work
https://github.com/scikit-learn/scikit-learn/pull/193

Merging this one fast will make Lars's life easier for his
BernouilliNB: he needs a binarizer which can work with dense and
sparse matrices out-of-the-box.

Mathieu

Gael Varoquaux

2011-06-06 10:52:45 UTC

Permalink

Post by Mathieu Blondel

By the way, this might be off-topic, as this thread is talking about
problems I am not used to, and I read it a bit quickly. However, I
recently wrote some code to choose bins sides (IOW thresholds) to
binarize univariate data trying to have equal population bins. This can
be useful if you want to convert a continious distribution to a set of
states. It can actually get a bit tricky when you have a mixture of
scattered data and a few macroscopicaly-occupied states.

I uploaded the code on:
https://gist.github.com/1010064
It has no tests :(.

It is only for univariate data. For mutlivariate, one would need to use
the tree built by a KD-tree or a ball-tree. However, dealing with
macroscopicaly-occupied states would get a bit trickier.

If it's of any use to other people, grab it. If its off general use, we
should write tests and integrate it to the scikit.

Gael

Olivier Grisel

2011-06-06 12:39:42 UTC

Permalink

Post by Gael Varoquaux

Post by Mathieu Blondel

I had the same idea after writing my last reply to that thread but was
too lazy to resend a new email. I am glad you already have so working
code for this.

Post by Gael Varoquaux
This can
be useful if you want to convert a continious distribution to a set of
states. It can actually get a bit tricky when you have a mixture of
scattered data and a few macroscopicaly-occupied states.
https://gist.github.com/1010064
It has no tests :(.
It is only for univariate data. For mutlivariate, one would need to use
the tree built by a KD-tree or a ball-tree. However, dealing with
macroscopicaly-occupied states would get a bit trickier.

I think we can leave multivariate binning for later (we could always
use univariate binning after a PCA, KMeans transform, NMF, or any kind
of other prototype extraction transformer to kind of simulate this).

Post by Gael Varoquaux
If it's of any use to other people, grab it. If its off general use, we
should write tests and integrate it to the scikit.

Yes.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel