Discussion:
[Scikit-learn-general] feature extraction for text
Mathieu Blondel
2010-09-14 14:29:32 UTC
Permalink
Hello,

I've been looking into the feature extraction code for text today and
I really like the split between the analyzer and the vectorizer. Nice!

I'm not familiar yet with hash representations so I have some questions:
- I see you can fix the number of dimensions so, is it somehow related
to dimensionality reduction?
- Will the vector representation be the same if I vectorize my
training documents and, later, vectorize new test documents?
- Is it possible to get word counts only? This is often necessary for
models based on the multinomial distribution.

I thought it would be nice to have plain (non-hashing) vectorizers so
I created a draft here:

http://github.com/mblondel/scikit-learn/commits/textextract

Some remarks:
- I tried to use the same API as the HashingVectorizer.
- You can get vectors as word counts, term-frequencies (normalized
counts) or tf-idf frequencies.
- You can pass a vocabulary dictionary as parameter.
- The constructor doesn't have a use_idf option because the
computations are not made in vectorize(). Is it possible to do the
same in HashVectorizer? This would allow to get word counts, as well
as get rid of use_idf.

If this looks OK, I will create a SparseVectorizer object.

In addition, I've made a second commit to add a "filters" argument to
the WordNGramAnalyzer. This gives some freedom to the user for the
preprocessing. I left stop word removal as is because it requires
tokenization but it could be made a filter as well. If you like the
idea, I will do the same for CharNGrammAnalyzer.

Another remark: Just like in fit(), I think we should return self at
the end of vectorize() so we can call get_tfidf() or other methods in
chain.

As it is a very common pre-processing in NLP, I'm planning to add some
frequency-based transformers. For example, one can remove dimensions
corresponding to tokens that appear less/more than n times in the
whole corpus or that appear in less/more than n% of the documents but
this will assume that the matrix contains word counts.

Mathieu
Olivier Grisel
2010-09-14 14:43:39 UTC
Permalink
Great to have some feedback on this. I will try to take some time this
evening (Paris time, e.g. in a couple of hours) to read your mail
carefully and give a complete reply.

Please know that I don't like the current API (esp. the fact that the
IDF estimate implies an implicit stateful behaviour in the hashing
vectorizer. I had a plan to rewrite them using the fit / transform API
to make the statefulness explicit in the fit model but did not have
the time to do so yet. The goal is to be able to have text features
extractors and vectorizers pipelinable using the new pipeline API to
be able to cross-validate both the feature extraction and the
classifier hyper parameters at the same time.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2010-09-14 14:55:40 UTC
Permalink
On Tue, Sep 14, 2010 at 11:43 PM, Olivier Grisel
Post by Olivier Grisel
Please know that I don't like the current API (esp. the fact that the
IDF estimate implies an implicit stateful behaviour in the hashing
vectorizer. I had a plan to rewrite them using the fit / transform API
to make the statefulness explicit in the fit model but did not have
the time to do so yet. The goal is to be able to have text features
extractors and vectorizers pipelinable using the new pipeline API to
be able to cross-validate both the feature extraction and the
classifier hyper parameters at the same time.
I thought of exactly the same while making the vectorizer. So what it
your plan for the new API? In particular, how the user will be able to
choose between the various representations (word counts, tf, tf-idf)?

Mathieu
Olivier Grisel
2010-09-14 15:06:16 UTC
Permalink
Post by Mathieu Blondel
On Tue, Sep 14, 2010 at 11:43 PM, Olivier Grisel
Post by Olivier Grisel
Please know that I don't like the current API (esp. the fact that the
IDF estimate implies an implicit stateful behaviour in the hashing
vectorizer. I had a plan to rewrite them using the fit / transform API
to make the statefulness explicit in the fit model but did not have
the time to do so yet. The goal is to be able to have text features
extractors and vectorizers pipelinable using the new pipeline API to
be able to cross-validate both the feature extraction and the
classifier hyper parameters at the same time.
I thought of exactly the same while making the vectorizer. So what it
your plan for the new API? In particular, how the user will be able to
choose between the various representations (word counts, tf, tf-idf)?
I need to review your code first but we could either have:

- Analyzers implementation with a transform method that takes an
sequence of unicode or file-like instances as input and return a
sequence of list of tokens as output

- TermCountDictionary[Sparse|Dense]Vectorizer with fit + transform method
- TermFreqDictionary[Sparse|Dense]Vectorizer with fit + transform method
- TfIdfDictionary[Sparse|Dense]Vectorizer with fit + transform method

- TermCountHashing[Sparse|Dense]Vectorizer with transform method only
- TermFreqHashing[Sparse|Dense]Vectorizer with transform method only
- TfIdfHashing[Sparse|Dense]Vectorizer with fit + transform method

the fit method can be used to build the dictionary of the 100000 most
frequent features for instance for the Dictionary vectorizers and the
IDF estimate for TfIdf vectorizers.

Alternartively we could have the TermCount / TermFreq / TfIDF variants
as init parameters of the Dictionary[Sparse|Dense]Vectorizer and
Hashing[Sparse|Dense]Vectorizer classes. That would make it easier to
select the best representation by grid search + cross validation of
that param.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2010-09-14 16:23:42 UTC
Permalink
On Wed, Sep 15, 2010 at 12:06 AM, Olivier Grisel
Post by Olivier Grisel
Alternartively we could have the TermCount / TermFreq / TfIDF variants
as init parameters of the Dictionary[Sparse|Dense]Vectorizer and
Hashing[Sparse|Dense]Vectorizer classes. That would make it easier to
select the best representation by grid search + cross validation of
that param.
I tend to prefer this one: shorter class names (the ones above are
really long!), grid search and no need for inheritance.

I want to be able to call fit, transform my training data, save the
vectorizer with pickle, and later transform test data or new inputs.
This means that the vocabulary parameter in init is not needed
anymore. In the case of the dictionary-based vectorizer, I think that
fit() needs to learn the vocabulary and the idf weights. [Say, you
want to transform a single new document after fit. It's not possible
to compute idf weights so you need to use those learned during fit()]

However, the term frequencies must be computed during transform().
This means that the vectorizer does not need to store any matrix, only
a vocabulary dictionary and an idf-vector, which is nice for saving
it.

The only thing is that it may require two passes on the data: one to
extract the vocabulary dictionary in fit() and one in transform() to
get the term-frequencies. I hope this translates the same in the
hashing case.

Mathieu
Olivier Grisel
2010-09-14 16:40:42 UTC
Permalink
Post by Mathieu Blondel
On Wed, Sep 15, 2010 at 12:06 AM, Olivier Grisel
Post by Olivier Grisel
Alternartively we could have the TermCount / TermFreq / TfIDF variants
as init parameters of the Dictionary[Sparse|Dense]Vectorizer and
Hashing[Sparse|Dense]Vectorizer classes. That would make it easier to
select the best representation by grid search + cross validation of
that param.
I tend to prefer this one: shorter class names (the ones above are
really long!), grid search and no need for inheritance.
I want to be able to call fit, transform my training data, save the
vectorizer with pickle, and later transform test data or new inputs.
This means that the vocabulary parameter in init is not needed
anymore. In the case of the dictionary-based vectorizer, I think that
fit() needs to learn the vocabulary and the idf weights. [Say, you
want to transform a single new document after fit. It's not possible
to compute idf weights so you need to use those learned during fit()]
 However, the term frequencies must be computed during transform().
This means that the vectorizer does not need to store any matrix, only
a vocabulary dictionary and an idf-vector, which is nice for saving
it.
I think we globally agree :)
Post by Mathieu Blondel
The only thing is that it may require two passes on the data: one to
extract the vocabulary dictionary in fit() and one in transform() to
get the term-frequencies.
For the dictionary vectorizer you only learn the dictionnary in the
fit method which is a python dict with tokens (words, work ngrams,
char ngrams) in keys and an arbitrary int index in the values.

The transform method will then reuse this dictionary to turn inputs such as:

[
['this', 'is', 'document', 'number', 'one'],
['this', 'is', 'document', 'number', 'two'],
]

into a set of vectors (sparse or dense) such as:

np.array([
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 0, 1],
])

Assuming that the learn dict vocabulary from the fit method is:

dictionary = {
'this': 0,
'is': 1,
'document': 2,
'number': 3,
'one': 4,
'two': 5,
}
Post by Mathieu Blondel
I hope this translates the same in the hashing case.
For the hashing case, the fit method is only useful to learn the IDF
estimate as there is no dictionary. The map for feature name to
feature int index is defined implicitly by the hashing function.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2010-09-14 16:53:50 UTC
Permalink
Post by Mathieu Blondel
The only thing is that it may require two passes on the data: one to
extract the vocabulary dictionary in fit() and one in transform() to
get the term-frequencies.
Hum. I re-read your mail and I think I misunderstood in my first
reply. Yes we will need 2 passes: one for fit and one for transform
but hopefuly the pipeline code will not re-rerun the transform method
of the upstream anayzer twice. I need to check that. Otherwise we can
still use joblib to memoize explicitly intermediate results from the
upstream analyzer.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2010-09-14 16:56:45 UTC
Permalink
Post by Olivier Grisel
Hum. I re-read your mail and I think I misunderstood in my first
reply. Yes we will need 2 passes: one for fit and one for transform
but hopefuly the pipeline code will not re-rerun the transform method
of the upstream anayzer twice. I need to check that. Otherwise we can
still use joblib to memoize explicitly intermediate results from the
upstream analyzer.
It helps joblib if the important data structures are stored as numpy
arrays (I say that because I think saw large dictionnaries mentionned). I
think it is possible to extend the fast code to anything inplementing the
buffer protocole.

Gaël
Olivier Grisel
2010-09-14 17:01:00 UTC
Permalink
Post by Gael Varoquaux
Post by Olivier Grisel
Hum. I re-read your mail and I think I misunderstood in my first
reply. Yes we will need 2 passes: one for fit and one for transform
but hopefuly the pipeline code will not re-rerun the transform method
of the upstream anayzer twice. I need to check that. Otherwise we can
still use joblib to memoize explicitly intermediate results from the
upstream analyzer.
It helps joblib if the important data structures are stored as numpy
arrays (I say that because I think saw large dictionnaries mentionned). I
think it is possible to extend the fast code to anything inplementing the
buffer protocole.
In this case the intermediate data is not the dictionary but the list
of list of unicode objects We could use an np.array of list of unicode
objects with variable sizes but that won't probably help much. Any way
I am not even sure we need joblib memoization at all yet. Need to have
a look first :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2010-09-14 17:03:29 UTC
Permalink
On Wed, Sep 15, 2010 at 1:53 AM, Olivier Grisel
Post by Olivier Grisel
Hum. I re-read your mail and I think I misunderstood in my first
reply. Yes we will need 2 passes: one for fit and one for transform
but hopefuly the pipeline code will not re-rerun the transform method
of the upstream anayzer twice. I need to check that. Otherwise we can
still use joblib to memoize explicitly intermediate results from the
upstream analyzer.
Yes, your example seems to confirm that we need two passes on the data
(in the training case). And I agree that if fit and transform take a
list of tokens as input, the analyzer won't need to be run twice.

Anyway, that seems like a cool new API.

Mathieu
Olivier Grisel
2010-09-14 17:22:02 UTC
Permalink
Post by Mathieu Blondel
Yes, your example seems to confirm that we need two passes on the data
(in the training case). And I agree that if fit and transform take a
list of tokens as input, the analyzer won't need to be run twice.
Anyway, that seems like a cool new API.
The only problem I see with this API is that the list of text tokens
should now be able to live in memory for the complete corpus before
being transformed into freq vectors that should be smaller. This might
not work on your personal laptop for wikipedia-sized corpora. But then
I am not sure we have any classifier able to deal reasonably well with
with n_samples ~= 3e6 and n_features ~= 1e5 even with a density level
of 1%. Such a freq-vectors dataset would weigh approximately 30GB
assuming double precision values + non zero indices (using compressed
sparse rows for instance).

To treat such cases we will need online transformers and classifiers
first. I think we can keep such use cases for later.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2010-09-14 17:35:28 UTC
Permalink
On Wed, Sep 15, 2010 at 2:22 AM, Olivier Grisel
Post by Olivier Grisel
The only problem I see with this API is that the list of text tokens
should now be able to live in memory for the complete corpus before
being transformed into freq vectors that should be smaller. This might
not work on your personal laptop for wikipedia-sized corpora. But then
I am not sure we have any classifier able to deal reasonably well with
with n_samples ~= 3e6 and n_features ~= 1e5 even with a density level
of 1%. Such a freq-vectors dataset would weigh approximately 30GB
assuming double precision values + non zero indices (using compressed
sparse rows for instance).
To treat such cases we will need online transformers and classifiers
first. I think we can keep such use cases for later.
I was writing a message about exactly this (!).

The fit / transform API is very nice for traditional algorithms but it
requires to store the entire dataset in memory. We could also have a
special-purpose vectorizer for online learning algorithms. In this
case, the vocabulary dictionary can be built incrementally and the
vector corresponding to a document can be "spit out" (e.g., with the
yield operator) as soon as it is ready, so that the online learning
algorithm can readily "consume" it. But, I agree that this kind of API
can be postponed until there are online classifiers in the scikit.

Also, I remember a blog post (maybe in Hal Daume's blog), where it was
written that in online algorithms, if you have one thread for
transforming your documents to vectors and another for the learning,
the second basically spends most of its time waiting for the first to
deliver vectors to learn!

Mathieu
Olivier Grisel
2010-09-14 17:44:30 UTC
Permalink
Post by Mathieu Blondel
Also, I remember a blog post (maybe in Hal Daume's blog), where it was
written that in online algorithms, if you have one thread for
transforming your documents to vectors and another for the learning,
the second basically spends most of its time waiting for the first to
deliver vectors to learn!
Yes this is what vowpal wabbit does IIRC. In fact feature extractions
/ hashing is what takes most of the times when training linear model
classifiers. Once you have packed you vectors into dense blocks,
computing dot products on them is very fast.

Also if your documents live in a distributed DB such as Hadoop DFS or
HBASE it is trivial and more efficient to perform the feature
extraction in // using mapreduce and then train the classifier on the
resulting vectors (you can even move online the classifier from one
node to another instead of moving the data around assuming you have no
way to // ize the classifier fitting).

But again, cluster-friendly machine learning is yet another issue, not
to be adressed in the short term by the scikit I guess.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2010-09-14 18:40:09 UTC
Permalink
This might not work on your personal laptop for wikipedia-sized
corpora. But then I am not sure we have any classifier able to deal
reasonably well with with n_samples ~= 3e6 and n_features ~= 1e5 even
with a density level of 1%. Such a freq-vectors dataset would weigh
approximately 30GB assuming double precision values + non zero indices
(using compressed sparse rows for instance).
If you can fit that in an array-like object, you can solve (partly) the
memory issue using memmapping and direct seek to the data you need. The
problem then becomes implementing algorithms that are local in memory in
order not to spend your time seeking... so you probably haven't solved
much :)

Gaël
Mathieu Blondel
2010-09-14 23:51:55 UTC
Permalink
On Wed, Sep 15, 2010 at 12:06 AM, Olivier Grisel
Post by Olivier Grisel
Alternartively we could have the TermCount / TermFreq / TfIDF variants
as init parameters of the Dictionary[Sparse|Dense]Vectorizer and
Hashing[Sparse|Dense]Vectorizer classes. That would make it easier to
select the best representation by grid search + cross validation of
that param.
I gave a second thought to the API. Once the data has been vectorized
by the vectorizer and you got a matrix of term counts, it's easy to
write a TFIDF transformer that learns the IDF weight vector in fit()
and apply the TF*IDF in transform(). In the pipelines, you would thus
have:

Input: raw data
Pipeline: WordNGramAnalyzer => TermCountVectorizer /
TermCountHashingVectorizer => TfidfTransformer => ...

If we don't use a transformer API for the vectorizer and accept to
start the pipeline from the TfidfTransformer, then it will not be
necessary to make two passes on the data and thus it doesn't have to
remain in memory.

Input: matrix of term counts
Pipeline: TfidfTransformer => ...

Or, I thought of a possible trick to not make two passes on the data
while keeping the transformer API. You could do nothing in fit().
Then, you would do the learning (e.g., IDF weights and vocabulary
dictionary) the first time transform() is called and return the
matrix. The second time transform() is called, for example for a new
input, the learning would already be done and it would just return the
matrix. transform() is not supposed to change the state of the object
but this may be a possible workaround... Gael, would that work in the
pipeline?

Mathieu
Mathieu Blondel
2010-09-15 05:27:04 UTC
Permalink
I've updated the dictionary-based vectorizer to follow the transformer
API. Hopefully this will help us getting a better picture of the
problem.

http://github.com/mblondel/scikit-learn/commits/textextract

There are 3 classes:
- TermCountVectorizer: transform tokenized documents into a document x
term/token-count matrix. transform() changes the state of the object
the first time it's called. This way, only one pass on the data is
needed.
- TfidfTransform: transform the document x term/token-count matrix
from above into a TF-IDF representation (IDF is optional).
- TfidfVectorizer: a shorthand that uses the pipeline API to do the
above two in one step, i.e., transform tokenized documents into a
TF-IDF representation.

I've also added a transform() method to the Pipeline object, as it was
needed to implement TfidfVectorizer.

I didn't change the analyzers yet so as to not break anything.

Comments welcome!
Mathieu
Gael Varoquaux
2010-09-15 05:47:50 UTC
Permalink
Post by Mathieu Blondel
Or, I thought of a possible trick to not make two passes on the data
while keeping the transformer API. You could do nothing in fit().
Then, you would do the learning (e.g., IDF weights and vocabulary
dictionary) the first time transform() is called and return the
matrix. The second time transform() is called, for example for a new
input, the learning would already be done and it would just return the
matrix. transform() is not supposed to change the state of the object
but this may be a possible workaround... Gael, would that work in the
pipeline?
I don't really master the application of what you are talking about, so I
fear I am going to talk nonsens.

I don't see an issue with not doing anything in fit. However, I am a bit
worried about doing the learning in transform: what happens if people
change the data? For instance, when cross validating, I spend my time
doing, with the same object but different X_train, X_test:

trans.fit(X_train).transform(X_test)

You will probably need a mechanism to detect that the data has not
changed. That is getting close to the reasons why we designed joblib.

I have the impression that the issues that you are mentioning here are
issues that I solve in my code using joblib. The reason being that it
enables me to separate the algorithmic logic from the dataflow
programming, and I find it helps me refactor my scientific code more
quickly.

My 2 sens,

Gaël
Mathieu Blondel
2010-09-15 06:16:44 UTC
Permalink
On Wed, Sep 15, 2010 at 2:47 PM, Gael Varoquaux
Post by Gael Varoquaux
I don't see an issue with not doing anything in fit. However, I am a bit
worried about doing the learning in transform: what happens if people
change the data? For instance, when cross validating, I spend my time
   trans.fit(X_train).transform(X_test)
Indeed, this case would fail. I can imagine this scenario in a
dimensionality reduction setting but in a *feature extraction*
setting, this would mean that the user never uses the (vectorized)
training data... My code assumes that the user subsequently uses the
training data. For example,

X_train = trans.fit(docs_train).transform(docs_train)
classifier.fit(X_train, y)
X_test = trans.transform(docs_test)
y_test = classifier.predict(X_test)

Mathieu
Alexandre Gramfort
2010-09-15 07:12:55 UTC
Permalink
Post by Mathieu Blondel
X_train = trans.fit(docs_train).transform(docs_train)
classifier.fit(X_train, y)
X_test = trans.transform(docs_test)
y_test = classifier.predict(X_test)
this reminds me of the manifold use case where basically
we compute an embedding of the train data that is exactly the
result of a transform of the train data. Let me explain

I could write :

X_train_embedded = trans.fit(X_train).transform(X_train)

however this is equivalent to

X_train_embedded = trans.fit(X_train).embedding_

Maybe one could imagine that if the transform has no argument
it returns the X_train transformed but this would break the way
pipeline works... ok that's just a random thought early in the morning...

Alex
Mathieu Blondel
2010-09-15 14:43:55 UTC
Permalink
On Wed, Sep 15, 2010 at 4:12 PM, Alexandre Gramfort
Post by Alexandre Gramfort
this reminds me of the manifold use case where basically
we compute an embedding of the train data that is exactly the
result of a transform of the train data. Let me explain
X_train_embedded = trans.fit(X_train).transform(X_train)
however this is equivalent to
X_train_embedded = trans.fit(X_train).embedding_
Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation
would also be in that case, I think.

- If the matrix passed to transform is the same as the one passed to
fit (i.e., training data), the transformed matrix is the parameters
learned by fit()

- If the matrix is different (i.e., test data), additional
computations are needed

Conclusion: in transform, you need to be able to tell if the matrix is
the same as in fit or not.

Mathieu
Alexandre Gramfort
2010-09-15 14:45:58 UTC
Permalink
Post by Mathieu Blondel
- If the matrix is different (i.e., test data), additional
computations are needed
Conclusion: in transform, you need to be able to tell if the matrix is
the same as in fit or not.
joblib is your friend :)

Alex
Olivier Grisel
2010-09-15 14:58:46 UTC
Permalink
Post by Alexandre Gramfort
Post by Mathieu Blondel
- If the matrix is different (i.e., test data), additional
computations are needed
Conclusion: in transform, you need to be able to tell if the matrix is
the same as in fit or not.
joblib is your friend :)
That's fine when the data is a numpy array but that won't work for
list of list of unicodes with inner lists of variable length.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2010-09-15 16:47:44 UTC
Permalink
OK, I am done with some other work, so I'll reply to the thread :)
Post by Olivier Grisel
Post by Alexandre Gramfort
Post by Mathieu Blondel
- If the matrix is different (i.e., test data), additional
computations are needed
Conclusion: in transform, you need to be able to tell if the matrix is
the same as in fit or not.
joblib is your friend :)
That's fine when the data is a numpy array but that won't work for
list of list of unicodes with inner lists of variable length.
I think your remark is interesting. My experience is that checking that
input data changes or not is actually a bit challenging for arbitrary
objects, and that's one of the reasons I forked out joblib from my
internal code and I am putting some effort in it. For what its worth,
joblib perfectly works on arbitrary objects, and I use it so everyday
although it's performance might not be as good as with numpy arrays.

In general, I find that the only (somewhat) reliable way to check for
modifications is to rely on pickling and hashing. First of all, you can't
really on Python's equal, as it is often overridden (think numpy arrays)
and when it is not, '==' tends to check for identity in the Python sens
(id). You can't check only identity since objects modified in place will
fall through the cracks, and when you work with big data (numpy arrays or
dicts) you modify objects in place.

So you need to recursively check for equality of the content in a graph
of objects (objects/subobject/items in a list...). Doing this amounts to
walking this graph, avoiding cycles, and if you look at the way the
standard Python pickler is defined, that's pretty much how it works.
Second, you can't afford to keep shadow object, because it would amount
to deepcopying, and that's a killer in terms of performance. That's where
the hashing comes in.

The way it is implemented in joblib is pretty much to use pickle and take
an md5 hash of the pickle, but I have subclassed the standard pickler to
take hashes of numpy arrays via their buffer interface rather then via a
pickle: http://github.com/joblib/joblib/blob/master/joblib/hashing.py

This is efficient only for objects for which the heavy data is stored in
a numpy array. I believe that it would be fairly easy to extend it to
data stored in objects that expose a buffer protocol. Now how to go
further I don't know. Suppose that you want to check that the content of
an SQLite object is not changed? Or simply a list of list of unicodes (as
you mention)? I don't know how to be efficient on that (pickling can be a
bit costly in terms of memory) and I'd really be interested by
suggestions on how to tackle this problem.

To conclude, they are quite a lot of subtleties in these problems related
to the Python object model (unless you special case them on certain
objects). I must admit that I like this code to live outside the scikit
(and be included as a bundled dependencies) because I think that it is
code I believe that machine learners are not specially qualified to
maintain, and because more eyes make bugs shallow. That's why I am not
sure why you are not convince that joblib is not a good option. To be
fair I can see another one if you are only interested in dictionaries: a
false dictionary that would either forbid modifications, or record them.
Traits has something like this, but it is also hard work to implement.

My 2 euro cents,

Gaël
Olivier Grisel
2010-09-15 15:03:39 UTC
Permalink
Post by Mathieu Blondel
On Wed, Sep 15, 2010 at 4:12 PM, Alexandre Gramfort
Post by Alexandre Gramfort
this reminds me of the manifold use case where basically
we compute an embedding of the train data that is exactly the
result of a transform of the train data. Let me explain
X_train_embedded = trans.fit(X_train).transform(X_train)
however this is equivalent to
X_train_embedded = trans.fit(X_train).embedding_
Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation
would also be in that case, I think.
- If the matrix passed to transform is the same as the one passed to
fit (i.e., training data), the transformed matrix is the parameters
learned by fit()
- If the matrix is different (i.e., test data), additional
computations are needed
Conclusion: in transform, you need to be able to tell if the matrix is
the same as in fit or not.
I prefer to make it explicit by using a new fit_transform() API as
explained earlier instead of implicit equality magic in transform (or
worst an implicit stateful transform that does stuff that look like
estimating internal parameters).

Fabian, Gael, others: do you have an opinion on this?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2010-09-15 15:07:38 UTC
Permalink
Post by Olivier Grisel
I prefer to make it explicit by using a new fit_transform() API as
explained earlier instead of implicit equality magic in transform (or
worst an implicit stateful transform that does stuff that look like
estimating internal parameters).
Fabian, Gael, others: do you have an opinion on this?
I have a strong one, that's why I am trying to shut up and listen to what
others have to say in order to learn from them :). But I know that I'll
soon yield and reply to a few messages in this thread :).

Fabian is on holidays (as you can clearly see from the commit log :P), so
I don't expect a reply from him terribly soon.

Gaël
Olivier Grisel
2010-09-15 15:15:52 UTC
Permalink
Post by Gael Varoquaux
Post by Olivier Grisel
I prefer to make it explicit by using a new fit_transform() API as
explained earlier instead of implicit equality magic in transform (or
worst an implicit stateful transform that does stuff that look like
estimating internal parameters).
Fabian, Gael, others: do you have an opinion on this?
I have a strong one, that's why I am trying to shut up and listen to what
others have to say in order to learn from them :). But I know that I'll
soon yield and reply to a few messages in this thread :).
Fabian is on holidays (as you can clearly see from the commit log :P), so
I don't expect a reply from him terribly soon.
Alright. Any other opinion?

Maybe we should update the mlcomp document classification examples and
make them use the Pipeline object in Mathieu's branch. That would be
great to update an existing or write a new example for the embedding
pattern as well.

Having real life worked examples makes it easier to decide which API
looks better from a user perspective.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Alexandre Gramfort
2010-09-15 15:33:37 UTC
Permalink
Post by Olivier Grisel
Maybe we should update the mlcomp document classification examples and
make them use the Pipeline object in Mathieu's branch. That would be
great to update an existing or write a new example for the embedding
pattern as well.
Having real life worked examples makes it easier to decide which API
looks better from a user perspective.
+1

it will also be less abstract for me :)

Alex
Gael Varoquaux
2010-09-15 16:56:06 UTC
Permalink
Post by Olivier Grisel
Maybe we should update the mlcomp document classification examples and
make them use the Pipeline object in Mathieu's branch. That would be
great to update an existing or write a new example for the embedding
pattern as well.
Having real life worked examples makes it easier to decide which API
looks better from a user perspective.
+1
+1. It would greatly help me.

And if you can make the example work on something else that mlcomp's data
it would be great, so that the average Joe can run it.

Gaël
Gael Varoquaux
2010-09-15 16:57:33 UTC
Permalink
Post by Olivier Grisel
I prefer to make it explicit by using a new fit_transform() API as
explained earlier instead of implicit equality magic in transform (or
worst an implicit stateful transform that does stuff that look like
estimating internal parameters).
I agree fully with every point above.

G.
Gael Varoquaux
2010-09-15 16:54:29 UTC
Permalink
Post by Mathieu Blondel
Conclusion: in transform, you need to be able to tell if the matrix is
the same as in fit or not.
I'd rather avoid having such features in the scikit because:

- Such check must often rely on non-trivial code, and when they fail,
you bite the dust (it has happened to me sooo often).
- It induces behavior that can be suprising.

I am not saying that there not value in having on-demand computing, or
data-flow programming. There is a huge amount of value. I am just saying
that I would like these problems to be solved outside the estimator
objects, and possibly joining forces with other projects. Separating the
alogrithms from the execution engine opens the door to improving them
separatly.

Gaël
Gael Varoquaux
2010-09-15 07:14:48 UTC
Permalink
Post by Mathieu Blondel
Post by Gael Varoquaux
   trans.fit(X_train).transform(X_test)
Indeed, this case would fail. I can imagine this scenario in a
dimensionality reduction setting but in a *feature extraction*
setting, this would mean that the user never uses the (vectorized)
training data...
My worry is that it opens the door to mistakes in which people use some
of the test set to train the estimator. If you know what you are doing,
and know which step is independent of which, you can easily avoid these
mistakes. However, if you don't really understand what is going on behind
an estimator or a transform, the door is open to mistakes.
Post by Mathieu Blondel
My code assumes that the user subsequently uses the
training data. For example,
X_train = trans.fit(docs_train).transform(docs_train)
classifier.fit(X_train, y)
X_test = trans.transform(docs_test)
y_test = classifier.predict(X_test)
That seems to me absolutely fine, as long as you don't change docs_train.

I guess we are talking too much in the abstract for me, as I don't really
know the details of what you want to do, so you should just go ahead.
Olivier is the guy who should be looking at your code and merging it.

Thanks a lot for your contributions, I like the way this feature
extraction code is shaping.

Gaël
Mathieu Blondel
2010-09-15 08:51:51 UTC
Permalink
On Wed, Sep 15, 2010 at 4:14 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Mathieu Blondel
Post by Gael Varoquaux
   trans.fit(X_train).transform(X_test)
Indeed, this case would fail. I can imagine this scenario in a
dimensionality reduction setting but in a *feature extraction*
setting, this would mean that the user never uses the (vectorized)
training data...
My worry is that it opens the door to mistakes in which people use some
of the test set to train the estimator. If you know what you are doing,
and know which step is independent of which, you can easily avoid these
mistakes. However, if you don't really understand what is going on behind
an estimator or a transform, the door is open to mistakes.
As I said before, I introduced a TfidfVectorizer which is a shorthand
for TermCountVectorizer followed byTfidfTransformer, using the
pipeline object.

Now let's have a look at how TfidfVectorizer behaves with the
problematic case pointed out by Gael:

X_test = TfidfVectorizer().fit(docs_train).transform(docs_test)

Since TfidfVectorizer is a pipeline of [TermCountVectorizer,
TfidfTransformer], TfidfVectorizer().fit(docs_train) will run:

# here fit does nothing, transform does the learning + the transform
Xt = TermCountVectorizer.fit(docs_train).transform(doc_train) # line 1
# here fit does something
Xt = TfidfTransformer.fit(Xt) # line 2

Next, we call transform(docs_test) on the fitted pipeline, which results in:

# here TermCountVectorizer was already trained by transform() at line 1
Xt = TermCountVectorizer.transform(docs_test)
# here TfidfTransformer was already trained by fit at line 2
Xt = TfidfTransformer.transform(Xt)

Conclusions:
1) doing the learning in transform() is not incompatible with the pipeline
2) the pipeline makes the whole thing look like a black box and
abstracts away the possible misconception that users may have

Mathieu
Olivier Grisel
2010-09-15 09:56:50 UTC
Permalink
Post by Mathieu Blondel
1) doing the learning in transform() is not incompatible with the pipeline
2) the pipeline makes the whole thing look like a black box and
abstracts away the possible misconception that users may have
Hi,

I did not have the time to have a look at your code yet. Probably
won't have the time today either. But will do ASAP, probably this WE.
In the mean time, just by reading the thread:

- I vote -1 for classes that have a transform() method that changes
the state of the instance. Statefulness should be wrapped explicitly
in the fit method: for text extraction this is useful for 2 use cases:
estimating what are the top N most frequent features to build the
vocabulary / dictionary for dictionary-based vectorizers and
estimating the IDF vector for TF-IDF flavored vectorizers.

- I don't see how you can have a TermCountVectorizer that has no fit
method: you need to estimate the most frequent terms to build the
vocabulary. Unless you give a predefined dictionary in the __init__.
The only term count/frequency vectorizer that can be implemented in a
fully stateless way (no fit method) is the hashing variant (be it
sparse or dense, TermCount or TermFreq).

- For the case where it is too expensive to call fit and drop the
intermediate results, maybe we could add a
transformer.fit_transform(train_data) method that is equivalent to
transformer.fit(train_data).transform(train_data) and that the
pipeline can use as an optim when available to avoid processing
train_data twice.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2010-09-15 14:39:02 UTC
Permalink
On Wed, Sep 15, 2010 at 6:56 PM, Olivier Grisel
Post by Olivier Grisel
Post by Mathieu Blondel
1) doing the learning in transform() is not incompatible with the pipeline
2) the pipeline makes the whole thing look like a black box and
abstracts away the possible misconception that users may have
Hi,
I did not have the time to have a look at your code yet. Probably
won't have the time today either. But will do ASAP, probably this WE.
- I vote -1 for classes that have a transform() method that changes
the state of the instance. Statefulness should be wrapped explicitly
estimating what are the top N most frequent features to build the
vocabulary / dictionary  for dictionary-based vectorizers and
estimating the IDF vector for  TF-IDF flavored vectorizers.
- I don't see how you can have a TermCountVectorizer that has no fit
method: you need to estimate the most frequent terms to build the
vocabulary. Unless you give a predefined dictionary in the __init__.
The only term count/frequency vectorizer that can be implemented in a
fully stateless way (no fit method) is the hashing variant (be it
sparse or dense, TermCount or TermFreq).
This is done in transform(), which as you said is stateful!. You can
think of transform() as making an implicit learning the first time
it's called.
Post by Olivier Grisel
-  For the case where it is too expensive to call fit and drop the
intermediate results, maybe we could add a
transformer.fit_transform(train_data) method that is equivalent to
transformer.fit(train_data).transform(train_data) and that the
pipeline can use as an optim when available to avoid processing
train_data twice.
As a general rule, doing the same processing twice should be PROSCRIBED.
Gael Varoquaux
2010-09-15 17:07:01 UTC
Permalink
Post by Mathieu Blondel
Post by Olivier Grisel
-  For the case where it is too expensive to call fit and drop the
intermediate results, maybe we could add a
transformer.fit_transform(train_data) method that is equivalent to
transformer.fit(train_data).transform(train_data) and that the
pipeline can use as an optim when available to avoid processing
train_data twice.
As a general rule, doing the same processing twice should be PROSCRIBED.
I would be interested in how you think this could be implemented. I find
that avoiding doing twice the same computation is one of the important
problems that keeps being adressed in computing, and I haven't seen any
general solution. I personnally keep fighting these problems in my own
code. I can sometimes resort to bottom-up dynamic programming (as with
joblib or specialy-crafted memoization), but it often has its own issues.

We have discussed these issues a few times during sprints or drinking
events (coffee or beer). We have somewhat come to the conclusion that, in
the spirit of moving forward and solving the core problem that the scikit
is trying to address, we would punt on these issues and focus on having
good algorithms with simple interfaces. These interfaces should make it
possible to do clever things like data flow programming, but not attempt
to solve these general problems.
Mathieu Blondel
2010-09-15 18:56:34 UTC
Permalink
On Thu, Sep 16, 2010 at 2:07 AM, Gael Varoquaux
Post by Gael Varoquaux
I would be interested in how you think this could be implemented. I find
that avoiding doing twice the same computation is one of the important
problems that keeps being adressed in computing, and I haven't seen any
general solution. I personnally keep fighting these problems in my own
code. I can sometimes resort to bottom-up dynamic programming (as with
joblib or specialy-crafted memoization), but it often has its own issues.
My point was just that algorithmically you should always strive to not
do the same calculations twice. Memoization comes to the rescue when
the algorithm makes it inherently difficult but it is a poor man 's
solution if you can optimize the original algorithm directly. joblib
also serves the purpose of saving computation tasks over the runs of a
program, i.e., object persistence, which is slightly different from
memoization.

The original motivation which consisted in making the feature
extraction a transformer was nice. But the fit / transform split
doesn't work well here because what fit and transform do is so close
that doing both at the same time is much more efficient. Ron also
mentioned that for some of the methods in the HMM code.

Traditional algorithms (as opposed to online ones) do need to store
the entire matrix in memory. However, as Olivier pointed out, the raw
document collection is typically much larger than a sparse matrix. So
we really want each raw document to be loaded one a time with a
generator and build the matrix from it.

I'm not particularly attached to the stateful transform() solution, I
just don't want to go over the dataset twice.

The fit_transform() solution seems to be a nice idea. As an additional
benefit, it seems to solve the problem pointed out by Alexandre for
embedding. The pipeline can detect fit_transform() and use it, or,
fall back to fit().transform() when not available. Objects with a
fit_transform() will typically need a transform() method as well so
you can do:

X_train_new = trans.fit_transform(X_train)
X_test_new = trans.transform(X_test)

I 've got some new ideas I would like to try out so I will continue to
improve the draft and eventually create an example with mlcomp.
Meanwhile, you can always have a look at unit tests to get an idea of
the API in practice.

Mathieu
Gael Varoquaux
2010-09-15 19:01:28 UTC
Permalink
Post by Mathieu Blondel
My point was just that algorithmically you should always strive to not
do the same calculations twice. Memoization comes to the rescue when
the algorithm makes it inherently difficult but it is a poor man 's
solution if you can optimize the original algorithm directly.
OK, so we are absolutely on the same page!
Post by Mathieu Blondel
Traditional algorithms (as opposed to online ones) do need to store
the entire matrix in memory. However, as Olivier pointed out, the raw
document collection is typically much larger than a sparse matrix. So
we really want each raw document to be loaded one a time with a
generator and build the matrix from it.
I'm not particularly attached to the stateful transform() solution, I
just don't want to go over the dataset twice.
The fit_transform() solution seems to be a nice idea. As an additional
benefit, it seems to solve the problem pointed out by Alexandre for
embedding. The pipeline can detect fit_transform() and use it, or,
fall back to fit().transform() when not available. Objects with a
fit_transform() will typically need a transform() method as well so
X_train_new = trans.fit_transform(X_train)
X_test_new = trans.transform(X_test)
I 've got some new ideas I would like to try out so I will continue to
improve the draft and eventually create an example with mlcomp.
Meanwhile, you can always have a look at unit tests to get an idea of
the API in practice.
OK, it seems that we are converging. I like your suggestions.

Thanks a lot for discussing, it helps having different points of view.

Gaël
Mathieu Blondel
2010-09-16 09:32:48 UTC
Permalink
I made another iteration over the draft, with the fit_transform idea:
http://github.com/mblondel/scikit-learn/commits/textextract

Various remarks:

- After giving it some thoughts, the analyzer (which does the
tokenization) should not be a transformer. This is because transform()
would have to return the whole tokenized document collection, which is
much larger than a sparse matrix. So, the analyzer should remain a
parameter of the vectorizer as was the case in the old API. For the
hashing-vectorizer, it would be even more a pity if the whole dataset
had to be loaded in memory...

- For the above reason, I designed the vectorizer to use iterables,
which allows to call the analyzer for one document at a time. However,
the cross-validation needs to know the number of samples. Thus if you
want to do grid search on a pipeline, you do need to convert the
documents to a list. See the unit tests for an example of grid search
on a pipeline (vectorizer + classifier).

- I patched pipeline.fit() to use fit_transform() when available and
fall back to fit().transform() when not.

- I patched the fit_grid_point to accept lists (no fancy indexing in
lists). Without this patch you can't build a pipeline of vectorizer +
transformers + classifier and optimize it with grid search.

- Vectorizer is a shortcut for CountVectorizer + TfidfTransformer.
This is useful when users quickly want to transform their raw
documents to a matrix with a one-liner, e.g. in the interactive
interpreter. You can choose between counts, TF, or TF-IDF.

Mathieu
Olivier Grisel
2010-09-16 09:47:57 UTC
Permalink
Post by Mathieu Blondel
http://github.com/mblondel/scikit-learn/commits/textextract
- After giving it some thoughts, the analyzer (which does the
tokenization) should not be a transformer. This is because transform()
would have to return the whole tokenized document collection, which is
much larger than a sparse matrix. So, the analyzer should remain a
parameter of the vectorizer as was the case in the old API. For the
hashing-vectorizer, it would be even more a pity if the whole dataset
had to be loaded in memory...
- For the above reason, I designed the vectorizer to use iterables,
which allows to call the analyzer for one document at a time. However,
the cross-validation needs to know the number of samples. Thus if you
want to do grid search on a pipeline, you do need to convert the
documents to a list. See the unit tests for an example of grid search
on a pipeline (vectorizer + classifier).
- I patched pipeline.fit() to use fit_transform() when available and
fall back to fit().transform() when not.
- I patched the fit_grid_point to accept lists (no fancy indexing in
lists). Without this patch you can't build a pipeline of vectorizer +
transformers + classifier and optimize it with grid search.
- Vectorizer is a shortcut for CountVectorizer + TfidfTransformer.
This is useful when users quickly want to transform their raw
documents to a matrix with a one-liner, e.g. in the interactive
interpreter. You can choose between counts, TF, or TF-IDF.
Thanks for your work and the detailed report. All of this sound very
nice to my ears :)

Before merging your work to master, we will need have to update the
way the document classification dataset loader work. This is used in
those examples:

http://github.com/mblondel/scikit-learn/blob/textextract/examples/mlcomp_sparse_document_classification.py
http://github.com/mblondel/scikit-learn/blob/textextract/examples/mlcomp_document_classification.py

Maybe we should change the dataset loader to spit an iterator of raw
text files suitable for consumption by the vectorizers instead of
precomputed frequency matrices? The NLTK corpus handler can also be
used as a reference example to lazily load text data on demand without
reading all the text in memory first.

That would allow us to showcase the feature extraction + classifier
pipeline explicity in the examples instead of hiding the feature
extraction with fixed parameters in the dataset loader module?

Can't wait this WE to apply this chain to a real life twitter
classification problem.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2010-09-16 10:26:35 UTC
Permalink
On Thu, Sep 16, 2010 at 6:47 PM, Olivier Grisel
Post by Olivier Grisel
Before merging your work to master, we will need have to update the
way the document classification dataset loader work. This is used in
There is also a number of things that I didn't do yet because I wanted
to wait for the API to be accepted. This includes:

- apply the same modifications as WordNGramAnalyzer to CharNGramAnalyzer
- sparse versions: SparseCountVectorizer, SparseTfidfTransformer,
SparseVectorizer
- hashing: HashingCountVectorizer, SparseHashingCountVectorizer (the
TfidfTransformer can be re-used, right?)

For the hashing classes, I will let Olivier do it :)

Before implementing that, I would like to wait a little bit more to be
sure that we have reached a sensible API.
Post by Olivier Grisel
http://github.com/mblondel/scikit-learn/blob/textextract/examples/mlcomp_sparse_document_classification.py
http://github.com/mblondel/scikit-learn/blob/textextract/examples/mlcomp_document_classification.py
Maybe we should change the dataset loader to spit an iterator of raw
text files suitable for consumption by the vectorizers instead of
precomputed frequency matrices? The NLTK corpus handler can also be
used as a reference example to lazily load text data on demand without
reading all the text in memory first.
That would allow us to showcase the feature extraction + classifier
pipeline explicity in the examples instead of hiding the feature
extraction with fixed parameters in the dataset loader module?
Indeed!

Mathieu
Olivier Grisel
2010-09-16 12:49:28 UTC
Permalink
Post by Mathieu Blondel
On Thu, Sep 16, 2010 at 6:47 PM, Olivier Grisel
Post by Olivier Grisel
Before merging your work to master, we will need have to update the
way the document classification dataset loader work. This is used in
There is also a number of things that I didn't do yet because I wanted
- apply the same modifications as WordNGramAnalyzer to  CharNGramAnalyzer
- sparse versions: SparseCountVectorizer, SparseTfidfTransformer,
SparseVectorizer
- hashing: HashingCountVectorizer, SparseHashingCountVectorizer (the
TfidfTransformer can be re-used, right?)
Need to check the details but most probably yes.
Post by Mathieu Blondel
For the hashing classes, I will let Olivier do it :)
Alright :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2010-09-18 17:49:08 UTC
Permalink
Post by Mathieu Blondel
- apply the same modifications as WordNGramAnalyzer to  CharNGramAnalyzer
- sparse versions: SparseCountVectorizer, SparseTfidfTransformer,
SparseVectorizer
Can I go ahead and implement those?

Mathieu
Olivier Grisel
2010-09-18 19:24:56 UTC
Permalink
Post by Mathieu Blondel
Post by Mathieu Blondel
- apply the same modifications as WordNGramAnalyzer to  CharNGramAnalyzer
- sparse versions: SparseCountVectorizer, SparseTfidfTransformer,
SparseVectorizer
Can I go ahead and implement those?
Yes please feel free to do so. I am sprinting with a potential new
contributor but I have a couple of issues to resolve before actually
diving into the text feature extraction code. I ll watch your branch
before undertaking any significant code change.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2010-09-18 20:52:39 UTC
Permalink
Post by Olivier Grisel
Post by Mathieu Blondel
Post by Mathieu Blondel
- apply the same modifications as WordNGramAnalyzer to  CharNGramAnalyzer
- sparse versions: SparseCountVectorizer, SparseTfidfTransformer,
SparseVectorizer
Can I go ahead and implement those?
Yes please feel free to do so. I am sprinting with a potential new
contributor but I have a couple of issues to resolve before actually
diving into the text feature extraction code. I ll watch your branch
before undertaking any significant code change.
BTW, please feel free to join the #scikit-learn IRC chan when working
on the scikit
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2010-09-22 09:24:50 UTC
Permalink
Post by Mathieu Blondel
- apply the same modifications as WordNGramAnalyzer to  CharNGramAnalyzer
- sparse versions: SparseCountVectorizer, SparseTfidfTransformer,
SparseVectorizer
Hi Mathieu,

I will be offline for the next 10 days or so. If you plan to work on
this in the mean time please do so but merge by branch first:

http://github.com/ogrisel/scikit-learn/tree/textextract

It contains some early work aimed at removing the feature extraction
from the document classification API. All tests pass but the
documentation classification need a rewrite similar to the test you
wrote with the pipeline combining a vectorizer with a classifier.

Please have a look at this changeset to grasp where I am heading:

http://github.com/ogrisel/scikit-learn/commit/41d3f72b7930f87bcccc9e3478c752006cdc2316

Best,
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2010-09-22 10:38:37 UTC
Permalink
Also one remark on the CountVectorizer: it currently uses a dense
numpy array representation and the vocabulary is not thresholded. It
will probably be useless for most real life use cases where the number
of distinct words will be in the order of 1e6 (even without n-grams).
I think we should provide an additional parameter n_features to the
__init__ that selects the top n_features most frequent vocabulary
items, drop the remaining items and then build the vocabulary out of
those features only.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2010-09-22 12:38:36 UTC
Permalink
On Wed, Sep 22, 2010 at 7:38 PM, Olivier Grisel
Post by Olivier Grisel
Also one remark on the CountVectorizer: it currently uses a dense
numpy array representation and the vocabulary is not thresholded. It
will probably be useless for most real life use cases where the number
of distinct words will be in the order of 1e6 (even without n-grams).
I think we should provide an additional parameter n_features to the
__init__ that selects the top n_features most frequent vocabulary
items, drop the remaining items and then build the vocabulary out of
those features only.
Regarding the matrix, as I wrote before, I'm planning to add a
SparseCountVectorizer (together with a SparseTfidfTransformer).

Regarding the vocabulary size, the most frequent terms are not
necessary the most discriminative (e.g. "the"). I often see papers
where they remove very infrequent words (e.g. don't occur more than x
times in the corpus) and very frequent words (e.g. occur in more than
x percents of the documents). Those are frequency-based kinds of
unsupervised feature selection. So is selecting the n most frequent
features. Since the best choice is likely to be application dependent,
it seems reasonable to use a transformer API.

One problem about feature selection with the transform API, though, is
that you loose track of what features have been selected /
not-selected. For example, in the case of supervised feature selection
for text, I would like to know what words/n-grams are useful for my
task. With the transformer API, I get a lower-dimensional matrix and
that's it. I'm unable to tell what the new dimensions correspond to.

Mathieu
Mathieu Blondel
2010-09-22 12:43:11 UTC
Permalink
On Wed, Sep 22, 2010 at 6:24 PM, Olivier Grisel
Post by Olivier Grisel
I will be offline for the next 10 days or so. If you plan to work on
 http://github.com/ogrisel/scikit-learn/tree/textextract
Will do! :) But I'm quite busy this week.

Mathieu
Mathieu Blondel
2010-09-30 07:27:46 UTC
Permalink
I've added SparseCountVectorizer, SparseTfidfTransformer and
SparseVectorizer to my branch.

http://github.com/mblondel/scikit-learn/commits/textextract

What still need to be done:

HashingCountVectorizer
SparseHashingCountVectorizer
HashingVectorizer (shortcut for HashingCountVectorizer + TfidfTransformer)
SparseHashingVectorizer (shortcut for SparseHashingCountVectorizer +
SparseTfidfTransformer)

SparseHashingCountVectorizer is a quite long name so if you come up
with better class names, please tell !

Mathieu
Mathieu Blondel
2010-10-15 10:41:25 UTC
Permalink
Olivier, according to you what remains to be done before we can merge
the branch in master?

On my side, I've noticed 2 problems that need to be addressed:

1) SparseTfidfTransformer is very slow. This is an implementation
issue. Transforming the matrix to an appropriate scipy sparse format
and do matrix-vector multiplication may speed up things. Any idea
welcome.

2) The vectorizers are not pickable because of the reference to the
analyzer, which in turns keeps a reference to filter functions, which
are not pickable. I can think of 2 ways to fix this. One is to use a
filter object, as pickle can serialize instance objects. The user can
thus provide his/her own filter object. Another is to pass the
analyzer object to fit and transform rather than to __init__. The
former solution seems to be better.

Mathieu
Post by Mathieu Blondel
I've added SparseCountVectorizer, SparseTfidfTransformer and
SparseVectorizer to my branch.
http://github.com/mblondel/scikit-learn/commits/textextract
HashingCountVectorizer
SparseHashingCountVectorizer
HashingVectorizer (shortcut for HashingCountVectorizer + TfidfTransformer)
SparseHashingVectorizer (shortcut for SparseHashingCountVectorizer +
SparseTfidfTransformer)
SparseHashingCountVectorizer is a quite long name so if you come up
with better class names, please tell !
Mathieu
Olivier Grisel
2010-10-15 12:11:25 UTC
Permalink
Post by Mathieu Blondel
Olivier, according to you what remains to be done before we can merge
the branch in master?
I haven't had the time to work on scikit-learn recently, and won't the
next week either. I don't remember exactly what's remaining.
The best way to find out it to finish the update of the 2 examples for
text classification to use the new API.

http://github.com/mblondel/scikit-learn/blob/textextract/examples/mlcomp_document_classification.py
http://github.com/mblondel/scikit-learn/blob/textextract/examples/mlcomp_sparse_document_classification.py

I gave a shot at the new dataset API for text document in this changeset:

http://github.com/mblondel/scikit-learn/commit/41d3f72b7930f87bcccc9e3478c752006cdc2316

But those changes are not yet taken into account in the examples.
Post by Mathieu Blondel
1) SparseTfidfTransformer is very slow. This is an implementation
issue. Transforming the matrix to an appropriate scipy sparse format
and do matrix-vector multiplication may speed up things. Any idea
welcome.
This needs some profiling love using: http://packages.python.org/line_profiler/
I don't know the quick answer. But we can merge to master and do the
optim later as long as the output of the transform is correct.
Post by Mathieu Blondel
2) The vectorizers are not pickable because of the reference to the
analyzer, which in turns keeps a reference to filter functions, which
are not pickable. I can think of 2 ways to fix this. One is to use a
filter object, as pickle can serialize instance objects. The user can
thus provide his/her own filter object. Another is to pass the
analyzer object to fit and transform rather than to __init__. The
former solution seems to be better.
Good catch: let's turn the filters into pickable objects.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2010-10-17 14:25:20 UTC
Permalink
Post by Olivier Grisel
Good catch: let's turn the filters into pickable objects.
Yes, we should try to have everything pickable in the scikit. In general,
I find that code with object that are not pickable is unPythonic code.

Gaël
Mathieu Blondel
2010-10-21 11:10:03 UTC
Permalink
Hello,

I've made quite some work on my textextract branch today
(http://github.com/mblondel/scikit-learn/commits/textextract). Here's
a summary:

- Moved Sparse* classes to feature_extract.sparse.text. This is
consistent with the rest of the scikit and makes the names shorter to
import: SparseCountVectorizer => sparse.CountVectorizer

- Moved preprocessing.py to preprocessing/__init__.py (so the API is
not broken) and added preprocessing/sparse/__init__.py

- Added Normalizer, LengthNormalizer and Binarizer to preprocessing
(dense and sparse). The sparse ones are coded in Cython. I tried to
optimize as much as I could without using cython but the performance
was not so good due to some limitations in scipy.sparse matrices.
sparse.TfidfTransformer now uses Normalizer and is 20x faster than
before. I think this can still be improved. Note that fit() in
Normalizer, LengthNormalizer and Binarizer doesn't do anything but I
thought it would be nice to be able to use those in a Pipeline, hence
the transformer-like API.

- Fixed the non-pickable filter issue.

Mathieu
Olivier Grisel
2010-10-21 11:54:54 UTC
Permalink
Post by Mathieu Blondel
Hello,
I've made quite some work on my textextract branch today
(http://github.com/mblondel/scikit-learn/commits/textextract). Here's
- Moved Sparse* classes to feature_extract.sparse.text. This is
consistent with the rest of the scikit and makes the names shorter to
import: SparseCountVectorizer => sparse.CountVectorizer
- Moved preprocessing.py to preprocessing/__init__.py (so the API is
not broken) and added preprocessing/sparse/__init__.py
- Added Normalizer, LengthNormalizer and Binarizer to preprocessing
(dense and sparse). The sparse ones are coded in Cython. I tried to
optimize as much as I could without using cython but the performance
was not so good due to some limitations in scipy.sparse matrices.
sparse.TfidfTransformer now uses Normalizer and is 20x faster than
before. I think this can still be improved. Note that fit() in
Normalizer, LengthNormalizer and Binarizer doesn't do anything but I
thought it would be nice to be able to use those in a Pipeline, hence
the transformer-like API.
- Fixed the non-pickable filter issue.
Those are amazing news. Thank you so much! I'll checkout your branch
to play with it in the flight back from Istanbul back to Paris :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2010-10-21 23:06:19 UTC
Permalink
Post by Olivier Grisel
Those are amazing news. Thank you so much! I'll checkout your branch
to play with it in the flight back from Istanbul back to Paris :)
Here is the result of my flight (low battery after 2 hours unfortunately):

http://github.com/ogrisel/scikit-learn/commits/textextract

Please merge to your branch (I cannot issue a pull request to your
github account for some reason: probably github did not detect your
branch as a fork of the official scikit-learn repo).

I plan to work more on it probably this WE, esp. to refactor the
hashing variants. But please feel free to go on without waiting for my
input :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2010-10-21 23:33:53 UTC
Permalink
On Fri, Oct 22, 2010 at 8:06 AM, Olivier Grisel
 http://github.com/ogrisel/scikit-learn/commits/textextract
Please merge to your branch (I cannot issue a pull request to your
github account for some reason: probably github did not detect your
branch as a fork of the official scikit-learn repo).
Thanks for the follow-up, Olivier.
I plan to work more on it probably this WE, esp. to refactor the
hashing variants.
If I'm not mistaken, TfidfTransformer can be reused so you just need
to update the hashing classes to produce counts.

Mathieu
Olivier Grisel
2010-10-21 23:38:10 UTC
Permalink
Post by Mathieu Blondel
On Fri, Oct 22, 2010 at 8:06 AM, Olivier Grisel
 http://github.com/ogrisel/scikit-learn/commits/textextract
Please merge to your branch (I cannot issue a pull request to your
github account for some reason: probably github did not detect your
branch as a fork of the official scikit-learn repo).
Thanks for the follow-up, Olivier.
I plan to work more on it probably this WE, esp. to refactor the
hashing variants.
If I'm not mistaken, TfidfTransformer can be reused so you just need
to update the hashing classes to produce counts.
Yes I think so.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2010-09-22 13:52:44 UTC
Permalink
Post by Olivier Grisel
I will be offline for the next 10 days or so.
Welcome to China my friend. It's a bit colder than I expected, and it can
rain a lot, keeps this in mind when packing :)

G
Loading...