Mathieu Blondel
2010-09-14 14:29:32 UTC
Hello,
I've been looking into the feature extraction code for text today and
I really like the split between the analyzer and the vectorizer. Nice!
I'm not familiar yet with hash representations so I have some questions:
- I see you can fix the number of dimensions so, is it somehow related
to dimensionality reduction?
- Will the vector representation be the same if I vectorize my
training documents and, later, vectorize new test documents?
- Is it possible to get word counts only? This is often necessary for
models based on the multinomial distribution.
I thought it would be nice to have plain (non-hashing) vectorizers so
I created a draft here:
http://github.com/mblondel/scikit-learn/commits/textextract
Some remarks:
- I tried to use the same API as the HashingVectorizer.
- You can get vectors as word counts, term-frequencies (normalized
counts) or tf-idf frequencies.
- You can pass a vocabulary dictionary as parameter.
- The constructor doesn't have a use_idf option because the
computations are not made in vectorize(). Is it possible to do the
same in HashVectorizer? This would allow to get word counts, as well
as get rid of use_idf.
If this looks OK, I will create a SparseVectorizer object.
In addition, I've made a second commit to add a "filters" argument to
the WordNGramAnalyzer. This gives some freedom to the user for the
preprocessing. I left stop word removal as is because it requires
tokenization but it could be made a filter as well. If you like the
idea, I will do the same for CharNGrammAnalyzer.
Another remark: Just like in fit(), I think we should return self at
the end of vectorize() so we can call get_tfidf() or other methods in
chain.
As it is a very common pre-processing in NLP, I'm planning to add some
frequency-based transformers. For example, one can remove dimensions
corresponding to tokens that appear less/more than n times in the
whole corpus or that appear in less/more than n% of the documents but
this will assume that the matrix contains word counts.
Mathieu
I've been looking into the feature extraction code for text today and
I really like the split between the analyzer and the vectorizer. Nice!
I'm not familiar yet with hash representations so I have some questions:
- I see you can fix the number of dimensions so, is it somehow related
to dimensionality reduction?
- Will the vector representation be the same if I vectorize my
training documents and, later, vectorize new test documents?
- Is it possible to get word counts only? This is often necessary for
models based on the multinomial distribution.
I thought it would be nice to have plain (non-hashing) vectorizers so
I created a draft here:
http://github.com/mblondel/scikit-learn/commits/textextract
Some remarks:
- I tried to use the same API as the HashingVectorizer.
- You can get vectors as word counts, term-frequencies (normalized
counts) or tf-idf frequencies.
- You can pass a vocabulary dictionary as parameter.
- The constructor doesn't have a use_idf option because the
computations are not made in vectorize(). Is it possible to do the
same in HashVectorizer? This would allow to get word counts, as well
as get rid of use_idf.
If this looks OK, I will create a SparseVectorizer object.
In addition, I've made a second commit to add a "filters" argument to
the WordNGramAnalyzer. This gives some freedom to the user for the
preprocessing. I left stop word removal as is because it requires
tokenization but it could be made a filter as well. If you like the
idea, I will do the same for CharNGrammAnalyzer.
Another remark: Just like in fit(), I think we should return self at
the end of vectorize() so we can call get_tfidf() or other methods in
chain.
As it is a very common pre-processing in NLP, I'm planning to add some
frequency-based transformers. For example, one can remove dimensions
corresponding to tokens that appear less/more than n times in the
whole corpus or that appear in less/more than n% of the documents but
this will assume that the matrix contains word counts.
Mathieu