[Scikit-learn-general] Converting NLTK feature set representation to scikits.learn feature set representation

Discussion:

Denzil Correa

2011-04-30 21:42:07 UTC

Hi all,

I would like to convert a NLTK feature set (each data point as a *list* with
a 2-*tuple* value where the first tuple value is the feature set and the
second tuple value is the class label) to scikits.learn numpy array feature
sets. My NLTK feature sets consist of a combination of multiple feature sets
including word unigrams, word bigrams, word trigrams, character unigrams,
character bigrams, character trigrams, frequency of punctuations, frequency
of function words, frequency of letters, frequency of special characters and
80-100 more such features.

There are multiple issues including : index-feature mapping and order
preservation since, target labels need to be stored in a separate array.

Is there a quick & efficient way to convert to the feature set
representation in scikits.learn? I moved over to scikits.learn to test the
accuracy of SVM's on my text classification task. Also, it would be really
helpful to the community to have such quick shifts between these two
frameworks/libraries.

Thanks!

--
Regards,

Denzil Correa
Ph.D Scholar
Indraprastha Institute of Information Technology, Delhi
http://www.iiitd.ac.in/

Olivier Grisel

2011-04-30 22:46:15 UTC

Permalink

Post by Denzil Correa
Hi all,
I would like to convert a NLTK feature set (each data point as a list with a
2-tuple value where the first tuple value is the feature set and the second
tuple value is the class label) to scikits.learn numpy array feature sets.
My NLTK feature sets consist of a combination of multiple feature sets
including word unigrams, word bigrams, word trigrams, character unigrams,
character bigrams, character trigrams, frequency of punctuations, frequency
of function words, frequency of letters, frequency of special characters and
80-100 more such features.
There are multiple issues including : index-feature mapping and order
preservation since, target labels need to be stored in a separate array.

I don't see the issue: just don't re-shuffle the samples and the labels.

Post by Denzil Correa
Is there a quick & efficient way to convert to the feature set
representation in scikits.learn? I moved over to scikits.learn to test the
accuracy of SVM's on my text classification task. Also, it would be really
helpful to the community to have such quick shifts between these two
frameworks/libraries.

Jacob Perkins started some work to use scikit-learn as classifier for nltk here:

https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/classification/sci.py

Note this should work with the latest stable release of scikit-learn
(0.7.1). In the current state of the master scikit-learn
feature_extraction.text package has changed a bit and this code would
need a bit of adaptation.

As for the use of SVMa, you should use the sparse LinearSVC (and not
kernel SVC that are not scalable to problems with many samples and
many features as in text classification, and would probably over-fit
anyway). Don't expect a miracle though. Training linear models with
the SVM objective (hinge loss + l2 regularizer) or the logistic
regression objective (log loss + l2 regularizer) generally give
comparable results for text classification.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Denzil Correa

2011-05-01 10:12:02 UTC

Permalink

Dear Olivier,
Thanks for the reply. I assume I need to add missing feature values as 0.0s (and also convert True to 1.0 and False to 0.0) in scikits.learn feature representation. The same isn't the case in NLTK. I would probably proceed to write my own function and post it on the mailing list to receive feedback on efficiency and correctness.
I am aware of Jacob's work but as of now it doesn't allow to add custom feature sets. I did clone the nltk-trainer git and try to understand the code (sci.py under the classification folder) but I got lost at a function call which I couldn't locate inside the source.
I am also aware about LinearSVC equating to logit (called Maxent in NLTK) under the conditions mentioned by you. However, I am not sure if NLTK implements a L2 regularizer in Maxent. I believe the same exists in scikits.learn and hence, will try it out.
--
Sent from my Nokia N900 using Nokia Messaging
----- Original message -----

Post by Olivier Grisel

Post by Denzil Correa
Hi all,
I would like to convert a NLTK feature set (each data point as a list
with a 2-tuple value where the first tuple value is the feature set
and the second tuple value is the class label) to scikits.learn numpy
array feature sets. My NLTK feature sets consist of a combination of
multiple feature sets including word unigrams, word bigrams, word
trigrams, character unigrams, character bigrams, character trigrams,
frequency of punctuations, frequency of function words, frequency of
letters, frequency of special characters and 80-100 more such features.
There are multiple issues including : index-feature mapping and order
preservation since, target labels need to be stored in a separate
array.

I don't see the issue: just don't re-shuffle the samples and the labels.

Post by Denzil Correa
Is there a quick & efficient way to convert to the feature set
representation in scikits.learn? I moved over to scikits.learn to test
the accuracy of SVM's on my text classification task. Also, it would
be really helpful to the community to have such quick shifts between
these two frameworks/libraries.

Jacob Perkins started some work to use scikit-learn as classifier for
https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/classification/sci.py
Note this should work with the latest stable release of scikit-learn
(0.7.1). In the current state of the master scikit-learn
feature_extraction.text package has changed a bit and this code would
need a bit of adaptation.
As for the use of SVMa, you should use the sparse LinearSVC (and not
kernel SVC that are not scalable to problems with many samples and
many features as in text classification, and would probably over-fit
anyway). Don't expect a miracle though. Training linear models with
the SVM objective (hinge loss + l2 regularizer) or the logistic
regression objective (log loss + l2 regularizer) generally give
comparable results for text classification.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network
management toolset available today.Â Delivers lowest initial
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2011-05-01 11:18:37 UTC

Permalink

Post by Denzil Correa
Dear Olivier,
Thanks for the reply. I assume I need to add missing feature values as 0.0s
(and also convert True to 1.0 and False to 0.0) in scikits.learn feature
representation. The same isn't the case in NLTK. I would probably proceed to
write my own function and post it on the mailing list to receive feedback on
efficiency and correctness.

No you should not: some scikit-learn classifiers are able to use
scipy.sparse matrices where missing values are treated as zeros are
not physically stored in memory to save space and speed up computation
for problems where the majority of the feature values of a given
samples are zeros. The CountVectorizer and TfidfTransformer classes of
scikits.learn.feature_extraction.text are specially written to
leverage this sparse representation so as to be able to scale to real
life text classification tasks.

If you are not familiar with numpy and scipy.sparse (e.g. the
coo_matrix and the csr_matrix constructors) I would strongly recommend
to have a look at the following document that will get you up to speed
on those matters:

http://scipy-lectures.github.com/_downloads/PythonScientific2.pdf

Post by Denzil Correa
I am aware of Jacob's work but as of now it doesn't allow to add custom
feature sets. I did clone the nltk-trainer git and try to understand the
code (sci.py under the classification folder) but I got lost at a function
call which I couldn't locate inside the source.

What is the function name?

Denzil Correa

2011-05-01 14:33:15 UTC

Permalink

Olivier,

Thanks. Let me look into numpy and scipy.parse. It will help my
understanding better,

With respect to nltk-trainer, the file is args.py under
*nltk-trainer/nltk_trainer/classification.
*

The lambda function is

*return lambda(train_feats): classifier_train(train_feats,
**classifier_train_kwargs)*

in the *make_classifier_builder(args)*function in the args.py.

Where does the *classifier_train* call go to ?

Post by Olivier Grisel

Post by Denzil Correa
Dear Olivier,
Thanks for the reply. I assume I need to add missing feature values as

0.0s

Post by Denzil Correa
(and also convert True to 1.0 and False to 0.0) in scikits.learn feature
representation. The same isn't the case in NLTK. I would probably proceed

Post by Denzil Correa
write my own function and post it on the mailing list to receive feedback

Post by Denzil Correa
efficiency and correctness.

No you should not: some scikit-learn classifiers are able to use
scipy.sparse matrices where missing values are treated as zeros are
not physically stored in memory to save space and speed up computation
for problems where the majority of the feature values of a given
samples are zeros. The CountVectorizer and TfidfTransformer classes of
scikits.learn.feature_extraction.text are specially written to
leverage this sparse representation so as to be able to scale to real
life text classification tasks.
If you are not familiar with numpy and scipy.sparse (e.g. the
coo_matrix and the csr_matrix constructors) I would strongly recommend
to have a look at the following document that will get you up to speed
http://scipy-lectures.github.com/_downloads/PythonScientific2.pdf

function

Post by Denzil Correa
call which I couldn't locate inside the source.

Olivier Grisel

2011-05-01 14:53:28 UTC

Permalink

Post by Denzil Correa
Olivier,
Thanks. Let me look into numpy and scipy.parse. It will help my
understanding better,

It's scipy.sparse : as in sparse matrix representation.

Post by Denzil Correa
With respect to nltk-trainer, the file is args.py under
nltk-trainer/nltk_trainer/classification.
The lambda function is
return lambda(train_feats): classifier_train(train_feats,
**classifier_train_kwargs)
in the make_classifier_builder(args)function in the args.py.
Where does the classifier_train call go to ?

classifier_train less than 10 lines before, depending on the value of
the command-line arguments:

classifier_train = ScikitsClassifier.train

https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/classification/args.py#L48

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Denzil Correa

2011-05-01 15:52:22 UTC

Permalink

Thanks for correcting me.

Is this the function which converts NLTK feature set to a
scipy.sparse.coo_matrix in a format acceptable to scikits?

https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/classification/sci.py#L19

Post by Olivier Grisel

Post by Denzil Correa
Olivier,
Thanks. Let me look into numpy and scipy.parse. It will help my
understanding better,

It's scipy.sparse : as in sparse matrix representation.

classifier_train less than 10 lines before, depending on the value of
classifier_train = ScikitsClassifier.train
https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/classification/args.py#L48
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
Regards,

Denzil Correa
Ph.D Scholar
Indraprastha Institute of Information Technology, Delhi
http://www.iiitd.ac.in/

Olivier Grisel

2011-05-01 16:01:13 UTC

Permalink

Post by Denzil Correa
Thanks for correcting me.
Is this the function which converts NLTK feature set to a
scipy.sparse.coo_matrix in a format acceptable to scikits?
https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/classification/sci.py#L19

Yes.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Denzil Correa

2011-05-10 14:01:54 UTC

Permalink

Olivier :

Is there any particular advantage of using a COO sparse matrix
representation over a DOK representation? Wouldn't it be relatively easy to
convert NLTK feature sets (dictionary key:value pairs) to a DOK
representation?

In NLTK feature set representations, one doesn't need to add non-zero
values.

Post by Denzil Correa

Post by Denzil Correa
Thanks for correcting me.
Is this the function which converts NLTK feature set to a
scipy.sparse.coo_matrix in a format acceptable to scikits?

https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/classification/sci.py#L19
Yes.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
Regards,

Denzil Correa
Ph.D Scholar
Indraprastha Institute of Information Technology, Delhi
http://www.iiitd.ac.in/

Olivier Grisel

2011-05-10 23:01:05 UTC

Permalink

Post by Denzil Correa
Is there any particular advantage of using a COO sparse matrix
representation over a DOK representation? Wouldn't it be relatively easy to
convert NLTK feature sets (dictionary key:value pairs) to a DOK
representation?

I think I tried DOK in the past and it was slower but I am not 100%
sure. Give it a try.

Post by Denzil Correa
In NLTK feature set representations, one doesn't need to add non-zero
values.

Sure.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel