[Scikit-learn-general] Feature selection

Discussion:

Herbert Schulz

2015-05-28 12:32:38 UTC

Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.

It is a multiclass problem (class 0-5), and the features consists of 1's
and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]

I'm using the Randfom Forest Classifier.

Should i just feature select the training data ? And is it enough if I'm
using this code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)

predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)

Cause the accuracy didn't get higher. Is everything ok in the code or am I
doing something wrong?

I'll be very grateful for your help.

Andreas Mueller

2015-05-28 15:59:53 UTC

Permalink

Hi Herbert.
1) Often reducing the features space does not help with accuracy, and
using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the set
of features, another that does the actual supervised task
(classification here).

Have you tried just using the standard classifiers? Clearly you tried
the RF, but I'd also try a linear method like
LinearSVC/LogisticRegression or a kernel SVC.

If you want to do feature selection, what you need to do is something
like this:

feature_selector = LinearSVC(penalty='l1') #or maybe start with
SelectKBest()
feature_selector.train(X_train, y_train)

X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)

classifier = RandomForestClassifier().fit(X_train_reduced, y_train)

prediction = classifier.predict(X_test_reduced)

Or you use a pipeline, as here:
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?

Cheers,
Andy

Post by Herbert Schulz
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of
1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
Should i just feature select the training data ? And is it enough if
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code or
am I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2015-05-28 16:21:27 UTC

Permalink

I agree with Andreas,
typically, a large number of features also shouldn't be a big problem for random forests in my experience; however, it of course depends on the number of trees and training samples.

If you suspect that overfitting might be a problem using unregularized classifiers, also consider "dimensionality reduction"/"feature exctraction" techniques to compress the feature space, e.g., linear or kernel PCA, or other methods listed in the manifold learning section on the scikit-website.

However, there are scenarios where you'd want to keep the "original" features (in contrast to e.g., principal components), and there are scenarios where linear methods such as LinearSVC(penalty='l1') may not work so well (e.g., for non-linear problems). The optimal solution would be to exhaustively test all feature combinations to see which works best, however, this can be quite costly. For demonstration purposes, I implemented "sequential backward selection" (http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/ <http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/>) some time ago; a simple greedy alternative to the exhaustive search, maybe you are lucky and it works well in your case? . When I find time after my summer projects, I am planning to implement some genetic algos for feature selection...

Best,
Sebastian

Post by Andreas Mueller
Hi Herbert.
1) Often reducing the features space does not help with accuracy, and using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the set of features, another that does the actual supervised task (classification here).
Have you tried just using the standard classifiers? Clearly you tried the RF, but I'd also try a linear method like LinearSVC/LogisticRegression or a kernel SVC.
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
Or you use a pipeline, as here: http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html <http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html>
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy

Post by Herbert Schulz
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of 1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini', max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code or am I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>

Herbert Schulz

2015-05-28 16:44:52 UTC

Permalink

Thank's to both of you!!! I realy appreciate it! I will try everything this
weekend.

Best regards,

Herb

Post by Sebastian Raschka
I agree with Andreas,
typically, a large number of features also shouldn't be a big problem for
random forests in my experience; however, it of course depends on the
number of trees and training samples.
If you suspect that overfitting might be a problem using unregularized
classifiers, also consider "dimensionality reduction"/"feature exctraction"
techniques to compress the feature space, e.g., linear or kernel PCA, or
other methods listed in the manifold learning section on the scikit-website.
However, there are scenarios where you'd want to keep the "original"
features (in contrast to e.g., principal components), and there are
scenarios where linear methods such as LinearSVC(penalty='l1') may not work
so well (e.g., for non-linear problems). The optimal solution would be to
exhaustively test all feature combinations to see which works best,
however, this can be quite costly. For demonstration purposes, I
implemented "sequential backward selection" (
http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/)
some time ago; a simple greedy alternative to the exhaustive search, maybe
you are lucky and it works well in your case? . When I find time after my
summer projects, I am planning to implement some genetic algos for feature
selection...
Best,
Sebastian
Hi Herbert.
1) Often reducing the features space does not help with accuracy, and
using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the set of
features, another that does the actual supervised task (classification
here).
Have you tried just using the standard classifiers? Clearly you tried the
RF, but I'd also try a linear method like LinearSVC/LogisticRegression or a
kernel SVC.
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of 1's
and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini', max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code or am I
doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

JAGANADH G

2015-06-01 10:16:56 UTC

Permalink

Hi

I have listed sklearn feature selection with minimal examples here

http://nbviewer.ipython.org/github/jaganadhg/data_science_notebooks/blob/master/sklearn/scikit_learn_feature_selection.ipynb

Jagan

Post by Herbert Schulz
Thank's to both of you!!! I realy appreciate it! I will try everything
this weekend.
Best regards,
Herb

Post by Sebastian Raschka
I agree with Andreas,
typically, a large number of features also shouldn't be a big problem for
random forests in my experience; however, it of course depends on the
number of trees and training samples.
If you suspect that overfitting might be a problem using unregularized
classifiers, also consider "dimensionality reduction"/"feature exctraction"
techniques to compress the feature space, e.g., linear or kernel PCA, or
other methods listed in the manifold learning section on the scikit-website.
However, there are scenarios where you'd want to keep the "original"
features (in contrast to e.g., principal components), and there are
scenarios where linear methods such as LinearSVC(penalty='l1') may not work
so well (e.g., for non-linear problems). The optimal solution would be to
exhaustively test all feature combinations to see which works best,
however, this can be quite costly. For demonstration purposes, I
implemented "sequential backward selection" (
http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/)
some time ago; a simple greedy alternative to the exhaustive search, maybe
you are lucky and it works well in your case? . When I find time after my
summer projects, I am planning to implement some genetic algos for feature
selection...
Best,
Sebastian
Hi Herbert.
1) Often reducing the features space does not help with accuracy, and
using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the set
of features, another that does the actual supervised task (classification
here).
Have you tried just using the standard classifiers? Clearly you tried the
RF, but I'd also try a linear method like LinearSVC/LogisticRegression or a
kernel SVC.
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of 1's
and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code or am
I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
**********************************
JAGANADH G
http://jaganadhg.in
*ILUGCBE*
http://ilugcbe.org.in

Herbert Schulz

2015-06-01 11:38:07 UTC

Permalink

Cool, thx for that!

Herb

Post by JAGANADH G
Hi
I have listed sklearn feature selection with minimal examples here
http://nbviewer.ipython.org/github/jaganadhg/data_science_notebooks/blob/master/sklearn/scikit_learn_feature_selection.ipynb
Jagan

Post by Herbert Schulz
Thank's to both of you!!! I realy appreciate it! I will try everything
this weekend.
Best regards,
Herb

Post by Sebastian Raschka
I agree with Andreas,
typically, a large number of features also shouldn't be a big problem
for random forests in my experience; however, it of course depends on the
number of trees and training samples.
If you suspect that overfitting might be a problem using unregularized
classifiers, also consider "dimensionality reduction"/"feature exctraction"
techniques to compress the feature space, e.g., linear or kernel PCA, or
other methods listed in the manifold learning section on the scikit-website.
However, there are scenarios where you'd want to keep the "original"
features (in contrast to e.g., principal components), and there are
scenarios where linear methods such as LinearSVC(penalty='l1') may not work
so well (e.g., for non-linear problems). The optimal solution would be to
exhaustively test all feature combinations to see which works best,
however, this can be quite costly. For demonstration purposes, I
implemented "sequential backward selection" (
http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/)
some time ago; a simple greedy alternative to the exhaustive search, maybe
you are lucky and it works well in your case? . When I find time after my
summer projects, I am planning to implement some genetic algos for feature
selection...
Best,
Sebastian
Hi Herbert.
1) Often reducing the features space does not help with accuracy, and
using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the set
of features, another that does the actual supervised task (classification
here).
Have you tried just using the standard classifiers? Clearly you tried
the RF, but I'd also try a linear method like LinearSVC/LogisticRegression
or a kernel SVC.
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of 1's
and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
Should i just feature select the training data ? And is it enough if I'm
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code or am
I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
**********************************
JAGANADH G
http://jaganadhg.in
*ILUGCBE*
http://ilugcbe.org.in
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Herbert Schulz

2015-06-02 09:04:11 UTC

Permalink

Does anyone know why this failure occurs?

ValueError: Unsupported set of arguments: loss='l1' and
penalty='squared_hinge'are not supported when dual=True, Parameters:
penalty='l1', loss='squared_hinge', dual=True

I'm using a Linear SVC ( in andreas example code).

Post by Herbert Schulz
Cool, thx for that!
Herb

Post by Herbert Schulz
Thank's to both of you!!! I realy appreciate it! I will try everything
this weekend.
Best regards,
Herb

Post by Sebastian Raschka
I agree with Andreas,
typically, a large number of features also shouldn't be a big problem
for random forests in my experience; however, it of course depends on the
number of trees and training samples.
If you suspect that overfitting might be a problem using unregularized
classifiers, also consider "dimensionality reduction"/"feature exctraction"
techniques to compress the feature space, e.g., linear or kernel PCA, or
other methods listed in the manifold learning section on the scikit-website.
However, there are scenarios where you'd want to keep the "original"
features (in contrast to e.g., principal components), and there are
scenarios where linear methods such as LinearSVC(penalty='l1') may not work
so well (e.g., for non-linear problems). The optimal solution would be to
exhaustively test all feature combinations to see which works best,
however, this can be quite costly. For demonstration purposes, I
implemented "sequential backward selection" (
http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/)
some time ago; a simple greedy alternative to the exhaustive search, maybe
you are lucky and it works well in your case? . When I find time after my
summer projects, I am planning to implement some genetic algos for feature
selection...
Best,
Sebastian
Hi Herbert.
1) Often reducing the features space does not help with accuracy, and
using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the set
of features, another that does the actual supervised task (classification
here).
Have you tried just using the standard classifiers? Clearly you tried
the RF, but I'd also try a linear method like LinearSVC/LogisticRegression
or a kernel SVC.
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of
1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
Should i just feature select the training data ? And is it enough if
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code or
am I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Michael Eickenberg

2015-06-02 09:19:35 UTC

Permalink

Some configurations are not implemented or difficult to evaluate in the
dual. Setting dual=True/False doesn't change the result, so please don't
vary it as you would vary other parameters. It can however sometimes yield
a speed-up. Here you should try setting dual=False as a first means of
debugging.

Michael

Post by Herbert Schulz
Does anyone know why this failure occurs?
ValueError: Unsupported set of arguments: loss='l1' and
penalty='l1', loss='squared_hinge', dual=True
I'm using a Linear SVC ( in andreas example code).

Post by Herbert Schulz
Cool, thx for that!
Herb

Post by Herbert Schulz
Thank's to both of you!!! I realy appreciate it! I will try everything
this weekend.
Best regards,
Herb

Post by Sebastian Raschka
I agree with Andreas,
typically, a large number of features also shouldn't be a big problem
for random forests in my experience; however, it of course depends on the
number of trees and training samples.
If you suspect that overfitting might be a problem using unregularized
classifiers, also consider "dimensionality reduction"/"feature exctraction"
techniques to compress the feature space, e.g., linear or kernel PCA, or
other methods listed in the manifold learning section on the scikit-website.
However, there are scenarios where you'd want to keep the "original"
features (in contrast to e.g., principal components), and there are
scenarios where linear methods such as LinearSVC(penalty='l1') may not work
so well (e.g., for non-linear problems). The optimal solution would be to
exhaustively test all feature combinations to see which works best,
however, this can be quite costly. For demonstration purposes, I
implemented "sequential backward selection" (
http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/)
some time ago; a simple greedy alternative to the exhaustive search, maybe
you are lucky and it works well in your case? . When I find time after my
summer projects, I am planning to implement some genetic algos for feature
selection...
Best,
Sebastian
Hi Herbert.
1) Often reducing the features space does not help with accuracy, and
using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the
set of features, another that does the actual supervised task
(classification here).
Have you tried just using the standard classifiers? Clearly you tried
the RF, but I'd also try a linear method like LinearSVC/LogisticRegression
or a kernel SVC.
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of
1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
Should i just feature select the training data ? And is it enough if
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code or
am I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Herbert Schulz

2015-06-02 15:25:38 UTC

Permalink

Thanks that helped.

But i just can't get an higher accuracy then 45%... don't now why. also
with logicstic regression and so on..

Is there a way to combine for example an SVM with a decision tree?

Herb

Post by Michael Eickenberg
Some configurations are not implemented or difficult to evaluate in the
dual. Setting dual=True/False doesn't change the result, so please don't
vary it as you would vary other parameters. It can however sometimes yield
a speed-up. Here you should try setting dual=False as a first means of
debugging.
Michael

Post by Herbert Schulz
Cool, thx for that!
Herb

Post by Herbert Schulz
Thank's to both of you!!! I realy appreciate it! I will try everything
this weekend.
Best regards,
Herb

Post by Sebastian Raschka
I agree with Andreas,
typically, a large number of features also shouldn't be a big problem
for random forests in my experience; however, it of course depends on the
number of trees and training samples.
If you suspect that overfitting might be a problem using
unregularized classifiers, also consider "dimensionality
reduction"/"feature exctraction" techniques to compress the feature space,
e.g., linear or kernel PCA, or other methods listed in the manifold
learning section on the scikit-website.
However, there are scenarios where you'd want to keep the "original"
features (in contrast to e.g., principal components), and there are
scenarios where linear methods such as LinearSVC(penalty='l1') may not work
so well (e.g., for non-linear problems). The optimal solution would be to
exhaustively test all feature combinations to see which works best,
however, this can be quite costly. For demonstration purposes, I
implemented "sequential backward selection" (
http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/)
some time ago; a simple greedy alternative to the exhaustive search, maybe
you are lucky and it works well in your case? . When I find time after my
summer projects, I am planning to implement some genetic algos for feature
selection...
Best,
Sebastian
Hi Herbert.
1) Often reducing the features space does not help with accuracy, and
using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the
set of features, another that does the actual supervised task
(classification here).
Have you tried just using the standard classifiers? Clearly you tried
the RF, but I'd also try a linear method like LinearSVC/LogisticRegression
or a kernel SVC.
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of
1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
Should i just feature select the training data ? And is it enough if
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code or
am I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2015-06-02 15:41:12 UTC

Permalink

Hi, Herbert,
I can't help you with the accuracy problem since this can be due to many different things. However, there is now a way to combine different classifiers for majority rule voting, the sklearn.ensemble.VotingClassifier (. It is not in the current stable release yet but you could get it from the scikit-learn dev version from github.

Alternatively, if you don't want to install the scikit-learn dev version, you could use the EnsembleClassifier from mlxtend until the next stable release of scikit-learn -- slightly different syntax but the same principle http://rasbt.github.io/mlxtend/docs/sklearn/ensemble_classifier/ <http://rasbt.github.io/mlxtend/docs/sklearn/ensemble_classifier/> (this is basically the original implementation that was later ported to scikit-learn).

Hope that helps.

Best,
Sebastian

Post by Herbert Schulz
Thanks that helped.
But i just can't get an higher accuracy then 45%... don't now why. also with logicstic regression and so on..
Is there a way to combine for example an SVM with a decision tree?
Herb
Some configurations are not implemented or difficult to evaluate in the dual. Setting dual=True/False doesn't change the result, so please don't vary it as you would vary other parameters. It can however sometimes yield a speed-up. Here you should try setting dual=False as a first means of debugging.
Michael
Does anyone know why this failure occurs?
ValueError: Unsupported set of arguments: loss='l1' and penalty='squared_hinge'are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True
I'm using a Linear SVC ( in andreas example code).
Cool, thx for that!
Herb
Hi
I have listed sklearn feature selection with minimal examples here
http://nbviewer.ipython.org/github/jaganadhg/data_science_notebooks/blob/master/sklearn/scikit_learn_feature_selection.ipynb <http://nbviewer.ipython.org/github/jaganadhg/data_science_notebooks/blob/master/sklearn/scikit_learn_feature_selection.ipynb>
Jagan
Thank's to both of you!!! I realy appreciate it! I will try everything this weekend.
Best regards,
Herb
I agree with Andreas,
typically, a large number of features also shouldn't be a big problem for random forests in my experience; however, it of course depends on the number of trees and training samples.
If you suspect that overfitting might be a problem using unregularized classifiers, also consider "dimensionality reduction"/"feature exctraction" techniques to compress the feature space, e.g., linear or kernel PCA, or other methods listed in the manifold learning section on the scikit-website.
However, there are scenarios where you'd want to keep the "original" features (in contrast to e.g., principal components), and there are scenarios where linear methods such as LinearSVC(penalty='l1') may not work so well (e.g., for non-linear problems). The optimal solution would be to exhaustively test all feature combinations to see which works best, however, this can be quite costly. For demonstration purposes, I implemented "sequential backward selection" (http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/ <http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/>) some time ago; a simple greedy alternative to the exhaustive search, maybe you are lucky and it works well in your case? . When I find time after my summer projects, I am planning to implement some genetic algos for feature selection...
Best,
Sebastian

Post by Herbert Schulz
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of 1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini', max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code or am I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
--
**********************************
JAGANADH G
http://jaganadhg.in <http://jaganadhg.in/>
ILUGCBE
http://ilugcbe.org.in <http://ilugcbe.org.in/>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Herbert Schulz

2015-07-08 10:03:17 UTC

Permalink

Hey,

the mlxtend library worked great on my Computer.

Now installed it on an server.

import mlxtend works fine

but if i want to import the EnsembleClassifier he gives ma an error like:

from mlxtend.sklearn import EnsembleClassifier :

"No module named sklearn"

import sklearn works also.

Does anyone knows why? I installed mlxtend with "python setup.py install"
I think, it is version 0.28

Post by Sebastian Raschka
Hi, Herbert,
I can't help you with the accuracy problem since this can be due to many
different things. However, there is now a way to combine different
classifiers for majority rule voting, the sklearn.ensemble.VotingClassifier
(. It is not in the current stable release yet but you could get it from
the scikit-learn dev version from github.
Alternatively, if you don't want to install the scikit-learn dev version,
you could use the EnsembleClassifier from mlxtend until the next stable
release of scikit-learn -- slightly different syntax but the same principle
http://rasbt.github.io/mlxtend/docs/sklearn/ensemble_classifier/ (this is
basically the original implementation that was later ported to
scikit-learn).
Hope that helps.
Best,
Sebastian
Thanks that helped.
But i just can't get an higher accuracy then 45%... don't now why. also
with logicstic regression and so on..
Is there a way to combine for example an SVM with a decision tree?
Herb

Post by Herbert Schulz
Cool, thx for that!
Herb

Post by Herbert Schulz
Thank's to both of you!!! I realy appreciate it! I will try
everything this weekend.
Best regards,
Herb

Post by Sebastian Raschka
I agree with Andreas,
typically, a large number of features also shouldn't be a big
problem for random forests in my experience; however, it of course depends
on the number of trees and training samples.
If you suspect that overfitting might be a problem using
unregularized classifiers, also consider "dimensionality
reduction"/"feature exctraction" techniques to compress the feature space,
e.g., linear or kernel PCA, or other methods listed in the manifold
learning section on the scikit-website.
However, there are scenarios where you'd want to keep the "original"
features (in contrast to e.g., principal components), and there are
scenarios where linear methods such as LinearSVC(penalty='l1') may not work
so well (e.g., for non-linear problems). The optimal solution would be to
exhaustively test all feature combinations to see which works best,
however, this can be quite costly. For demonstration purposes, I
implemented "sequential backward selection" (
http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/)
some time ago; a simple greedy alternative to the exhaustive search, maybe
you are lucky and it works well in your case? . When I find time after my
summer projects, I am planning to implement some genetic algos for feature
selection...
Best,
Sebastian
Hi Herbert.
1) Often reducing the features space does not help with accuracy,
and using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the
set of features, another that does the actual supervised task
(classification here).
Have you tried just using the standard classifiers? Clearly you
tried the RF, but I'd also try a linear method like
LinearSVC/LogisticRegression or a kernel SVC.
If you want to do feature selection, what you need to do is
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of
1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
Should i just feature select the training data ? And is it enough if
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code
or am I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Herbert Schulz

2015-07-08 10:06:09 UTC

Permalink

Ah, the API changes...

but know im getting something like:

import mlxtend.classifier.EnsembleClassifier
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mlxtend/classifier/__init__.py", line 8, in <module>
from .ensemble import EnsembleClassifier
File "mlxtend/classifier/ensemble.py", line 15, in <module>
from sklearn.pipeline import _name_estimators
ImportError: cannot import name _name_estimators

Post by Herbert Schulz
Hey,
the mlxtend library worked great on my Computer.
Now installed it on an server.
import mlxtend works fine
"No module named sklearn"
import sklearn works also.
Does anyone knows why? I installed mlxtend with "python setup.py install"
I think, it is version 0.28

Post by Sebastian Raschka
Hi, Herbert,
I can't help you with the accuracy problem since this can be due to many
different things. However, there is now a way to combine different
classifiers for majority rule voting, the sklearn.ensemble.VotingClassifier
(. It is not in the current stable release yet but you could get it from
the scikit-learn dev version from github.
Alternatively, if you don't want to install the scikit-learn dev version,
you could use the EnsembleClassifier from mlxtend until the next stable
release of scikit-learn -- slightly different syntax but the same principle
http://rasbt.github.io/mlxtend/docs/sklearn/ensemble_classifier/ (this
is basically the original implementation that was later ported to
scikit-learn).
Hope that helps.
Best,
Sebastian
Thanks that helped.
But i just can't get an higher accuracy then 45%... don't now why. also
with logicstic regression and so on..
Is there a way to combine for example an SVM with a decision tree?
Herb

Post by Herbert Schulz
Cool, thx for that!
Herb

Post by Herbert Schulz
Thank's to both of you!!! I realy appreciate it! I will try
everything this weekend.
Best regards,
Herb

Post by Sebastian Raschka
I agree with Andreas,
typically, a large number of features also shouldn't be a big
problem for random forests in my experience; however, it of course depends
on the number of trees and training samples.
If you suspect that overfitting might be a problem using
unregularized classifiers, also consider "dimensionality
reduction"/"feature exctraction" techniques to compress the feature space,
e.g., linear or kernel PCA, or other methods listed in the manifold
learning section on the scikit-website.
However, there are scenarios where you'd want to keep the
"original" features (in contrast to e.g., principal components), and there
are scenarios where linear methods such as LinearSVC(penalty='l1') may not
work so well (e.g., for non-linear problems). The optimal solution would be
to exhaustively test all feature combinations to see which works best,
however, this can be quite costly. For demonstration purposes, I
implemented "sequential backward selection" (
http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/)
some time ago; a simple greedy alternative to the exhaustive search, maybe
you are lucky and it works well in your case? . When I find time after my
summer projects, I am planning to implement some genetic algos for feature
selection...
Best,
Sebastian
Hi Herbert.
1) Often reducing the features space does not help with accuracy,
and using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the
set of features, another that does the actual supervised task
(classification here).
Have you tried just using the standard classifiers? Clearly you
tried the RF, but I'd also try a linear method like
LinearSVC/LogisticRegression or a kernel SVC.
If you want to do feature selection, what you need to do is
feature_selector = LinearSVC(penalty='l1') #or maybe start with
SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce
my features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists
of 1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
Should i just feature select the training data ? And is it enough
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code
or am I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general