everything this weekend.
Post by Sebastian RaschkaI agree with Andreas,
typically, a large number of features also shouldn't be a big
problem for random forests in my experience; however, it of course depends
on the number of trees and training samples.
If you suspect that overfitting might be a problem using
unregularized classifiers, also consider "dimensionality
reduction"/"feature exctraction" techniques to compress the feature space,
e.g., linear or kernel PCA, or other methods listed in the manifold
learning section on the scikit-website.
However, there are scenarios where you'd want to keep the "original"
features (in contrast to e.g., principal components), and there are
scenarios where linear methods such as LinearSVC(penalty='l1') may not work
so well (e.g., for non-linear problems). The optimal solution would be to
exhaustively test all feature combinations to see which works best,
however, this can be quite costly. For demonstration purposes, I
implemented "sequential backward selection" (
http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/)
some time ago; a simple greedy alternative to the exhaustive search, maybe
you are lucky and it works well in your case? . When I find time after my
summer projects, I am planning to implement some genetic algos for feature
selection...
Best,
Sebastian
Hi Herbert.
1) Often reducing the features space does not help with accuracy,
and using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the
set of features, another that does the actual supervised task
(classification here).
Have you tried just using the standard classifiers? Clearly you
tried the RF, but I'd also try a linear method like
LinearSVC/LogisticRegression or a kernel SVC.
If you want to do feature selection, what you need to do is
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
Cheers,
Andy
Hello,
I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore i want to reduce my
features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of
1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the Randfom Forest Classifier.
Should i just feature select the training data ? And is it enough if
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
max_depth=13)
clf.fit(X_train, y_train).transform(X_train)
predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)
Cause the accuracy didn't get higher. Is everything ok in the code
or am I doing something wrong?
I'll be very grateful for your help.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general