Andreas
2012-01-02 15:24:50 UTC
Hi everybody.
Recently, I started working with the RandomForest modules and there is a
couple of things that I noticed
that I would like to change.
So this in particularly goes out to @glouppe, who is the expert on the
field :)
1)
The narrative docs say that max_features=n_features is a good value for
RandomForests.
As far as I know, Breiman 2001 suggests max_features =
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
I just tried "digits" and max_features = log_2(n_features) works better than
max_featurs = n_features. Of course that is definitely no conclusive
evidence ;)
Is there any reference that says max_features = n_features is good?
Also, I think this default value contradicts the beginning of the
narrative docs a bit,
since that claims "In addition, when splitting a node during the
construction of the tree,
the split that is chosen is no longer the best split among all features.
Instead, the split that is picked is the best split among a random
subset of the features."
Later, a recommendation on using max_features = n_features is made, but
no connection to the explanation above is given.
2)
I noticed max_depth defaults to 10 in RandomForests, while the narrative
docs say
that max_dept = None yields best results. Is the default value chosen
because
"None" might take to long?
3)
In the RandomForest docs, it's not clear to me from the documentation which
parameters are parameters of the ensemble and which are parameters of the
base estimator. I think that should be made more explicit.
4) Understanding the parameters "min density" took me some time,
in particular because I didn't see that it was a parameter of the
base estimator, not the ensemble. I think the docstring should start with
"This parameter trades runtime against memory requirement of the
base decision tree." or similar.
5) I think an explanation of "bootstrap" should go in the docs.
The docstring just states "Whether bootstrap samples are used when
building trees."
I don't think this is very helpful since "bootstrap" is quite hard to
look up for
an outsider.
6) As far as I can see, it is possible to set "bootstrap" to 'False' and
still
have max_features = n_features.
This would build n_estimator estimators that are identical, right?
I think this option should somehow be excluded.
Minor remarks that I'll fix if no-one objects:
- All Forest classifiers should have Trees in the "see also section"
Answers / comments welcome :)
Cheers,
Andy
Recently, I started working with the RandomForest modules and there is a
couple of things that I noticed
that I would like to change.
So this in particularly goes out to @glouppe, who is the expert on the
field :)
1)
The narrative docs say that max_features=n_features is a good value for
RandomForests.
As far as I know, Breiman 2001 suggests max_features =
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
I just tried "digits" and max_features = log_2(n_features) works better than
max_featurs = n_features. Of course that is definitely no conclusive
evidence ;)
Is there any reference that says max_features = n_features is good?
Also, I think this default value contradicts the beginning of the
narrative docs a bit,
since that claims "In addition, when splitting a node during the
construction of the tree,
the split that is chosen is no longer the best split among all features.
Instead, the split that is picked is the best split among a random
subset of the features."
Later, a recommendation on using max_features = n_features is made, but
no connection to the explanation above is given.
2)
I noticed max_depth defaults to 10 in RandomForests, while the narrative
docs say
that max_dept = None yields best results. Is the default value chosen
because
"None" might take to long?
3)
In the RandomForest docs, it's not clear to me from the documentation which
parameters are parameters of the ensemble and which are parameters of the
base estimator. I think that should be made more explicit.
4) Understanding the parameters "min density" took me some time,
in particular because I didn't see that it was a parameter of the
base estimator, not the ensemble. I think the docstring should start with
"This parameter trades runtime against memory requirement of the
base decision tree." or similar.
5) I think an explanation of "bootstrap" should go in the docs.
The docstring just states "Whether bootstrap samples are used when
building trees."
I don't think this is very helpful since "bootstrap" is quite hard to
look up for
an outsider.
6) As far as I can see, it is possible to set "bootstrap" to 'False' and
still
have max_features = n_features.
This would build n_estimator estimators that are identical, right?
I think this option should somehow be excluded.
Minor remarks that I'll fix if no-one objects:
- All Forest classifiers should have Trees in the "see also section"
Answers / comments welcome :)
Cheers,
Andy