Discussion:
[Scikit-learn-general] Question and comments on RandomForests
Andreas
2012-01-02 15:24:50 UTC
Permalink
Hi everybody.
Recently, I started working with the RandomForest modules and there is a
couple of things that I noticed
that I would like to change.
So this in particularly goes out to @glouppe, who is the expert on the
field :)

1)
The narrative docs say that max_features=n_features is a good value for
RandomForests.
As far as I know, Breiman 2001 suggests max_features =
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
I just tried "digits" and max_features = log_2(n_features) works better than
max_featurs = n_features. Of course that is definitely no conclusive
evidence ;)
Is there any reference that says max_features = n_features is good?

Also, I think this default value contradicts the beginning of the
narrative docs a bit,
since that claims "In addition, when splitting a node during the
construction of the tree,
the split that is chosen is no longer the best split among all features.
Instead, the split that is picked is the best split among a random
subset of the features."
Later, a recommendation on using max_features = n_features is made, but
no connection to the explanation above is given.

2)
I noticed max_depth defaults to 10 in RandomForests, while the narrative
docs say
that max_dept = None yields best results. Is the default value chosen
because
"None" might take to long?

3)
In the RandomForest docs, it's not clear to me from the documentation which
parameters are parameters of the ensemble and which are parameters of the
base estimator. I think that should be made more explicit.

4) Understanding the parameters "min density" took me some time,
in particular because I didn't see that it was a parameter of the
base estimator, not the ensemble. I think the docstring should start with
"This parameter trades runtime against memory requirement of the
base decision tree." or similar.

5) I think an explanation of "bootstrap" should go in the docs.
The docstring just states "Whether bootstrap samples are used when
building trees."
I don't think this is very helpful since "bootstrap" is quite hard to
look up for
an outsider.

6) As far as I can see, it is possible to set "bootstrap" to 'False' and
still
have max_features = n_features.
This would build n_estimator estimators that are identical, right?
I think this option should somehow be excluded.


Minor remarks that I'll fix if no-one objects:

- All Forest classifiers should have Trees in the "see also section"


Answers / comments welcome :)

Cheers,
Andy
Gilles Louppe
2012-01-02 15:48:48 UTC
Permalink
Hi Andy!
Post by Andreas
1)
The narrative docs say that max_features=n_features is a good value for
RandomForests.
As far as I know, Breiman 2001 suggests max_features =
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
I just tried "digits" and max_features = log_2(n_features) works better than
max_featurs = n_features. Of course that is definitely no conclusive
evidence ;)
Is there any reference that says max_features = n_features is good?
Also, I think this default value contradicts the beginning of the
narrative docs a bit,
since that claims "In addition, when splitting a node during the
construction of the tree,
the split that is chosen is no longer the best split among all features.
Instead, the split that is picked is the best split among a random
subset of the features."
Later, a recommendation on using max_features = n_features is made, but
no connection to the explanation above is given.
Short answer: the optimal value of max_features is problem-specific.

In [1], it was found experimentally that max_features=sqrt(n_features)
was working well for classification problems, and
max_features=n_features for regression problems. This is a least the
case for extra-trees. For random forests, I am no longer sure, I will
check with my advisor.

[1] http://orbi.ulg.ac.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf

Anyway, I agree that this paragraph may be confusing wrt to default
values in the code and would need some changes.
Post by Andreas
2)
I noticed max_depth defaults to 10 in RandomForests, while the narrative
docs say
that max_dept = None yields best results. Is the default value chosen
because
"None" might take to long?
This was set to 10 because in our first implementations, it may indeed
have taken too long. It was not changed since.

I nearly always use max_depth=None in practice and instead use
min_split to control the depth of the tree. I agree we should make
max_depth=None by default.
Post by Andreas
3)
In the RandomForest docs, it's not clear to me from the documentation which
parameters are parameters of the ensemble and which are parameters of the
base estimator. I think that should be made more explicit.
Agreed, we should make that more explicit.
Post by Andreas
4) Understanding the parameters "min density" took me some time,
in particular because I didn't see that it was a parameter of the
base estimator, not the ensemble. I think the docstring should start with
"This parameter trades runtime against memory requirement of the
base decision tree." or similar.
Agreed.
Post by Andreas
5) I think an explanation of "bootstrap" should go in the docs.
The docstring just states "Whether bootstrap samples are used when
building trees."
I don't think this is very helpful since "bootstrap" is quite hard to
look up for
an outsider.
Okay, we should make that more explicit. I didn't realize it was obscure.
Post by Andreas
6) As far as I can see, it is possible to set "bootstrap" to 'False' and
still
have max_features = n_features.
This would build n_estimator estimators that are identical, right?
I think this option should somehow be excluded.
Using random forests, yes they would identical. They wouldn't for extra-trees.
Post by Andreas
- All Forest classifiers should have Trees in the "see also section"
Agreed.
Post by Andreas
Answers / comments welcome :)
Cheers,
Andy
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2012-01-02 16:04:23 UTC
Permalink
Post by Gilles Louppe
Post by Andreas
5) I think an explanation of "bootstrap" should go in the docs.
The docstring just states "Whether bootstrap samples are used when
building trees."
I don't think this is very helpful since "bootstrap" is quite hard to
look up for
an outsider.
Okay, we should make that more explicit. I didn't realize it was obscure.
You can just say that it's "randomly sampling with replacement samples
for the training set" and add a link to
http://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29

Thanks Andy for making the effort to make the doc and default value
consistent and Gilles for answering :)

Looking forward to reviewing the pull request.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Andreas
2012-01-02 16:08:54 UTC
Permalink
Post by Gilles Louppe
Post by Andreas
6) As far as I can see, it is possible to set "bootstrap" to 'False' and
still
have max_features = n_features.
This would build n_estimator estimators that are identical, right?
I think this option should somehow be excluded.
Using random forests, yes they would identical. They wouldn't for extra-trees.
I haven't looked at extra trees yet (and probably won't in the near future).
Do you think in the case of random forests, the user should be alerted to this
fact or do you think they should know what they are doing?
Post by Gilles Louppe
Post by Andreas
Post by Andreas
5) I think an explanation of "bootstrap" should go in the docs.
The docstring just states "Whether bootstrap samples are used when
building trees."
I don't think this is very helpful since "bootstrap" is quite hard to
look up for
an outsider.
Okay, we should make that more explicit. I didn't realize it was obscure.
You can just say that it's "randomly sampling with replacement samples
for the training set" and add a link to
http://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29
+1 that's what I had in mind.
Post by Gilles Louppe
Thanks Andy for making the effort to make the doc and default value
consistent and Gilles for answering :)
You're welcome
Post by Gilles Louppe
Looking forward to reviewing the pull request.
:P Maybe tonight.
Gilles Louppe
2012-01-02 17:58:18 UTC
Permalink
Post by Gilles Louppe
Post by Andreas
The narrative docs say that max_features=n_features is a good value for
RandomForests.
As far as I know, Breiman 2001 suggests max_features =
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
I just tried "digits" and max_features = log_2(n_features) works better than
max_featurs = n_features. Of course that is definitely no conclusive
evidence ;)
Is there any reference that says max_features = n_features is good?
Also, I think this default value contradicts the beginning of the
narrative docs a bit,
since that claims "In addition, when splitting a node during the
construction of the tree,
the split that is chosen is no longer the best split among all features.
Instead, the split that is picked is the best split among a random
subset of the features."
Later, a recommendation on using max_features = n_features is made, but
no connection to the explanation above is given.
Short answer: the optimal value of max_features is problem-specific.
In [1], it was found experimentally that max_features=sqrt(n_features)
was working well for classification problems, and
max_features=n_features for regression problems. This is a least the
case for extra-trees. For random forests, I am no longer sure, I will
check with my advisor.
Back to you.

In the random forest manual [2], it is recommended to use
max_features=sqrt(n_features), with some warnings though:

"mtry0 = the number of variables to split on at each node. Default is
the square root of mdim. ATTENTION! DO NOT USE THE DEFAULT VALUES OF
MTRY0 IF YOU WANT TO OPTIMIZE THE PERFORMANCE OF RANDOM FORESTS. TRY
DIFFERENT VALUES-GROW 20-30 TREES, AND SELECT THE VALUE OF MTRY THAT
GIVES THE SMALLEST OOB ERROR RATE."

[2]: http://oz.berkeley.edu/users/breiman/RandomForests/cc_manual.htm

I don't know why I had in mind that RFs should have
max_features=n_features by default. My bad.

My advisor says that indeed log2 was at first recommended in Breiman's
paper, but sqrt was later prefered by Breiman, as [2] indeed
indicates.

What I suggest is to add a string value max_features="auto" such that
max_features=sqrt(n_features) on classification problems and
max_features=n_features on regression. In the same way, we could add
max_features="sqrt" or max_features="log2" and let the user decides.

@amueller If you like, I can take care of all these changes (in that
case, I'll do it tomorrow).

Gilles
Andreas Mueller
2012-01-02 23:22:40 UTC
Permalink
Post by Gilles Louppe
Post by Gilles Louppe
Post by Andreas
The narrative docs say that max_features=n_features is a good value for
RandomForests.
As far as I know, Breiman 2001 suggests max_features =
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
I just tried "digits" and max_features = log_2(n_features) works better than
max_featurs = n_features. Of course that is definitely no conclusive
evidence ;)
Is there any reference that says max_features = n_features is good?
Also, I think this default value contradicts the beginning of the
narrative docs a bit,
since that claims "In addition, when splitting a node during the
construction of the tree,
the split that is chosen is no longer the best split among all features.
Instead, the split that is picked is the best split among a random
subset of the features."
Later, a recommendation on using max_features = n_features is made, but
no connection to the explanation above is given.
Short answer: the optimal value of max_features is problem-specific.
In [1], it was found experimentally that max_features=sqrt(n_features)
was working well for classification problems, and
max_features=n_features for regression problems. This is a least the
case for extra-trees. For random forests, I am no longer sure, I will
check with my advisor.
Back to you.
In the random forest manual [2], it is recommended to use
"mtry0 = the number of variables to split on at each node. Default is
the square root of mdim. ATTENTION! DO NOT USE THE DEFAULT VALUES OF
MTRY0 IF YOU WANT TO OPTIMIZE THE PERFORMANCE OF RANDOM FORESTS. TRY
DIFFERENT VALUES-GROW 20-30 TREES, AND SELECT THE VALUE OF MTRY THAT
GIVES THE SMALLEST OOB ERROR RATE."
[2]: http://oz.berkeley.edu/users/breiman/RandomForests/cc_manual.htm
I don't know why I had in mind that RFs should have
max_features=n_features by default. My bad.
My advisor says that indeed log2 was at first recommended in Breiman's
paper, but sqrt was later prefered by Breiman, as [2] indeed
indicates.
What I suggest is to add a string value max_features="auto" such that
max_features=sqrt(n_features) on classification problems and
max_features=n_features on regression. In the same way, we could add
max_features="sqrt" or max_features="log2" and let the user decides.
Thanks for checking. I didn't know about the random forest manual.
Will check that one.
I am +1 about having an "auto" keyword with square root default
and user specified otherwise.
Post by Gilles Louppe
@amueller If you like, I can take care of all these changes (in that
case, I'll do it tomorrow).
I thought I could do it today but didn't get to it.
To tired to do it now. I haven't started so feel free ;)

Cheers Andy

ps: Happy new year also from me :)
Andreas
2012-01-03 08:00:47 UTC
Permalink
One other question:
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?

Thanks :)

Andy
b***@gmail.com
2012-01-03 08:06:43 UTC
Permalink
Hi Andy,

IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?

Brian

-----Original Message-----
From: Andreas <***@ais.uni-bonn.de>
Date: Tue, 03 Jan 2012 09:00:47
To: <scikit-learn-***@lists.sourceforge.net>
Reply-To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests

One other question:
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?

Thanks :)

Andy

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-lea
Andreas
2012-01-03 08:27:46 UTC
Permalink
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.

Thanks for any comments :)

Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gilles Louppe
2012-01-03 08:30:51 UTC
Permalink
Hi Andras,

Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.

Gilles

PS: I have just started to change to doc. Expect a PR later today :)
Post by Andreas
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.
Thanks for any comments :)
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas
2012-01-03 08:34:48 UTC
Permalink
Hi Gilles.
Thanks! Will try that.

Also thanks for working on the docs! :)

Cheers,
Andy
Post by Gilles Louppe
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
Post by Andreas
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.
Thanks for any comments :)
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas
2012-01-03 08:41:06 UTC
Permalink
I also get the same error when using max_depth=1.
It's here:
File "/home/amueller/checkout/scikit-learn/sklearn/tree/tree.py", line
357, in _build_tree
np.argsort(X.T, axis=1).astype(np.int32).T)

The parameters of my forest are:
RandomForestClassifier(bootstrap=True, compute_importances=False,
criterion='gini', max_depth=1, max_features=10,
min_density=0.1, min_split=1, n_estimators=1, n_jobs=1,
random_state=<mtrand.RandomState object at 0x7f451ac2b0d8>)

It's quite possible that this is some stupid mistake on my part.

I just want to understand what is happening. I would like to use
this code on much larger datasets (with more ram ;) where data
copies can really be an issue.

Thanks again for your help.
Andy
Post by Andreas
Hi Gilles.
Thanks! Will try that.
Also thanks for working on the docs! :)
Cheers,
Andy
Post by Gilles Louppe
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
Post by Andreas
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.
Thanks for any comments :)
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-01-03 08:41:27 UTC
Permalink
Hi Andy,

I'll investigate the issue with an artificial dataset of comparable
size - to be honest I suspect that we focused on speed at the cost of
memory usage...

As a quick fix you could set `min_density=1` which will result in less
memory copies at the cost of runtime.

best,
Peter
Post by Andreas
Hi Gilles.
Thanks! Will try that.
Also thanks for working on the docs! :)
Cheers,
Andy
Post by Gilles Louppe
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
Post by Andreas
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.
Thanks for any comments :)
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Gilles Louppe
2012-01-03 08:45:04 UTC
Permalink
Note also that when using bootstrap=True, copies of X have to be
created for each tree.

But this should work anyway since you only build 1 tree... Hmmm.

Gilles

On 3 January 2012 09:41, Peter Prettenhofer
Post by b***@gmail.com
Hi Andy,
I'll investigate the issue with an artificial dataset of comparable
size - to be honest I suspect that we focused on speed at the cost of
memory usage...
As a quick fix you could set `min_density=1` which will result in less
memory copies at the cost of runtime.
best,
 Peter
Post by Andreas
Hi Gilles.
Thanks! Will try that.
Also thanks for working on the docs! :)
Cheers,
Andy
Post by Gilles Louppe
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
Post by Andreas
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.
Thanks for any comments :)
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-01-03 08:52:28 UTC
Permalink
Hi,

I just checked DecisionTreeClassifier - it basically requires the same
amout of memory for its internal data structures (= `X_sorted` which
is also 60.000 x 786 x 4 bytes). I haven't checked RandomForest but
you have to make sure that joblib does not fork a new process. If so,
the new process will have the same memory footprint as the parent
process (which is 2x the input size because X_sorted is precomputed).
Furthermore, because of pythons memory management I assume that the
data will be copied once more due to copy on write (actually, we dont
write X or X_sorted but we increment their reference counts which
should be enough to trigger a copy).

best
Post by Gilles Louppe
Note also that when using bootstrap=True, copies of X have to be
created for each tree.
But this should work anyway since you only build 1 tree... Hmmm.
Gilles
On 3 January 2012 09:41, Peter Prettenhofer
Post by b***@gmail.com
Hi Andy,
I'll investigate the issue with an artificial dataset of comparable
size - to be honest I suspect that we focused on speed at the cost of
memory usage...
As a quick fix you could set `min_density=1` which will result in less
memory copies at the cost of runtime.
best,
 Peter
Post by Andreas
Hi Gilles.
Thanks! Will try that.
Also thanks for working on the docs! :)
Cheers,
Andy
Post by Gilles Louppe
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
Post by Andreas
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.
Thanks for any comments :)
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Gael Varoquaux
2012-01-03 08:53:52 UTC
Permalink
Post by Peter Prettenhofer
I just checked DecisionTreeClassifier - it basically requires the same
amout of memory for its internal data structures (= `X_sorted` which
is also 60.000 x 786 x 4 bytes).
Would it be an option to allow sorting in place to avoid the memory copy?
It would be the pattern 'copy=False' that we have in a few places in the
scikit.

Gael
Peter Prettenhofer
2012-01-03 08:56:16 UTC
Permalink
Post by Gael Varoquaux
Post by Peter Prettenhofer
I just checked DecisionTreeClassifier - it basically requires the same
amout of memory for its internal data structures (= `X_sorted` which
is also 60.000 x 786 x 4 bytes).
Would it be an option to allow sorting in place to avoid the memory copy?
Personally, I wouldn't want to do it but other people might have a
different opinion?

best,
peter
Post by Gael Varoquaux
It would be the pattern 'copy=False' that we have in a few places in the
scikit.
Gael
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Andreas
2012-01-03 09:13:11 UTC
Permalink
Hi.
I just switched to DecisionTreeClassifier to make analysis easier.
There should be no joblib there, right?
One thing I noticed is that there is often
``np.argsort(X.T, axis=1).astype(np.int32)``
which always does a copy.
Still, as these should be garbage collected, I don't really see
where all the memory goes...
I'll give it a closer look later but I'll move to another box
for now.
Thanks everybody for the help!
And sorry for keeping you @peter.
Cheers,
Andy
Post by Peter Prettenhofer
Hi,
I just checked DecisionTreeClassifier - it basically requires the same
amout of memory for its internal data structures (= `X_sorted` which
is also 60.000 x 786 x 4 bytes). I haven't checked RandomForest but
you have to make sure that joblib does not fork a new process. If so,
the new process will have the same memory footprint as the parent
process (which is 2x the input size because X_sorted is precomputed).
Furthermore, because of pythons memory management I assume that the
data will be copied once more due to copy on write (actually, we dont
write X or X_sorted but we increment their reference counts which
should be enough to trigger a copy).
best
Post by Gilles Louppe
Note also that when using bootstrap=True, copies of X have to be
created for each tree.
But this should work anyway since you only build 1 tree... Hmmm.
Gilles
On 3 January 2012 09:41, Peter Prettenhofer
Post by b***@gmail.com
Hi Andy,
I'll investigate the issue with an artificial dataset of comparable
size - to be honest I suspect that we focused on speed at the cost of
memory usage...
As a quick fix you could set `min_density=1` which will result in less
memory copies at the cost of runtime.
best,
Peter
Post by Andreas
Hi Gilles.
Thanks! Will try that.
Also thanks for working on the docs! :)
Cheers,
Andy
Post by Gilles Louppe
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
Post by Andreas
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.
Thanks for any comments :)
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-01-03 10:53:03 UTC
Permalink
Post by Andreas
Hi.
I just switched to DecisionTreeClassifier to make analysis easier.
There should be no joblib there, right?
correct.
Post by Andreas
One thing I noticed is that there is often
``np.argsort(X.T, axis=1).astype(np.int32)``
which always does a copy.
This is done once at the beginning of build_tree and each time the
sample mask gets too sparse. In the latter case both `X` and `y` are
copied too. To see the memory overhead of ``np.argsort(X.T,
axis=1).astype(np.int32)`` at the beginning we should test with
`min_density=1.0` which turnes off fancy indexing of `X` and `y` and
re-computation of `X_sorted`.
Post by Andreas
Still, as these should be garbage collected, I don't really see
where all the memory goes...
I'll give it a closer look later but I'll move to another box
for now.
Thanks everybody for the help!
Cheers,
Andy
Post by Peter Prettenhofer
Hi,
I just checked DecisionTreeClassifier - it basically requires the same
amout of memory for its internal data structures (= `X_sorted` which
is also 60.000 x 786 x 4 bytes). I haven't checked RandomForest but
you have to make sure that joblib does not fork a new process. If so,
the new process will have the same memory footprint as the parent
process (which is 2x the input size because X_sorted is precomputed).
Furthermore, because of pythons memory management I assume that the
data will be copied once more due to copy on write (actually, we dont
write X or X_sorted but we increment their reference counts which
should be enough to trigger a copy).
best
Post by Gilles Louppe
Note also that when using bootstrap=True, copies of X have to be
created for each tree.
But this should work anyway since you only build 1 tree... Hmmm.
Gilles
On 3 January 2012 09:41, Peter Prettenhofer
Post by b***@gmail.com
Hi Andy,
I'll investigate the issue with an artificial dataset of comparable
size - to be honest I suspect that we focused on speed at the cost of
memory usage...
As a quick fix you could set `min_density=1` which will result in less
memory copies at the cost of runtime.
best,
  Peter
Post by Andreas
Hi Gilles.
Thanks! Will try that.
Also thanks for working on the docs! :)
Cheers,
Andy
Post by Gilles Louppe
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
Post by Andreas
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.
Thanks for any comments :)
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Andreas
2012-01-03 08:48:05 UTC
Permalink
Thanks for the help, Peter!
As I guess the memory error is before the actual construction
of the tree, the `min_density=1` didn't help.

I'll try going go another box but when I have time I'll try
to dig more into the code.
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
I'll investigate the issue with an artificial dataset of comparable
size - to be honest I suspect that we focused on speed at the cost of
memory usage...
As a quick fix you could set `min_density=1` which will result in less
memory copies at the cost of runtime.
best,
Peter
Post by Andreas
Hi Gilles.
Thanks! Will try that.
Also thanks for working on the docs! :)
Cheers,
Andy
Post by Gilles Louppe
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
Post by Andreas
Hi Brian.
The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you made quite an effort to optimize)
so I don't really
have an idea how the construction works. Maybe I just had a
misconception of the memory
usage of the algorithm. I just started playing with it.
Thanks for any comments :)
Cheers,
Andy
Post by b***@gmail.com
Hi Andy,
IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem?
Brian
-----Original Message-----
Date: Tue, 03 Jan 2012 09:00:47
Subject: Re: [Scikit-learn-general] Question and comments on RandomForests
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
this "much" ram? Should I change "min_density"?
Thanks :)
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas
2012-01-03 10:51:03 UTC
Permalink
I noticed one more thing in the random forest code:
The random forests averages the probabilities in the leaves.
This is in contrast to Breiman 2001, where trees vote with
hard class decisions afaik.
As far as I can tell, that is not documented.

Has anyone tried both methods and @glouppe:
why did you choose averaged probabilities over the hard method?

Should the hard voting also be implemented in sklearn
or would that overcomplicate things?

Cheers,
Andy
Andreas
2012-01-10 08:39:09 UTC
Permalink
Hey everybody.
Looking a bit into my memory problems, the first thing I noticed
is that the "fit" method of BaseDecisionTree converts
the data into float32 and Fortran-Style ordering.
I haven't paid much attention to that but is that a common pattern?
Having float32 seems reasonable and this conversion is also
done in the SVMs. But there it is also mentioned explicitly in the docs.

Converting the input to Fortran-style ordering seems a bit weird to
me, as I thought the convention was that the data is assumed
to be C-ordered.
Is there an efficiency reason for doing this that can not be
circumvented?

Another thing I don't like so much is that the conversion
into float32 is not very explicit, since a type is defined
in the _tree.pyx, then imported into tree.py and used
for conversion there.
I no-one disagrees I would very much prefer to make
the type explicit in tree.py.

Any hints on the Fortran style ordering would be very welcome.

Thanks,
Andy
Gilles Louppe
2012-01-10 08:42:45 UTC
Permalink
It is converted to Fortran order for efficiency reasons. The most
repeated and consuming operation is the search for split thresholds,
which is performed column-wise, hence the Fortran ordering.

Gilles
Post by Andreas
Hey everybody.
Looking a bit into my memory problems, the first thing I noticed
is that the "fit" method of BaseDecisionTree converts
the data into float32 and Fortran-Style ordering.
I haven't paid much attention to that but is that a common pattern?
Having float32 seems reasonable and this conversion is also
done in the SVMs. But there it is also mentioned explicitly in the docs.
Converting the input to Fortran-style ordering seems a bit weird to
me, as I thought the convention was that the data is assumed
to be C-ordered.
Is there an efficiency reason for doing this that can not be
circumvented?
Another thing I don't like so much is that the conversion
into float32 is not very explicit, since a type is defined
in the _tree.pyx, then imported into tree.py and used
for conversion there.
I no-one disagrees I would very much prefer to make
the type explicit in tree.py.
Any hints on the Fortran style ordering would be very welcome.
Thanks,
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas
2012-01-10 08:53:12 UTC
Permalink
Post by Gilles Louppe
It is converted to Fortran order for efficiency reasons. The most
repeated and consuming operation is the search for split thresholds,
which is performed column-wise, hence the Fortran ordering.
Gilles
Thanks Gilles and Mathieu for the fast answers :)

@Gilles: Is this the operation in ``smallest_sample_larger_than``?

@both: This might be a stupid question but is there really so much
difference
in indexing continuously or with stride over a C pointer?

I didn't do much CPU optimization in the past so sorry for asking stupid
stuff ;)

Cheers,
Andy
Post by Gilles Louppe
Post by Andreas
Hey everybody.
Looking a bit into my memory problems, the first thing I noticed
is that the "fit" method of BaseDecisionTree converts
the data into float32 and Fortran-Style ordering.
I haven't paid much attention to that but is that a common pattern?
Having float32 seems reasonable and this conversion is also
done in the SVMs. But there it is also mentioned explicitly in the docs.
Converting the input to Fortran-style ordering seems a bit weird to
me, as I thought the convention was that the data is assumed
to be C-ordered.
Is there an efficiency reason for doing this that can not be
circumvented?
Another thing I don't like so much is that the conversion
into float32 is not very explicit, since a type is defined
in the _tree.pyx, then imported into tree.py and used
for conversion there.
I no-one disagrees I would very much prefer to make
the type explicit in tree.py.
Any hints on the Fortran style ordering would be very welcome.
Thanks,
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gilles Louppe
2012-01-10 09:22:22 UTC
Permalink
Post by Andreas
@both: This might be a stupid question but is there really so much
difference
in indexing continuously or with stride over a C pointer?
I didn't do much CPU optimization in the past so sorry for asking stupid
stuff ;)
Yes, the speedup can be quite significant.

To sum up, when the CPU accesses the central memory, it prefetches a
whole contiguous block of bytes from the location pointed by the
pointer. That block is put into the CPU cache(s), which allow much
faster accesses to the subsequent bytes within that block (i.e., the
next elements in the array). If the array was C-ordered, then one
couldn't benefit from the CPU cache in our case (because the next
value in some column j would most likely not be within the block
fetched into the cache (the next values in the block would be the
values on the same line, not on the same column)).

Gilles
Andreas
2012-01-10 09:43:42 UTC
Permalink
Post by Gilles Louppe
Post by Andreas
@both: This might be a stupid question but is there really so much
difference
in indexing continuously or with stride over a C pointer?
I didn't do much CPU optimization in the past so sorry for asking stupid
stuff ;)
Yes, the speedup can be quite significant.
To sum up, when the CPU accesses the central memory, it prefetches a
whole contiguous block of bytes from the location pointed by the
pointer. That block is put into the CPU cache(s), which allow much
faster accesses to the subsequent bytes within that block (i.e., the
next elements in the array). If the array was C-ordered, then one
couldn't benefit from the CPU cache in our case (because the next
value in some column j would most likely not be within the block
fetched into the cache (the next values in the block would be the
values on the same line, not on the same column)).
I thought that modern compilers coupled with modern
hardware that uses all kind of branch prediction
and prefetchers wouldn't really rely on such simple
mechanisms any more.
(see for example
http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers/)

But I you say there is still a difference then I believe you ;)

Still I would be interested in a C program benchmark.
Maybe I'll whip one up right now...

Cheers,
Andy
Gilles Louppe
2012-01-10 09:57:11 UTC
Permalink
Well, not everyone is using modern architectures ;)
Post by Andreas
Post by Gilles Louppe
Post by Andreas
@both: This might be a stupid question but is there really so much
difference
in indexing continuously or with stride over a C pointer?
I didn't do much CPU optimization in the past so sorry for asking stupid
stuff ;)
Yes, the speedup can be quite significant.
To sum up, when the CPU accesses the central memory, it prefetches a
whole contiguous block of bytes from the location pointed by the
pointer. That block is put into the CPU cache(s), which allow much
faster accesses to the subsequent bytes within that block (i.e., the
next elements in the array). If the array was C-ordered, then one
couldn't benefit from the CPU cache in our case (because the next
value in some column j would most likely not be within the block
fetched into the cache (the next values in the block would be the
values on the same line, not on the same column)).
I thought that modern compilers coupled with modern
hardware that uses all kind of branch prediction
and prefetchers wouldn't really rely on such simple
mechanisms any more.
(see for example
http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers/)
But I you say there is still a difference then I believe you ;)
Still I would be interested in a C program benchmark.
Maybe I'll whip one up right now...
Cheers,
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas
2012-01-10 09:59:48 UTC
Permalink
Post by Gilles Louppe
Well, not everyone is using modern architectures ;)
"modern" here means i368 ;)
Andreas
2012-01-10 12:55:15 UTC
Permalink
Next question about DecisionTrees:
I am not sure if I understand the documentation correctly. It says:
"Setting min_density to 0 will always use the sample mask to select the
subset of samples at each node.
This results in little to no additional memory being allocated, making
it appropriate for massive datasets or within ensemble learners,
but at the expense of being slower when training deep trees. "

This sounds to me as if "min_density=0" is slowest but takes the least
memory. Is that what is meant?

When doing benchmarking, I found "min_density=0" to be the fastest
version on my dataset.
It has set n_samples = 6180, n_features = 2000, n_class=10,

The I tried with MNIST (n_samples=60000, n_features=786, n_class=10) and
found
min_density=0 to be slower than .1 (twice as long) but .5 to be slower
than .1


On digits, since training is very fast, it was hard do measure any real
difference.
Still, min_density=0 was fastest and min_density=1 was slowest (1.5
times as slow).

I use the default settings from RandomForest with has max_depth=None,
n_features=auto
and I am using only one tree (n_estimators=1).
On which data sets did the statement in the documentation hold?

It seems to me that there is some sweet spot for each dataset and that
on the datasets
I tested, low values seem faster. Setting min_density=1 was often very slow

What are your experiences?

While .1 seems a good default value, it doesn't seem to be a tradeoff
between
time and memory on the datasets I tested. Rather it seems to be
the value that makes the algorithm runs fastest.

Any help / comment / remarks very welcome!

Thanks,
Andy
Brian Holt
2012-01-10 13:22:48 UTC
Permalink
Hi Andy,

The best way to understand the min_density parameter is to think of it as
'the minimum subset population density'. The idea is that if this density
parameter gets too low, then the program should copy the points and proceed
to split using the copied subset.

As an example, assume that there are 1000 datapoints in a set. At depth 5
at a given node, assume that there are only 90 datapoints that are to be
used to find the split point. This works out at a density of 0.09. If the
min_density parameter is set to 0.1, then the program will not pass in the
original dataset with the mask, but will construct a new dataset with only
these 90 values to pass in to the split routine.

The reasoning behind the idea is that there are 2 extremes with a sliding
scale between them. At the one extreme, we always copy the data as we
descend the tree to create the recursive partition. This should be a bit
slower (because of all the copying), and will also use a lot of memory.
At the other extreme, we always retain a single dataset and simple modify
the mask to indicate which datapoints should be considered at a particular
node. This is likely to be fast (less copying), but also slower because
the cython code that computes the split will need to check potentially lots
of irrelevant datapoints for inclusion in the split operation.

The solution we came up with was the allow the user to specify at which
density the data should be copied, to satisfy the users requirements and
resource trade offs.

Hope it helps
Brian
Peter Prettenhofer
2012-01-10 13:24:37 UTC
Permalink
Post by Andreas
"Setting min_density to 0 will always use the sample mask to select the
subset of samples at each node.
This results in little to no additional memory being allocated, making it
appropriate for massive datasets or within ensemble learners,
but at the expense of being slower when training deep trees. "
This sounds to me as if "min_density=0" is slowest but takes the least
memory. Is that what is meant?
Correct, but I've you grow your trees deep than the runtime overhead
should be significant.
Post by Andreas
When doing benchmarking, I found "min_density=0" to be the fastest version
on my dataset.
It has set n_samples = 6180, n_features = 2000, n_class=10,
The I tried with MNIST (n_samples=60000, n_features=786, n_class=10) and
found
min_density=0 to be slower than .1 (twice as long) but  .5 to be slower than
.1
that sounds reasonable - 0.5 triggers a fancy indexing op (=copy)
whenever more than 50% of the samples are out of the (current) mask.
Which means that basically whenever you make a split either the left
or the right child will be fancy indexed. Fancy indexing itself is
costly and must be amortized by less time spend traversing the sample
mask.
Post by Andreas
On digits, since training is very fast, it was hard do measure any real
difference.
Still, min_density=0 was fastest and min_density=1 was slowest (1.5 times as
slow).
I use the default settings from RandomForest with has max_depth=None,
n_features=auto
and I am using only one tree (n_estimators=1).
On which data sets did the statement in the documentation hold?
we did choose the default parameter (min_density=0.1) on covertype
(large number of samples, few features), madelone, (and arcene) - none
of these has a significant number of features.
Post by Andreas
It seems to me that there is some sweet spot for each dataset and that on
the datasets
I tested, low values seem faster. Setting min_density=1 was often very slow
What are your experiences?
For shallow trees (in particular base learners for boosting) I use
min_density=0 too.
Post by Andreas
While .1 seems a good default value, it doesn't seem to be a tradeoff
between
time and memory on the datasets I tested. Rather it seems to be
the value that makes the algorithm runs fastest.
Agreed - the memory tradeoff is neglectable since we basically build
the tree in a depth first fashion and the deeper you get the smaller
the the arrays become, thus, the less memory they consume. The time it
takes to traverse the sample_mask is the major limiting factor.

Thanks for your analysis that was really useful - we should modify the
docstrings to make this more clear.

best,
Peter
Post by Andreas
Any help / comment / remarks very welcome!
Thanks,
Andy
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Andreas
2012-01-10 13:44:38 UTC
Permalink
Thanks for your comments Peter and Brian.
Post by Peter Prettenhofer
Post by Andreas
"Setting min_density to 0 will always use the sample mask to select the
subset of samples at each node.
This results in little to no additional memory being allocated, making it
appropriate for massive datasets or within ensemble learners,
but at the expense of being slower when training deep trees. "
This sounds to me as if "min_density=0" is slowest but takes the least
memory. Is that what is meant?
Correct, but I've you grow your trees deep than the runtime overhead
should be significant.
I have not explicitly looked at the depth of the trees but
by default max_depth=None, meaning the trees are grown
to full depth. I haven't tried with max_features=num_features,
though.
Post by Peter Prettenhofer
Post by Andreas
When doing benchmarking, I found "min_density=0" to be the fastest version
on my dataset.
It has set n_samples = 6180, n_features = 2000, n_class=10,
The I tried with MNIST (n_samples=60000, n_features=786, n_class=10) and
found
min_density=0 to be slower than .1 (twice as long) but .5 to be slower than
.1
that sounds reasonable - 0.5 triggers a fancy indexing op (=copy)
whenever more than 50% of the samples are out of the (current) mask.
Which means that basically whenever you make a split either the left
or the right child will be fancy indexed. Fancy indexing itself is
costly and must be amortized by less time spend traversing the sample
mask.
Yeah, it does seem reasonable. Still, I feel it is hard
to judge when the fancy indexing cost is amortized.

The decision to let the user make the choice seems good,
though from what is in the documentation, it was not
clear what the choices of the parameter meant.
After reading through the cython and doing some profiling
I got there ;)
Post by Peter Prettenhofer
Thanks for your analysis that was really useful - we should modify the
docstrings to make this more clear.
That would be great. Maybe some form of Brian's explanation
could be included.


A related question:
I know you spent a lot of time tweaking the decision tree code.
Do you think there might still be some potential there?

In particular, would it be possible to use lists of pointers/indices
instead of the
masks to avoid iterating over masked out items? Or do you think
that would create to much overhead?

The current code works great for me (thanks for contributing!!!!),
still it would mean a lot if I could make it even faster. At the moment
it takes me
about 8 hours to grow a tree with only a subset of the features
that I actually want to use.... I have a 128 core cluster here but then
building
a forest with 1000 trees would still take roughly 6 days....

Thanks,
Andy
Gilles Louppe
2012-01-10 14:21:29 UTC
Permalink
Post by Andreas
The current code works great for me (thanks for contributing!!!!),
still it would mean a lot if I could make it even faster. At the moment
it takes me
about 8 hours to grow a tree with only a subset of the features
that I actually want to use.... I have a 128 core cluster here but then
building
a forest with 1000 trees would still take roughly 6 days....
Did you stick to random forests? They are much slower than extra-trees
(because they look for the best splits, while in extra-trees splits
are drawn at random). They also compare to each other in terms of
accuracy. In addition, from experience, on large to big datasets,
bootstrap doesn't help. You can turn it off (as long as max_features
<< n_features, with RFs).

Gilles
Andreas
2012-01-10 14:25:31 UTC
Permalink
Post by Gilles Louppe
Post by Andreas
The current code works great for me (thanks for contributing!!!!),
still it would mean a lot if I could make it even faster. At the moment
it takes me
about 8 hours to grow a tree with only a subset of the features
that I actually want to use.... I have a 128 core cluster here but then
building
a forest with 1000 trees would still take roughly 6 days....
Did you stick to random forests? They are much slower than extra-trees
(because they look for the best splits, while in extra-trees splits
are drawn at random). They also compare to each other in terms of
accuracy. In addition, from experience, on large to big datasets,
bootstrap doesn't help. You can turn it off (as long as max_features
<< n_features, with RFs).
Up to now I used RandomForests.
Thanks for the tips. I'll give it a try.
Olivier Grisel
2012-01-10 14:48:31 UTC
Permalink
Post by Andreas
Post by Gilles Louppe
Post by Andreas
The current code works great for me (thanks for contributing!!!!),
still it would mean a lot if I could make it even faster. At the moment
it takes me
about 8 hours to grow a tree with only a subset of the features
that I actually want to use.... I have a 128 core cluster here but then
building
a forest with 1000 trees would still take roughly 6 days....
Did you stick to random forests? They are much slower than extra-trees
(because they look for the best splits, while in extra-trees splits
are drawn at random). They also compare to each other in terms of
accuracy. In addition, from experience, on large to big datasets,
bootstrap doesn't help. You can turn it off (as long as max_features
<<  n_features, with RFs).
Up to now I used RandomForests.
Thanks for the tips. I'll give it a try.
Out of curiosity can you please report comparative timings on your data?

Also I think Gilles' remark should be added (and made prominent) to
the narrative documentation and also in the "See also" section of the
docstrings of RandomForests.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Peter Prettenhofer
2012-01-10 14:56:38 UTC
Permalink
Post by Olivier Grisel
Post by Andreas
[..]
Up to now I used RandomForests.
Thanks for the tips. I'll give it a try.
Out of curiosity can you please report comparative timings on your data?
Also I think Gilles' remark should be added (and made prominent) to
the narrative documentation and also in the "See also" section of the
docstrings of RandomForests.
Absolutely - I think this was also mentioned in the MSR technical
report that Brian announced some months ago.

@andy have you tried R's random forest package (via rpy2)? I'd be
curious how that compares to our code...
Post by Olivier Grisel
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Andreas
2012-01-10 14:58:55 UTC
Permalink
Post by Peter Prettenhofer
Post by Olivier Grisel
Out of curiosity can you please report comparative timings on your data?
Also I think Gilles' remark should be added (and made prominent) to
the narrative documentation and also in the "See also" section of the
docstrings of RandomForests.
Absolutely - I think this was also mentioned in the MSR technical
report that Brian announced some months ago.
Could you please give me the reference again? That would be awesome :)
Post by Peter Prettenhofer
@andy have you tried R's random forest package (via rpy2)? I'd be
curious how that compares to our code...
Not yet. Will do :)
Brian Holt
2012-01-10 15:46:30 UTC
Permalink
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
Peter Prettenhofer
2012-01-10 16:45:29 UTC
Permalink
Post by Andreas
I know you spent a lot of time tweaking the decision tree code.
Do you think there might still be some potential there?
There's certainly room for improvement! I don't know if I speak for
Brian too but this was the first time I wrote a decision tree and I
did not look extensively at other implementations (well... I looked at
R's gbm package afterwards to find that it looks remarkably similar -
but it treats categorical and real valued features differently).

You can gain a lot if you can make some assumptions on the input data.
The current code is designed for data with a large number of potential
split points (real valued features) - if you expect few potential
split points (e.g. mostly categorical feature) than you would use
fundamentally different data structures (don't iterate over the sample
mask but use some form of skip list).

I remember a paper about scaling regression trees (and ultimately
gradient boosting for learning to rank applications) where the authors
proposed an adaptive binning procedure to make the (dominant) real
valued features discrete which allows more efficient search for split
points.
Post by Andreas
In particular, would it be possible to use lists of pointers/indices
instead of the
masks to avoid iterating over masked out items? Or do you think
that would create to much overhead?
The convenient point about the sample mask is that you need not update
your primary data structure (i.e. X_argsorted).

I think any significant performance leap in the decision tree would
ultimately require to change the data structure which is not a
low-hanging fruit.
Post by Andreas
The current code works great for me (thanks for contributing!!!!),
still it would mean a lot if I could make it even faster. At the moment
it takes me
about 8 hours to grow a tree with only a subset of the features
that I actually want to use.... I have a 128 core cluster here but then
building
a forest with 1000 trees would still take roughly 6 days....
8 hours is quite a lot... there must be some room for improvement -
maybe we should schedule a profiling session :-)

But I'd also like to point out that I benched our code (not the latest
version tough) against WEKA and R's rpart -> the first didn't even
allow me to load the benchmark dataset (covertype) while the latter
was at least 10 times slower.
--
Peter Prettenhofer
Mathieu Blondel
2012-01-10 08:48:25 UTC
Permalink
Post by Andreas
Another thing I don't like so much is that the conversion
into float32 is not very explicit, since a type is defined
in the _tree.pyx, then imported into tree.py and used
for conversion there.
I no-one disagrees I would very much prefer to make
the type explicit in tree.py.
In the future, we could support other types by using a template system.
Post by Andreas
Any hints on the Fortran style ordering would be very welcome.
Fortran-style (in a sense, the dense equivalent of CSC matrices) is
useful when you want fast access to one particular feature in all
samples (as is the case in e.g. coordinate decent).

Mathieu
Gael Varoquaux
2012-01-10 08:52:24 UTC
Permalink
I am sure that there are good technical reasons for these choices (and
that they are constrained by the current lack of templating system). If
the user knows them, he can engineer his data to avoid copies.

To help the user, I think that it would be great to try and
systematically document them in the 'notes' section of the docstring. I
myself keep forgetting to do it, but it can save our users (or our
developers, as in the case of Andy) a lot of time.

Gael
Gael Varoquaux
2012-01-02 17:07:06 UTC
Permalink
Post by Andreas
1)
The narrative docs say that max_features=n_features is a good value for
RandomForests.
As far as I know, Breiman 2001 suggests max_features =
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
I just tried "digits" and max_features = log_2(n_features) works better than
max_featurs = n_features. Of course that is definitely no conclusive
evidence ;)
Is there any reference that says max_features = n_features is good?
Actually, I believe consistency can be shown for random forest greedily
grown (as they are in the standard implementations) if they are many
samples per leaf:
http://jmlr.csail.mit.edu/papers/volume9/biau08a/biau08a.pdf,
theorem 9: the number of leafs k goes as k = o(sqrt(n/log(n)))

For me, this makes sens intuitively: overfit is prevented by some sort of
averaging. This averaging works better is each leaf has more than one
sample.

Now for better rules of thumb, I have no references :).

Thanks for the discussion, Andreas and Gilles, having different people
hammering on the code and the docs definitely helps making it accessible
to everybody.

Gael

PS: happy new year.
Loading...