[Scikit-learn-general] Random forest low score on testing data

Discussion:

muhammad waseem

2016-02-05 16:00:32 UTC

Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.

1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724

Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722

Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]

The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725

Then I used GridSearch among the best parameters found in the above runs
and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729

Then I used these parameters to predict for an unknown dataset but got a
very low score (around 0.72).

My questions are; Am I doing the hyperparameter tuning correctly or I am
missing something?

2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?

Sorry, if these are basic questions as I am new to scikit-learn and ML.

Thanks!

Luca Puggini

2016-02-05 16:13:51 UTC

Permalink

To me the score is not so low. The model is slightly over fitting. Try to
repeat the same process with extremely randomized trees instead of random
forest and try to keep a low depth.

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above runs
and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got a
very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I am
missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Sent by mobile phone

muhammad waseem

2016-02-05 16:27:21 UTC

Permalink

Hi Luca,
Could you please explain how can do this randomized trees in scikit-learn?
So you suggest I should be using Random forest?

Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting. Try to
repeat the same process with extremely randomized trees instead of random
forest and try to keep a low depth.

--
Sent by mobile phone
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Luca Puggini

2016-02-05 17:00:07 UTC

Permalink

Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor

it work similarly to random forest. In my experience RF tends often to
overfit.
I suggest you to start using the default parameters and cross validate only
on the max_depth parameter. Start with small values of max_depth [2, 3, 5,
7, 10] and check how the performances of the model change.

Good Luck.
Luca

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in scikit-learn?
So you suggest I should be using Random forest?

--
Sent by mobile phone

muhammad waseem

2016-02-05 17:13:21 UTC

Permalink

Thanks Luca, I will give it a try. When you say extremely randomised, does
this mean using large number of n_estimators?

Also, any idea how to solve overfitting problem for random forest?

Regards
Waseem

Post by Luca Puggini
Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
it work similarly to random forest. In my experience RF tends often to
overfit.
I suggest you to start using the default parameters and cross validate
only on the max_depth parameter. Start with small values of max_depth [2,
3, 5, 7, 10] and check how the performances of the model change.
Good Luck.
Luca

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?

Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting. Try
to repeat the same process with extremely randomized trees instead of
random forest and try to keep a low depth.

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above
runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got
a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I
am missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Luca Puggini

2016-02-05 20:46:23 UTC

Permalink

The number of trees (n estimators) should be as much large as possible. It
does not cause over fitting. In random forest over fitting is usually
caused by the depth and by variables with several unique values. I'll
suggest you to start using randomized trees with low depth. If you want to
use rf you can try to reduce the number of variables used at each split.

Observe that if you use OOB to estimate the prediction error it may be
biased when the number of trees is large.

In addition I'll suggest you to shuffle the data at the beginning if you
can.

Post by muhammad waseem
Thanks Luca, I will give it a try. When you say extremely randomised, does
this mean using large number of n_estimators?
Also, any idea how to solve overfitting problem for random forest?
Regards
Waseem

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?

Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting. Try
to repeat the same process with extremely randomized trees instead of
random forest and try to keep a low depth.

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above
runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got
a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I
am missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Sent by mobile phone

muhammad waseem

2016-02-05 21:32:39 UTC

Permalink

Hi Luca,
Thanks for your time and answer. I will try this with lower max_depth (both
for randomised and RF to see what happens)*.*
By number of variable used at each split, you mean min_samples_split, right?

I did not use OOB score.
I will also try to shuffle my data as well.

Thanks again.

Post by Luca Puggini
The number of trees (n estimators) should be as much large as possible.
It does not cause over fitting. In random forest over fitting is usually
caused by the depth and by variables with several unique values. I'll
suggest you to start using randomized trees with low depth. If you want to
use rf you can try to reduce the number of variables used at each split.
Observe that if you use OOB to estimate the prediction error it may be
biased when the number of trees is large.
In addition I'll suggest you to shuffle the data at the beginning if you
can.

Post by muhammad waseem
Thanks Luca, I will give it a try. When you say extremely randomised,
does this mean using large number of n_estimators?
Also, any idea how to solve overfitting problem for random forest?
Regards
Waseem

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?

Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting. Try
to repeat the same process with extremely randomized trees instead of
random forest and try to keep a low depth.

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above
runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but
got a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I
am missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Jacob Schreiber

2016-02-05 21:43:01 UTC

Permalink

I'm a bit unclear what you expect shuffling the data to do, Luca, since you
end up taking a random sample if you bootstrap and re-ordering it anyway.

Jacob

Post by muhammad waseem
Hi Luca,
Thanks for your time and answer. I will try this with lower max_depth
(both for randomised and RF to see what happens)*.*
By number of variable used at each split, you mean min_samples_split, right?
I did not use OOB score.
I will also try to shuffle my data as well.
Thanks again.

Post by muhammad waseem
Thanks Luca, I will give it a try. When you say extremely randomised,
does this mean using large number of n_estimators?
Also, any idea how to solve overfitting problem for random forest?
Regards
Waseem

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?

Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting.
Try to repeat the same process with extremely randomized trees instead of
random forest and try to keep a low depth.

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above
runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but
got a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or
I am missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Luca Puggini

2016-02-06 01:45:37 UTC

Permalink

If I understood correctly he is using a train set that is used for model
identification and training. A test set is then used to evaluate the
results. If he gets good performances on the train set and bad on the test
set it may be due to the fact that the test set contains different
information respect to the train set. This is for example common in time
series.

Post by Jacob Schreiber
I'm a bit unclear what you expect shuffling the data to do, Luca, since
you end up taking a random sample if you bootstrap and re-ordering it
anyway.
Jacob

Post by muhammad waseem
Thanks Luca, I will give it a try. When you say extremely randomised,
does this mean using large number of n_estimators?
Also, any idea how to solve overfitting problem for random forest?
Regards
Waseem

Post by Luca Puggini
Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
it work similarly to random forest. In my experience RF tends often
to overfit.
I suggest you to start using the default parameters and cross validate
only on the max_depth parameter. Start with small values of max_depth [2,
3, 5, 7, 10] and check how the performances of the model change.
Good Luck.
Luca
On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?

Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting.
Try to repeat the same process with extremely randomized trees instead of
random forest and try to keep a low depth.

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above
runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but
got a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or
I am missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Sent by mobile phone

Luca Puggini

2016-02-06 01:50:05 UTC

Permalink

@muhammad by number of variables at each split I mean 'max_features'.

Post by Luca Puggini
If I understood correctly he is using a train set that is used for model
identification and training. A test set is then used to evaluate the
results. If he gets good performances on the train set and bad on the test
set it may be due to the fact that the test set contains different
information respect to the train set. This is for example common in time
series.

Post by Jacob Schreiber
I'm a bit unclear what you expect shuffling the data to do, Luca, since
you end up taking a random sample if you bootstrap and re-ordering it
anyway.
Jacob

Post by Luca Puggini
The number of trees (n estimators) should be as much large as
possible. It does not cause over fitting. In random forest over fitting
is usually caused by the depth and by variables with several unique
values. I'll suggest you to start using randomized trees with low depth.
If you want to use rf you can try to reduce the number of variables used at
each split.
Observe that if you use OOB to estimate the prediction error it may be
biased when the number of trees is large.
In addition I'll suggest you to shuffle the data at the beginning if
you can.

Post by muhammad waseem
Thanks Luca, I will give it a try. When you say extremely randomised,
does this mean using large number of n_estimators?
Also, any idea how to solve overfitting problem for random forest?
Regards
Waseem

Post by Luca Puggini
Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
it work similarly to random forest. In my experience RF tends often
to overfit.
I suggest you to start using the default parameters and cross
validate only on the max_depth parameter. Start with small values of
max_depth [2, 3, 5, 7, 10] and check how the performances of the model
change.
Good Luck.
Luca
On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?

Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting.
Try to repeat the same process with extremely randomized trees instead of
random forest and try to keep a low depth.
On Fri 5 Feb 2016 at 16:01 muhammad waseem <

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15,
min_samples_split:10, bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30,
min_samples_split:20, bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25,
min_samples_split:22, bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the
above runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but
got a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly
or I am missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Sent by mobile phone

Jacob Schreiber

2016-02-06 02:25:32 UTC

Permalink

Luca, I'm not sure I understand what you're saying. All test sets have
different information than their training sets--why does that mean
shuffling would help? Algorithmically the tree resorts the data anyway
without caring about the order they were in originally.

Post by Luca Puggini
@muhammad by number of variables at each split I mean 'max_features'.

Post by Jacob Schreiber
I'm a bit unclear what you expect shuffling the data to do, Luca, since
you end up taking a random sample if you bootstrap and re-ordering it
anyway.
Jacob
On Fri, Feb 5, 2016 at 1:32 PM, muhammad waseem <

Post by Luca Puggini
The number of trees (n estimators) should be as much large as
possible. It does not cause over fitting. In random forest over fitting
is usually caused by the depth and by variables with several unique
values. I'll suggest you to start using randomized trees with low depth.
If you want to use rf you can try to reduce the number of variables used at
each split.
Observe that if you use OOB to estimate the prediction error it may be
biased when the number of trees is large.
In addition I'll suggest you to shuffle the data at the beginning if
you can.

Post by muhammad waseem
Thanks Luca, I will give it a try. When you say extremely randomised,
does this mean using large number of n_estimators?
Also, any idea how to solve overfitting problem for random forest?
Regards
Waseem

Post by Luca Puggini
Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
it work similarly to random forest. In my experience RF tends often
to overfit.
I suggest you to start using the default parameters and cross
validate only on the max_depth parameter. Start with small values of
max_depth [2, 3, 5, 7, 10] and check how the performances of the model
change.
Good Luck.
Luca
On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?

Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting.
Try to repeat the same process with extremely randomized trees instead of
random forest and try to keep a low depth.
On Fri 5 Feb 2016 at 16:01 muhammad waseem <

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15,
min_samples_split:10, bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30,
min_samples_split:20, bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25,
min_samples_split:22, bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the
above runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset
but got a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly
or I am missing something?
2) Why is my testing score very low as compared to my training
and validation score and how can I improve it so that I get good
predictions out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Sent by mobile phone

Luca Puggini

2016-02-06 02:51:40 UTC

Permalink

suppose to have a medical datasets where the first 500 people are from
population A and the patients from 500 to 1000 are from population B.
People in pop A can be very different from the ones in pop B. If you
train only in the first half of the data the model may miss important
information relative to pop B. If you shuffle the data at the beginning
you will have in both train and test sets samples from pop A and pop B.
I do not know if this can help muhammad as it is difficult to judge without
the data. It's worth to try as it is one line of code.

I hope this clarified.

Post by Jacob Schreiber
Luca, I'm not sure I understand what you're saying. All test sets have
different information than their training sets--why does that mean
shuffling would help? Algorithmically the tree resorts the data anyway
without caring about the order they were in originally.

Post by Luca Puggini
@muhammad by number of variables at each split I mean 'max_features'.

Post by Luca Puggini
The number of trees (n estimators) should be as much large as
possible. It does not cause over fitting. In random forest over fitting
is usually caused by the depth and by variables with several unique
values. I'll suggest you to start using randomized trees with low depth.
If you want to use rf you can try to reduce the number of variables used at
each split.
Observe that if you use OOB to estimate the prediction error it may
be biased when the number of trees is large.
In addition I'll suggest you to shuffle the data at the beginning if
you can.
On Fri, Feb 5, 2016, 5:14 PM muhammad waseem <

Post by muhammad waseem
Thanks Luca, I will give it a try. When you say extremely
randomised, does this mean using large number of n_estimators?
Also, any idea how to solve overfitting problem for random forest?
Regards
Waseem

Post by Luca Puggini
Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
it work similarly to random forest. In my experience RF tends
often to overfit.
I suggest you to start using the default parameters and cross
validate only on the max_depth parameter. Start with small values of
max_depth [2, 3, 5, 7, 10] and check how the performances of the model
change.
Good Luck.
Luca
On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?

Post by Luca Puggini
To me the score is not so low. The model is slightly over
fitting. Try to repeat the same process with extremely randomized trees
instead of random forest and try to keep a low depth.
On Fri 5 Feb 2016 at 16:01 muhammad waseem <

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15,
min_samples_split:10, bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30,
min_samples_split:20, bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25,
min_samples_split:22, bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the
above runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset
but got a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly
or I am missing something?
2) Why is my testing score very low as compared to my training
and validation score and how can I improve it so that I get good
predictions out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Sent by mobile phone

muhammad waseem

2016-02-08 15:23:29 UTC

Permalink

Hi Luca,
Thanks for your help. I have tried to shuffle my data (which made sense in
my case as it was ordered as days, months, hours). I have also tried
lowring max_depth with less number of features but it did not work for me.
I have also tried ExtraTreesRegressor but without any luck.
By using feature_importances_, I found out that one of the features was not
very important so I removed it but that did not work either with random
forest or extra trees as well. Any ideas what I could try?

Thanks
Regards
Waseem

Post by Luca Puggini
suppose to have a medical datasets where the first 500 people are from
population A and the patients from 500 to 1000 are from population B.
People in pop A can be very different from the ones in pop B. If you
train only in the first half of the data the model may miss important
information relative to pop B. If you shuffle the data at the beginning
you will have in both train and test sets samples from pop A and pop B.
I do not know if this can help muhammad as it is difficult to judge
without the data. It's worth to try as it is one line of code.
I hope this clarified.

Post by Luca Puggini
@muhammad by number of variables at each split I mean 'max_features'.

Post by Luca Puggini
If I understood correctly he is using a train set that is used for
model identification and training. A test set is then used to evaluate the
results. If he gets good performances on the train set and bad on the test
set it may be due to the fact that the test set contains different
information respect to the train set. This is for example common in time
series.

Post by Jacob Schreiber
I'm a bit unclear what you expect shuffling the data to do, Luca,
since you end up taking a random sample if you bootstrap and re-ordering it
anyway.
Jacob
On Fri, Feb 5, 2016 at 1:32 PM, muhammad waseem <

Post by Luca Puggini
The number of trees (n estimators) should be as much large as
possible. It does not cause over fitting. In random forest over fitting
is usually caused by the depth and by variables with several unique
values. I'll suggest you to start using randomized trees with low depth.
If you want to use rf you can try to reduce the number of variables used at
each split.
Observe that if you use OOB to estimate the prediction error it may
be biased when the number of trees is large.
In addition I'll suggest you to shuffle the data at the beginning if
you can.
On Fri, Feb 5, 2016, 5:14 PM muhammad waseem <

Post by muhammad waseem
Thanks Luca, I will give it a try. When you say extremely
randomised, does this mean using large number of n_estimators?
Also, any idea how to solve overfitting problem for random forest?
Regards
Waseem

Post by Luca Puggini
Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
it work similarly to random forest. In my experience RF tends
often to overfit.
I suggest you to start using the default parameters and cross
validate only on the max_depth parameter. Start with small values of
max_depth [2, 3, 5, 7, 10] and check how the performances of the model
change.
Good Luck.
Luca
On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <

Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?
On Fri, Feb 5, 2016 at 4:13 PM, Luca Puggini <

Post by Luca Puggini
To me the score is not so low. The model is slightly over
fitting. Try to repeat the same process with extremely randomized trees
instead of random forest and try to keep a low depth.
On Fri 5 Feb 2016 at 16:01 muhammad waseem <

Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random
forest (Regression) and have tried to use GridSearch with Cross-validation
(CV=5) to tune hyperparameters. I fixed n_estimators =2000 for all cases.
Below are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15,
min_samples_split:10, bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30,
min_samples_split:20, bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25,
min_samples_split:22, bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the
above runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset
but got a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning
correctly or I am missing something?
2) Why is my testing score very low as compared to my training
and validation score and how can I improve it so that I get good
predictions out of my model?
Sorry, if these are basic questions as I am new to scikit-learn
and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application
Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Sent by mobile phone

Andreas Mueller

2016-02-09 20:16:06 UTC

Permalink

You should probably use a different cross-validation strategy if your
data is ordered. This will give you more realistic cross-validation results.
There was a time series CV object somewhere, and by now I think we
should include it (this is the third time this comes up in the last 3 days)

muhammad waseem

2016-02-09 20:22:56 UTC

Permalink

Hi Andreas,
Thanks for your reply. I have already shuffled my data so it is not in
ordered now but still no luck. Any other suggestions?

Post by Andreas Mueller
You should probably use a different cross-validation strategy if your
data is ordered. This will give you more realistic cross-validation results.
There was a time series CV object somewhere, and by now I think we
should include it (this is the third time this comes up in the last 3 days)
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2016-02-09 21:01:05 UTC

Permalink

How did you create the hold-out test data? Before or after shuffling?

Post by muhammad waseem
Hi Andreas,
Thanks for your reply. I have already shuffled my data so it is not in
ordered now but still no luck. Any other suggestions?
You should probably use a different cross-validation strategy if your
data is ordered. This will give you more realistic
cross-validation results.
There was a time series CV object somewhere, and by now I think we
should include it (this is the third time this comes up in the last 3 days)
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

muhammad waseem

2016-02-09 21:23:17 UTC

Permalink

I have it in separate file (csv). Actually, I have four years weather data
(hourly values in two files), I use 3 years (first file) worth of data for
training and one years worth of data (second file) for testing.

Am I doing it correctly? any ideas?

Post by Andreas Mueller
How did you create the hold-out test data? Before or after shuffling?
Hi Andreas,
Thanks for your reply. I have already shuffled my data so it is not in
ordered now but still no luck. Any other suggestions?

Andreas Mueller

2016-02-09 21:40:29 UTC

Permalink

Yes. Exactly what Luca said and what I said earlier.

There is temporal structure in your data. If you use k-fold cross
validation (or even shuffle the data) that destroys the temporal structure.
You want to make predictions for the future (the second file). You
should use a cross-validation method that tries to predict form the past
to the future, not that tries to predict arbitrary time points.
Otherwise, your results will be too optimistic, as you found.

Post by muhammad waseem
I have it in separate file (csv). Actually, I have four years weather
data (hourly values in two files), I use 3 years (first file) worth of
data for training and one years worth of data (second file) for testing.
Am I doing it correctly? any ideas?
How did you create the hold-out test data? Before or after shuffling?

Post by muhammad waseem
Hi Andreas,
Thanks for your reply. I have already shuffled my data so it is
not in ordered now but still no luck. Any other suggestions?
You should probably use a different cross-validation strategy if your
data is ordered. This will give you more realistic
cross-validation results.
There was a time series CV object somewhere, and by now I think we
should include it (this is the third time this comes up in
the last 3 days)
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Luca Puggini

2016-02-09 22:00:38 UTC

Permalink

Personally I think that random forest should not be used for time series
data unless the data is supposed to have some sort of periodicity. This is
because random forest is a sort of local estimator. It's not effective if
new samples are outside of the hypercube defined by the training data.
This is quite common in time series. If I were you I would try something
like linear regression or extreme learning machine. If you are interested
in extreme learning machine there should be a PR on scikit-learn (I wrote a
simple paper with a simple introduction to ELM: "Extreme learning machines
for virtual metrology and etch rate prediction". Maybe this can help you

.

Post by Andreas Mueller
Yes. Exactly what Luca said and what I said earlier.
There is temporal structure in your data. If you use k-fold cross
validation (or even shuffle the data) that destroys the temporal structure.
You want to make predictions for the future (the second file). You should
use a cross-validation method that tries to predict form the past
to the future, not that tries to predict arbitrary time points. Otherwise,
your results will be too optimistic, as you found.
I have it in separate file (csv). Actually, I have four years weather data
(hourly values in two files), I use 3 years (first file) worth of data for
training and one years worth of data (second file) for testing.
Am I doing it correctly? any ideas?

--
Sent by mobile phone

muhammad waseem

2016-02-10 04:47:04 UTC

Permalink

Thanks Luca and Andreas, the idea behind this is to predict a weather
parameter using some other parameters. You still think it will be difficult
to solve with Random Forest as it is not really time series. I get good
training results (with high max_depth) but not very good for the testing
dataset, meaning the regressor is unable to generalise.

What about gradient boosting regressor, is this suitable?

Thanks
Kindest Regards
Waseem

Post by Luca Puggini
Personally I think that random forest should not be used for time series
data unless the data is supposed to have some sort of periodicity. This is
because random forest is a sort of local estimator. It's not effective if
new samples are outside of the hypercube defined by the training data.
This is quite common in time series. If I were you I would try something
like linear regression or extreme learning machine. If you are interested
in extreme learning machine there should be a PR on scikit-learn (I wrote a
simple paper with a simple introduction to ELM: "Extreme learning machines
for virtual metrology and etch rate prediction". Maybe this can help you
.

Post by Andreas Mueller
Yes. Exactly what Luca said and what I said earlier.
There is temporal structure in your data. If you use k-fold cross
validation (or even shuffle the data) that destroys the temporal structure.
You want to make predictions for the future (the second file). You should
use a cross-validation method that tries to predict form the past
to the future, not that tries to predict arbitrary time points.
Otherwise, your results will be too optimistic, as you found.
I have it in separate file (csv). Actually, I have four years weather
data (hourly values in two files), I use 3 years (first file) worth of data
for training and one years worth of data (second file) for testing.
Am I doing it correctly? any ideas?

Andreas Mueller

2016-02-10 16:26:09 UTC

Permalink

The problem is really how you do cross-validation.

Post by muhammad waseem
Thanks Luca and Andreas, the idea behind this is to predict a weather
parameter using some other parameters. You still think it will be
difficult to solve with Random Forest as it is not really time series.
I get good training results (with high max_depth) but not very good
for the testing dataset, meaning the regressor is unable to generalise.
What about gradient boosting regressor, is this suitable?
Thanks
Kindest Regards
Waseem
Personally I think that random forest should not be used for time
series data unless the data is supposed to have some sort of
periodicity. This is because random forest is a sort of local
estimator. It's not effective if new samples are outside of the
hypercube defined by the training data. This is quite common in
time series. If I were you I would try something like linear
regression or extreme learning machine. If you are interested in
extreme learning machine there should be a PR on scikit-learn (I
wrote a simple paper with a simple introduction to ELM: "Extreme
learning machines for virtual metrology and etch rate prediction".
Maybe this can help you
.
Yes. Exactly what Luca said and what I said earlier.
There is temporal structure in your data. If you use k-fold
cross validation (or even shuffle the data) that destroys the
temporal structure.
You want to make predictions for the future (the second file).
You should use a cross-validation method that tries to predict
form the past
to the future, not that tries to predict arbitrary time
points. Otherwise, your results will be too optimistic, as you
found.

Post by muhammad waseem
I have it in separate file (csv). Actually, I have four years
weather data (hourly values in two files), I use 3 years
(first file) worth of data for training and one years worth
of data (second file) for testing.
Am I doing it correctly? any ideas?
On Tue, Feb 9, 2016 at 9:01 PM, Andreas Mueller
How did you create the hold-out test data? Before or
after shuffling?

Post by muhammad waseem
Hi Andreas,
Thanks for your reply. I have already shuffled my data
so it is not in ordered now but still no luck. Any other
suggestions?
On Tue, Feb 9, 2016 at 8:16 PM, Andreas Mueller
You should probably use a different cross-validation
strategy if your
data is ordered. This will give you more realistic
cross-validation results.
There was a time series CV object somewhere, and by
now I think we
should include it (this is the third time this comes
up in the last 3 days)
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into
Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at
just $35/Month
Monitor end-to-end web transactions and take
corrective actions now
Troubleshoot faster and improve end-user experience.
Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Sent by mobile phone
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

muhammad waseem

2016-02-10 17:43:00 UTC

Permalink

Hi Andreas,
Here is the code showing the way I am currently doing it;

*from sklearn.cross_validation import train_test_split*
*X_train, X_test, y_train, y_test =
train_test_split(X,Y,test_size=0.3,random_state=10)*

*model = RandomForestRegressor(random_state = 10, n_estimators = 3000)*
*param_grid = { "max_features" : [2,3,4,5],*
* "max_depth" : [5, 50, 100, 150, 200],*
* "min_samples_split" : [2, 10, 20, 30] ,*
* "bootstrap": [True, False]}*
*grid_search = GridSearchCV(model, param_grid, n_jobs=-1, cv=5)*
*grid_search.fit(X_train, y_train)*
*print (grid_search.best_params_)*

Is this the correct way of doing it?

Regards
Waseem

Post by Andreas Mueller
The problem is really how you do cross-validation.
Thanks Luca and Andreas, the idea behind this is to predict a weather
parameter using some other parameters. You still think it will be difficult
to solve with Random Forest as it is not really time series. I get good
training results (with high max_depth) but not very good for the testing
dataset, meaning the regressor is unable to generalise.
What about gradient boosting regressor, is this suitable?
Thanks
Kindest Regards
Waseem

Post by Andreas Mueller
Yes. Exactly what Luca said and what I said earlier.
There is temporal structure in your data. If you use k-fold cross
validation (or even shuffle the data) that destroys the temporal structure.
You want to make predictions for the future (the second file). You
should use a cross-validation method that tries to predict form the past
to the future, not that tries to predict arbitrary time points.
Otherwise, your results will be too optimistic, as you found.
I have it in separate file (csv). Actually, I have four years weather
data (hourly values in two files), I use 3 years (first file) worth of data
for training and one years worth of data (second file) for testing.
Am I doing it correctly? any ideas?

Post by Andreas Mueller
You should probably use a different cross-validation strategy if your
data is ordered. This will give you more realistic cross-validation results.
There was a time series CV object somewhere, and by now I think we
should include it (this is the third time this comes up in the last 3 days)
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
<http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140>
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general