[Scikit-learn-general] Sparse Gradient Boosting & Fully Corrective Gradient Boosting

Discussion:

c TAKES

2014-09-16 00:16:50 UTC

Arnaud Joly

2014-09-16 07:44:18 UTC

Hi,

There is a very advanced pull request which add sparse matrix support to
decision tree: https://github.com/scikit-learn/scikit-learn/pull/3173

Based on this, it could be possible to have gradient tree boosting working
on sparse data. Note that adaboost already support sparse matrix
with non-tree based estimator.

Best regards,
Arnaud

Is anyone working on making Gradient Boosting Regressor work with sparse matrices?
Or is anyone working on adding an option for fully corrective gradient boosting, I.E. all trees in the ensemble are re-weighted at each iteration?
These are things I would like to see and may be able to help with if no one is currently working on them.
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2014-09-16 13:15:25 UTC

Permalink

Could you give a reference for gradient boosting with fully corrective
updates?

Since the philosophy of gradient boosting is to fit each tree against the
residuals (or negative gradient) so far, I am wondering how such fully
corrective update would work...

Mathieu

Post by c TAKES
Is anyone working on making Gradient Boosting Regressor work with sparse
matrices?
Or is anyone working on adding an option for fully corrective gradient
boosting, I.E. all trees in the ensemble are re-weighted at each iteration?
These are things I would like to see and may be able to help with if no
one is currently working on them.
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Peter Prettenhofer

2014-09-16 16:16:46 UTC

Permalink

The only reference I know is the Regularized Greedy Forest paper by Johnson
and Zhang [1]
I havent read the primary source (by Zhang as well).

[1] http://arxiv.org/abs/1109.0887

Post by Mathieu Blondel
Could you give a reference for gradient boosting with fully corrective
updates?
Since the philosophy of gradient boosting is to fit each tree against the
residuals (or negative gradient) so far, I am wondering how such fully
corrective update would work...
Mathieu

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

c TAKES

2014-09-16 19:02:00 UTC

Permalink

yes - In fact my real goal is to implement RGF ultimately, though I had
considered building/forking off the current GradientBoostingRegressor
package as a starting point A) b/c I'm new to developing for scikit-learn
and B) to maintain as much consistency as possible with the rest of the
package.

Upon further review though, I don't think there's much point in adding
fully corrective updates to GBR because without the regularization (the
rest of RGF) it is probably useless and leads to over fitting, as per
http://stat.rutgers.edu/home/tzhang/software/rgf/tpami14-rgf.pdf.

So it would likely be more useful to go ahead and create RGF as an entirely
separate module. But that could take some time, of course.

Is anyone working on RGF for sklearn?

Arnaud, thanks for directing me to your sparse implementation, I will have
a look! It would certainly be excellent to have this work for all decision
tree algorithms.

Ken

On Tue, Sep 16, 2014 at 11:16 AM, Peter Prettenhofer <

Post by Peter Prettenhofer
The only reference I know is the Regularized Greedy Forest paper by
Johnson and Zhang [1]
I havent read the primary source (by Zhang as well).
[1] http://arxiv.org/abs/1109.0887

--
Peter Prettenhofer
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2014-09-20 15:04:50 UTC

Permalink

I read the RGF paper. Interesting method but definitely too early to
include it in scikit-learn (we focus on mature algorithms). This also looks
more complicated to implement than gradient boosting, since tree induction
and boosting are interleaved.

The paper also clarified what "fully corrective" means, thanks. To
summarize, here are different strategies for setting the weights in
gradient boosting:
1. using the same constant value (`learning_rate`) for all estimators
2. setting the weight of the last estimator by line search (other weights
are kept fixed)
3. setting one separate weight per leaf node of the last estimator, by line
search
4. refitting all estimators weights (including the past ones)
5. refitting all leaf nodes of all estimators?

Some authors [1] argued that option 1 is sufficient in practice to get good
performance since our goal is to do well on test data, not training data.
Option 3 is what scikit-learn implements. Option 4 is the fully corrective
case. It basically amounts to a least squares for square loss or to
logisic regression for log loss (using each estimator predictions as
features). One thing though is that our implementation doesn't store the
estimator weights explicitly (leaves are updated directly). Setting a
penalty (l1 or l2) on the estimator weights (i.e., on the functional)
should prevent overfitting. Option 5 sounds pretty computationally
expensive, although the update doesn't need to be done at every iteration.

I recently re-implemented gradient boosting [2]. One difference in my
implementation is that it is possible to use any base learner (not just
trees). So my implementation stores estimator weights explicitly and uses
option 2 above. The fully corrective updates (option 4) might be easier to
add to my implementation.

BTW, the notion of fully corrective updates is also present in the matching
pursuit (-> orthogonal matching pursuit) and frank-wolfe literatures.

Mathieu

[1] "Boosting Algorithms: Regularization, Prediction and Model Fitting",
Peter B Ìuhlmann and Torsten Hothorn (thanks to Peter for telling me about
this paper)

[2]
https://github.com/mblondel/ivalice/blob/master/ivalice/impl/gradient_boosting.py

Mathieu

Post by c TAKES
yes - In fact my real goal is to implement RGF ultimately, though I had
considered building/forking off the current GradientBoostingRegressor
package as a starting point A) b/c I'm new to developing for scikit-learn
and B) to maintain as much consistency as possible with the rest of the
package.
Upon further review though, I don't think there's much point in adding
fully corrective updates to GBR because without the regularization (the
rest of RGF) it is probably useless and leads to over fitting, as per
http://stat.rutgers.edu/home/tzhang/software/rgf/tpami14-rgf.pdf.
So it would likely be more useful to go ahead and create RGF as an
entirely separate module. But that could take some time, of course.
Is anyone working on RGF for sklearn?
Arnaud, thanks for directing me to your sparse implementation, I will have
a look! It would certainly be excellent to have this work for all decision
tree algorithms.
Ken
On Tue, Sep 16, 2014 at 11:16 AM, Peter Prettenhofer <

Post by Mathieu Blondel
Could you give a reference for gradient boosting with fully corrective
updates?
Since the philosophy of gradient boosting is to fit each tree against
the residuals (or negative gradient) so far, I am wondering how such fully
corrective update would work...
Mathieu

Post by c TAKES
Is anyone working on making Gradient Boosting Regressor work with
sparse matrices?
Or is anyone working on adding an option for fully corrective gradient
boosting, I.E. all trees in the ensemble are re-weighted at each iteration?
These are things I would like to see and may be able to help with if no
one is currently working on them.
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2014-09-20 16:55:08 UTC

Permalink

On a related note, here is an implementeation of Logistic Regression
applied to one-hot features obtained from leaf membership info of a
GBRT model:

http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/sklearn_demos/Income%20classification.ipynb#Using-the-boosted-trees-to-extract-features-for-a-Logistic-Regression-model

This is inspired by this paper from Facebook:
https://www.facebook.com/publications/329190253909587/ .

It's easy to implement and seems to work quite well.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Mathieu Blondel

2014-09-21 08:46:31 UTC

Permalink

Post by Olivier Grisel
On a related note, here is an implementeation of Logistic Regression
applied to one-hot features obtained from leaf membership info of a
http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/sklearn_demos/Income%20classification.ipynb#Using-the-boosted-trees-to-extract-features-for-a-Logistic-Regression-model
https://www.facebook.com/publications/329190253909587/ .
It's easy to implement and seems to work quite well.

What is the advantage of this method over using GBRT directly?

Mathieu

Post by Olivier Grisel
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Olivier Grisel

2014-09-22 18:48:56 UTC

Permalink

Post by Mathieu Blondel

What is the advantage of this method over using GBRT directly?

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Peter Prettenhofer

2014-09-22 19:01:08 UTC

Permalink

Key advantage of using RuleFit [1] -- striking that they didnt cite it btw
-- is that if you add the original features your model can a) better
incorporate additive effects and b) extrapolate, a limitation of any
tree-based method like GBRT or RF.

[1] http://statweb.stanford.edu/~jhf/R-RuleFit.html

Post by Olivier Grisel

On Sun, Sep 21, 2014 at 1:55 AM, Olivier Grisel <

Post by Olivier Grisel
On a related note, here is an implementeation of Logistic Regression
applied to one-hot features obtained from leaf membership info of a

http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/sklearn_demos/Income%20classification.ipynb#Using-the-boosted-trees-to-extract-features-for-a-Logistic-Regression-model

Post by Olivier Grisel
https://www.facebook.com/publications/329190253909587/ .
It's easy to implement and seems to work quite well.

What is the advantage of this method over using GBRT directly?

A significant improvement in F1-score for the positive / minority
class and ROC AUC on this dataset (Adult Census binarized income
prediction with integer encoding of the categorical variables).
Apparently the facebook ad team reported the same kind of improvement
on their own data.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

Olivier Grisel

2014-09-23 18:06:15 UTC

Permalink

I need to update my code to add one-hot-encoded original categorical
features a-la-rule to see if it improves f1 score further or not.

--
Olivier

Olivier Grisel

2014-09-20 17:04:50 UTC

Permalink

Post by Mathieu Blondel
I recently re-implemented gradient boosting [2].

I am interested in your feedback in implementing trees with numba. Is
it easy to reach the speed the scikit-learn cython generated code?
What are the trade-offs and traps you discovered while working on
this?

In general would you recommend numba over Cython for Python users
implementing their own estimators with non vectorizable routines?

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Mathieu Blondel

2014-09-21 09:07:35 UTC

Permalink

Post by Olivier Grisel

Post by Mathieu Blondel
I recently re-implemented gradient boosting [2].

I am interested in your feedback in implementing trees with numba. Is
it easy to reach the speed the scikit-learn cython generated code?
What are the trade-offs and traps you discovered while working on
this?
In general would you recommend numba over Cython for Python users
implementing their own estimators with non vectorizable routines?

My implementation of trees is still very much a work in progress and I
still have to implement a bunch of optimizations on the algorithmic level
before we can compare it to the implementation in scikit-learn. My
implementation is not a port of scikit-learn, I started from scratch
(although I followed most of the implementation techniques detailed in
Gilles' thesis).

I really love Numba. It is really great to be able to accelerate routines
directly in the same file and type inference makes the code much more
readable. I usually get 20x to 50x speed ups compared to pure Python.
However there are still a few limitations. There are a few language
constructs which are not yet supported like recursion and returning tuples.
Working around these limitations sometimes makes the code ugly. Compilation
is slow. Here's an example:

$ time nosetests ivalice/impl/tests/test_tree.py
........
----------------------------------------------------------------------
Ran 8 tests in 0.416s

OK
nosetests ivalice/impl/tests/test_tree.py 5,50s user 0,20s system 99% cpu
5,696 total

So the tests take 0.4s to run but compilation of the jitted functions takes
5s. It's ok for large scale learning but quite annoying for small scale
interactive work. So compilation caching would be a nice feature to have (
https://github.com/numba/numba/issues/224).

Last, error reporting is a bit confusing at times.

To summarize, in its current state, I'd recommend Numba for small projects
but perhaps not for scikit-learn. Also this makes me realize that Cython
has reached a level of maturity which is pretty impressive!

Mathieu

Post by Olivier Grisel
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

c TAKES

2014-09-20 19:16:21 UTC

Permalink

Mathieu - thanks so much for this in depth response. It clarified a fair
amount of what I was seeing in the gradient boosting source code that
differs from my prior understanding of standard GBMs. It also became clear
from reviewing the code that to implement RGF I would have to modify both
tree.py and _tree.pyx as well to accomplish, as you described it, the
interleaving of tree induction and boosting. I would need to make it so a
DecisionTreeRegressor - or perhaps a separate class implementing
BaseDecisionTree - can add one new split at a time, as well as add a
function to return loss reductions that would result from each possible
split.

Understandable that scikit-learn wants to focus on more mature algorithms,
so perhaps I'll spend my efforts more on writing a python wrapper for
Johnson and Zhang's implementation of RGF, at least for now. Personally I
do think it is fairly well proven though, as Johnson and Zhang have been
quite successful with it in competitions. To a lesser extent so have I for
that matter :).

One question I do have - you said that Option 3 is the one scikit-learn
implements (setting one separate weight per leaf node of the last
estimator, by line search). Perhaps I'm overlooking something, but to me
the code looks like it is implementing Option 1 (using the same constant
value (`learning_rate`) for all estimators).

Take the loss function LeastSquaresError for simplicity. Unless I'm
mistaken, the predictions are simply updated by this statement:

y_pred[:, k] += learning_rate * tree.predict(X).ravel()

where learning_rate is a constant, and that's it. Isn't that Option 1? I
don't see any line search.

Best,
Ken

Post by Mathieu Blondel
I read the RGF paper. Interesting method but definitely too early to
include it in scikit-learn (we focus on mature algorithms). This also looks
more complicated to implement than gradient boosting, since tree induction
and boosting are interleaved.
The paper also clarified what "fully corrective" means, thanks. To
summarize, here are different strategies for setting the weights in
1. using the same constant value (`learning_rate`) for all estimators
2. setting the weight of the last estimator by line search (other weights
are kept fixed)
3. setting one separate weight per leaf node of the last estimator, by
line search
4. refitting all estimators weights (including the past ones)
5. refitting all leaf nodes of all estimators?
Some authors [1] argued that option 1 is sufficient in practice to get
good performance since our goal is to do well on test data, not training
data.
Option 3 is what scikit-learn implements. Option 4 is the fully corrective
case. It basically amounts to a least squares for square loss or to
logisic regression for log loss (using each estimator predictions as
features). One thing though is that our implementation doesn't store the
estimator weights explicitly (leaves are updated directly). Setting a
penalty (l1 or l2) on the estimator weights (i.e., on the functional)
should prevent overfitting. Option 5 sounds pretty computationally
expensive, although the update doesn't need to be done at every iteration.
I recently re-implemented gradient boosting [2]. One difference in my
implementation is that it is possible to use any base learner (not just
trees). So my implementation stores estimator weights explicitly and uses
option 2 above. The fully corrective updates (option 4) might be easier to
add to my implementation.
BTW, the notion of fully corrective updates is also present in the
matching pursuit (-> orthogonal matching pursuit) and frank-wolfe
literatures.
Mathieu
[1] "Boosting Algorithms: Regularization, Prediction and Model Fitting",
Peter B Ìuhlmann and Torsten Hothorn (thanks to Peter for telling me about
this paper)
[2]
https://github.com/mblondel/ivalice/blob/master/ivalice/impl/gradient_boosting.py
Mathieu

Post by c TAKES
yes - In fact my real goal is to implement RGF ultimately, though I had
considered building/forking off the current GradientBoostingRegressor
package as a starting point A) b/c I'm new to developing for scikit-learn
and B) to maintain as much consistency as possible with the rest of the
package.
Upon further review though, I don't think there's much point in adding
fully corrective updates to GBR because without the regularization (the
rest of RGF) it is probably useless and leads to over fitting, as per
http://stat.rutgers.edu/home/tzhang/software/rgf/tpami14-rgf.pdf.
So it would likely be more useful to go ahead and create RGF as an
entirely separate module. But that could take some time, of course.
Is anyone working on RGF for sklearn?
Arnaud, thanks for directing me to your sparse implementation, I will
have a look! It would certainly be excellent to have this work for all
decision tree algorithms.
Ken
On Tue, Sep 16, 2014 at 11:16 AM, Peter Prettenhofer <

Post by Mathieu Blondel
Could you give a reference for gradient boosting with fully corrective
updates?
Since the philosophy of gradient boosting is to fit each tree against
the residuals (or negative gradient) so far, I am wondering how such fully
corrective update would work...
Mathieu

Post by c TAKES
Is anyone working on making Gradient Boosting Regressor work with
sparse matrices?
Or is anyone working on adding an option for fully corrective gradient
boosting, I.E. all trees in the ensemble are re-weighted at each iteration?
These are things I would like to see and may be able to help with if
no one is currently working on them.
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2014-09-21 08:41:10 UTC

Permalink

Hi Ken,

Post by c TAKES
Understandable that scikit-learn wants to focus on more mature algorithms,
so perhaps I'll spend my efforts more on writing a python wrapper for
Johnson and Zhang's implementation of RGF, at least for now. Personally I
do think it is fairly well proven though, as Johnson and Zhang have been
quite successful with it in competitions. To a lesser extent so have I for
that matter :).

I am not saying that their method isn't useful, just that it is too early
to add it to scikit-learn. Anyway, let us know if you come up with a
scikit-learn compatible implementation or wrapper :)

Post by c TAKES
One question I do have - you said that Option 3 is the one scikit-learn
implements (setting one separate weight per leaf node of the last
estimator, by line search). Perhaps I'm overlooking something, but to me
the code looks like it is implementing Option 1 (using the same constant
value (`learning_rate`) for all estimators).

I've just checked and the optimal step size for the squared loss is:

step_size = \sum_i (y_i - f_i) h_i / \sum_i h_i^2

where y_i is the correct label of instance i, f_i is the prediction of the
ensemble so far and h_i is the prediction of the model to be added to the
ensemble. Recall that the tree is fitted against the residuals. So if the
tree perfectly fits the residuals, y_i - f_i = h_i and thus the optimal
step size is 1.0. This might not be the case if the tree is too shallow to
fit the data though. This is for option 2. For option 3, the sum must be
taken over the instances that reached a particular node.

If you have a look at other loss functions it is clear that scikit-learn
implements option 3.

Here's the loop over the leaves:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L191

Here's the line search for one leaf in the case of the log loss:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L377

To be more precise, scikit-learn finds the step size by a single
Newton-Raphson step, as suggested in the original gradient boosting paper.
So, the line search is only approximate.

Mathieu