Post by Mathieu BlondelPost by Olivier GriselThis is discussion is interesting: we could add a new
BinningTransformer (maybe as a a complement to the MidrangeScaler
discussed on the preprocessing-simplification pull request) so as to
make linear models able to capture non linear features in the data.
That should be easy to implement and could potentially make all the
linear model more expressive for a small computational overhead.
Sounds like an idea (+an option for specifying the range of features
on which we want to apply the binning). This transformer could
potentially be used to transform categorical features to binary
features too (for example a 5-category variable needs to be mapped to
5 binary features).
By the way, this might be off-topic, as this thread is talking about
problems I am not used to, and I read it a bit quickly. However, I
recently wrote some code to choose bins sides (IOW thresholds) to
binarize univariate data trying to have equal population bins. This can
be useful if you want to convert a continious distribution to a set of
states. It can actually get a bit tricky when you have a mixture of
scattered data and a few macroscopicaly-occupied states.
I uploaded the code on:
https://gist.github.com/1010064
It has no tests :(.
It is only for univariate data. For mutlivariate, one would need to use
the tree built by a KD-tree or a ball-tree. However, dealing with
macroscopicaly-occupied states would get a bit trickier.
If it's of any use to other people, grab it. If its off general use, we
should write tests and integrate it to the scikit.
Gael