Discussion:
[Scikit-learn-general] cross-validation and indices=False
Joel Nothman
2013-07-30 23:14:15 UTC
Permalink
Hi,

I'm sure you're all burnt out from what looks like a great sprint; thanks
for all that work and congratulations on the RC! So I apologise for the bad
timing.

I am wondering why there is a need to support the indices=False case in
cross_validation. Indices are superior in that they can be used with
np.take and with sparse matrices. And most of the standard cv
implementations output indices that are converted into boolean masks and
back to indices.

Moreover, building generic tools that take cv implementations as input need
to handle both cases (or make assumptions).

What is the intention behind indices=False; why not deprecate it and
simplify the API and code? (And speed up indexing by using np.take.)

- Joel
Lars Buitinck
2013-07-31 06:08:46 UTC
Permalink
Post by Joel Nothman
I am wondering why there is a need to support the indices=False case in
cross_validation. Indices are superior in that they can be used with np.take
and with sparse matrices. And most of the standard cv implementations output
indices that are converted into boolean masks and back to indices.
Moreover, building generic tools that take cv implementations as input need
to handle both cases (or make assumptions).
What is the intention behind indices=False; why not deprecate it and
simplify the API and code? (And speed up indexing by using np.take.)
Funny, I was wondering the same thing yesterday. IIRC, we originally
used only masks and indices were added to please the sparse
matrix-pushing crowd (yours truly). Then safe_mask got introduced to
accept both at the consumer side.

Arguably, masks are easier to interpret, though, esp. in feature
selection code; you can multiply them with your coef_ before plotting
it to see which features are deactivated.

Do you have any timings for np.take?
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Joel Nothman
2013-07-31 06:17:50 UTC
Permalink
Post by Joel Nothman
Post by Joel Nothman
I am wondering why there is a need to support the indices=False case in
cross_validation. Indices are superior in that they can be used with
np.take
Post by Joel Nothman
and with sparse matrices. And most of the standard cv implementations
output
Post by Joel Nothman
indices that are converted into boolean masks and back to indices.
Moreover, building generic tools that take cv implementations as input
need
Post by Joel Nothman
to handle both cases (or make assumptions).
What is the intention behind indices=False; why not deprecate it and
simplify the API and code? (And speed up indexing by using np.take.)
Funny, I was wondering the same thing yesterday. IIRC, we originally
used only masks and indices were added to please the sparse
matrix-pushing crowd (yours truly). Then safe_mask got introduced to
accept both at the consumer side.
Arguably, masks are easier to interpret, though, esp. in feature
selection code; you can multiply them with your coef_ before plotting
it to see which features are deactivated.
But that isn't really meaningful for cv.

Do you have any timings for np.take?
See http://wesmckinney.com/blog/?p=215
Ideally this is a bug that will disappear from numpy anyway -- for all I
know it already has -- so it should be less of the focus than a simplified
API.

- Joel
Alexandre Gramfort
2013-07-31 07:07:37 UTC
Permalink
hi,

indeed we could stick to indices and use np.take whenever possible.

In [33]: A = np.random.randn(500, 500)
In [34]: idx = np.unique(np.random.randint(0, 499, 400))
In [35]: mask = np.zeros(500, dtype=np.bool)
In [36]: mask[idx] = True
In [37]: %timeit A[idx]
1000 loops, best of 3: 1.79 ms per loop
In [38]: %timeit A[mask]
1000 loops, best of 3: 1.77 ms per loop
In [39]: %timeit A.take(idx, axis=0)
10000 loops, best of 3: 103 us per loop

Alex
Gael Varoquaux
2013-07-31 07:11:31 UTC
Permalink
Post by Joel Nothman
What is the intention behind indices=False;
Old design oversight (aka historical reasons).
Post by Joel Nothman
why not deprecate it and simplify the API and code? (And speed up
indexing by using np.take.)
+1! Making things simpler is always better.

G
Olivier Grisel
2013-07-31 07:53:57 UTC
Permalink
+1 for deprecating boolean mask for CV as well.
Jaques Grobler
2013-07-31 08:08:20 UTC
Permalink
Makes sense to me to deprecate here +1
Post by Olivier Grisel
+1 for deprecating boolean mask for CV as well.
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...