[Scikit-learn-general] cross-validation and indices=False

Discussion:

Joel Nothman

2013-07-30 23:14:15 UTC

Hi,

I'm sure you're all burnt out from what looks like a great sprint; thanks
for all that work and congratulations on the RC! So I apologise for the bad
timing.

I am wondering why there is a need to support the indices=False case in
cross_validation. Indices are superior in that they can be used with
np.take and with sparse matrices. And most of the standard cv
implementations output indices that are converted into boolean masks and
back to indices.

Moreover, building generic tools that take cv implementations as input need
to handle both cases (or make assumptions).

What is the intention behind indices=False; why not deprecate it and
simplify the API and code? (And speed up indexing by using np.take.)

- Joel

Lars Buitinck

2013-07-31 06:08:46 UTC

Permalink

Post by Joel Nothman
I am wondering why there is a need to support the indices=False case in
cross_validation. Indices are superior in that they can be used with np.take
and with sparse matrices. And most of the standard cv implementations output
indices that are converted into boolean masks and back to indices.
Moreover, building generic tools that take cv implementations as input need
to handle both cases (or make assumptions).
What is the intention behind indices=False; why not deprecate it and
simplify the API and code? (And speed up indexing by using np.take.)

Funny, I was wondering the same thing yesterday. IIRC, we originally
used only masks and indices were added to please the sparse
matrix-pushing crowd (yours truly). Then safe_mask got introduced to
accept both at the consumer side.

Arguably, masks are easier to interpret, though, esp. in feature
selection code; you can multiply them with your coef_ before plotting
it to see which features are deactivated.

Do you have any timings for np.take?

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Joel Nothman

2013-07-31 06:17:50 UTC

Permalink

Post by Joel Nothman

Post by Joel Nothman
I am wondering why there is a need to support the indices=False case in
cross_validation. Indices are superior in that they can be used with

np.take

Post by Joel Nothman
and with sparse matrices. And most of the standard cv implementations

output

Post by Joel Nothman
indices that are converted into boolean masks and back to indices.
Moreover, building generic tools that take cv implementations as input

need

Post by Joel Nothman
to handle both cases (or make assumptions).
What is the intention behind indices=False; why not deprecate it and
simplify the API and code? (And speed up indexing by using np.take.)

Funny, I was wondering the same thing yesterday. IIRC, we originally
used only masks and indices were added to please the sparse
matrix-pushing crowd (yours truly). Then safe_mask got introduced to
accept both at the consumer side.
Arguably, masks are easier to interpret, though, esp. in feature
selection code; you can multiply them with your coef_ before plotting
it to see which features are deactivated.

But that isn't really meaningful for cv.

Do you have any timings for np.take?
See http://wesmckinney.com/blog/?p=215
Ideally this is a bug that will disappear from numpy anyway -- for all I
know it already has -- so it should be less of the focus than a simplified
API.

- Joel

Alexandre Gramfort

2013-07-31 07:07:37 UTC

Permalink

hi,

indeed we could stick to indices and use np.take whenever possible.

In [33]: A = np.random.randn(500, 500)
In [34]: idx = np.unique(np.random.randint(0, 499, 400))
In [35]: mask = np.zeros(500, dtype=np.bool)
In [36]: mask[idx] = True
In [37]: %timeit A[idx]
1000 loops, best of 3: 1.79 ms per loop
In [38]: %timeit A[mask]
1000 loops, best of 3: 1.77 ms per loop
In [39]: %timeit A.take(idx, axis=0)
10000 loops, best of 3: 103 us per loop

Alex

Gael Varoquaux

2013-07-31 07:11:31 UTC

Permalink

Post by Joel Nothman
What is the intention behind indices=False;

Old design oversight (aka historical reasons).

Post by Joel Nothman
why not deprecate it and simplify the API and code? (And speed up
indexing by using np.take.)

+1! Making things simpler is always better.

G

Olivier Grisel

2013-07-31 07:53:57 UTC

Permalink

+1 for deprecating boolean mask for CV as well.

Jaques Grobler

2013-07-31 08:08:20 UTC

Permalink

Makes sense to me to deprecate here +1

Post by Olivier Grisel
+1 for deprecating boolean mask for CV as well.
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general