Discussion:
[SciPy-Dev] Proposal for Scikit-Signal - a SciPy toolbox for signal processing
Jaidev Deshpande
2011-12-26 17:39:29 UTC
Permalink
Hi

I gave a talk at SciPy India 2011 about a Python implementation of the
Hilbert-Huang Transform that I was working on. The HHT is a method
used as an alternative to Fourier and Wavelet analyses of nonlinear
and nonstationary data. Following the talk Gael Varoquaux said that
there's room for a separate scikit for signal processing. He also gave
a lightning talk about bootstrapping a SciPy community project soon
after.

So with this list let us start working out what the project should be like.

For noobs like me, Gael's talk was quite a useful guide. Here's the
link to a gist he made about it - https://gist.github.com/1433151

Here's the link to my SciPy talk:
http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/view/search/tag%3Ascipy

I personally am researching nonlinear and nonstationary signal
processing, I'd love to know what others can bring to this project.
Also, let's talk about the limitations of the current signal
processing tools available in SciPy and other scikits. I think there's
a lot of documentation to be worked out, and there is also a lack of
physically meaningful examples in the documentation.

Thanks

PS: I'm ccing a few people who might already be on the scipy-dev list.
Sorry for the inconvenience.
a***@ajackson.org
2011-12-27 00:12:30 UTC
Permalink
Post by Jaidev Deshpande
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the
Hilbert-Huang Transform that I was working on. The HHT is a method
used as an alternative to Fourier and Wavelet analyses of nonlinear
and nonstationary data. Following the talk Gael Varoquaux said that
there's room for a separate scikit for signal processing. He also gave
a lightning talk about bootstrapping a SciPy community project soon
after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the
link to a gist he made about it - https://gist.github.com/1433151
http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/view/search/tag%3Ascipy
I personally am researching nonlinear and nonstationary signal
processing, I'd love to know what others can bring to this project.
Also, let's talk about the limitations of the current signal
processing tools available in SciPy and other scikits. I think there's
a lot of documentation to be worked out, and there is also a lack of
physically meaningful examples in the documentation.
I've been playing with Empirical Mode Decomposition, and coded it up in
numpy, and it is pretty neat, but I do believe that NASA has patented it, which
would probably preclude distributing it in a scikit, without a lot of legal
effort.

From the Nasa website :
"An example of successfully brokering NASA technology through a no-cost
brokerage partnership was the exclusive license for the Hilbert-Huang
Transform, composed of 10 U.S. patents and one domestic patent application,
which was part of a lot auctioned by Ocean Tomo Federal Services LLC, in
October 2008."

Alan
--
-----------------------------------------------------------------------
| Alan K. Jackson | To see a World in a Grain of Sand |
| ***@ajackson.org | And a Heaven in a Wild Flower, |
| www.ajackson.org | Hold Infinity in the palm of your hand |
| Houston, Texas | And Eternity in an hour. - Blake |
-----------------------------------------------------------------------
Jaidev Deshpande
2011-12-27 17:55:59 UTC
Permalink
Hi Alan
Post by a***@ajackson.org
I've been playing with Empirical Mode Decomposition, and coded it up in
numpy, and it is pretty neat, but I do believe that NASA has patented it, which
would probably preclude distributing it in a scikit, without a lot of legal
effort.
EMD and HHT are nowhere in the scikit plan right now (unless others
decide to put it up), I simply mentioned it because it's one of my
interests.

Let's start talking about a basic signal processing scikit first. I
guess there will be enough room for adaptive methods like the HHT
later.

Regards
Alexandre Gramfort
2011-12-28 16:33:24 UTC
Permalink
Hi all,

I think a scikit-signal would be a nice project in the scipy ecosystem.
My gut feeling is that there is already a lot of great pieces of code for
signal processing in Python but it's too fragmented. Most of it is in
scipy.signal but one may need some pieces in scipy.ndimage and
external projects like for example for wavelets. I would be neat to have
a main entry point for signal processing in Python. As demonstrated
with scikit-learn a small and great project can emerge from scipy/numpy.
The benefit is that the entry cost can be much lower for a developer
compared to contributing directly to scipy and a small project can release
more often and eventually back port new stuff from scipy core.

As I already said to Jaydev, I think one should start by defining the
scope of such a project and list/review existing codes to bootstrap
the project.

Best,
Alex



On Tue, Dec 27, 2011 at 6:55 PM, Jaidev Deshpande
Post by Jaidev Deshpande
Hi Alan
Post by a***@ajackson.org
I've been playing with Empirical Mode Decomposition, and coded it up in
numpy, and it is pretty neat, but I do believe that NASA has patented it, which
would probably preclude distributing it in a scikit, without a lot of legal
effort.
EMD and HHT are nowhere in the scikit plan right now (unless others
decide to put it up), I simply mentioned it because it's one of my
interests.
Let's start talking about a basic signal processing scikit first. I
guess there will be enough room for adaptive methods like the HHT
later.
Regards
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
David Cournapeau
2011-12-28 18:16:14 UTC
Permalink
Hi Jaidev,

On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande
Post by Jaidev Deshpande
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the
Hilbert-Huang Transform that I was working on. The HHT is a method
used as an alternative to Fourier and Wavelet analyses of nonlinear
and nonstationary data. Following the talk Gael Varoquaux said that
there's room for a separate scikit for signal processing. He also gave
a lightning talk about bootstrapping a SciPy community project soon
after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the
link to a gist he made about it - https://gist.github.com/1433151
http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/view/search/tag%3Ascipy
I personally am researching nonlinear and nonstationary signal
processing, I'd love to know what others can bring to this project.
Also, let's talk about the limitations of the current signal
processing tools available in SciPy and other scikits. I think there's
a lot of documentation to be worked out, and there is also a lack of
physically meaningful examples in the documentation.
I think it would be a good addition to the ecosystem. We are for
example missing a lot of core algorithms even for linear signal
processing, and scipy.signal itself would benefit from some
refactoring.

I myself started something (the talkbox scikit), but realistically, I
won't have time to work on a full toolbox, so we can consolidate here.
I would be willing to work on merging what I already have in what you
have in mind:
- Linear prediction coding with Levinson Durbin implementation
- A start of periodogram function

I could spend time to implement a few more things like MUSIC/PENCIL,
and some basic matching pursuit algorithms.

cheers,

David
j***@gmail.com
2011-12-28 18:46:43 UTC
Permalink
Post by David Cournapeau
Hi Jaidev,
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande
Post by Jaidev Deshpande
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the
Hilbert-Huang Transform that I was working on. The HHT is a method
used as an alternative to Fourier and Wavelet analyses of nonlinear
and nonstationary data. Following the talk Gael Varoquaux said that
there's room for a separate scikit for signal processing. He also gave
a lightning talk about bootstrapping a SciPy community project soon
after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the
link to a gist he made about it - https://gist.github.com/1433151
http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/view/search/tag%3Ascipy
I personally am researching nonlinear and nonstationary signal
processing, I'd love to know what others can bring to this project.
Also, let's talk about the limitations of the current signal
processing tools available in SciPy and other scikits. I think there's
a lot of documentation to be worked out, and there is also a lack of
physically meaningful examples in the documentation.
I think it would be a good addition to the ecosystem. We are for
example missing a lot of core algorithms even for linear signal
processing, and scipy.signal itself would benefit from some
refactoring.
I myself started something (the talkbox scikit), but realistically, I
won't have time to work on a full toolbox, so we can consolidate here.
I would be willing to work on merging what I already have in what you
 - Linear prediction coding with Levinson Durbin implementation
 - A start of periodogram function
 I could spend time to implement a few more things like MUSIC/PENCIL,
and some basic matching pursuit algorithms.
depending on your scope, nitime will also be interesting which has
much more in terms of multi-dimensional signals, e.g. a multivariate
Levinson Durbin and various cross-spectral functions.

statsmodels has quite a bit of time series analysis now, but the focus
and datasets are pretty different, although I benefited from reading
the scipy.signal, talkbox and matplotlib codes for some basic tools.

Josef
Post by David Cournapeau
cheers,
David
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Alexandre Gramfort
2011-12-28 18:53:20 UTC
Permalink
Post by j***@gmail.com
depending on your scope, nitime will also be interesting which has
much more in terms of multi-dimensional signals, e.g. a multivariate
Levinson Durbin and various cross-spectral functions.
I fully agree. nitime hides some great pieces of code like slepian tapers and
various cross-sprectrum/coherence tools. I feel this code should be visible
to a broader audience which would favor cross-fertilization between
scientific domains and help nitime reach a critical mass of
users/contributors.

Alex
Neal Becker
2011-12-29 14:52:47 UTC
Permalink
Post by David Cournapeau
Hi Jaidev,
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande
Post by Jaidev Deshpande
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the
Hilbert-Huang Transform that I was working on. The HHT is a method
used as an alternative to Fourier and Wavelet analyses of nonlinear
and nonstationary data. Following the talk Gael Varoquaux said that
there's room for a separate scikit for signal processing. He also gave
a lightning talk about bootstrapping a SciPy community project soon
after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the
link to a gist he made about it - https://gist.github.com/1433151
http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-
a34360a2cc1f/view/search/tag%3Ascipy
Post by David Cournapeau
Post by Jaidev Deshpande
I personally am researching nonlinear and nonstationary signal
processing, I'd love to know what others can bring to this project.
Also, let's talk about the limitations of the current signal
processing tools available in SciPy and other scikits. I think there's
a lot of documentation to be worked out, and there is also a lack of
physically meaningful examples in the documentation.
I think it would be a good addition to the ecosystem. We are for
example missing a lot of core algorithms even for linear signal
processing, and scipy.signal itself would benefit from some
refactoring.
I myself started something (the talkbox scikit), but realistically, I
won't have time to work on a full toolbox, so we can consolidate here.
I would be willing to work on merging what I already have in what you
- Linear prediction coding with Levinson Durbin implementation
- A start of periodogram function
I could spend time to implement a few more things like MUSIC/PENCIL,
and some basic matching pursuit algorithms.
cheers,
David
I have code for periodogram that you can use. I'm using my own fft wrapper
around fftw (via pyublas), but you can easily modify this to use some other fft
library.
David Cournapeau
2012-01-03 07:14:40 UTC
Permalink
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande
Post by Jaidev Deshpande
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the
Hilbert-Huang Transform that I was working on. The HHT is a method
used as an alternative to Fourier and Wavelet analyses of nonlinear
and nonstationary data. Following the talk Gael Varoquaux said that
there's room for a separate scikit for signal processing. He also gave
a lightning talk about bootstrapping a SciPy community project soon
after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the
link to a gist he made about it - https://gist.github.com/1433151
http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/view/search/tag%3Ascipy
I personally am researching nonlinear and nonstationary signal
processing, I'd love to know what others can bring to this project.
Also, let's talk about the limitations of the current signal
processing tools available in SciPy and other scikits. I think there's
a lot of documentation to be worked out, and there is also a lack of
physically meaningful examples in the documentation.
Thanks
PS: I'm ccing a few people who might already be on the scipy-dev list.
Sorry for the inconvenience.
Jaidev,

at this point, I think we should just start with actual code. Could
you register a scikit-signal organization on github ? I could then
start populating a project skeleton, and then everyone can start
adding actual code

regards,

David
Warren Weckesser
2012-01-03 08:14:59 UTC
Permalink
Post by David Cournapeau
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande
Post by Jaidev Deshpande
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the
Hilbert-Huang Transform that I was working on. The HHT is a method
used as an alternative to Fourier and Wavelet analyses of nonlinear
and nonstationary data. Following the talk Gael Varoquaux said that
there's room for a separate scikit for signal processing. He also gave
a lightning talk about bootstrapping a SciPy community project soon
after.
So with this list let us start working out what the project should be
like.
Post by Jaidev Deshpande
For noobs like me, Gael's talk was quite a useful guide. Here's the
link to a gist he made about it - https://gist.github.com/1433151
http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/view/search/tag%3Ascipy
Post by Jaidev Deshpande
I personally am researching nonlinear and nonstationary signal
processing, I'd love to know what others can bring to this project.
Also, let's talk about the limitations of the current signal
processing tools available in SciPy and other scikits. I think there's
a lot of documentation to be worked out, and there is also a lack of
physically meaningful examples in the documentation.
Thanks
PS: I'm ccing a few people who might already be on the scipy-dev list.
Sorry for the inconvenience.
Jaidev,
at this point, I think we should just start with actual code. Could
you register a scikit-signal organization on github ? I could then
start populating a project skeleton, and then everyone can start
adding actual code
This sounds like a great idea.

Given that the 'learn', 'image' and 'statsmodels' projects have dropped (or
will soon drop) the 'scikits' namespace, should the 'signal' project not
bother using the 'scikits' namespace? Maybe you've already thought about
this, but if not, it is something to consider.

Warren
Alexandre Gramfort
2012-01-03 08:18:38 UTC
Permalink
Post by Warren Weckesser
Given that the 'learn', 'image' and 'statsmodels' projects have dropped (or
will soon drop) the 'scikits' namespace, should the 'signal' project not
bother using the 'scikits' namespace?  Maybe you've already thought about
this, but if not, it is something to consider.
I would still vote for sksignal as import name (like sklearn) and
scikit-signal for the brand name.

It's convient to go sk->tab to get the scikit's list with ipython autocomplete

Alex
Jaidev Deshpande
2012-01-03 08:58:34 UTC
Permalink
Hi David,
Could you register a scikit-signal organization on github ? I could then
start populating a project skeleton, and then everyone can start
adding actual code
The organization's up at https://github.com/scikit-signal

I've never done this before, by the way. So just let me know if you
want any changes. Also, who'd like to be owners?

Thanks
David Cournapeau
2012-01-03 20:21:13 UTC
Permalink
On Tue, Jan 3, 2012 at 8:58 AM, Jaidev Deshpande
Post by Jaidev Deshpande
Hi David,
Could you register a scikit-signal organization on github ? I could then
start populating a project skeleton, and then everyone can start
adding actual code
The organization's up at https://github.com/scikit-signal
I've never done this before, by the way. So just let me know if you
want any changes. Also, who'd like to be owners?
My github account: cournape. I will start a scikit-learn package as
soon as you give me the privileges,

cheers,

David
Travis Oliphant
2012-01-03 09:00:09 UTC
Permalink
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.

I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.

I disagree with Gael that there should be a scikits-signal package. There are too many scikits already that should just be scipy projects (with scipy available in modular form). In my mind, almost every scikits- project should just be a scipy- project. There really was no need for the scikits namespace in the first place.

Signal processing was the main thing I started writing SciPy for in the first place. These are the tools that made Matlab famous and I've always wanted Python to have the best-of-breed algorithms for. To me SciPy as a project has failed if general signal processing tools are being written in other high-level packages. I've watched this trend away from common development in SciPy in image processing, machine learning, optimization, and differential equation solution with some sadness over the past several years. Frankly, it makes me want to just pull out all of the individual packages I wrote that originally got pulled together into SciPy into separate projects and develop them individually from there. Leaving it to packaging and distribution issues to pull them together again.

Hmm.. perhaps that is not such a bad idea. What do others think? What should really be in core SciPy and what should be in other packages? Perhaps it doesn't matter now and SciPy should just be maintained as it is with new features added in other packages? A lot has changed in the landscape since Pearu, Eric, and I released SciPy. Many people have contributed to the individual packages --- but the vision has waned for the project has a whole. The SciPy community is vibrant and alive, but the SciPy project does not seem to have a coherent goal. I'd like to see that changed this year if possible.

In working on SciPy for .NET, I did a code.google search for open source packages that were relying on scipy imports. What I found was that almost all cases of scipy were: linalg, optimize, stats, special. It makes the case that scipy as a packages should be limited to that core set of tools (and their dependencies). All the other modules should just be distributed as separate projects / packages.

What is your experience? what packages in scipy do you use?

Thanks,

-Travis
Post by David Cournapeau
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande
Post by Jaidev Deshpande
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the
Hilbert-Huang Transform that I was working on. The HHT is a method
used as an alternative to Fourier and Wavelet analyses of nonlinear
and nonstationary data. Following the talk Gael Varoquaux said that
there's room for a separate scikit for signal processing. He also gave
a lightning talk about bootstrapping a SciPy community project soon
after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the
link to a gist he made about it - https://gist.github.com/1433151
http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/view/search/tag%3Ascipy
I personally am researching nonlinear and nonstationary signal
processing, I'd love to know what others can bring to this project.
Also, let's talk about the limitations of the current signal
processing tools available in SciPy and other scikits. I think there's
a lot of documentation to be worked out, and there is also a lack of
physically meaningful examples in the documentation.
Thanks
PS: I'm ccing a few people who might already be on the scipy-dev list.
Sorry for the inconvenience.
Jaidev,
at this point, I think we should just start with actual code. Could
you register a scikit-signal organization on github ? I could then
start populating a project skeleton, and then everyone can start
adding actual code
regards,
David
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Christopher Felton
2012-01-03 11:39:05 UTC
Permalink
Post by Travis Oliphant
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
I disagree with Gael that there should be a scikits-signal package. There are too many scikits already that should just be scipy projects (with scipy available in modular form). In my mind, almost every scikits- project should just be a scipy- project. There really was no need for the scikits namespace in the first place.
Signal processing was the main thing I started writing SciPy for in the first place. These are the tools that made Matlab famous and I've always wanted Python to have the best-of-breed algorithms for. To me SciPy as a project has failed if general signal processing tools are being written in other high-level packages. I've watched this trend away from common development in SciPy in image processing, machine learning, optimization, and differential equation solution with some sadness over the past several years. Frankly, it makes me want to just pull out all of the individual packages I wrote that originally got pulled together into SciPy into separate projects and develop them individually from there. Leaving it to packaging and distribution issues to pull them together again.
Hmm.. perhaps that is not such a bad idea. What do others think? What should really be in core SciPy and what should be in other packages? Perhaps it doesn't matter now and SciPy should just be maintained as it is with new features added in other packages? A lot has changed in the landscape since Pearu, Eric, and I released SciPy. Many people have contributed to the individual packages --- but the vision has waned for the project has a whole. The SciPy community is vibrant and alive, but the SciPy project does not seem to have a coherent goal. I'd like to see that changed this year if possible.
In working on SciPy for .NET, I did a code.google search for open source packages that were relying on scipy imports. What I found was that almost all cases of scipy were: linalg, optimize, stats, special. It makes the case that scipy as a packages should be limited to that core set of tools (and their dependencies). All the other modules should just be distributed as separate projects / packages.
What is your experience? what packages in scipy do you use?
Thanks,
-Travis
My experience, I have not used scikits and I mainly use the scipy.signal
package. I don't have a strong opinion if .signal should be part of the
core scipy or an independent package. But it seems that there should be
one package! And hopefully, one development effort. In general
extending and enhancing the current .signal (regardless if it is part of
scipy or not) not fragmenting the signal processing related code across
multiple packages.

Regards,
Chris
Robert Kern
2012-01-03 11:47:55 UTC
Permalink
I don't know if this has already been discussed or not.   But, I really don't understand the reasoning behind "yet-another-project" for signal processing.   That is the whole-point of the signal sub-project under the scipy namespace.   Why not just develop there?  Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace.  I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place.   If you don't want to do that, then just pick a top-level name and use it.
I disagree with Gael that there should be a scikits-signal package.   There are too many scikits already that should just be scipy projects (with scipy available in modular form).    In my mind, almost every scikits- project should just be a scipy- project.   There really was no need for the scikits namespace in the first place.
To be fair, the idea of the scikits namespace formed when the
landscape was quite different and may no longer be especially
relevant, but it had its reasons. Some projects can't go into the
monolithic scipy-as-it-is for license, build, or development cycle
reasons. Saying that scipy shouldn't be monolithic then is quite
reasonable by itself, but no one has stepped up to do the work (I took
a stab at it once). It isn't a reasonable response to someone who
wants to contribute something. Enthusiasm isn't a fungible quantity.
Someone who just wants to contribute his wrapper for whatever and is
told to first go refactor a mature package with a lot of users is
going to walk away. As they should.

Instead, we tried to make it easier for people to contribute their
code to the Python world. At the time, project hosting was limited, so
Enthought's offer of sharing scipy's SVN/Trac/mailing list
infrastructure was useful. Now, not so much.

At the time, namespace packages seemed like a reasonable technology.
Experience both inside and outside scikits has convinced most of us
otherwise. One thing that does not seemed to have changed is that some
people still want some kind of branding to demonstrate that their
package belongs to this community. We used the name "scikits" instead
of "scipy" because we anticipated confusion about what was in
scipy-the-monolithic-package and what was available in separate
packages (and since we were using namespace packages, technical issues
with namespace packages and the non-empty scipy/__init__.py file).

You don't say what you think "being a scipy- project" means, so it's
hard to see what you are proposing as an alternative.
Signal processing was the main thing I started writing SciPy for in the first place.   These are the tools that made Matlab famous and I've always wanted Python to have the best-of-breed algorithms for.     To me SciPy as a project has failed if general signal processing tools are being written in other high-level packages.   I've watched this trend away from common development in SciPy in image processing, machine learning, optimization, and differential equation solution with some sadness over the past several years.    Frankly, it makes me want to just pull out all of the individual packages I wrote that originally got pulled together into SciPy into separate projects and develop them individually from there.   Leaving it to packaging and distribution issues to pull them together again.
Hmm.. perhaps that is not such a bad idea.   What do others think?  What should really be in core SciPy and what should be in other packages?   Perhaps it doesn't matter now and SciPy should just be maintained as it is with new features added in other packages?   A lot has changed in the landscape since Pearu, Eric, and I released SciPy.    Many people have contributed to the individual packages --- but the vision has waned for the project has a whole.     The SciPy community is vibrant and alive, but the SciPy project does not seem to have a coherent goal.   I'd like to see that changed this year if possible.
In working on SciPy for .NET, I did a code.google search for open source packages that were relying on scipy imports.   What I found was that almost all cases of scipy were:  linalg, optimize, stats, special.   It makes the case that scipy as a packages should be limited to that core set of tools (and their dependencies).   All the other modules should just be distributed as separate projects / packages.
As you say, the landscape has changed significantly. Monolithic
packages are becoming less workable as the number of things we want to
build/wrap is increasing. Building multiple packages that you want has
also become marginally easier. At least, easier than trying to build a
single package that wraps everything you don't want. It was a lot
easier to envision everything under the sun being in scipy proper back
in 2000.

I think it would be reasonable to remake scipy as a slimmed-down core
package (with deprecated compatibility stubs for a while) with a
constellation of top-level packages around it. We could open up the
github.com/scipy organization to those other projects who want that
kind of branding, though that still does invite the potential
confusion that we tried to avoid with the "scikits" name. That said,
since we don't need to fit it into a valid namespace package name,
just using the branding of calling them a "scipy toolkit" or "scipy
addon" would be fine. Breaking up scipy might help the individual
packages develop and release at their own pace.

But mostly, I would like to encourage the idea that one should not be
sad or frustrated when people contribute open source code to our
community just because it's not in scipy or any particular package (or
for that matter using the "right" DVCS). The important thing is that
it is available to the Python community and that it works with the
other tools that we have (i.e. talks with numpy). If your emotional
response is anything but gratitude, then it's unworthy of you.
--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
Travis Oliphant
2012-01-03 16:39:26 UTC
Permalink
Post by Robert Kern
Post by Travis Oliphant
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
I disagree with Gael that there should be a scikits-signal package. There are too many scikits already that should just be scipy projects (with scipy available in modular form). In my mind, almost every scikits- project should just be a scipy- project. There really was no need for the scikits namespace in the first place.
To be fair, the idea of the scikits namespace formed when the
landscape was quite different and may no longer be especially
relevant, but it had its reasons. Some projects can't go into the
monolithic scipy-as-it-is for license, build, or development cycle
reasons. Saying that scipy shouldn't be monolithic then is quite
reasonable by itself, but no one has stepped up to do the work (I took
a stab at it once). It isn't a reasonable response to someone who
wants to contribute something. Enthusiasm isn't a fungible quantity.
Someone who just wants to contribute his wrapper for whatever and is
told to first go refactor a mature package with a lot of users is
going to walk away. As they should.
This is an excellent point. I think SciPy suffers from the same issues that also affect the Python standard library. Like any organization, there is a dynamic balance between "working together" and "communication overhead" / dealing with legacy issues. I'm constantly grateful and inspired by the code that gets written and contributed by individuals. I would just like to see all of this code get more traction (and simple entry points are key for that). It's the main reason for my desire to see a Foundation that can sponsor the community.

My previously mentioned sadness comes from my inability to contribute meaningfully over the past couple of years, and the missing full time effort that would help keep the SciPy project more cohesive. I'm hopeful this can change either directly or indirectly this year.

Just to be clear, any sadness and frustration I feel is not with anyone in the community of people who are spending their free time writing code and contributing organizational efforts to making SciPy (both the package and the community) what it is. My frustration is directed squarely at myself for not being able to do more, both personally and in funding and sponsoring more.

In the end, I would just like to see more resources devoted to these efforts.

-Travis
Gael Varoquaux
2012-01-03 15:44:22 UTC
Permalink
Hi Travis,

It is good that you are asking these questions. I think that they are
important. Let me try to give my view on some of the points you raise.
Post by Travis Oliphant
There are too many scikits already that should just be scipy projects
I used to think pretty much as you did: I don't want to have to depend on
too many packages. In addition we are a community, so why so many
packages? My initial vision when investing in the scikit-learn was that
we would merge it back to scipy after a while. The dynamic of the project
has changed a bit my way of seeing things, and I now think that it is a
good thing to have scikits-like packages that are more specialized than
scipy for the following reasons:

1. Development is technically easier in smaller packages

A developer working on a specific package does not need to tackle
complexity of the full scipy suite. Building can be made easier, as scipy
must (for good reasons) depend on Fortran and C++ packs. It is well known
that the complexity of developing a project grows super-linearly with the
number of lines of code.

It's also much easier to achieve short release cycles. Short
release cycles are critical to the dynamic of a community-driven
project (and I'd like to thanks our current release manager, Ralf
Gommers, for his excellent work).

2. Narrowing the application domain helps developers and users

It is much easier to make entry points, in the code and in the
documentation, with a given application in mind. Also, best practices and
conventions may vary between communities. While this is (IMHO) one of the
tragedies of contemporary science, it such domain specialization
helps people feeling comfortable.

Computational trade offs tend to be fairly specific to a given
context. For instance machine learning will more often be interested in
datasets with a large number of features and a (comparatively) small
number of samples, whereas in statistics it is the opposite. Thus the
same algorithm might be implemented differently. Catering for all needs
tends to make the code much more complex, and may confuse the user by
presenting him too many options.

Developers cannot be expert in everything. If I specialize in machine
learning, and follow the recent developments in literature, chances are
that I do not have time to competitive in numerical integration. Having
too wide a scope in a project means that each developer understands well
a small fraction of the code. It makes things really hard for the release
manager, but also for day to day work, e.g. what to do with a new broken
test.

3. It is easier to build an application-specific community

An application specific library is easier to brand. One can tailor a
website, a user manual, and conference presentation or papers to an
application. As a result the project gains visibility in the community
of scientists and engineers it target.

Also, having more focused mailing lists helps building enthusiasm, a they
have less volume, and are more focused on on questions that people
are interested in.

Finally, a sad but true statement, is that people tend to get more credo
when working on an application-specific project than on a core layer.
Similarly, it is easier for me to get credit to fund development of an
application-specific project.

On a positive note, I would like to stress that I think that the
scikit-learn has had a general positive impact on the scipy ecosystem,
including for those who do not use it, or who do not care at all about
machine learning. First, it is drawing more users in the community, and
as a result, there is more interest and money flying around. But more
importantly, when I look at the latest release of scipy, I see many of
the new contributors that are also scikit-learn contributors (not only
Fabian). This can be partly explained by the fact that getting involved
in the scikit-learn was an easy and high-return-on-investment move for
them, but they quickly grew to realize that the base layer could be
improved. We have always had the vision to push in scipy any improvement
that was general-enough to be useful across application domains.
Remember, David Cournapeau was lured in the scipy business by working on
the original scikit-learn.
Post by Travis Oliphant
Frankly, it makes me want to pull out all of the individual packages I
wrote that originally got pulled together into SciPy into separate
projects and develop them individually from there.
What you are proposing is interesting, that said, I think that the
current status quo with scipy is a good one. Having a core collection of
numerical tools is, IMHO, a key element of the Python scientific
community for two reasons:

* For the user, knowing that he will find the answer to most of his
simple questions in a single library makes it easy to start. It also
makes it easier to document.

* Different packages need to rely on a lot of common generic tools.
Linear algebra, sparse linear algebra, simple statistics and signal
processing, simple black-box optimizer, interpolation ND-image-like
processing. Indeed You ask what package in scipy do people use.
Actually, in scikit-learn we use all sub-packages apart from
'integrate'. I checked, and we even use 'io' in one of the examples.
Any code doing high-end application-specific numerical computing will
need at least a few of the packages of scipy. Of course, a package
may need an optimizer tailored to a specific application, in which
case they will roll there own, an this effort might be duplicated a
bit. But having the common core helps consolidating the ecosystem.

So the setup that I am advocating is a core library, with many other
satellite packages. Or rather a constellation of packages that use each
other rather then a monolithic universe. This is a common strategy of
breaking a package up into parts that can be used independently to make
them lighter and hopefully ease the development of the whole. For
instance, this is what was done to the ETS (Enthought Tool Suite). And we
have all seen this strategy gone bad, for instance in the situation of
'dependency hell', in which case all packages start depending on each
other, the installation becomes an issue and there is a grid lock of
version-compatibility bugs. This is why any such ecosystem must have an
almost tree-like structure in its dependency graph. Some packages must be
on top of the graph, more 'core' than others, and as we descend the
graph, packages can reduce their dependencies. I think that we have more
or less this situation with scipy, and I am quite happy about it.

Now I hear your frustration when this development happens a bit in the
wild with no visible construction of an ecosystem. This ecosystem does
get constructed via the scipy mailing-lists, conferences, and in general
the community, but it may not be very clear to the external observer. One
reason why my group decided to invest in the scikit-learn was that it was
the learning package that seemed the closest in terms of code and
community connections. This was the virtue of the 'scikits' branding. For
technical reasons, the different scikits have started getting rid of this
namespace in the module import. You seem to think that the branding name
'scikits' does not reflect accurately the fact that they are tight
members of the scipy constellationhile I must say that I am not a huge
fan of the name 'scikits', we have now invested in it, and I don't think
that we can easily move away.

If the problem is a branding issue, it may be partly addressed with
appropriate communication. A set of links across the different web pages
of the ecosystem, and a central document explaining the relationships
between the packages might help. But this idea is not completely new and
it simply is waiting for someone to invest time in it. For instance,
there was the project of reworking the scipy.org homepage.

Another important problem is the question of what sits 'inside' this
collection of tools, and what is outside. The answer to this question
will pretty much depend on who you ask. In practice, for the end user, it
is very much conditioned by what meta-package they can download. EPD,
Sage, Python(x,y), and many others give different answers.

To conclude, I'd like to stress that, in my eyes, what really matters is
a solution that gives us a vibrant community, with a good production of
quality code and documentation. I think that the current set of small
projects makes it easier to gather developers and users, and that it
work well as long as they talk to each other and do not duplicate too
much each-other's functionality. If on top of that they are BSD-licensed
and use numpy as their data model, I am a happy man.

What I am pushing for is a Bazar-like development model, in which it is
easy for various approaches answering different needs to develop in
parallel with different compromises. In such a context, I think that
Jaidev could kick start a successful and useful scikit-signal. Hopefully
this would not preclude improvements to the docs, examples, and existing
code in scipy.signal.

Sorry for the long post, and thank you for reading.

Gael
Travis Oliphant
2012-01-04 06:37:23 UTC
Permalink
Hi Gael,

Thanks for your email. I appreciate the detailed response. Please don't mis-interpret my distaste for the scikit namespace as anything more than organizational. I'm very impressed with most of the scikits themselves: scikit-learn being a particular favorite. It is very clear to most people that smaller teams and projects is useful for diverse collaboration and very effective to involve more people in development. This is all very good and I'm very encouraged by this development. Even in the SciPy package itself, active development happens on only a few packages which have received attention from small teams.

Of course, the end user wants integration so the more packages exist the more we need tools like EPD, ActivePython, Python(X,Y), and Sage (and corresponding repositories like CRAN). The landscape is much better in this direction than it was earlier, but packaging and distribution is still a major weak-point in Python. I think the scientific computing community should continue to just develop it's own packaging solutions. I've been a big fan of David Cournapeau's work in this area (bento being his latest effort).

Your vision of a bazaar model is a good one. I just think we need to get scipy itself more into that model. I agree it's useful to have a core-set of common functionality, but I am quite in favor of moving to a more tight-knit core for the main scipy package with additional scipy-*named* packages (e.g. scipy-odr), etc. These can install directly into the scipy package infrastructure (or use whatever import mechanisms the distributions desire). This move to more modular packages for SciPy itself, has been in my mind for a long time which is certainly why I see the scikits name-space as superfluous. But, I understand that branding means something.

So, my (off the top of my head) take on what should be core scipy is:

fftpack
stats
io
special
optimize]
linalg
lib.blas
lib.lapack
misc

I think the other packages should be maintained, built and distributed as

scipy-constants
scipy-integrate
scipy-cluster
scipy-ndimage
scipy-spatial
scipy-odr
scipy-sparse
scipy-maxentropy
scipy-signal
scipy-weave (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.)

Then, we could create a scipy superpack to install it all together. What issues do people see with a plan like this?

Obviously it takes time and effort to do this. But, I'm hoping to find time or sponsor people who will have time to do this work. Thus, I'd like to have the conversation to find out what people think *should* be done. There also may be students looking for a way to get involved or people interested in working on Google Summer of Code projects.

Thanks,

-Travis
Post by Gael Varoquaux
Hi Travis,
It is good that you are asking these questions. I think that they are
important. Let me try to give my view on some of the points you raise.
Post by Travis Oliphant
There are too many scikits already that should just be scipy projects
I used to think pretty much as you did: I don't want to have to depend on
too many packages. In addition we are a community, so why so many
packages? My initial vision when investing in the scikit-learn was that
we would merge it back to scipy after a while. The dynamic of the project
has changed a bit my way of seeing things, and I now think that it is a
good thing to have scikits-like packages that are more specialized than
1. Development is technically easier in smaller packages
A developer working on a specific package does not need to tackle
complexity of the full scipy suite. Building can be made easier, as scipy
must (for good reasons) depend on Fortran and C++ packs. It is well known
that the complexity of developing a project grows super-linearly with the
number of lines of code.
It's also much easier to achieve short release cycles. Short
release cycles are critical to the dynamic of a community-driven
project (and I'd like to thanks our current release manager, Ralf
Gommers, for his excellent work).
2. Narrowing the application domain helps developers and users
It is much easier to make entry points, in the code and in the
documentation, with a given application in mind. Also, best practices and
conventions may vary between communities. While this is (IMHO) one of the
tragedies of contemporary science, it such domain specialization
helps people feeling comfortable.
Computational trade offs tend to be fairly specific to a given
context. For instance machine learning will more often be interested in
datasets with a large number of features and a (comparatively) small
number of samples, whereas in statistics it is the opposite. Thus the
same algorithm might be implemented differently. Catering for all needs
tends to make the code much more complex, and may confuse the user by
presenting him too many options.
Developers cannot be expert in everything. If I specialize in machine
learning, and follow the recent developments in literature, chances are
that I do not have time to competitive in numerical integration. Having
too wide a scope in a project means that each developer understands well
a small fraction of the code. It makes things really hard for the release
manager, but also for day to day work, e.g. what to do with a new broken
test.
3. It is easier to build an application-specific community
An application specific library is easier to brand. One can tailor a
website, a user manual, and conference presentation or papers to an
application. As a result the project gains visibility in the community
of scientists and engineers it target.
Also, having more focused mailing lists helps building enthusiasm, a they
have less volume, and are more focused on on questions that people
are interested in.
Finally, a sad but true statement, is that people tend to get more credo
when working on an application-specific project than on a core layer.
Similarly, it is easier for me to get credit to fund development of an
application-specific project.
On a positive note, I would like to stress that I think that the
scikit-learn has had a general positive impact on the scipy ecosystem,
including for those who do not use it, or who do not care at all about
machine learning. First, it is drawing more users in the community, and
as a result, there is more interest and money flying around. But more
importantly, when I look at the latest release of scipy, I see many of
the new contributors that are also scikit-learn contributors (not only
Fabian). This can be partly explained by the fact that getting involved
in the scikit-learn was an easy and high-return-on-investment move for
them, but they quickly grew to realize that the base layer could be
improved. We have always had the vision to push in scipy any improvement
that was general-enough to be useful across application domains.
Remember, David Cournapeau was lured in the scipy business by working on
the original scikit-learn.
Post by Travis Oliphant
Frankly, it makes me want to pull out all of the individual packages I
wrote that originally got pulled together into SciPy into separate
projects and develop them individually from there.
What you are proposing is interesting, that said, I think that the
current status quo with scipy is a good one. Having a core collection of
numerical tools is, IMHO, a key element of the Python scientific
* For the user, knowing that he will find the answer to most of his
simple questions in a single library makes it easy to start. It also
makes it easier to document.
* Different packages need to rely on a lot of common generic tools.
Linear algebra, sparse linear algebra, simple statistics and signal
processing, simple black-box optimizer, interpolation ND-image-like
processing. Indeed You ask what package in scipy do people use.
Actually, in scikit-learn we use all sub-packages apart from
'integrate'. I checked, and we even use 'io' in one of the examples.
Any code doing high-end application-specific numerical computing will
need at least a few of the packages of scipy. Of course, a package
may need an optimizer tailored to a specific application, in which
case they will roll there own, an this effort might be duplicated a
bit. But having the common core helps consolidating the ecosystem.
So the setup that I am advocating is a core library, with many other
satellite packages. Or rather a constellation of packages that use each
other rather then a monolithic universe. This is a common strategy of
breaking a package up into parts that can be used independently to make
them lighter and hopefully ease the development of the whole. For
instance, this is what was done to the ETS (Enthought Tool Suite). And we
have all seen this strategy gone bad, for instance in the situation of
'dependency hell', in which case all packages start depending on each
other, the installation becomes an issue and there is a grid lock of
version-compatibility bugs. This is why any such ecosystem must have an
almost tree-like structure in its dependency graph. Some packages must be
on top of the graph, more 'core' than others, and as we descend the
graph, packages can reduce their dependencies. I think that we have more
or less this situation with scipy, and I am quite happy about it.
Now I hear your frustration when this development happens a bit in the
wild with no visible construction of an ecosystem. This ecosystem does
get constructed via the scipy mailing-lists, conferences, and in general
the community, but it may not be very clear to the external observer. One
reason why my group decided to invest in the scikit-learn was that it was
the learning package that seemed the closest in terms of code and
community connections. This was the virtue of the 'scikits' branding. For
technical reasons, the different scikits have started getting rid of this
namespace in the module import. You seem to think that the branding name
'scikits' does not reflect accurately the fact that they are tight
members of the scipy constellationhile I must say that I am not a huge
fan of the name 'scikits', we have now invested in it, and I don't think
that we can easily move away.
If the problem is a branding issue, it may be partly addressed with
appropriate communication. A set of links across the different web pages
of the ecosystem, and a central document explaining the relationships
between the packages might help. But this idea is not completely new and
it simply is waiting for someone to invest time in it. For instance,
there was the project of reworking the scipy.org homepage.
Another important problem is the question of what sits 'inside' this
collection of tools, and what is outside. The answer to this question
will pretty much depend on who you ask. In practice, for the end user, it
is very much conditioned by what meta-package they can download. EPD,
Sage, Python(x,y), and many others give different answers.
To conclude, I'd like to stress that, in my eyes, what really matters is
a solution that gives us a vibrant community, with a good production of
quality code and documentation. I think that the current set of small
projects makes it easier to gather developers and users, and that it
work well as long as they talk to each other and do not duplicate too
much each-other's functionality. If on top of that they are BSD-licensed
and use numpy as their data model, I am a happy man.
What I am pushing for is a Bazar-like development model, in which it is
easy for various approaches answering different needs to develop in
parallel with different compromises. In such a context, I think that
Jaidev could kick start a successful and useful scikit-signal. Hopefully
this would not preclude improvements to the docs, examples, and existing
code in scipy.signal.
Sorry for the long post, and thank you for reading.
Gael
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Robert Kern
2012-01-04 14:10:11 UTC
Permalink
Post by Travis Oliphant
fftpack
stats
io
special
optimize]
linalg
lib.blas
lib.lapack
misc
I think the other packages should be maintained, built and distributed as
scipy-constants
scipy-integrate
scipy-cluster
scipy-ndimage
scipy-spatial
scipy-odr
scipy-sparse
scipy-maxentropy
scipy-signal
scipy-weave  (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.)
Then, we could create a scipy superpack to install it all together.     What issues do people see with a plan like this?
The main technical issue/decision is how to split up the "physical"
packages themselves. Do we use namespace packages, such that
scipy.signal will still be imported as "from scipy import signal", or
do we rename the packages such that each one is its own top-level
package? It's important to specify this when making a proposal because
each imposes different costs that we may want to factor into how we
divide up the packages.

I think the lesson we've learned from scikits (and ETS, for that
matter) is that this community at least does not want to use namespace
packages. Some of this derives from a distate of setuptools, which is
used in the implementation, but a lot of it derives from the very
concept of namespace packages independent of any implementation.
Monitoring the scikit-learn and pystatsmodels mailing lists, I noticed
that a number of installation problems stemmed just from having the
top-level package being "scikits" and shared between several packages.
This is something that can only be avoided by not using namespace
packages altogether.

There are also technical issues that cut across implementations.
Namely, the scipy/__init__.py files need to be identical between all
of the packages. Maintaining non-empty identical __init__.py files is
not feasible. We don't make many changes to it these days, but we
won't be able to make *any* changes ever again. We could empty it out,
if we are willing to make this break with backwards compatibility
once.

Going with unique top-level packages, do we use a convention like
"scipy_signal", at least for the packages being broken out from the
current monolithic scipy? Do we provide a proxy package hierarchy for
backwards compatibility (e.g. having proxy modules like
scipy/signal/signaltools.py that just import everything from
scipy_signal/signaltools.py) like Enthought does with etsproxy after
we split up ETS?
--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
Skipper Seabold
2012-01-04 14:30:55 UTC
Permalink
On Wed, Jan 4, 2012 at 1:37 AM, Travis Oliphant <***@continuum.io> wrote:
<snip>
Post by Travis Oliphant
fftpack
stats
io
special
optimize]
linalg
lib.blas
lib.lapack
misc
I think the other packages should be maintained, built and distributed as
scipy-constants
scipy-integrate
scipy-cluster
scipy-ndimage
scipy-spatial
scipy-odr
scipy-sparse
scipy-maxentropy
scipy-signal
scipy-weave  (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.)
Then, we could create a scipy superpack to install it all together.     What issues do people see with a plan like this?
My first thought is that what is 'core' could use a little more
discussion. We are using parts of integrate and signal in statsmodels
so our dependencies almost double if these are split off as a separate
installation. I'd suspect others might feel the same. This isn't a
deal breaker though, and I like the idea of being more modular,
depending on how it's implemented and how easy it is for users to grab
and install different parts.

Skipper
j***@gmail.com
2012-01-04 14:53:45 UTC
Permalink
Post by Skipper Seabold
<snip>
Post by Travis Oliphant
fftpack
stats
io
special
optimize]
linalg
lib.blas
lib.lapack
misc
I think the other packages should be maintained, built and distributed as
scipy-constants
scipy-integrate
scipy-cluster
scipy-ndimage
scipy-spatial
scipy-odr
scipy-sparse
scipy-maxentropy
scipy-signal
scipy-weave  (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.)
Then, we could create a scipy superpack to install it all together.     What issues do people see with a plan like this?
My first thought is that what is 'core' could use a little more
discussion. We are using parts of integrate and signal in statsmodels
so our dependencies almost double if these are split off as a separate
installation. I'd suspect others might feel the same. This isn't a
deal breaker though, and I like the idea of being more modular,
depending on how it's implemented and how easy it is for users to grab
and install different parts.
I think that breaking up scipy just gives us a lot more installation
problems, and if it's merged together again into a superpack, then it
wouldn't change a whole lot, but increase the work of the release
management.
I wouldn't mind if weave is split out, since it crashes and I never use it.

The splitup is also difficult because of interdependencies,
stats is a final usage sub package and doesn't need to be in the core,
it's not used by any other part, AFAIK
it uses at least also integrate.

optimize uses sparse is at least one other case I know.

I've been in favor of cleaning up imports for a long time, but
splitting up scipy means we can only rely on a smaller set of
functions without increasing the number of packages that need to be
installed.

What if stats wants to use spatial or signal?

Josef
Post by Skipper Seabold
Skipper
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Denis Laxalde
2012-01-04 15:56:06 UTC
Permalink
Post by j***@gmail.com
The splitup is also difficult because of interdependencies,
stats is a final usage sub package and doesn't need to be in the core,
it's not used by any other part, AFAIK
it uses at least also integrate.
optimize uses sparse is at least one other case I know.
There could then be another level of split-up, per module, to circumvent
these dependency problems. For instance the core optimize module would
not include the nonlin module (the one depending on sparse) which would
in turn be in scipy-optimize-nonlin, part of the "contrib" meta package.
Also, somebody developing a new optimization solver would name their
package scipy-optimize-$SOLVER so that it could be included in the
contrib area.
Post by j***@gmail.com
What if stats wants to use spatial or signal?
The same would apply here. The bits from stats that want to use spatial
would stay in the contrib area until spatial moves to core.
--
Denis
j***@gmail.com
2012-01-04 16:53:25 UTC
Permalink
Post by Denis Laxalde
Post by j***@gmail.com
The splitup is also difficult because of interdependencies,
stats is a final usage sub package and doesn't need to be in the core,
it's not used by any other part, AFAIK
it uses at least also integrate.
and interpolate I think
Post by Denis Laxalde
Post by j***@gmail.com
optimize uses sparse is at least one other case I know.
There could then be another level of split-up, per module, to circumvent
these dependency problems. For instance the core optimize module would
not include the nonlin module (the one depending on sparse) which would
in turn be in scipy-optimize-nonlin, part of the "contrib" meta package.
Also, somebody developing a new optimization solver would name their
package scipy-optimize-$SOLVER so that it could be included in the
contrib area.
Post by j***@gmail.com
What if stats wants to use spatial or signal?
The same would apply here. The bits from stats that want to use spatial
would stay in the contrib area until spatial moves to core.
That sounds like it will be difficult to keep track of things.

I don't see any clear advantages that would justify the additional
installation problems.

The advantage of the current scipy is that it is a minimal common set
of functionality that we can assume a user has installed when we
require scipy.

scipy.stats, statsmodels and sklearn load large parts of scipy, but
maybe not fully overlapping. If I want to use sklearn additional to
statsmodels, I don't have to worry about additional dependencies,
since we try to stick with numpy and scipy as required dependencies
(statsmodels also has pandas now).

If we break up scipy, then we have to think which additional sub- or
sub-sub-packages users need to install before they can use the
scikits, unless we require users to install a super-super-package that
includes (almost) all of the current scipy.

The next stage will be keeping track of versions. It sounds a lot of
fun if there are changes, and we not only have to check for numpy and
scipy version, but also the version of each sub-package.

Nothing is impossible, I just don't see the advantage of moving away
from the current one-click install that works very well on Windows.

Josef
Post by Denis Laxalde
--
Denis
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Ralf Gommers
2012-01-04 18:24:22 UTC
Permalink
Post by Travis Oliphant
Post by Skipper Seabold
<snip>
Post by Travis Oliphant
fftpack
stats
io
special
optimize]
linalg
lib.blas
lib.lapack
misc
I think the other packages should be maintained, built and distributed
as
Post by Skipper Seabold
Post by Travis Oliphant
scipy-constants
scipy-integrate
scipy-cluster
scipy-ndimage
scipy-spatial
scipy-odr
scipy-sparse
scipy-maxentropy
scipy-signal
scipy-weave (actually I think weave should be installed separately
and/or merged with other foreign code integration tools like fwrap, f2py,
etc.)
Post by Skipper Seabold
Post by Travis Oliphant
Then, we could create a scipy superpack to install it all together.
What issues do people see with a plan like this?
Post by Skipper Seabold
My first thought is that what is 'core' could use a little more
discussion. We are using parts of integrate and signal in statsmodels
so our dependencies almost double if these are split off as a separate
installation. I'd suspect others might feel the same. This isn't a
deal breaker though, and I like the idea of being more modular,
depending on how it's implemented and how easy it is for users to grab
and install different parts.
I think that breaking up scipy just gives us a lot more installation
problems, and if it's merged together again into a superpack, then it
wouldn't change a whole lot, but increase the work of the release
management.
I wouldn't mind if weave is split out, since it crashes and I never use it.
The splitup is also difficult because of interdependencies,
stats is a final usage sub package and doesn't need to be in the core,
it's not used by any other part, AFAIK
it uses at least also integrate.
optimize uses sparse is at least one other case I know.
I've been in favor of cleaning up imports for a long time, but
splitting up scipy means we can only rely on a smaller set of
functions without increasing the number of packages that need to be
installed.
What if stats wants to use spatial or signal?
I agree with Josef that splitting scipy will be difficult, and I suspect
it's (a) not worth the pain and (b) that it doesn't solve the issue that I
think Travis hopes it will solve (more development of the sub-packages).
Installation, dependency problems and effort of releasing will probably get
worse.

Looking at Travis' list of non-core packages I'd say that sparse certainly
belongs in the core and integrate probably too. Looking at what's left:
- constants : very small and low cost to keep in core. Not much to improve
there.
- cluster : low maintenance cost, small. not sure about usage, quality.
- ndimage : difficult one. hard to understand code, may not see much
development either way.
- spatial : kdtree is widely used, of good quality. low maintenance cost.
- odr : quite small, low cost to keep in core. pretty much done as far as I
can tell.
- maxentropy : is deprecated, will disappear.
- signal : not in great shape, could be viable independent package. On the
other hand, if scikits-signal takes off and those developers take care to
improve and build on scipy.signal when possible, that's OK too.
- weave : no point spending any effort on it. keep for backwards
compatibility only, direct people to Cython instead.

Overall, I don't see many viable independent packages there. So here's an
alternative to spending a lot of effort on reorganizing the package
structure:
1. Formulate a coherent vision of what in principle belongs in scipy
(current modules + what's missing).
2. Focus on making it easier to contribute to scipy. There are many ways to
do this; having more accessible developer docs, having a list of "easy
fixes", adding info to tickets on how to get started on the reported
issues, etc. We can learn a lot from Sympy and IPython here.
3. Recognize that quality of code and especially documentation is
important, and fill the main gaps.
4. Deprecate sub-modules that don't belong in scipy (anymore), and remove
them for scipy 1.0. I think that this applies only to maxentropy and weave.
5. Find a clear (group of) maintainer(s) for each sub-module. For people
familiar with one module, responding to tickets and pull requests for that
module would not cost so much time.

In my opinion, spending effort on improving code/documentation quality and
attracting new developers (those go hand in hand) instead of reorganizing
will have both more impact and be more beneficial for our users.

Cheers,
Ralf
Travis Oliphant
2012-01-05 01:43:45 UTC
Permalink
Thanks for the feedback. My point was to generate discussion and start the ball rolling on exactly the kind of conversation that has started.

Exactly as Ralf mentioned, the point is to get development on sub-packages --- something that the scikits effort and other individual efforts have done very, very well. In fact, it has worked so well, that it taught me a great deal about what is important in open source. My perhaps irrational dislike for the *name* "scikits" should not be interpreted as anything but a naming taste preference (and I am not known for my ability to choose names well anyway). I very much like and admire the community around scikits. I just would have preferred something easier to type (even just sci_* would have been better in my mind as high-level packages: sci_learn, sci_image, sci_statsmodels, etc.). I didn't feel like I was able to fully participate in that discussion when it happened, so you can take my comments now as simply historical and something I've been wanting to get off my chest for a while.

Without better packaging and dependency management systems (especially on Windows and Mac), splitting out code doesn't help those who are not distribution dependent (who themselves won't be impacted much). There are scenarios under which it could make sense to split out SciPy, but I agree that right now it doesn't make sense to completely split everything. However, I do think it makes sense to clean things up and move some things out in preparation for SciPy 1.0

One thing that would be nice is what is the view of documentation and examples for the different packages. Where is work there most needed?
- constants : very small and low cost to keep in core. Not much to improve there.
Agreed.
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically contains a K-means vector quantization code with functionality that I suspect exists in scikits-learn. I would recommend deprecation and removal while pointing people to scikits-learn for equivalent functionality (or moving it to scikits-learn).
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own. The package is fairly mature and just needs maintenance.
- spatial : kdtree is widely used, of good quality. low maintenance cost.
Good to hear maintenance cost is low.
- odr : quite small, low cost to keep in core. pretty much done as far as I can tell.
Agreed.
- maxentropy : is deprecated, will disappear.
Great.
- signal : not in great shape, could be viable independent package. On the other hand, if scikits-signal takes off and those developers take care to improve and build on scipy.signal when possible, that's OK too.
What are the needs of this package? What needs to be fixed / improved? It is a broad field and I could see fixing scipy.signal with a few simple algorithms (the filter design, for example), and then pushing a separate package to do more advanced signal processing algorithms. This sounds fine to me. It looks like I can put attention to scipy.signal then, as It was one of the areas I was most interested in originally.
- weave : no point spending any effort on it. keep for backwards compatibility only, direct people to Cython instead.
Agreed. Anyway we can deprecate this for SciPy 1.0?
1. Formulate a coherent vision of what in principle belongs in scipy (current modules + what's missing).
O.K. so SciPy should contain "basic" modules that are going to be needed for a lot of different kinds of analysis to be a dependency for other more advanced packages. This is somewhat vague, of course.

What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?

http://gams.nist.gov/cgi-bin/serve.cgi
2. Focus on making it easier to contribute to scipy. There are many ways to do this; having more accessible developer docs, having a list of "easy fixes", adding info to tickets on how to get started on the reported issues, etc. We can learn a lot from Sympy and IPython here.
Definitely!
3. Recognize that quality of code and especially documentation is important, and fill the main gaps.
Is there a write-up of recognized gaps here that we can start with?
4. Deprecate sub-modules that don't belong in scipy (anymore), and remove them for scipy 1.0. I think that this applies only to maxentropy and weave.
I think it also applies to cluster as described above.
5. Find a clear (group of) maintainer(s) for each sub-module. For people familiar with one module, responding to
tickets and pull requests for that module would not cost so much time.
Is there a list where this is kept?
In my opinion, spending effort on improving code/documentation quality and attracting new developers (those go hand in hand) instead of reorganizing will have both more impact and be more beneficial for our users.
Agreed. Thanks for the feedback.

Best,

-Travis
Fernando Perez
2012-01-05 02:22:16 UTC
Permalink
Hi all,
What do others think is missing?  Off the top of my head:   basic wavelets
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed in
SciPy.   Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy
one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
from Berkeley' paper on parallel computing is not a bad starting
point; summarized here they are:

Dense Linear Algebra
Sparse Linear Algebra [1]
Spectral Methods
N-Body Methods
Structured Grids
Unstructured Grids
MapReduce
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack and Branch-and-Bound
Graphical Models
Finite State Machines

Descriptions of each can be found here:
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
here:

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

That list is biased towards the classes of codes used in
supercomputing environments, and some of the topics are probably
beyond the scope of scipy (say structured/unstructured grids, at least
for now).

But it can be a decent guiding outline to reason about what are the
'big areas' of scientific computing, so that scipy at least provides
building blocks that would be useful in these directions.

One area that hasn't been directly mentioned too much is the situation
with statistical tools. On the one hand, we have the phenomenal work
of pandas, statsmodels and sklearn, which together are helping turn
python into a great tool for statistical data analysis (understood in
a broad sense). But it would probably be valuable to have enough of a
statistical base directly in numpy/scipy so that the 'out of the box'
experience for statistical work is improved. I know we have
scipy.stats, but it seems like it needs some love.

Cheers,

f
j***@gmail.com
2012-01-05 02:50:30 UTC
Permalink
Post by Alexandre Gramfort
Hi all,
What do others think is missing?  Off the top of my head:   basic wavelets
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed in
SciPy.   Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy
one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
from Berkeley' paper on parallel computing is not a bad starting
   Dense Linear Algebra
   Sparse Linear Algebra [1]
   Spectral Methods
   N-Body Methods
   Structured Grids
   Unstructured Grids
   MapReduce
   Combinational Logic
   Graph Traversal
   Dynamic Programming
   Backtrack and Branch-and-Bound
   Graphical Models
   Finite State Machines
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in
supercomputing environments, and some of the topics are probably
beyond the scope of scipy (say structured/unstructured grids, at least
for now).
But it can be a decent guiding outline to reason about what are the
'big areas' of scientific computing, so that scipy at least provides
building blocks that would be useful in these directions.
One area that hasn't been directly mentioned too much is the situation
with statistical tools.  On the one hand, we have the phenomenal work
of pandas, statsmodels and sklearn, which together are helping turn
python into a great tool for statistical data analysis (understood in
a broad sense).  But it would probably be valuable to have enough of a
statistical base directly in numpy/scipy so that the 'out of the box'
experience for statistical work is improved.  I know we have
scipy.stats, but it seems like it needs some love.
(I didn't send something like the first part earlier, because I didn't
want to talk so much.)

Every new code and sub-package need additional topic specific maintainers.

Pauli, Warren and Ralf are doing a great job as default, general
maintainers, and especially Warren and Ralf have been pushing
bug-fixes and enhancements into stats (and I have been reviewing
almost all of it).

If there is a well defined set of enhancements that could go into
stats, then I wouldn't mind, but I don't see much reason in
duplicating code and maintenance work with statsmodels.

Of course there are large parts that statsmodels doesn't cover either,
and it is useful to extend the coverage of statistics in either
package.

However, adding code that is not low maintenance (because it's fully
tested) or doesn't have committed maintainers doesn't make much sense
in my opinion.

Cheers,

Josef
Post by Alexandre Gramfort
Cheers,
f
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Travis Oliphant
2012-01-05 03:29:59 UTC
Permalink
Post by Alexandre Gramfort
Hi all,
Post by Travis Oliphant
What do others think is missing? Off the top of my head: basic wavelets
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed in
SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy
one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
from Berkeley' paper on parallel computing is not a bad starting
Dense Linear Algebra
Sparse Linear Algebra [1]
Spectral Methods
N-Body Methods
Structured Grids
Unstructured Grids
MapReduce
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack and Branch-and-Bound
Graphical Models
Finite State Machines
This is a nice list, thanks!
Post by Alexandre Gramfort
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in
supercomputing environments, and some of the topics are probably
beyond the scope of scipy (say structured/unstructured grids, at least
for now).
But it can be a decent guiding outline to reason about what are the
'big areas' of scientific computing, so that scipy at least provides
building blocks that would be useful in these directions.
Thanks for the links.
Post by Alexandre Gramfort
One area that hasn't been directly mentioned too much is the situation
with statistical tools. On the one hand, we have the phenomenal work
of pandas, statsmodels and sklearn, which together are helping turn
python into a great tool for statistical data analysis (understood in
a broad sense). But it would probably be valuable to have enough of a
statistical base directly in numpy/scipy so that the 'out of the box'
experience for statistical work is improved. I know we have
scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work. A big question to me is the impact of data-frames as the underlying data-representation of the algorithms and the relationship between the data-frame and a NumPy array.

-Travis
Post by Alexandre Gramfort
Cheers,
f
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Fernando Perez
2012-01-05 03:46:03 UTC
Permalink
It seems like scipy stats has received quite a bit of attention.   There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Well, I recently needed to do some simple linear modeling, and the
stats glm docstring isn't very encouraging:

Docstring:
Calculates a linear model fit ...
anova/ancova/lin-regress/t-test/etc. Taken from:

Peterson et al. Statistical limitations in functional neuroimaging
I. Non-inferential methods and statistical models. Phil Trans Royal Soc
Lond B 354: 1239-1260.

Returns
-------
statistic, p-value ???

### END of docstring

I turned to statsmodels, which had great examples and it was very easy
to use (for an ignoramus on the matter like myself).

But perhaps that happens to be an isolated point. I have to admit,
I've just been using the pandas/statsmodels/sklearn combo directly.
Part of that has to do also with the nice, long-form examples
available for them, something which I think we still lack in
numpy/scipy but where some of the new focused projects have done a
great job (the matplotlib gallery blazed that trail, and others have
followed with excellent results).

Cheers,

f
j***@gmail.com
2012-01-05 04:11:15 UTC
Permalink
Post by Fernando Perez
It seems like scipy stats has received quite a bit of attention.   There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Well, I recently needed to do some simple linear modeling, and the
Calculates a linear model fit ...
Peterson et al. Statistical limitations in functional neuroimaging
I. Non-inferential methods and statistical models.  Phil Trans Royal Soc
Lond B 354: 1239-1260.
Returns
-------
statistic, p-value ???
### END of docstring
glm should have been removed a long time ago, since it doesn't make much sense.

a basic OLS class might not be bad for scipy, also from some of the
questions that I have seen on stackoverflow of users that use the
cookbook class.
Post by Fernando Perez
I turned to statsmodels, which had great examples and it was very easy
to use (for an ignoramus on the matter like myself).
But perhaps that happens to be an isolated point.  I have to admit,
I've just been using the pandas/statsmodels/sklearn combo directly.
Part of that has to do also with the nice, long-form examples
available for them, something which I think we still lack in
numpy/scipy but where some of the new focused projects have done a
great job (the matplotlib gallery blazed that trail, and others have
followed with excellent results).
I'm not exactly unhappy about this :), especially once we get to the
stage where you can type
print modelresults.summary()
and we print diagnostic checks why you shouldn't trust your model
results, or we print no warning comments and the diagnostic checks
don't indicate anything is wrong.

Of course I'm not so happy about the lack of examples in scipy.

Josef
Post by Fernando Perez
Cheers,
f
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Warren Weckesser
2012-01-05 06:02:19 UTC
Permalink
Post by Alexandre Gramfort
Post by Alexandre Gramfort
Hi all,
Post by Travis Oliphant
What do others think is missing? Off the top of my head: basic
wavelets
Post by Alexandre Gramfort
Post by Travis Oliphant
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed
in
Post by Alexandre Gramfort
Post by Travis Oliphant
SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy
one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
from Berkeley' paper on parallel computing is not a bad starting
Dense Linear Algebra
Sparse Linear Algebra [1]
Spectral Methods
N-Body Methods
Structured Grids
Unstructured Grids
MapReduce
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack and Branch-and-Bound
Graphical Models
Finite State Machines
This is a nice list, thanks!
Post by Alexandre Gramfort
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in
supercomputing environments, and some of the topics are probably
beyond the scope of scipy (say structured/unstructured grids, at least
for now).
But it can be a decent guiding outline to reason about what are the
'big areas' of scientific computing, so that scipy at least provides
building blocks that would be useful in these directions.
Thanks for the links.
Post by Alexandre Gramfort
One area that hasn't been directly mentioned too much is the situation
with statistical tools. On the one hand, we have the phenomenal work
of pandas, statsmodels and sklearn, which together are helping turn
python into a great tool for statistical data analysis (understood in
a broad sense). But it would probably be valuable to have enough of a
statistical base directly in numpy/scipy so that the 'out of the box'
experience for statistical work is improved. I know we have
scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There
is always more to do, of course, but I'm not sure what specifically you
think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect
skewness and kurtosis formulas for some distributions, and I now have very
little confidence that any of the other distributions are correct. Of
course, most of them probably *are* correct, but without tests, all are in
doubt.

Warren


A big question to me is the impact of data-frames as the underlying
Post by Alexandre Gramfort
data-representation of the algorithms and the relationship between the
data-frame and a NumPy array.
-Travis
Post by Alexandre Gramfort
Cheers,
f
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Travis Oliphant
2012-01-05 06:26:05 UTC
Permalink
Post by Travis Oliphant
Post by Alexandre Gramfort
Hi all,
Post by Travis Oliphant
What do others think is missing? Off the top of my head: basic wavelets
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed in
SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy
one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
from Berkeley' paper on parallel computing is not a bad starting
Dense Linear Algebra
Sparse Linear Algebra [1]
Spectral Methods
N-Body Methods
Structured Grids
Unstructured Grids
MapReduce
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack and Branch-and-Bound
Graphical Models
Finite State Machines
This is a nice list, thanks!
Post by Alexandre Gramfort
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in
supercomputing environments, and some of the topics are probably
beyond the scope of scipy (say structured/unstructured grids, at least
for now).
But it can be a decent guiding outline to reason about what are the
'big areas' of scientific computing, so that scipy at least provides
building blocks that would be useful in these directions.
Thanks for the links.
Post by Alexandre Gramfort
One area that hasn't been directly mentioned too much is the situation
with statistical tools. On the one hand, we have the phenomenal work
of pandas, statsmodels and sklearn, which together are helping turn
python into a great tool for statistical data analysis (understood in
a broad sense). But it would probably be valuable to have enough of a
statistical base directly in numpy/scipy so that the 'out of the box'
experience for statistical work is improved. I know we have
scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect skewness and kurtosis formulas for some distributions, and I now have very little confidence that any of the other distributions are correct. Of course, most of them probably *are* correct, but without tests, all are in doubt.
There is such a thing as *over-reliance* on tests as well. Tests help but it is not a black or white kind of thing as seems to come across in many of the messages on this list about what part of scipy is in "good shape" or "easy to maintain" or "has love." Just because tests exist doesn't mean that you can trust the code --- you also then have to trust the tests. Ultimately, trust is built from successful *usage*. Tests are only a pseudo-subsitute for that usage. It so happens that usage that comes along with the code itself makes it easier to iterate on changes and catch some of the errors that can happen on re-factoring.

In summary, tests are good! But, they also add overhead and themselves must be maintained, and I don't think it helps to disparage working code. I've seen a lot of terrible code that has *great* tests and seen projects fail because developers focus too much on the tests and not enough on what the code is actually doing. Great tests can catch many things but they cannot make up for not paying attention when writing the code.

-Travis
Post by Travis Oliphant
Warren
A big question to me is the impact of data-frames as the underlying data-representation of the algorithms and the relationship between the data-frame and a NumPy array.
-Travis
Post by Alexandre Gramfort
Cheers,
f
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Ralf Gommers
2012-01-05 06:47:13 UTC
Permalink
Post by Warren Weckesser
Post by Alexandre Gramfort
Post by Alexandre Gramfort
Hi all,
Post by Travis Oliphant
What do others think is missing? Off the top of my head: basic
wavelets
Post by Alexandre Gramfort
Post by Travis Oliphant
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed
in
Post by Alexandre Gramfort
Post by Travis Oliphant
SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy
one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
from Berkeley' paper on parallel computing is not a bad starting
Dense Linear Algebra
Sparse Linear Algebra [1]
Spectral Methods
N-Body Methods
Structured Grids
Unstructured Grids
MapReduce
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack and Branch-and-Bound
Graphical Models
Finite State Machines
This is a nice list, thanks!
Post by Alexandre Gramfort
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in
supercomputing environments, and some of the topics are probably
beyond the scope of scipy (say structured/unstructured grids, at least
for now).
But it can be a decent guiding outline to reason about what are the
'big areas' of scientific computing, so that scipy at least provides
building blocks that would be useful in these directions.
Thanks for the links.
Post by Alexandre Gramfort
One area that hasn't been directly mentioned too much is the situation
with statistical tools. On the one hand, we have the phenomenal work
of pandas, statsmodels and sklearn, which together are helping turn
python into a great tool for statistical data analysis (understood in
a broad sense). But it would probably be valuable to have enough of a
statistical base directly in numpy/scipy so that the 'out of the box'
experience for statistical work is improved. I know we have
scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There
is always more to do, of course, but I'm not sure what specifically you
think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect
skewness and kurtosis formulas for some distributions, and I now have very
little confidence that any of the other distributions are correct. Of
course, most of them probably *are* correct, but without tests, all are in
doubt.
There is such a thing as *over-reliance* on tests as well.
True in principle, but we're so far from that point that you don't have to
worry about that for the foreseeable future.
Post by Warren Weckesser
Tests help but it is not a black or white kind of thing as seems to come
across in many of the messages on this list about what part of scipy is in
"good shape" or "easy to maintain" or "has love." Just because tests
exist doesn't mean that you can trust the code --- you also then have to
trust the tests. Ultimately, trust is built from successful *usage*.
Tests are only a pseudo-subsitute for that usage. It so happens that usage
that comes along with the code itself makes it easier to iterate on changes
and catch some of the errors that can happen on re-factoring.
In summary, tests are good! But, they also add overhead and themselves
must be maintained, and I don't think it helps to disparage working code.
I've seen a lot of terrible code that has *great* tests and seen projects
fail because developers focus too much on the tests and not enough on what
the code is actually doing. Great tests can catch many things but they
cannot make up for not paying attention when writing the code.
Certainly, but besides giving more confidence that code is correct, a major
advantage is that it is a massive help when working on existing code -
especially for new developers. Now we have to be extremely careful in
reviewing patches to check nothing gets broken (including backwards
compatibility). Tests in that respect are not a maintenance burden, but a
time saver.

As an example, last week I wanted to add a way to easily adjust the
bandwidth of gaussian_kde. This was maybe 10 lines of code, didn't take
long at all. Then I spent some time adding tests and improving the docs,
and thought I was done. After sending the PR, I spent at least an equal
amount of time reworking everything a couple of times to not break any of
the existing subclasses that could be found. In addition it took a lot of
Josef's time to review it all and convince me of the error of my way. A few
tests could have saved us a lot of time.

Ralf
j***@gmail.com
2012-01-05 14:10:20 UTC
Permalink
On Thu, Jan 5, 2012 at 1:47 AM, Ralf Gommers
Post by Ralf Gommers
Post by Travis Oliphant
Post by Alexandre Gramfort
Hi all,
What do others think is missing?  Off the top of my head:   basic wavelets
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed in
SciPy.   Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy
one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
from Berkeley' paper on parallel computing is not a bad starting
   Dense Linear Algebra
   Sparse Linear Algebra [1]
   Spectral Methods
   N-Body Methods
   Structured Grids
   Unstructured Grids
   MapReduce
   Combinational Logic
   Graph Traversal
   Dynamic Programming
   Backtrack and Branch-and-Bound
   Graphical Models
   Finite State Machines
This is a nice list, thanks!
Post by Alexandre Gramfort
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in
supercomputing environments, and some of the topics are probably
beyond the scope of scipy (say structured/unstructured grids, at least
for now).
But it can be a decent guiding outline to reason about what are the
'big areas' of scientific computing, so that scipy at least provides
building blocks that would be useful in these directions.
Thanks for the links.
Post by Alexandre Gramfort
One area that hasn't been directly mentioned too much is the situation
with statistical tools.  On the one hand, we have the phenomenal work
of pandas, statsmodels and sklearn, which together are helping turn
python into a great tool for statistical data analysis (understood in
a broad sense).  But it would probably be valuable to have enough of a
statistical base directly in numpy/scipy so that the 'out of the box'
experience for statistical work is improved.  I know we have
scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention.   There
is always more to do, of course, but I'm not sure what specifically you
think is missing or needs work.
Test coverage, for example.  I recently fixed several wildly incorrect
skewness and kurtosis formulas for some distributions, and I now have very
little confidence that any of the other distributions are correct.  Of
course, most of them probably *are* correct, but without tests, all are in
doubt.
There is such a thing as *over-reliance* on tests as well.
True in principle, but we're so far from that point that you don't have to
worry about that for the foreseeable future.
Tests help but it is not a black or white kind of thing as seems to come
across in many of the messages on this list about what part of scipy is in
"good shape" or "easy to maintain" or "has love."    Just because tests
exist doesn't mean that you can trust the code --- you also then have to
trust the tests.   Ultimately, trust is built from successful *usage*.
Tests are only a pseudo-subsitute for that usage.  It so happens that usage
that comes along with the code itself makes it easier to iterate on changes
and catch some of the errors that can happen on re-factoring.
In summary, tests are good!  But, they also add overhead and themselves
must be maintained, and I don't think it helps to disparage working code.
I've seen a lot of terrible code that has *great* tests and seen projects
fail because developers focus too much on the tests and not enough on what
the code is actually doing.   Great tests can catch many things but they
cannot make up for not paying attention when writing the code.
Certainly, but besides giving more confidence that code is correct, a major
advantage is that it is a massive help when working on existing code -
especially for new developers. Now we have to be extremely careful in
reviewing patches to check nothing gets broken (including backwards
compatibility). Tests in that respect are not a maintenance burden, but a
time saver.
Overall I also think that adding sufficient tests at the time of
adding the code is a big time saver in the long run. It is a lot more
difficult to figure out later why something is wrong and how to fix
it.

Without sufficient tests it's also difficult to tell whether code that
looks good works as advertised, (my last mistake was a misplaced
bracket that only showed up in cases that were not covered by the
tests).

And of course as Ralf mentioned, refactoring without test coverage is
dangerous business even if the change looks "innocent.

Josef
Post by Ralf Gommers
As an example, last week I wanted to add a way to easily adjust the
bandwidth of gaussian_kde. This was maybe 10 lines of code, didn't take long
at all. Then I spent some time adding tests and improving the docs, and
thought I was done. After sending the PR, I spent at least an equal amount
of time reworking everything a couple of times to not break any of the
existing subclasses that could be found. In addition it took a lot of
Josef's time to review it all and convince me of the error of my way. A few
tests could have saved us a lot of time.
Ralf
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Charles R Harris
2012-01-05 14:45:12 UTC
Permalink
Post by j***@gmail.com
On Thu, Jan 5, 2012 at 1:47 AM, Ralf Gommers
Post by Ralf Gommers
Post by Warren Weckesser
Post by Travis Oliphant
Post by Alexandre Gramfort
Hi all,
Post by Travis Oliphant
What do others think is missing? Off the top of my head: basic wavelets
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things
needed
Post by Ralf Gommers
Post by Warren Weckesser
Post by Travis Oliphant
Post by Alexandre Gramfort
Post by Travis Oliphant
in
SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy
one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
from Berkeley' paper on parallel computing is not a bad starting
Dense Linear Algebra
Sparse Linear Algebra [1]
Spectral Methods
N-Body Methods
Structured Grids
Unstructured Grids
MapReduce
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack and Branch-and-Bound
Graphical Models
Finite State Machines
This is a nice list, thanks!
Post by Alexandre Gramfort
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in
supercomputing environments, and some of the topics are probably
beyond the scope of scipy (say structured/unstructured grids, at
least
Post by Ralf Gommers
Post by Warren Weckesser
Post by Travis Oliphant
Post by Alexandre Gramfort
for now).
But it can be a decent guiding outline to reason about what are the
'big areas' of scientific computing, so that scipy at least provides
building blocks that would be useful in these directions.
Thanks for the links.
Post by Alexandre Gramfort
One area that hasn't been directly mentioned too much is the
situation
Post by Ralf Gommers
Post by Warren Weckesser
Post by Travis Oliphant
Post by Alexandre Gramfort
with statistical tools. On the one hand, we have the phenomenal work
of pandas, statsmodels and sklearn, which together are helping turn
python into a great tool for statistical data analysis (understood in
a broad sense). But it would probably be valuable to have enough of
a
Post by Ralf Gommers
Post by Warren Weckesser
Post by Travis Oliphant
Post by Alexandre Gramfort
statistical base directly in numpy/scipy so that the 'out of the box'
experience for statistical work is improved. I know we have
scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention.
There
Post by Ralf Gommers
Post by Warren Weckesser
Post by Travis Oliphant
is always more to do, of course, but I'm not sure what specifically you
think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect
skewness and kurtosis formulas for some distributions, and I now have
very
Post by Ralf Gommers
Post by Warren Weckesser
little confidence that any of the other distributions are correct. Of
course, most of them probably *are* correct, but without tests, all are
in
Post by Ralf Gommers
Post by Warren Weckesser
doubt.
There is such a thing as *over-reliance* on tests as well.
True in principle, but we're so far from that point that you don't have
to
Post by Ralf Gommers
worry about that for the foreseeable future.
Post by Warren Weckesser
Tests help but it is not a black or white kind of thing as seems to come
across in many of the messages on this list about what part of scipy is
in
Post by Ralf Gommers
Post by Warren Weckesser
"good shape" or "easy to maintain" or "has love." Just because tests
exist doesn't mean that you can trust the code --- you also then have to
trust the tests. Ultimately, trust is built from successful *usage*.
Tests are only a pseudo-subsitute for that usage. It so happens that
usage
Post by Ralf Gommers
Post by Warren Weckesser
that comes along with the code itself makes it easier to iterate on
changes
Post by Ralf Gommers
Post by Warren Weckesser
and catch some of the errors that can happen on re-factoring.
In summary, tests are good! But, they also add overhead and themselves
must be maintained, and I don't think it helps to disparage working
code.
Post by Ralf Gommers
Post by Warren Weckesser
I've seen a lot of terrible code that has *great* tests and seen
projects
Post by Ralf Gommers
Post by Warren Weckesser
fail because developers focus too much on the tests and not enough on
what
Post by Ralf Gommers
Post by Warren Weckesser
the code is actually doing. Great tests can catch many things but they
cannot make up for not paying attention when writing the code.
Certainly, but besides giving more confidence that code is correct, a
major
Post by Ralf Gommers
advantage is that it is a massive help when working on existing code -
especially for new developers. Now we have to be extremely careful in
reviewing patches to check nothing gets broken (including backwards
compatibility). Tests in that respect are not a maintenance burden, but a
time saver.
Overall I also think that adding sufficient tests at the time of
adding the code is a big time saver in the long run. It is a lot more
difficult to figure out later why something is wrong and how to fix
it.
Without sufficient tests it's also difficult to tell whether code that
looks good works as advertised, (my last mistake was a misplaced
bracket that only showed up in cases that were not covered by the
tests).
And of course as Ralf mentioned, refactoring without test coverage is
dangerous business even if the change looks "innocent.
And sufficient means test everything. I always turn up bugs when I increase
test coverage. It can be embarrassing.

Chuck
j***@gmail.com
2012-01-05 13:51:02 UTC
Permalink
On Thu, Jan 5, 2012 at 1:02 AM, Warren Weckesser
Post by Travis Oliphant
Post by Alexandre Gramfort
Hi all,
What do others think is missing?  Off the top of my head:   basic wavelets
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed in
SciPy.   Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy
one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
from Berkeley' paper on parallel computing is not a bad starting
   Dense Linear Algebra
   Sparse Linear Algebra [1]
   Spectral Methods
   N-Body Methods
   Structured Grids
   Unstructured Grids
   MapReduce
   Combinational Logic
   Graph Traversal
   Dynamic Programming
   Backtrack and Branch-and-Bound
   Graphical Models
   Finite State Machines
This is a nice list, thanks!
Post by Alexandre Gramfort
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in
supercomputing environments, and some of the topics are probably
beyond the scope of scipy (say structured/unstructured grids, at least
for now).
But it can be a decent guiding outline to reason about what are the
'big areas' of scientific computing, so that scipy at least provides
building blocks that would be useful in these directions.
Thanks for the links.
Post by Alexandre Gramfort
One area that hasn't been directly mentioned too much is the situation
with statistical tools.  On the one hand, we have the phenomenal work
of pandas, statsmodels and sklearn, which together are helping turn
python into a great tool for statistical data analysis (understood in
a broad sense).  But it would probably be valuable to have enough of a
statistical base directly in numpy/scipy so that the 'out of the box'
experience for statistical work is improved.  I know we have
scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention.   There
is always more to do, of course, but I'm not sure what specifically you
think is missing or needs work.
Test coverage, for example.  I recently fixed several wildly incorrect
skewness and kurtosis formulas for some distributions, and I now have very
little confidence that any of the other distributions are correct.  Of
course, most of them probably *are* correct, but without tests, all are in
doubt.
Actually for this part it's not so much the test coverage, I have
written some imperfect tests, but they are disabled because skew,
kurtosis (3rd and 4th moments) and entropy still have several bugs for
sure.
One problem is that they are statistical tests with some false alarms,
especially for distributions that are far away from the normal.

But the main problem is that it requires a lot of work fixing those
bugs, find the correct formulas (which is not so easy for some more
exotic distributions) and then finding out where the current
calculations are wrong.
As you have seen for the cases that you recently fixed.

variances (2nd moments) might be ok, but I'm not completely convinced
anymore since I discovered that the corresponding test was a dummy.

Better tests would be useful, but statistical tests based on random
samples were the only once I could come up with at the time that
(mostly) worked across all 100 distributions.

Josef
Warren
Post by Travis Oliphant
   A big question to me is the impact of data-frames as the underlying
data-representation of the algorithms and the relationship between the
data-frame and a NumPy array.
-Travis
Post by Alexandre Gramfort
Cheers,
f
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Charles R Harris
2012-01-05 02:33:38 UTC
Permalink
Post by Travis Oliphant
Thanks for the feedback. My point was to generate discussion and
start the ball rolling on exactly the kind of conversation that has
started.
Exactly as Ralf mentioned, the point is to get development on sub-packages
--- something that the scikits effort and other individual efforts have
done very, very well. In fact, it has worked so well, that it taught me a
great deal about what is important in open source. My perhaps irrational
dislike for the *name* "scikits" should not be interpreted as anything but
a naming taste preference (and I am not known for my ability to choose
names well anyway). I very much like and admire the community around
scikits. I just would have preferred something easier to type (even just
sci_* would have been better in my mind as high-level packages: sci_learn,
sci_image, sci_statsmodels, etc.). I didn't feel like I was able to
fully participate in that discussion when it happened, so you can take my
comments now as simply historical and something I've been wanting to get
off my chest for a while.
Without better packaging and dependency management systems (especially on
Windows and Mac), splitting out code doesn't help those who are not
distribution dependent (who themselves won't be impacted much). There are
scenarios under which it could make sense to split out SciPy, but I agree
that right now it doesn't make sense to completely split everything.
However, I do think it makes sense to clean things up and move some things
out in preparation for SciPy 1.0
One thing that would be nice is what is the view of documentation and
examples for the different packages. Where is work there most needed?
Looking at Travis' list of non-core packages I'd say that sparse certainly
- constants : very small and low cost to keep in core. Not much to improve there.
Agreed.
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically
contains a K-means vector quantization code with functionality that I
suspect exists in scikits-learn. I would recommend deprecation and
removal while pointing people to scikits-learn for equivalent functionality
(or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like
that? It is hardly specific to machine learning. Same with various matrix
factorizations.
Post by Travis Oliphant
- ndimage : difficult one. hard to understand code, may not see much
development either way.
This overlaps with scikits-image but has quite a bit of useful
functionality on its own. The package is fairly mature and just needs
maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to
scikits-image since it *is* image specific and might be better maintained.
Post by Travis Oliphant
- spatial : kdtree is widely used, of good quality. low maintenance cost.
Indexing of all sorts tends to be fundamental. But not everyone knows they
want it ;)

Good to hear maintenance cost is low.
Post by Travis Oliphant
- odr : quite small, low cost to keep in core. pretty much done as far as I can tell.
Agreed.
- maxentropy : is deprecated, will disappear.
Great.
- signal : not in great shape, could be viable independent package. On the
other hand, if scikits-signal takes off and those developers take care to
improve and build on scipy.signal when possible, that's OK too.
What are the needs of this package? What needs to be fixed / improved?
It is a broad field and I could see fixing scipy.signal with a few simple
algorithms (the filter design, for example), and then pushing a separate
package to do more advanced signal processing algorithms. This sounds
fine to me. It looks like I can put attention to scipy.signal then, as It
was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that
works for complex filter design that belongs somewhere.
Post by Travis Oliphant
- weave : no point spending any effort on it. keep for backwards
compatibility only, direct people to Cython instead.
Agreed. Anyway we can deprecate this for SciPy 1.0?
Overall, I don't see many viable independent packages there. So here's an
alternative to spending a lot of effort on reorganizing the package
1. Formulate a coherent vision of what in principle belongs in scipy
(current modules + what's missing).
O.K. so SciPy should contain "basic" modules that are going to be needed
for a lot of different kinds of analysis to be a dependency for other more
advanced packages. This is somewhat vague, of course.
What do others think is missing? Off the top of my head: basic wavelets
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed in
SciPy. Are there other relevant taxonomies these days?
http://gams.nist.gov/cgi-bin/serve.cgi
2. Focus on making it easier to contribute to scipy. There are many ways
to do this; having more accessible developer docs, having a list of "easy
fixes", adding info to tickets on how to get started on the reported
issues, etc. We can learn a lot from Sympy and IPython here.
Definitely!
3. Recognize that quality of code and especially documentation is
important, and fill the main gaps.
Is there a write-up of recognized gaps here that we can start with?
4. Deprecate sub-modules that don't belong in scipy (anymore), and remove
them for scipy 1.0. I think that this applies only to maxentropy and weave.
I think it also applies to cluster as described above.
5. Find a clear (group of) maintainer(s) for each sub-module. For people
familiar with one module, responding to
tickets and pull requests for that module would not cost so much time.
Is there a list where this is kept?
In my opinion, spending effort on improving code/documentation quality and
attracting new developers (those go hand in hand) instead of reorganizing
will have both more impact and be more beneficial for our users.
Chuck
Travis Oliphant
2012-01-05 03:07:28 UTC
Permalink
Post by Travis Oliphant
Post by Ralf Gommers
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically contains a K-means vector quantization code with functionality that I suspect exists in scikits-learn. I would recommend deprecation and removal while pointing people to scikits-learn for equivalent functionality (or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like that? It is hardly specific to machine learning. Same with various matrix factorizations.
What is basic and what is not basic is the whole point of the discussion. I'm not sure that the functionality in cluster.vq and cluster.hierarchy can be considered "basic". But, it will certainly depend on the kinds of problems you tend to solve. I also don't understand your reference to matrix factorizations in this context.

But, this isn't a big-deal to me, either, so if there are strong opinions wanting to keep it, then great.
Post by Travis Oliphant
What are the needs of this package? What needs to be fixed / improved? It is a broad field and I could see fixing scipy.signal with a few simple algorithms (the filter design, for example), and then pushing a separate package to do more advanced signal processing algorithms. This sounds fine to me. It looks like I can put attention to scipy.signal then, as It was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that works for complex filter design that belongs somewhere.
It seems like this should go into scipy.signal next to the remez algorithm that is already there.

-Travis
Charles R Harris
2012-01-05 03:53:18 UTC
Permalink
Post by Charles R Harris
Post by Ralf Gommers
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically
contains a K-means vector quantization code with functionality that I
suspect exists in scikits-learn. I would recommend deprecation and
removal while pointing people to scikits-learn for equivalent functionality
(or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like
that? It is hardly specific to machine learning. Same with various matrix
factorizations.
What is basic and what is not basic is the whole point of the discussion.
I'm not sure that the functionality in cluster.vq and cluster.hierarchy
can be considered "basic". But, it will certainly depend on the kinds of
problems you tend to solve. I also don't understand your reference to
matrix factorizations in this context.
But, this isn't a big-deal to me, either, so if there are strong opinions
wanting to keep it, then great.
Clustering is pretty basic to lots of things. That said, K-means might not
be the one to keep.

There are various matrix factorizations beyond the basic svd that are less
common, but potentially useful, such as that in partial least squares and
positive matrix factorization. I think the scikits-learn folks use some of
these and they might have and idea as to how useful they have been. ISTR
someone posting about doing PLS for scipy a while back.
Post by Charles R Harris
Post by Ralf Gommers
What are the needs of this package? What needs to be fixed / improved?
It is a broad field and I could see fixing scipy.signal with a few simple
algorithms (the filter design, for example), and then pushing a separate
package to do more advanced signal processing algorithms. This sounds
fine to me. It looks like I can put attention to scipy.signal then, as It
was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that
works for complex filter design that belongs somewhere.
It seems like this should go into scipy.signal next to the remez algorithm
that is already there.
I'd actually like it to replace the current one since it it is readable --
mostly python with a bit of Cython for finding extrema -- and does
hermitean filters, which covers both the symmetric and anti-symmetric
filters that the current version does.

Chuck
Travis Oliphant
2012-01-05 04:02:09 UTC
Permalink
Post by Travis Oliphant
Post by Travis Oliphant
Post by Ralf Gommers
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically contains a K-means vector quantization code with functionality that I suspect exists in scikits-learn. I would recommend deprecation and removal while pointing people to scikits-learn for equivalent functionality (or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like that? It is hardly specific to machine learning. Same with various matrix factorizations.
What is basic and what is not basic is the whole point of the discussion. I'm not sure that the functionality in cluster.vq and cluster.hierarchy can be considered "basic". But, it will certainly depend on the kinds of problems you tend to solve. I also don't understand your reference to matrix factorizations in this context.
But, this isn't a big-deal to me, either, so if there are strong opinions wanting to keep it, then great.
Clustering is pretty basic to lots of things. That said, K-means might not be the one to keep.
There are various matrix factorizations beyond the basic svd that are less common, but potentially useful, such as that in partial least squares and positive matrix factorization. I think the scikits-learn folks use some of these and they might have and idea as to how useful they have been. ISTR someone posting about doing PLS for scipy a while back.
Post by Travis Oliphant
What are the needs of this package? What needs to be fixed / improved? It is a broad field and I could see fixing scipy.signal with a few simple algorithms (the filter design, for example), and then pushing a separate package to do more advanced signal processing algorithms. This sounds fine to me. It looks like I can put attention to scipy.signal then, as It was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that works for complex filter design that belongs somewhere.
It seems like this should go into scipy.signal next to the remez algorithm that is already there.
I'd actually like it to replace the current one since it it is readable -- mostly python with a bit of Cython for finding extrema -- and does hermitean filters, which covers both the symmetric and anti-symmetric filters that the current version does.
Cool! That sounds even better :-)

-Travis
Post by Travis Oliphant
Chuck
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Zachary Pincus
2012-01-05 03:16:01 UTC
Permalink
Just one point here: one of the current shortcomings in scipy from my perspective is interpolation, which is spread between interpolate, signal, and ndimage, each package with strengths and inexplicable (to a new user) weaknesses.

One trouble spot is the fact that it's not clear that ndimage is where one ought to turn for general interpolation/resampling of gridded data (a topic which comes up at least once every couple months on the list).
Post by Travis Oliphant
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own. The package is fairly mature and just needs maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to scikits-image since it *is* image specific and might be better maintained.
See above. The interpolation stuff is pretty useful for a lot of tasks that aren't really "imaging" per se, but which involve gridded data. (GIS, e.g.) Similarly, the code for convolutions and similar (median filtering, e.g.) seems pretty generally useful and in many ways better than what's in scipy.signal for certain tasks.

I'm less certain about the morphological operations and the connected-components labeling, which might be more task-specific and fit better with scikits-image? (Probably after a re-write in Cython?)

Zach
Travis Oliphant
2012-01-05 03:36:38 UTC
Permalink
Great points.

I agree that interpolation still needs love. I've had the exact same concern multiple times before. It comes up quite a bit in classes.

It looks like interpolate and signal are still areas that I can spend some free time. I know Warren has spent time in signal. Is anyone else working on interpolate --- I can check this of course myself, but just in case someone is following this conversation who is interested in coordinating.

We may need to continue the conversation about ndimage.

I appreciate the patience with me after my being silent for a while. I'm technically between jobs as I recently left Enthought. I just re-did my mail account setup so now I see all scipy-dev and numpy-discussion mails instead of having to remember to go look at the conversations.

Thanks,

-Travis
Post by Zachary Pincus
Just one point here: one of the current shortcomings in scipy from my perspective is interpolation, which is spread between interpolate, signal, and ndimage, each package with strengths and inexplicable (to a new user) weaknesses.
One trouble spot is the fact that it's not clear that ndimage is where one ought to turn for general interpolation/resampling of gridded data (a topic which comes up at least once every couple months on the list).
Post by Travis Oliphant
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own. The package is fairly mature and just needs maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to scikits-image since it *is* image specific and might be better maintained.
See above. The interpolation stuff is pretty useful for a lot of tasks that aren't really "imaging" per se, but which involve gridded data. (GIS, e.g.) Similarly, the code for convolutions and similar (median filtering, e.g.) seems pretty generally useful and in many ways better than what's in scipy.signal for certain tasks.
I'm less certain about the morphological operations and the connected-components labeling, which might be more task-specific and fit better with scikits-image? (Probably after a re-write in Cython?)
Zach
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
j***@gmail.com
2012-01-05 05:32:58 UTC
Permalink
Post by Travis Oliphant
Great points.
I agree that interpolation still needs love.  I've had the exact same concern multiple times before.  It comes up quite a bit in classes.
It looks like interpolate and signal are still areas that I can spend some free time.     I know Warren has spent time in signal.   Is anyone else working on interpolate --- I can check this of course myself, but just in case someone is following this conversation who is interested in coordinating.
There have been several starts on a control system toolbox that has
some overlap with scipy.signal, but I haven't heard of any discussion
in a while.

The scipy wavelets look like a complete mystery, the docs are sparse,
and with a google search I found only a single example of it's usage.

Josef
Post by Travis Oliphant
We may need to continue the conversation about ndimage.
I appreciate the patience with me after my being silent for a while.    I'm technically between jobs as I recently left Enthought.     I just re-did my mail account setup so now I see all scipy-dev and numpy-discussion mails instead of having to remember to go look at the conversations.
Thanks,
-Travis
Post by Zachary Pincus
Just one point here: one of the current shortcomings in scipy from my perspective is interpolation, which is spread between interpolate, signal, and ndimage, each package with strengths and inexplicable (to a new user) weaknesses.
One trouble spot is the fact that it's not clear that ndimage is where one ought to turn for general interpolation/resampling of gridded data (a topic which comes up at least once every couple months on the list).
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own.   The package is fairly mature and just needs maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to scikits-image since it *is* image specific and might be better maintained.
See above. The interpolation stuff is pretty useful for a lot of tasks that aren't really "imaging" per se, but which involve gridded data. (GIS, e.g.) Similarly, the code for convolutions and similar (median filtering, e.g.) seems pretty generally useful and in many ways better than what's in scipy.signal for certain tasks.
I'm less certain about the morphological operations and the connected-components labeling, which might be more task-specific and fit better with scikits-image? (Probably after a re-write in Cython?)
Zach
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Pauli Virtanen
2012-01-09 11:37:02 UTC
Permalink
Post by Zachary Pincus
Just one point here: one of the current shortcomings in scipy
from my perspective is interpolation, which is spread between
interpolate, signal, and ndimage, each package with strengths
and inexplicable (to a new user) weaknesses.
Interpolation and splines are indeed a weak point currently.

What's missing is:

- interface for interpolating gridded data (unifying ndimage,
RectBivariateSpline, and scipy.spline routines)

- the interface for `griddata` could be simplified a bit
(-> allow variable number of arguments). Also, no natural neighbor
interpolation so far.

- FITPACK is a quirky beast, especially its 2D-routines (apart from
RectBivariateSpline) which very often don't work for real data.
I'm also not fully sure how far it and its smoothing can be trusted
in 1D (see stackoverflow)

- There are two sets of incompatible spline routines in
scipy.interpolate, which should be cleaned up.

The *Spline class interfaces are also not very pretty, as there is
__class__ changing magic going on.

The interp2d interface is somewhat confusing, and IMO would be best
deprecated.

- There is also a problem with large 1D data sets: FITPACK is slow, and
the other set of spline routines try to invert a dense matrix,
rather than e.g. using the band matrix routines.

- RBF sort of works, but uses dense matrices and is not suitable for
large data sets. IDW interpolation could be an useful addition here.

And probably more: making a laundry list of what to fix could be helpful.
Zachary Pincus
2012-01-09 13:02:12 UTC
Permalink
Also, as long as a list is being made:
scipy.signal has matched functions [cq]spline1d() and [cq]spline1d_eval(), but only [cq]spline2d(), with no matching _eval function.

And as far as FITPACK goes, I agree can be extremely, and possibly dangerously, "quirky" -- it's prone to almost arbitrarily bad ringing artifacts when the smoothing coefficient isn't large enough, and is very (very) sensitive to initial conditions in terms of what will and won't provoke the ringing. It has its uses, but it seems to me odd enough that it really shouldn't be the "default" 1D spline tool to direct people to.

Zach
Post by Pauli Virtanen
Post by Zachary Pincus
Just one point here: one of the current shortcomings in scipy
from my perspective is interpolation, which is spread between
interpolate, signal, and ndimage, each package with strengths
and inexplicable (to a new user) weaknesses.
Interpolation and splines are indeed a weak point currently.
- interface for interpolating gridded data (unifying ndimage,
RectBivariateSpline, and scipy.spline routines)
- the interface for `griddata` could be simplified a bit
(-> allow variable number of arguments). Also, no natural neighbor
interpolation so far.
- FITPACK is a quirky beast, especially its 2D-routines (apart from
RectBivariateSpline) which very often don't work for real data.
I'm also not fully sure how far it and its smoothing can be trusted
in 1D (see stackoverflow)
- There are two sets of incompatible spline routines in
scipy.interpolate, which should be cleaned up.
The *Spline class interfaces are also not very pretty, as there is
__class__ changing magic going on.
The interp2d interface is somewhat confusing, and IMO would be best
deprecated.
- There is also a problem with large 1D data sets: FITPACK is slow, and
the other set of spline routines try to invert a dense matrix,
rather than e.g. using the band matrix routines.
- RBF sort of works, but uses dense matrices and is not suitable for
large data sets. IDW interpolation could be an useful addition here.
And probably more: making a laundry list of what to fix could be helpful.
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
j***@gmail.com
2012-01-09 15:46:28 UTC
Permalink
Post by Zachary Pincus
scipy.signal has matched functions [cq]spline1d() and [cq]spline1d_eval(), but only [cq]spline2d(), with no matching _eval function.
And as far as FITPACK goes, I agree can be extremely, and possibly dangerously, "quirky" -- it's prone to almost arbitrarily bad ringing artifacts when the smoothing coefficient isn't large enough, and is very (very) sensitive to initial conditions in terms of what will and won't provoke the ringing. It has its uses, but it seems to me odd enough that it really shouldn't be the "default" 1D spline tool to direct people to.
Do you have an example of "arbitrarily" bad ringing?
Zachary Pincus
2012-01-09 19:06:13 UTC
Permalink
Post by j***@gmail.com
Post by Zachary Pincus
scipy.signal has matched functions [cq]spline1d() and [cq]spline1d_eval(), but only [cq]spline2d(), with no matching _eval function.
And as far as FITPACK goes, I agree can be extremely, and possibly dangerously, "quirky" -- it's prone to almost arbitrarily bad ringing artifacts when the smoothing coefficient isn't large enough, and is very (very) sensitive to initial conditions in terms of what will and won't provoke the ringing. It has its uses, but it seems to me odd enough that it really shouldn't be the "default" 1D spline tool to direct people to.
Do you have an example of "arbitrarily" bad ringing?
Post by Zachary Pincus
From what I was reading up on splines in the last weeks, I got the
impression was that this is a "feature" of interpolating splines, and
to be useful with a larger number of points we always need to smooth
sufficiently (reduce knots or penalize).
(I just read a comment that R with 5000 points only chooses about 200 knots).
Example below; it's using parametric splines because I have a simple interactive tool to draw them and notice occasional "blowing up" like what you see below. I *think* I've seen similar issues with normal splines, but haven't used them a lot lately. (For the record, treating the x and y values as separate and using the non-parametric spline fitting does NOT yield these crazy errors on *these data*...)

As far as the smoothing parameter, the "good" data will go crazy if s=3, but is fine with s=0.25 or s=4; similarly the "bad" data isn't prone to ringing if s=0.25 or s=5. So there's serious sensitivity both to the x,y positions of the data (as below) and to the smoothing parameter in a fairly small range.

I could probably come up with an example that goes crazy with even fewer input points, but this was the first thing I came up with. Small modifications to the input data seem to make it go even crazier, but the below illustrates the general point.

Zach


import numpy
import scipy.interpolate as interp
good = numpy.array(
[[ 24.21162868, 28.75056713, 32.64108579, 36.85581434,
41.07054289, 46.582111 , 52.417889 , 55.17367305,
57.92945711, 61.00945105, 64.89996971, 72.19469221,
75.76100098, 83.21782842, 83.21782842, 88.56729158,
86.29782236, 90.18834103, 86.62203225],
[ 70.57364276, 71.22206254, 69.27680321, 72.5189021 ,
65.06207466, 70.89785265, 67.33154388, 68.62838343,
69.92522299, 67.00733399, 77.21994548, 68.30417354,
71.38416748, 71.38416748, 64.25154993, 70.08732793,
61.00945105, 63.44102521, 56.47051261]])
bad = good.copy()
# now make a *small* change
bad[:,-1] = 87.432556973542049, 55.984197773255048

good_tck, good_u = interp.splprep(good, s=4)
bad_tck, bad_u = interp.splprep(bad, s=4)
print good.ptp(axis=1)
print numpy.array(interp.splev(numpy.linspace(good_u[0], good_u[-1], 300), good_tck)).ptp(axis=1)
print numpy.array(interp.splev(numpy.linspace(bad_u[0], bad_u[-1], 300), bad_tck)).ptp(axis=1)

And the output on my machine is:
[ 65.97671235 20.74943287]
[ 67.69845281 20.52518913]
[ 2868.98673621 450984.86622631]
j***@gmail.com
2012-01-09 20:30:17 UTC
Permalink
Post by j***@gmail.com
Post by Zachary Pincus
scipy.signal has matched functions [cq]spline1d() and [cq]spline1d_eval(), but only [cq]spline2d(), with no matching _eval function.
And as far as FITPACK goes, I agree can be extremely, and possibly dangerously, "quirky" -- it's prone to almost arbitrarily bad ringing artifacts when the smoothing coefficient isn't large enough, and is very (very) sensitive to initial conditions in terms of what will and won't provoke the ringing. It has its uses, but it seems to me odd enough that it really shouldn't be the "default" 1D spline tool to direct people to.
Do you have an example of "arbitrarily" bad ringing?
Pauli Virtanen
2012-01-10 10:14:56 UTC
Permalink
09.01.2012 21:30, ***@gmail.com kirjoitti:
[clip]
One impression I had when I tried this out a few weeks ago, is that
the spline smoothing factor s is imposed with equality not inequality.
In the examples that I tried with varying s, the reported error sum of
squares always matched s to a few decimals. (I don't know how because
I didn't see the knots change in some examples.)
As far as I understand the FITPACK code, it starts with a low number of
knots in the spline, and then inserts new knots until the criterion
given with `s` is satisfied for the LSQ spline. Then it adjusts k-th
derivative discontinuities until the sum of squares of errors is equal
to `s`.

Provided I understood this correctly (at least this is what was written
in fppara.f): I'm not so sure that using k-th derivative discontinuity
as the smoothness term in the optimization is what people actually
expect from "smoothing". A more likely candidate would be the curvature.
However, the default value for the splines is k=3, cubic, which yields a
somewhat strange "smoothness" constraint.

If this is indeed what FITPACK does, then it seems to me that the
approach to smoothing is somewhat flawed. (However, it'd probably best
to read the book before making judgments here.)

Pauli
Pauli Virtanen
2012-01-11 10:05:01 UTC
Permalink
09.01.2012 20:06, Zachary Pincus kirjoitti:
[clip]
Post by Zachary Pincus
good_tck, good_u = interp.splprep(good, s=4)
bad_tck, bad_u = interp.splprep(bad, s=4)
print good.ptp(axis=1)
print numpy.array(interp.splev(numpy.linspace(good_u[0], good_u[-1], 300), good_tck)).ptp(axis=1)
print numpy.array(interp.splev(numpy.linspace(bad_u[0], bad_u[-1], 300), bad_tck)).ptp(axis=1)
[ 65.97671235 20.74943287]
[ 67.69845281 20.52518913]
[ 2868.98673621 450984.86622631]
After a closer look at this, it seems to me that there could also be a
numerical problem (or perhaps a bug) in the fitpack algorithm, i.e., the
bad results are not necessarily due to a "wrong" smoothness metric. In
the "bad" case it seems that the 3rd derivative discontinuities also
explode.
--
Pauli Virtanen
j***@gmail.com
2012-01-05 03:30:30 UTC
Permalink
On Wed, Jan 4, 2012 at 9:33 PM, Charles R Harris
Post by Charles R Harris
Thanks for the feedback.      My point was to generate discussion and
start the ball rolling on exactly the kind of conversation that has started.
Exactly as Ralf mentioned, the point is to get development on sub-packages
--- something that the scikits effort and other individual efforts have done
very, very well.   In fact, it has worked so well, that it taught me a great
deal about what is important in open source.   My perhaps irrational dislike
for the *name* "scikits" should not be interpreted as anything but a naming
taste preference (and I am not known for my ability to choose names well
anyway).     I very much like and admire the community around scikits.  I
just would have preferred something easier to type (even just sci_* would
have been better in my mind as high-level packages:  sci_learn, sci_image,
sci_statsmodels, etc.).    I didn't feel like I was able to fully
participate in that discussion when it happened, so you can take my comments
now as simply historical and something I've been wanting to get off my chest
for a while.
Without better packaging and dependency management systems (especially on
Windows and Mac), splitting out code doesn't help those who are not
distribution dependent (who themselves won't be impacted much).   There are
scenarios under which it could make sense to split out SciPy, but I agree
that right now it doesn't make sense to completely split everything.
However, I do think it makes sense to clean things up and move some things
out in preparation for SciPy 1.0
One thing that would be nice is what is the view of documentation and
examples for the different packages.   Where is work there most needed?
Looking at Travis' list of non-core packages I'd say that sparse certainly
- constants : very small and low cost to keep in core. Not much to improve there.
Agreed.
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit.   It basically
contains a K-means vector quantization code with functionality that I
suspect  exists in scikits-learn.   I would recommend deprecation and
removal while pointing people to scikits-learn for equivalent functionality
(or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like
that? It is hardly specific to machine learning. Same with various matrix
factorizations.
- ndimage : difficult one. hard to understand code, may not see much
development either way.
This overlaps with scikits-image but has quite a bit of useful
functionality on its own.   The package is fairly mature and just needs
maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to
scikits-image since it *is* image specific and might be better maintained.
- spatial : kdtree is widely used, of good quality. low maintenance cost.
Indexing of all sorts tends to be fundamental. But not everyone knows they
want it ;)
Good to hear maintenance cost is low.
- odr : quite small, low cost to keep in core. pretty much done as far as I can tell.
Agreed.
- maxentropy : is deprecated, will disappear.
Great.
- signal : not in great shape, could be viable independent package. On the
other hand, if scikits-signal takes off and those developers take care to
improve and build on scipy.signal when possible, that's OK too.
What are the needs of this package?  What needs to be fixed / improved?
It is a broad field and I could see fixing scipy.signal with a few simple
algorithms (the filter design, for example), and then pushing a separate
package to do more advanced signal processing algorithms.    This sounds
fine to me.   It looks like I can put attention to scipy.signal then, as It
was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that
works for complex filter design that belongs somewhere.
ltisys was pretty neglected, but Warren, I think, made quite big improvements.
There was several times the discussion whether MIMO works or should
work, similar there was a discrete time proposal but I didn't keep up
with what happened to it.

In statsmodels we are very happy with signal.lfilter but I wished
there were a multi input version of it.
Other things that are basic, periodograms, burg and levinson_durbin
are scipy algorithms I think, but having them in a scikits.signal
would be good also.

Josef
Post by Charles R Harris
- weave : no point spending any effort on it. keep for backwards
compatibility only, direct people to Cython instead.
Agreed.   Anyway we can deprecate this for SciPy 1.0?
Overall, I don't see many viable independent packages there. So here's an
alternative to spending a lot of effort on reorganizing the package
1. Formulate a coherent vision of what in principle belongs in scipy
(current modules + what's missing).
O.K.  so SciPy should contain "basic" modules that are going to be needed
for a lot of different kinds of analysis to be a dependency for other more
advanced packages.  This is somewhat vague, of course.
What do others think is missing?  Off the top of my head:   basic wavelets
(dwt primarily) and more complete interpolation strategies (I'd like to
finish the basic interpolation approaches I started a while ago).
Originally, I used GAMS as an "overview" of the kinds of things needed in
SciPy.   Are there other relevant taxonomies these days?
http://gams.nist.gov/cgi-bin/serve.cgi
2. Focus on making it easier to contribute to scipy. There are many ways
to do this; having more accessible developer docs, having a list of "easy
fixes", adding info to tickets on how to get started on the reported issues,
etc. We can learn a lot from Sympy and IPython here.
Definitely!
3. Recognize that quality of code and especially documentation is
important, and fill the main gaps.
Is there a write-up of recognized gaps here that we can start with?
4. Deprecate sub-modules that don't belong in scipy (anymore), and remove
them for scipy 1.0. I think that this applies only to maxentropy and weave.
I think it also applies to cluster as described above.
5. Find a clear (group of) maintainer(s) for each sub-module. For people
familiar with one module, responding to
tickets and pull requests for that module would not cost so much time.
Is there a list where this is kept?
In my opinion, spending effort on improving code/documentation quality and
attracting new developers (those go hand in hand) instead of reorganizing
will have both more impact and be more beneficial for our users.
Chuck
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Charles R Harris
2012-01-05 04:11:56 UTC
Permalink
Post by j***@gmail.com
On Wed, Jan 4, 2012 at 9:33 PM, Charles R Harris
<snip>
Post by j***@gmail.com
ltisys was pretty neglected, but Warren, I think, made quite big improvements.
There was several times the discussion whether MIMO works or should
work, similar there was a discrete time proposal but I didn't keep up
with what happened to it.
In statsmodels we are very happy with signal.lfilter but I wished
there were a multi input version of it.
Other things that are basic, periodograms, burg and levinson_durbin
are scipy algorithms I think, but having them in a scikits.signal
would be good also.
Those all sound like good additions. Burg and Levinson_Durbin would also be
useful for folks making a maximum entropy package and would be a natural
fit with lfilter. I've seen various approaches to image interpolation that
could also make use of the lfilter functionality.

<snip>

Chuck
Neal Becker
2012-01-05 15:32:15 UTC
Permalink
Some comments on signal processing:

Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a
general purpose filter, which is an IIR filter, single rate. Efficiency is very
important in my work, so I implement many optimized variations.

Most of the time, FIR filters are used. These then come in variations for
single rate, interpolation, and decimation (there is also another design for
rational rate conversion). Then these have variants for scalar/complex
input/output, as well as complex in/out with scalar coefficients.

IIR filters are seperate.

FFT based FIR filters are another type, and include both complex in/out as well
as scalar in/out (taking advantage of the 'two channel' trick for fft).
j***@gmail.com
2012-01-05 16:00:39 UTC
Permalink
Post by Neal Becker
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a
general purpose filter, which is an IIR filter, single rate.  Efficiency is very
important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used.  These then come in variations for
single rate, interpolation, and decimation (there is also another design for
rational rate conversion).  Then these have variants for scalar/complex
input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well
as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?

It looks like a small change in the implementation, but it is slower
than lfilter for shorter time series so I mostly dropped fft based
filtering.

Josef
Post by Neal Becker
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Travis Oliphant
2012-01-05 16:14:45 UTC
Permalink
Post by j***@gmail.com
Post by Neal Becker
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a
general purpose filter, which is an IIR filter, single rate. Efficiency is very
important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for
single rate, interpolation, and decimation (there is also another design for
rational rate conversion). Then these have variants for scalar/complex
input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well
as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower
than lfilter for shorter time series so I mostly dropped fft based
filtering.
I think he is talking about filter design, correct?

lfilter can be used to implement FIR and IIR filters -- although an FIR filter is easily computed with convolve/correlate as well.

FIR filter design is usually done in the FFT-domain. But, this picks the coefficients for the actual filtering itself done with something like convolve

If you *do* filtering in the FFT-domain than it's usually going to be IIR. What are you referring to when you say "small change in the implementation"

-Travis
Post by j***@gmail.com
Josef
Post by Neal Becker
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
j***@gmail.com
2012-01-05 16:48:39 UTC
Permalink
Post by Travis Oliphant
Post by j***@gmail.com
Post by Neal Becker
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a
general purpose filter, which is an IIR filter, single rate.  Efficiency is very
important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used.  These then come in variations for
single rate, interpolation, and decimation (there is also another design for
rational rate conversion).  Then these have variants for scalar/complex
input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well
as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower
than lfilter for shorter time series so I mostly dropped fft based
filtering.
I think he is talking about filter design, correct?
lfilter can be used to implement FIR and IIR filters -- although an FIR filter is easily computed with convolve/correlate as well.
FIR filter design is usually done in the FFT-domain.   But, this picks the coefficients for the actual filtering itself done with something like convolve
If you *do* filtering in the FFT-domain than it's usually going to be IIR.   What are you referring to when you say "small change in the implementation"
maybe I'm interpreting things wrongly since I'm not so familiar with
the signal processing terminology

as far as I understand fftconvolve(in1, in2) applies a FIR filter in2
to in1, however it is possible to divide by the fft of an in3, that
would have both IIR filter terms as in lfilter.
(I tried out different versions of fft based time series analysis in
the statsmodels sandbox.)

I never looked very closely at filter design itself, because that is
very different from the estimation procedures we use in time series
analysis.

Josef
Post by Travis Oliphant
-Travis
Post by j***@gmail.com
Josef
Post by Neal Becker
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Neal Becker
2012-01-05 19:19:33 UTC
Permalink
Post by Travis Oliphant
Post by j***@gmail.com
Post by Neal Becker
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a
general purpose filter, which is an IIR filter, single rate. Efficiency is
very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for
single rate, interpolation, and decimation (there is also another design for
rational rate conversion). Then these have variants for scalar/complex
input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as
well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower
than lfilter for shorter time series so I mostly dropped fft based
filtering.
I think he is talking about filter design, correct?
The comments I made were all about efficient filter implementation, not about
filter design.

About FFT-based IIR filter, I never heard of it. I was talking about the fact
that fft can be used to efficiently implement a linear convolution exactly (for
the case of convolution of a finite or short sequence - the impulse response of
the filter - with a long or infinite sequence, the overlap-add or overlap-save
techniques are used).
Post by Travis Oliphant
lfilter can be used to implement FIR and IIR filters -- although an FIR filter
is easily computed with convolve/correlate as well.
FIR filter design is usually done in the FFT-domain. But, this picks the
coefficients for the actual filtering itself done with something like convolve
If you *do* filtering in the FFT-domain than it's usually going to be IIR.
What are you referring to when you say "small change in the implementation"
-Travis
Post by j***@gmail.com
Josef
Post by Neal Becker
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Travis Oliphant
2012-01-05 21:48:29 UTC
Permalink
Post by Neal Becker
Post by Travis Oliphant
Post by j***@gmail.com
Post by Neal Becker
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a
general purpose filter, which is an IIR filter, single rate. Efficiency is
very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for
single rate, interpolation, and decimation (there is also another design for
rational rate conversion). Then these have variants for scalar/complex
input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as
well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower
than lfilter for shorter time series so I mostly dropped fft based
filtering.
I think he is talking about filter design, correct?
The comments I made were all about efficient filter implementation, not about
filter design.
About FFT-based IIR filter, I never heard of it. I was talking about the fact
that fft can be used to efficiently implement a linear convolution exactly (for
the case of convolution of a finite or short sequence - the impulse response of
the filter - with a long or infinite sequence, the overlap-add or overlap-save
techniques are used).
Sure, of course. It's hard to know the way people are using terms. I agree that people don't usually use the term IIR when talking about an FFT-based filter (but there is an "effective" time-domain response for every filtering operation done in the Fourier domain --- as you noted). That's what I was referring to.

It's been a while since I wrote lfilter, but it transposes the filtering operation into Direct Form II, and then does a straightforward implementation of the feed-back and feed-forward equations.

Here is some information on the approach:
https://ccrma.stanford.edu/~jos/fp/Direct_Form_II.html

IIR filters implemented in the time-domain need something like lfilter. FIR filters are "just" convolution in the time domain --- and there are different approaches to doing that discrete-time convolution as you've noted. IIR filters are *just* convolution as well (but convolution with an infinite sequence). Of course, if you use the FFT-domain to implement the filter, then you can just as well design in that space the filtering-function you want to multiply the input signal with (it's just important to keep in mind the impact in the time-domain of what you are doing in the frequency domain --- i.e. sharp-edges result in ringing, the basic time-frequency product limitations, etc.)

These same ideas come under different names and have different emphasis in different disciplines.

-Travis
Neal Becker
2012-01-05 22:30:44 UTC
Permalink
Post by Travis Oliphant
Post by Neal Becker
Post by Travis Oliphant
Post by j***@gmail.com
Post by Neal Becker
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a
general purpose filter, which is an IIR filter, single rate. Efficiency
is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for
single rate, interpolation, and decimation (there is also another design for
rational rate conversion). Then these have variants for scalar/complex
input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as
well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower
than lfilter for shorter time series so I mostly dropped fft based
filtering.
I think he is talking about filter design, correct?
The comments I made were all about efficient filter implementation, not about
filter design.
About FFT-based IIR filter, I never heard of it. I was talking about the
fact that fft can be used to efficiently implement a linear convolution
exactly (for the case of convolution of a finite or short sequence - the
impulse response of the filter - with a long or infinite sequence, the
overlap-add or overlap-save techniques are used).
Sure, of course. It's hard to know the way people are using terms. I agree
that people don't usually use the term IIR when talking about an FFT-based
filter (but there is an "effective" time-domain response for every filtering
operation done in the Fourier domain --- as you noted). That's what I was
referring to.
It's been a while since I wrote lfilter, but it transposes the filtering
operation into Direct Form II, and then does a straightforward implementation
of the feed-back and feed-forward equations.
https://ccrma.stanford.edu/~jos/fp/Direct_Form_II.html
IIR filters implemented in the time-domain need something like lfilter. FIR
filters are "just" convolution in the time domain --- and there are different
approaches to doing that discrete-time convolution as you've noted. IIR
filters are *just* convolution as well (but convolution with an infinite
sequence). Of course, if you use the FFT-domain to implement the filter,
then you can just as well design in that space the filtering-function you want
to multiply the input signal with (it's just important to keep in mind the
impact in the time-domain of what you are doing in the frequency domain ---
i.e. sharp-edges result in ringing, the basic time-frequency product
limitations, etc.)
These same ideas come under different names and have different emphasis in
different disciplines.
-Travis
Here, I claim the best approach is to realize that
1. Just making the coefficients in the freq domain be samples of a desired
response gives you no exact result (as you noted), but
2. On the other hand, fft can be used to perform fast convolution, which is (can
be) mathematically exactly the same as time domain convolution. Therefore, just
realize that
* use your favorite FIR filter design tool (e.g., remez) to design the filter
Now the only approximation is in the fir filter design step, and you should know
precisely what is the nature of any approximation
j***@gmail.com
2012-01-05 23:04:18 UTC
Permalink
Post by Neal Becker
Post by Neal Becker
Post by Travis Oliphant
Post by j***@gmail.com
Post by Neal Becker
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a
general purpose filter, which is an IIR filter, single rate.  Efficiency
is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used.  These then come in variations for
single rate, interpolation, and decimation (there is also another design for
rational rate conversion).  Then these have variants for scalar/complex
input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as
well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower
than lfilter for shorter time series so I mostly dropped fft based
filtering.
I think he is talking about filter design, correct?
The comments I made were all about efficient filter implementation, not about
filter design.
About FFT-based IIR filter, I never heard of it.  I was talking about the
fact that fft can be used to efficiently implement a linear convolution
exactly (for the case of convolution of a finite or short sequence - the
impulse response of the filter - with a long or infinite sequence, the
overlap-add or overlap-save techniques are used).
Sure, of course.   It's hard to know the way people are using terms.   I agree
that people don't usually use the term IIR when talking about an FFT-based
filter (but there is an "effective" time-domain response for every filtering
operation done in the Fourier domain --- as you noted).   That's what I was
referring to.
It's been a while since I wrote lfilter, but it transposes the filtering
operation  into Direct Form II, and then does a straightforward implementation
of the feed-back and feed-forward equations.
https://ccrma.stanford.edu/~jos/fp/Direct_Form_II.html
IIR filters implemented in the time-domain need something like lfilter.   FIR
filters are "just" convolution in the time domain --- and there are different
approaches to doing that discrete-time convolution as you've noted.   IIR
filters are *just* convolution as well (but convolution with an infinite
sequence).   Of course, if you use the FFT-domain to implement the filter,
then you can just as well design in that space the filtering-function you want
to multiply the input signal with (it's just important to keep in mind the
impact in the time-domain of what you are doing in the frequency domain ---
i.e. sharp-edges result in ringing, the basic time-frequency product
limitations, etc.)
These same ideas come under different names and have different emphasis in
different disciplines.
-Travis
Here, I claim the best approach is to realize that
1. Just making the coefficients in the freq domain be samples of a desired
response gives you no exact result (as you noted), but
2. On the other hand, fft can be used to perform fast convolution, which is (can
be) mathematically exactly the same as time domain convolution.  Therefore, just
realize that
* use your favorite FIR filter design tool (e.g., remez) to design the filter
Now the only approximation is in the fir filter design step, and you should know
precisely what is the nature of any approximation
Thanks, if I understand both of you correctly, then the difference
comes down to whether we want to have a parsimonious IIR
parameterization, with only a few parameters that can be estimated as
in time series analysis (Box-Jenkins), or whether you want to design a
filter where having a "long" FIR representation doesn't have any
disadvantages (in frequency domain, FFT, the filter might be full
length anyway).

Josef
Post by Neal Becker
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Christopher Felton
2012-01-06 13:40:05 UTC
Permalink
Post by Neal Becker
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a
general purpose filter, which is an IIR filter, single rate. Efficiency is very
important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for
single rate, interpolation, and decimation (there is also another design for
rational rate conversion). Then these have variants for scalar/complex
input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well
as scalar in/out (taking advantage of the 'two channel' trick for fft).
This link, http://www.scipy.org/Cookbook/ApplyFIRFilter, describes the
different "filter" methods currently implemented in scipy. Not just
lfilter.

Regards,
Chris
Neal Becker
2012-01-05 15:25:31 UTC
Permalink
Charles R Harris wrote:

...
Post by Charles R Harris
Filter design could use improvement. I also have a remez algorithm that
works for complex filter design that belongs somewhere.
Can I get a copy of this please??
Charles R Harris
2012-01-05 16:23:58 UTC
Permalink
Post by Neal Becker
...
Post by Charles R Harris
Filter design could use improvement. I also have a remez algorithm that
works for complex filter design that belongs somewhere.
Can I get a copy of this please??
Sure, it's attached. It's pretty old at this point and I don't consider it
finished. If you want to work on it I could put a repository up on github.
I experimented with both fft and barycentric Lagrange for interpolation
(ala the original), and ended up using barycentric interpolation to
generate evenly spaced sample points and then an fft for finer
interpolation, allowing fine grids with less computation. Along with that
the band edges are all rounded to grid points whereas the original used the
exact values.

I haven't looked at this for two years and it needs tests, a filter design
front end, and probably some cleanup/refactoring.

Chuck
Ralf Gommers
2012-01-05 07:48:12 UTC
Permalink
Post by Ralf Gommers
5. Find a clear (group of) maintainer(s) for each sub-module. For people
familiar with one module, responding to
tickets and pull requests for that module would not cost so much time.
Is there a list where this is kept?
Not really. The only way you can tell a little bit right now is the way
Trac tickets get assigned. For example Pauli gets documentation, Josef gets
stats tickets.

We could have a list on Trac, linked to from the developers page on
scipy.org, where we have a list of modules with for each module a (group
of) people listed who are interested in it and would respond to tickets and
PRs for that module. Not necessarily to fix everything asap, but at least
to review patches, respond to tickets and outline how bugs should be fixed
or enhancements could best be added.

For PRs I think everyone can follow the RSS feed that Pauli set up. For
Trac I'm not sure it's possible to send notifications to more than one
person. If not, at least the tickets should get assigned to one person who
could then forward them, until there's a better solution.

As administrative points I would propose:
- People should be able to add and remove themselves from this list.
- Commit rights are not necessary to be on the list (but of course can be
asked for).
- Add a recommendation that no one person should be the Trac assignee for
more than two modules, and preferably only one if it's a large one.

The group of people interested in a module could also compile a list of
things to do to improve the quality of the module, and add tickets to an
"easy fixes" list.

Ralf
Benny Malengier
2012-01-05 09:11:14 UTC
Permalink
I'll jump in the discussion.

As author of the odes scikit, I'd like to note that we moved
development to github for the normal reasons,
https://github.com/bmcage/odes
We work on a cython implementation of the sundials solvers we need (I
discussed with the pysundials author, and they effectively have no
more time to work on that except to keep it doing for what they use
it), and are experimenting with the API. When we finalize this work,
I'll ask to remove the svn version from the old servers. My co-worker
on this hates the scikit namespace, but for now, it is still in.

The reason for the scikit and not patches to integrate are as before:
dependency on sundials. I do think the (c)vode solver in scipy is too
old-fashioned, and should better be replaced by the current vode
solver of sundials. So I would urge that some thoughts are given if
those parts of scipy.integrate really should make it in a 1.0 version.

Another issue with the odes scikit is that nobody seems to know how
the API for ODE or DAE is best done, different fields have their own
typically workflow. So just doing it as it is usefull for my
applicatoins seems like the fastest way forward, and if a broader
community is interested, we can discuss. Also, I can change the API of
my own things, but to find time to change ode class in scipy.integrate
would be difficult (I don't have a fixed position).

Benny

PS: For those interested, you can see the API for DAE from
https://github.com/bmcage/odes/blob/master/scikits/odes/sundials/ida.pyx
. I would think the main annoyance would be that the equations must be
passed to the init method as a class ResFunction due to
performance/technical reasons, which is not very scipy like. That
however would be for another mail thread, which I'll do at another
time. Odes does not have it's own mailing list at the moment.
Denis Laxalde
2012-01-05 13:36:48 UTC
Permalink
Post by Ralf Gommers
For PRs I think everyone can follow the RSS feed that Pauli set up. For
Trac I'm not sure it's possible to send notifications to more than one
person.
Trac generates RSS feeds as well in the "Custom Query" tab based on
filters (e.g. by component, status).
--
Denis
Mathieu Blondel
2012-01-17 09:01:52 UTC
Permalink
I would like to give some feedback on my experience as a contributor
to scikit-learn. Here are a few things I like:

- Contributing and following the project allows me to improve my
knowledge of the field (I'm a graduate student in machine learning).
The signal-to-noise ratio on the mailing-list is high, as the threads
are usually directly related to my interest. It's also a valuable
addition to my CV.

- The barrier to entry is very low: the code base is not too big, the
code is clear and the API is simple. This explains partly why we get
so many pull-requests from occasional contributors.

- Contributors get push privilege (become part of the scikit-learn
github organization) after just a few pull requests and are fully
credited in the changelogs and file headers. We never had any problem
with this policy: people usually know when a commit can be pushed to
master directly and when it warrants a pull-request / review first.

- All important decisions are taken democratically and we now have
well-identified workflows. The small of size of the project probably
helps a lot.

- The project is very dynamic and is moving fast!


I like the idea of a core scipy library and an ecosystem of scikits
with a well-identified scope around it! The success of scikit-learn
could be used as a model, so as to reproduce the successes and not
repeat the failures (see Gael's document on bootstrapping a
community). This is already happening in scikit-image, as far as I can
see.

Why use the prefix scikit- rather than a top-level package name?
Because scikit should be a brand name and should be a guarantee of
quality.

My 2 cents,
Mathieu

David Cournapeau
2012-01-03 20:18:49 UTC
Permalink
I don't know if this has already been discussed or not.   But, I really don't understand the reasoning behind "yet-another-project" for signal processing.   That is the whole-point of the signal sub-project under the scipy namespace.   Why not just develop there?  Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace.  I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place.   If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want
to put something in scipy. I would note that putting something in
scikits today means it cannot be integrated into scipy later. But
putting things in scipy has (implicitly at least) much stronger
requirements around API stability than a scikit, and a much slower
release process (I think on average, we made one release year).

cheers,

David
Robert Kern
2012-01-03 20:33:01 UTC
Permalink
Post by David Cournapeau
I don't know if this has already been discussed or not.   But, I really don't understand the reasoning behind "yet-another-project" for signal processing.   That is the whole-point of the signal sub-project under the scipy namespace.   Why not just develop there?  Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace.  I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place.   If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want
to put something in scipy. I would note that putting something in
scikits today means it cannot be integrated into scipy later.
Why not? We incorporate pre-existing code all of the time. What makes
a scikits project any different from others?
--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
David Cournapeau
2012-01-04 08:17:42 UTC
Permalink
Post by Robert Kern
Post by David Cournapeau
I don't know if this has already been discussed or not.   But, I really don't understand the reasoning behind "yet-another-project" for signal processing.   That is the whole-point of the signal sub-project under the scipy namespace.   Why not just develop there?  Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace.  I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place.   If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want
to put something in scipy. I would note that putting something in
scikits today means it cannot be integrated into scipy later.
Why not? We incorporate pre-existing code all of the time. What makes
a scikits project any different from others?
Sorry, I meant the contrary from what I wrote: of course, putting
something in scikits does not prevent it from being integrated in
scipy later.

David
Ralf Gommers
2012-01-03 20:37:10 UTC
Permalink
Post by Travis Oliphant
Post by Travis Oliphant
I don't know if this has already been discussed or not. But, I really
don't understand the reasoning behind "yet-another-project" for signal
processing. That is the whole-point of the signal sub-project under the
scipy namespace. Why not just develop there? Github access is easy to
grant.
Post by Travis Oliphant
I must admit, I've never been a fan of the scikits namespace. I would
prefer that we just stick with the scipy namespace and work on making scipy
more modular and easy to distribute as separate modules in the first place.
If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want
to put something in scipy. I would note that putting something in
scikits today means it cannot be integrated into scipy later. But
putting things in scipy has (implicitly at least) much stronger
requirements around API stability than a scikit, and a much slower
release process (I think on average, we made one release year).
Integrating code into scipy after initially developing it as a separate
package is something that is not really happening right now though. In
cases like scikits.image/learn/statsmodels, which are active, growing
projects, that of course doesn't make sense, but for packages that are
stable and see little active development it should happen more imho.

Example 1: numerical differentiation. Algopy and numdifftools are two
mature packages that are general enough that it would make sense to
integrate them. Especially algopy has quite good docs. Not much active
development, and the respective authors would be in favor, see
http://projects.scipy.org/scipy/ticket/1510.

Example 2: pywavelets. Nice complete package with good docs, much better
than scipy.signal.wavelets. Very little development activity for the
package, and wavelets are of interest for a wide variety of applications.
Would have helped with the recent peak finding additions by Jacob Silterra
for example. (Not sure how the author of pywavelets would feel about this,
it's just an example).

I'm sure it's not difficult to find more examples. Scipy is getting
released more frequently now than before, and I hope we can keep it that
way. Perhaps there are simple reasons that integrating code doesn't happen,
like lack of time of the main developer. But on the other hand, maybe we as
scipy developers aren't as welcoming as we should be, or should just go and
ask developers how they would feel about incorporating their mature code?

Ralf
Travis Oliphant
2012-01-03 21:07:38 UTC
Permalink
Perhaps that is a concrete thing that I can do over the next few months: Follow-up with different developers of packages that might be interested in incorporating their code into ScIPy as a module or as part of another module.

Longer term, I would like to figure out how to make SciPy more modular.

-Travis
Post by David Cournapeau
Post by Travis Oliphant
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want
to put something in scipy. I would note that putting something in
scikits today means it cannot be integrated into scipy later. But
putting things in scipy has (implicitly at least) much stronger
requirements around API stability than a scikit, and a much slower
release process (I think on average, we made one release year).
Integrating code into scipy after initially developing it as a separate package is something that is not really happening right now though. In cases like scikits.image/learn/statsmodels, which are active, growing projects, that of course doesn't make sense, but for packages that are stable and see little active development it should happen more imho.
Example 1: numerical differentiation. Algopy and numdifftools are two mature packages that are general enough that it would make sense to integrate them. Especially algopy has quite good docs. Not much active development, and the respective authors would be in favor, see http://projects.scipy.org/scipy/ticket/1510.
Example 2: pywavelets. Nice complete package with good docs, much better than scipy.signal.wavelets. Very little development activity for the package, and wavelets are of interest for a wide variety of applications. Would have helped with the recent peak finding additions by Jacob Silterra for example. (Not sure how the author of pywavelets would feel about this, it's just an example).
I'm sure it's not difficult to find more examples. Scipy is getting released more frequently now than before, and I hope we can keep it that way. Perhaps there are simple reasons that integrating code doesn't happen, like lack of time of the main developer. But on the other hand, maybe we as scipy developers aren't as welcoming as we should be, or should just go and ask developers how they would feel about incorporating their mature code?
Ralf
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Gael Varoquaux
2012-01-03 21:30:24 UTC
Permalink
Post by Travis Oliphant
Integrating code into scipy after initially developing it as a separate
package is something that is not really happening right now though.
I would look to respectfully disagree :). With regards to large
contributions, Jake VanderPlas's work on arpack started in the
scikit-learn. The discussion that we had recently on integrating the
graph algorithmic shows that such an integration will continue. In
addition, if I look at the commits in scipy, I see plenty that were
initiated in the scikit-learn (I see them, because I look at the
contributions of scikit-learn developers).

That said, I know what you mean: a lot of worthwhile code is just
developed on its own, and never gets merged into a major package. It's a
pity, as it would be more useful. That said, it is also easy to see why
it doesn't happen: the authors implemented that code to scratch an itch,
and once that itch scratched, there are done.
Post by Travis Oliphant
Example 1: numerical differentiation. Algopy and numdifftools are two
mature packages that are general enough that it would make sense to
integrate them. Especially algopy has quite good docs. Not much active
development, and the respective authors would be in favor, see
http://projects.scipy.org/scipy/ticket/1510.
OK, this sounds like an interesting project that could/should get
funding. Time to make a list for next year's GSOC, if we can find
somebody willing to mentor it.
Post by Travis Oliphant
Example 2: pywavelets. Nice complete package with good docs, much better
than scipy.signal.wavelets. Very little development activity for the
package, and wavelets are of interest for a wide variety of applications.
Yes, pywavelet is high on my list of code that should live in a biggest
package. I find that it's actually fairly technical code, and I would be
weary of merging it in if there is not somebody with good expertise to
maintain it.

[snip (reordered quoting of Ralf's email)]
Post by Travis Oliphant
In cases like scikits.image/learn/statsmodels, which are active,
growing projects, that of course doesn't make sense
Well, actually, if people think that some of the algorithms that we have
in scikit-learn should be merged back in scipy, we are open to it. A few
things to keep in mind:

- We have gathered a significant experience on some techniques relative
to stochastic algorithms and big data. I wouldn't like to merge in
scipy too technical code, for the fear of it 'dying' there. Some people
say that code goes to the Python standard library to die [1] :).

- For the reasons explained in my previous mail (i.e. pros of having
domain specific packages when it comes to highly specialized features)
I don't think that it is desirable to see in the long run the full
codebase of scikit-learn merged in scipy.
Post by Travis Oliphant
Scipy is getting released more frequently now than before, and I hope
we can keep it that way.
This, plus the move to github, does make it much easier to contribute. I
think that it is having a noticeable impact.
Post by Travis Oliphant
or should just go and ask developers how they would feel about
incorporating their mature code?
That might actually be useful.

Gael

[1]
http://frompythonimportpodcast.com/episode-004-dave-hates-decorators-where-code-goes-to-die
Gaël Varoquaux
2012-01-03 11:50:34 UTC
Permalink
Hi Jaidev, hi list,

I am resending a mail that I sent a few weeks ago, as I am not sure why,
but I haven't been able to send to the list recently. This e-mail is a
bit out of context with the current discussion, but I'd just like to get
it out for the record, and because I originally wrote it to support the
idea. I am writing a new mail to address the current discussion.

-- Original mail --

Indeed, at the scipy India, Jaidev gave a great talk about the empirical
mode decomposition, and the Hilbert-Huang Transform. Given that I have
absolutely formal training in signal processing, one thing that I really
appreciated in his talk, is that I was able to sit back and actually
learn useful practical signal processing. Not many people go through the
work of making code and examples understandable to none experts.

That got me thinking that we, the scipy community, could really use a
signal processing toolkit, that non experts like me could use. There is a
lot of code lying around, in different toolkits (to list only
MIT/BSD-licensed code: nitime, talkbox, mne-python, some in matplotlib),
without mentioning code scattered on people's computer.

I think that such a project can bring value only if it manages to do more
than lumping individual code together. Namely it needs code quality,
consistency across functionality and good documentation and examples.
This value comes from the community dynamics that build around it. A
project with a low bus factor is a project that I am weary of. In
addition, once people start feeling excited and proud of it, the quality
of the contributions increases.

I do not have the time, nor the qualifications to drive a scikit-signal.
Jaidev is not very experimented in building scipy packages, but he has
the motivation and, I think, the skills. At scipy India, we pushed him to
give it a go. Hopefully, he will find the time to try, and walk down the
recipe I cooked up to create a project [1], but for the project to be
successful in the long run, it needs interest from other contributors of
the scipy ecosystem.


In the mean time, better docs and examples for scipy.signal would also
help. For instance, hilbert transform is in there, but because I don't
know signal processing, I do not know how to make a good use of it.
Investing time on that is a investment with little risks: it is editable
on line at http://docs.scipy.org/scipy/docs/scipy-docs/index.rst/

My 2 euro cents

Gaël

[1] https://gist.github.com/1433151

PS: sorry if you receive this message twice.
j***@gmail.com
2012-01-03 14:54:59 UTC
Permalink
On Tue, Jan 3, 2012 at 6:50 AM, Gaël Varoquaux
Post by Gaël Varoquaux
Hi Jaidev, hi list,
I am resending a mail that I sent a few weeks ago, as I am not sure why,
but I haven't been able to send to the list recently. This e-mail is a
bit out of context with the current discussion, but I'd just like to get
it out for the record, and because I originally wrote it to support the
idea. I am writing a new mail to address the current discussion.
-- Original mail --
Indeed, at the scipy India, Jaidev gave a great talk about the empirical
mode decomposition, and the Hilbert-Huang Transform. Given that I have
absolutely formal training in signal processing, one thing that I really
appreciated in his talk, is that I was able to sit back and actually
learn useful practical signal processing. Not many people go through the
work of making code and examples understandable to none experts.
That got me thinking that we, the scipy community, could really use a
signal processing toolkit, that non experts like me could use. There is a
lot of code lying around, in different toolkits (to list only
MIT/BSD-licensed code: nitime, talkbox, mne-python, some in matplotlib),
without mentioning code scattered on people's computer.
I think that such a project can bring value only if it manages to do more
than lumping individual code together. Namely it needs code quality,
consistency across functionality and good documentation and examples.
This value comes from the community dynamics that build around it. A
project with a low bus factor is a project that I am weary of. In
addition, once people start feeling excited and proud of it, the quality
of the contributions increases.
I do not have the time, nor the qualifications to drive a scikit-signal.
Jaidev is not very experimented in building scipy packages, but he has
the motivation and, I think, the skills. At scipy India, we pushed him to
give it a go. Hopefully, he will find the time to try, and walk down the
recipe I cooked up to create a project [1], but for the project to be
successful in the long run, it needs interest from other contributors of
the scipy ecosystem.
In the mean time, better docs and examples for scipy.signal would also
help. For instance, hilbert transform is in there, but because I don't
know signal processing, I do not know how to make a good use of it.
Investing time on that is a investment with little risks: it is editable
on line at http://docs.scipy.org/scipy/docs/scipy-docs/index.rst/
My 2 euro cents
Gaël
[1] https://gist.github.com/1433151
PS: sorry if you receive this message twice.
I think scipy as a central toolbox has still a very valuable role. For
example statsmodels uses linalg, stats, optimize, interpolate,
special, signal, fft and some sparse, and I might have forgotten
something.

sklearn (Fabian) brought several improvements to linalg back to scipy,
the recent discussion on sparse graph algorithms show there are
enhancements that are useful to have centrally across applications and
across scikits.
(another example Lomb-Scargle looks interesting as general tool, but I
haven't seen any other code for unevenly space time yet, and haven't
used it yet..)

The advantage of the slow and backwards compatible pace of scipy is
that we don't have to keep up with the much faster changes in the
early stages of scikits development.

One advantage of a scikits is that it is possible to figure out a more
useful class structure for extended work, than the more "almost
everything is a function approach" in scipy.

I also agree with Gael that having some usage documentation, like the
examples in statsmodels, sklearn and matplotlib, are very useful. My
recent examples, figuring out how to use the quadrature weights and
points (I managed), and how to use the signal wavelets or pywavelets
for function approximation (no clue yet).
Some parts are well covered in the scipy tutorials, others we are on our own.

Josef
Post by Gaël Varoquaux
_______________________________________________
SciPy-Dev mailing list
http://mail.scipy.org/mailman/listinfo/scipy-dev
Loading...