Hi Travis,
It is good that you are asking these questions. I think that they are
important. Let me try to give my view on some of the points you raise.
Post by Travis OliphantThere are too many scikits already that should just be scipy projects
I used to think pretty much as you did: I don't want to have to depend on
too many packages. In addition we are a community, so why so many
packages? My initial vision when investing in the scikit-learn was that
we would merge it back to scipy after a while. The dynamic of the project
has changed a bit my way of seeing things, and I now think that it is a
good thing to have scikits-like packages that are more specialized than
scipy for the following reasons:
1. Development is technically easier in smaller packages
A developer working on a specific package does not need to tackle
complexity of the full scipy suite. Building can be made easier, as scipy
must (for good reasons) depend on Fortran and C++ packs. It is well known
that the complexity of developing a project grows super-linearly with the
number of lines of code.
It's also much easier to achieve short release cycles. Short
release cycles are critical to the dynamic of a community-driven
project (and I'd like to thanks our current release manager, Ralf
Gommers, for his excellent work).
2. Narrowing the application domain helps developers and users
It is much easier to make entry points, in the code and in the
documentation, with a given application in mind. Also, best practices and
conventions may vary between communities. While this is (IMHO) one of the
tragedies of contemporary science, it such domain specialization
helps people feeling comfortable.
Computational trade offs tend to be fairly specific to a given
context. For instance machine learning will more often be interested in
datasets with a large number of features and a (comparatively) small
number of samples, whereas in statistics it is the opposite. Thus the
same algorithm might be implemented differently. Catering for all needs
tends to make the code much more complex, and may confuse the user by
presenting him too many options.
Developers cannot be expert in everything. If I specialize in machine
learning, and follow the recent developments in literature, chances are
that I do not have time to competitive in numerical integration. Having
too wide a scope in a project means that each developer understands well
a small fraction of the code. It makes things really hard for the release
manager, but also for day to day work, e.g. what to do with a new broken
test.
3. It is easier to build an application-specific community
An application specific library is easier to brand. One can tailor a
website, a user manual, and conference presentation or papers to an
application. As a result the project gains visibility in the community
of scientists and engineers it target.
Also, having more focused mailing lists helps building enthusiasm, a they
have less volume, and are more focused on on questions that people
are interested in.
Finally, a sad but true statement, is that people tend to get more credo
when working on an application-specific project than on a core layer.
Similarly, it is easier for me to get credit to fund development of an
application-specific project.
On a positive note, I would like to stress that I think that the
scikit-learn has had a general positive impact on the scipy ecosystem,
including for those who do not use it, or who do not care at all about
machine learning. First, it is drawing more users in the community, and
as a result, there is more interest and money flying around. But more
importantly, when I look at the latest release of scipy, I see many of
the new contributors that are also scikit-learn contributors (not only
Fabian). This can be partly explained by the fact that getting involved
in the scikit-learn was an easy and high-return-on-investment move for
them, but they quickly grew to realize that the base layer could be
improved. We have always had the vision to push in scipy any improvement
that was general-enough to be useful across application domains.
Remember, David Cournapeau was lured in the scipy business by working on
the original scikit-learn.
Post by Travis OliphantFrankly, it makes me want to pull out all of the individual packages I
wrote that originally got pulled together into SciPy into separate
projects and develop them individually from there.
What you are proposing is interesting, that said, I think that the
current status quo with scipy is a good one. Having a core collection of
numerical tools is, IMHO, a key element of the Python scientific
community for two reasons:
* For the user, knowing that he will find the answer to most of his
simple questions in a single library makes it easy to start. It also
makes it easier to document.
* Different packages need to rely on a lot of common generic tools.
Linear algebra, sparse linear algebra, simple statistics and signal
processing, simple black-box optimizer, interpolation ND-image-like
processing. Indeed You ask what package in scipy do people use.
Actually, in scikit-learn we use all sub-packages apart from
'integrate'. I checked, and we even use 'io' in one of the examples.
Any code doing high-end application-specific numerical computing will
need at least a few of the packages of scipy. Of course, a package
may need an optimizer tailored to a specific application, in which
case they will roll there own, an this effort might be duplicated a
bit. But having the common core helps consolidating the ecosystem.
So the setup that I am advocating is a core library, with many other
satellite packages. Or rather a constellation of packages that use each
other rather then a monolithic universe. This is a common strategy of
breaking a package up into parts that can be used independently to make
them lighter and hopefully ease the development of the whole. For
instance, this is what was done to the ETS (Enthought Tool Suite). And we
have all seen this strategy gone bad, for instance in the situation of
'dependency hell', in which case all packages start depending on each
other, the installation becomes an issue and there is a grid lock of
version-compatibility bugs. This is why any such ecosystem must have an
almost tree-like structure in its dependency graph. Some packages must be
on top of the graph, more 'core' than others, and as we descend the
graph, packages can reduce their dependencies. I think that we have more
or less this situation with scipy, and I am quite happy about it.
Now I hear your frustration when this development happens a bit in the
wild with no visible construction of an ecosystem. This ecosystem does
get constructed via the scipy mailing-lists, conferences, and in general
the community, but it may not be very clear to the external observer. One
reason why my group decided to invest in the scikit-learn was that it was
the learning package that seemed the closest in terms of code and
community connections. This was the virtue of the 'scikits' branding. For
technical reasons, the different scikits have started getting rid of this
namespace in the module import. You seem to think that the branding name
'scikits' does not reflect accurately the fact that they are tight
members of the scipy constellationhile I must say that I am not a huge
fan of the name 'scikits', we have now invested in it, and I don't think
that we can easily move away.
If the problem is a branding issue, it may be partly addressed with
appropriate communication. A set of links across the different web pages
of the ecosystem, and a central document explaining the relationships
between the packages might help. But this idea is not completely new and
it simply is waiting for someone to invest time in it. For instance,
there was the project of reworking the scipy.org homepage.
Another important problem is the question of what sits 'inside' this
collection of tools, and what is outside. The answer to this question
will pretty much depend on who you ask. In practice, for the end user, it
is very much conditioned by what meta-package they can download. EPD,
Sage, Python(x,y), and many others give different answers.
To conclude, I'd like to stress that, in my eyes, what really matters is
a solution that gives us a vibrant community, with a good production of
quality code and documentation. I think that the current set of small
projects makes it easier to gather developers and users, and that it
work well as long as they talk to each other and do not duplicate too
much each-other's functionality. If on top of that they are BSD-licensed
and use numpy as their data model, I am a happy man.
What I am pushing for is a Bazar-like development model, in which it is
easy for various approaches answering different needs to develop in
parallel with different compromises. In such a context, I think that
Jaidev could kick start a successful and useful scikit-signal. Hopefully
this would not preclude improvements to the docs, examples, and existing
code in scipy.signal.
Sorry for the long post, and thank you for reading.
Gael