[Scikit-learn-general] GSoC 2012

Discussion:

Bala Subrahmanyam Varanasi

2012-01-18 06:12:44 UTC

Dear all,

I would like to participate in Google Summer of Code this year. Please let
me know the ideas which you would like to implement in scikit-learn in GSoC
- 2012.

Also... I'm attending to Stanford's Online courses - ML class and NLP
class. I believe this is the right time to discuss. Because, I can learn
new things before the start of GSoC and can work on challenging
implementations in scikit-learn.

Thank you.

Bala Subrahmanyam Varanasi
IV B.Tech, Information Technology
Vishnu Institute of Technology
e-mail: ***@gmail.com
contact number: +919985415959

Andreas

2012-01-18 10:11:32 UTC

Permalink

Hi Bela.
I'm not sure how this usually goes but here is my current wish list.
We'd have to discuss whether any of that actually fits into the scikits,
thou ;)

- Multilayer Perceptron and Multinomial Logistic regression
I have been working on that so maybe there is not enough
left to do there for a GSoC. Not sure, though

- Graph Cut Energy minimization
This is an inference technique so I'm not totally sure
if this should go into scikit-learn. Could also be a
candidate for scikit-image.
The main work would be to implement an efficient
max flow algorithm and then do graph constructions
for alpha expansion and alpha-beta swaps.

- Averaged gradient descent
I think this is on everybody's wish list. Not
sure how much work this will be.
I'm sure lot's of people will have to say something to that ;)
See issue #543: https://github.com/scikit-learn/scikit-learn/issues/543

- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.
At the moment, the most commonly used implementation
is Joachim's SVMstruct.
This has licensing issues but talking to him might help.
Another option is implementing optimization via SGD
or, if you want to go crazy, cutting plane techniques
or bundle methods yourself.
Designing the interface is also non-trivial.
One would have to think about whether / how it
is possible to use structured SVMs just from Python,
without writing Cython functions.

- Low rank kernel approximations (Nystrom methods)
This is mainly interesting for SVMs.
The idea is to approximate the kernel matrix with
a low rank factorization and use this to construct
a linear SVM problem.
This is related to the current kernel approximation
module but has a somewhat other approach.
This method makes large scale SVMs fast / possible

- Kernel Perceptron
There is a (I think) pure Python implementation
by Mathieu that could be Cythonized.

That's it for the moment, I think.
I'd be happy to mentor any of the above projects
if the others agree that they are sensible.

Maybe we should update the wiki for the next GSoC?

Cheers,
Andy

Post by Bala Subrahmanyam Varanasi
Dear all,
I would like to participate in Google Summer of Code this year. Please
let me know the ideas which you would like to implement in
scikit-learn in GSoC - 2012.
Also... I'm attending to Stanford's Online courses - ML class and NLP
class. I believe this is the right time to discuss. Because, I can
learn new things before the start of GSoC and can work on challenging
implementations in scikit-learn.
Thank you.
Bala Subrahmanyam Varanasi
IV B.Tech, Information Technology
Vishnu Institute of Technology
contact number: +919985415959

Lars Buitinck

2012-01-18 10:28:52 UTC

Permalink

Post by Andreas
- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.

+100 on this one...

Post by Andreas
Designing the interface is also non-trivial.

Indeed. I suspect different APIs would be needed for different cases
(linear-chain case, tree case, general case). Having just one of these
would be great, but I agree this might be a *big* project.

One more thing one my wishlist is semisupervised meta-algorithms, like
self-training, co-training, co-boosting. These should not be
incredibly hard to implement, but they're still far from trivial.

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Gael Varoquaux

2012-01-18 22:26:23 UTC

Permalink

Post by Lars Buitinck

Post by Andreas
- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.

+100 on this one...

For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.

This seems to me as a fairly challenging project.

Post by Lars Buitinck
One more thing one my wishlist is semisupervised meta-algorithms, like
self-training, co-training, co-boosting. These should not be
incredibly hard to implement, but they're still far from trivial.

Yes, I think that having a good semi-supervised codebase (including model
selection) would be a interesting project. What I like about it, is that
it seems to me as having a gradual difficulty.

Of course, other suggestions are welcome. Maybe we should start a wiki
page.

Gael

Andreas

2012-01-18 22:37:15 UTC

Permalink

Post by Gael Varoquaux

Post by Lars Buitinck

Post by Andreas
- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.

+100 on this one...

For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.
This seems to me as a fairly challenging project.

This is quite definitely a challenging project.
This should only be given to someone with a fair understanding
of the topic.

There are several options, as I tried to say in my initial post:
1) bindings for an existing structured SVM.
2) bindings for a smart solver with structured svm code by us
3) using SGD for solving. This means "having our own SVM solver"
-- but we already got one in SGDClassifier.
4) write a solver using cutting plane or bundle methods (not quite
sure if this is a good idea)

Gael Varoquaux

2012-01-18 22:44:46 UTC

Permalink

Having this feature might get us a LOT of attention.
But this is really not a simple project.

Before trying to jump to the super fancy features, I'd rather have a
polished and versatile version of the scikit. They are many things that I
find that we haven't explored right. For instance these are my personal
pain points:

* we don't have an online learning framework.

* Our model selection framework is still weak

- see
https://github.com/scikit-learn/scikit-learn/pull/443#issuecomment-3231270

- also, it the difficulty to do nested cross-validation with a specific
cross-validation strategy,

* we are light on the semi-supervised API

* our parameter naming is not uniform-enough across models.

All these are points that I'd like to see addressed, because I fear that
they could all induce a change in API or conventions. And I'd like API
and conventions to be stabilized, to be able to push out a 1.0 (I am
talking 6 months to 1 year horizon).

G

PS: On the down side, I am having crazy days these last weeks. I spend
the whole day at work looking at other people's problem, and when I get
home in the evening, instead of hacking on open source, I answer email
and try to get done the work that should have done during the day...
Sorry lads, just mumbling at my own improductivity.

Andreas

2012-01-18 23:08:41 UTC

Permalink

Post by Gael Varoquaux

Having this feature might get us a LOT of attention.
But this is really not a simple project.

Before trying to jump to the super fancy features, I'd rather have a
polished and versatile version of the scikit.

I totally agree - I tried to do as much polishing as I can the
last couple of weeks.
There is still a lot to do. I opened some issues today and
yesterday to track stuff that seemed important to me.

I have no experience with GSoC and I will totally bow
to you wisdom there. My thinking was that single
algorithms are more "project-like" than doing polishing here and
there.

There is important refactoring being done by Lars and Mathieu
at the moment which is really great. But I wouldn't give that
to someone as a project.

Post by Gael Varoquaux
They are many things that I
find that we haven't explored right. For instance these are my personal
* we don't have an online learning framework.
* Our model selection framework is still weak
- see
https://github.com/scikit-learn/scikit-learn/pull/443#issuecomment-3231270
- also, it the difficulty to do nested cross-validation with a specific
cross-validation strategy,
* we are light on the semi-supervised API
* our parameter naming is not uniform-enough across models.
All these are points that I'd like to see addressed, because I fear that
they could all induce a change in API or conventions.

I noticed some cross-validation issues but not all that you mentioned.
We should maybe plan a bit more on that.

About online and semi-supervised learning:
I feel these are two specific sub-fields that many people are interested
in but that are not central to machine learning.
I am not sure I would want the scikit api to focus on these.
If you go to a machine learning conference, I'm pretty sure
there will be more people working on structured learning
than on semi-supervised and online learning.

Don't get me wrong. I don't want to quickly forcestructured
learning into the scikit. It is a long term goal of me to
have this in a nice accessible form. I just wanted to mention
it as an option.

Post by Gael Varoquaux
And I'd like API
and conventions to be stabilized, to be able to push out a 1.0 (I am
talking 6 months to 1 year horizon).

I couldn't agree more!

Cheers,
Andy

Gael Varoquaux

2012-01-19 06:59:08 UTC

Permalink

Post by Andreas
I have no experience with GSoC and I will totally bow
to you wisdom there. My thinking was that single
algorithms are more "project-like" than doing polishing here and
there.

Yes. My point was that I'd like to see project that help us close these
gaps, rather than open new ones.

Multi-task learning or semi-supervised learning are two projects that
would probably help us find the limits of our cross-validation scheme.
Structured output, as much as I personnally am interested in it, seems to
be of the kind to open a gap, and it should probably be tackled by
someone with good experience with the scikit.

Vlad's idea of exploring trace norm would be interesting, as it would be
relevent for both multi-task and unsupervised learning.

G

Mathieu Blondel

2012-01-18 23:24:09 UTC

Permalink

On Thu, Jan 19, 2012 at 7:44 AM, Gael Varoquaux

Post by Gael Varoquaux

Having this feature might get us a LOT of attention.
But this is really not a simple project.

Also the scikit has a bias towards dense data. It would be nice if
more estimators could work with sparse data too.

Mathieu

Peter Prettenhofer

2012-01-19 07:38:16 UTC

Permalink

Post by Andreas

Post by Gael Varoquaux

Post by Lars Buitinck
[..]

Post by Andreas
- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.

+100 on this one...

For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.
This seems to me as a fairly challenging project.

This is quite definitely a challenging project.
This should only be given to someone with a fair understanding
of the topic.
1) bindings for an existing structured SVM.
2) bindings for a smart solver with structured svm code by us
3) using SGD for solving. This means "having our own SVM solver"
-- but we already got one in SGDClassifier.
4) write a solver using cutting plane or bundle methods (not quite
sure if this is a good idea)

Alexandre Gramfort

2012-01-19 08:03:22 UTC

Permalink

i've created the wiki page to organize what was suggested and so
people can volunteer for mentoring.

https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012

Alex

On Thu, Jan 19, 2012 at 8:38 AM, Peter Prettenhofer

Post by Andreas

Post by Gael Varoquaux

Post by Lars Buitinck
[..]

Post by Andreas
- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.

+100 on this one...

For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.
This seems to me as a fairly challenging project.

Vincent Michel

2012-01-19 08:10:05 UTC

Permalink

Hi list,

I'm more than +1 for online learning, it could be a killing feature of the
scikit !
I also like the first suggestion of Andreas, about Multinomial Logistic
regression. I think there is interesting work to do in the junction with
Bayesian statistics and priors.

Vincent

Post by Alexandre Gramfort
i've created the wiki page to organize what was suggested and so
people can volunteer for mentoring.
https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
Alex
On Thu, Jan 19, 2012 at 8:38 AM, Peter Prettenhofer

Post by Andreas

Post by Gael Varoquaux

Post by Lars Buitinck
[..]

Post by Andreas
- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.

+100 on this one...

For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.
This seems to me as a fairly challenging project.

Mathieu Blondel

2012-01-18 10:37:12 UTC

Permalink

On Wed, Jan 18, 2012 at 3:12 PM, Bala Subrahmanyam Varanasi

Also... I'm attending to Stanford's Online courses - ML class and NLP class.
I believe this is the right time to discuss. Because, I can learn new things
before the start of GSoC and can work on challenging implementations in
scikit-learn.

It would be nice if you could make a few contributions to scikit-learn
before the application process starts. This will allow you to
familiarize with the code base, us to evaluate your potential and, if
I remember correctly, this is actually a requirement from the PSF.

Mathieu

Bala Subrahmanyam Varanasi

2012-01-18 10:47:15 UTC

Permalink

Dear Mathieu,

It would be nice if you could make a few contributions to scikit-learn

Post by Mathieu Blondel
before the application process starts. This will allow you to
familiarize with the code base, us to evaluate your potential and, if
I remember correctly, this is actually a requirement from the PSF.

I would like to contribute to scikit-learn. I'm going through the source
code. As I'm a newbie for ML, I'm trying to learn the concepts and going
through the documentation.

Upto now, I pulled two commits regarding the documentation. I hope I could
do more in the coming days. Here are my commits.

https://github.com/Balu-Varanasi/scikit-learn/commit/36d0adb8c14b8105b9ba690073d0501955bce328

https://github.com/Balu-Varanasi/scikit-learn/commit/f28ee57cc637ed1d36de9a4a322aba6fd641d478

Thank you.

Post by Mathieu Blondel
Mathieu
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2012-01-18 11:10:58 UTC

Permalink

On Wed, Jan 18, 2012 at 7:47 PM, Bala Subrahmanyam Varanasi

Post by Bala Subrahmanyam Varanasi
Upto now, I pulled two commits regarding the documentation. I hope I could
do more in the coming days. Here are my commits.
https://github.com/Balu-Varanasi/scikit-learn/commit/36d0adb8c14b8105b9ba690073d0501955bce328
https://github.com/Balu-Varanasi/scikit-learn/commit/f28ee57cc637ed1d36de9a4a322aba6fd641d478

Thanks you! When you have small changes like this, you can send us a
pull request. We usually merge it quite fast.

Another good way to familiarize yourself with the code base is to try
to tackle open issues:
https://github.com/scikit-learn/scikit-learn/issues

Cheers,
Mathieu

Andreas

2012-01-18 12:23:29 UTC

Permalink

You might start on this one:
https://github.com/scikit-learn/scikit-learn/issues/559
It should be fairly easy to do.

Bala Subrahmanyam Varanasi

2012-01-18 12:34:36 UTC

Permalink

Hi :)

Post by Andreas
https://github.com/scikit-learn/scikit-learn/issues/559
It should be fairly easy to do.

Okay... Sure ! I'll try to do this.

Post by Andreas
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Jaidev Deshpande

2012-01-18 15:31:47 UTC

Permalink

Hi Bala,

Well, the two of us do have a busy summer coming up, but a word of
caution - Google hasn't decided yet whether they will hold GSoC this
year. Please join the GSoC mailing list too.

We'll talk more tonight if you are free...

Cheers

Bala Subrahmanyam Varanasi

2012-01-18 15:39:55 UTC

Permalink

Hi Jaidev,

Well, the two of us do have a busy summer coming up, but a word of

Post by Jaidev Deshpande
caution - Google hasn't decided yet whether they will hold GSoC this
year. Please join the GSoC mailing list too.

Hm... Let us hope for the best.

Post by Jaidev Deshpande
We'll talk more tonight if you are free...

Sure.

See you soon.

Post by Jaidev Deshpande
Cheers
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2012-01-18 22:23:47 UTC

Permalink

Post by Mathieu Blondel
It would be nice if you could make a few contributions to scikit-learn
before the application process starts. This will allow you to
familiarize with the code base, us to evaluate your potential and, if
I remember correctly, this is actually a requirement from the PSF.

I would like to stress this point. A GSOC is supposed to be an ambitious
and challenging project. This means that you should prove to yourself and
to your mentors that you can undertake it. Also, getting familiar with
the codebase of the host project, and the stakes of your specific project
is a great way of making sure that the application is well scoped and
realistic.

In my opinion, something that made Vlad's GSOC especially succesful, was
that he had already started on the project before writing the
application. As a result the application was clear and to the point, and
everybody was convinced that he could get the job done (well done Vlad!).

Keep in mind that there is a lot of competition for GSOC.

Gael

Vlad Niculae

2012-01-18 23:00:48 UTC

Permalink

Post by Gael Varoquaux

I would like to stress this point. A GSOC is supposed to be an ambitious
and challenging project. This means that you should prove to yourself and
to your mentors that you can undertake it. Also, getting familiar with
the codebase of the host project, and the stakes of your specific project
is a great way of making sure that the application is well scoped and
realistic.
In my opinion, something that made Vlad's GSOC especially succesful, was
that he had already started on the project before writing the
application. As a result the application was clear and to the point, and
everybody was convinced that he could get the job done (well done Vlad!).

Thank you Gael. I came here to say congratulations to Bala for choosing a great
team with who to do a GSoC, based on my great experience last summer.

An online framework could be a good idea for a GSoC but I don't like it as a project for someone new entering the team because it's much too oriented towards interface engineering and not enough towards algorithms. Same goes for a project on wrapping some structured SVM solver or something like that.

How about trace (nuclear) norm minimization, low rank + sparse decompositions?

Vlad

Post by Gael Varoquaux
Keep in mind that there is a lot of competition for GSOC.
Gael
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general