Discussion:
[Scikit-learn-general] GSoC 2012
Bala Subrahmanyam Varanasi
2012-01-18 06:12:44 UTC
Permalink
Dear all,

I would like to participate in Google Summer of Code this year. Please let
me know the ideas which you would like to implement in scikit-learn in GSoC
- 2012.

Also... I'm attending to Stanford's Online courses - ML class and NLP
class. I believe this is the right time to discuss. Because, I can learn
new things before the start of GSoC and can work on challenging
implementations in scikit-learn.

Thank you.

Bala Subrahmanyam Varanasi
IV B.Tech, Information Technology
Vishnu Institute of Technology
e-mail: ***@gmail.com
contact number: +919985415959
Andreas
2012-01-18 10:11:32 UTC
Permalink
Hi Bela.
I'm not sure how this usually goes but here is my current wish list.
We'd have to discuss whether any of that actually fits into the scikits,
thou ;)

- Multilayer Perceptron and Multinomial Logistic regression
I have been working on that so maybe there is not enough
left to do there for a GSoC. Not sure, though

- Graph Cut Energy minimization
This is an inference technique so I'm not totally sure
if this should go into scikit-learn. Could also be a
candidate for scikit-image.
The main work would be to implement an efficient
max flow algorithm and then do graph constructions
for alpha expansion and alpha-beta swaps.

- Averaged gradient descent
I think this is on everybody's wish list. Not
sure how much work this will be.
I'm sure lot's of people will have to say something to that ;)
See issue #543: https://github.com/scikit-learn/scikit-learn/issues/543

- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.
At the moment, the most commonly used implementation
is Joachim's SVMstruct.
This has licensing issues but talking to him might help.
Another option is implementing optimization via SGD
or, if you want to go crazy, cutting plane techniques
or bundle methods yourself.
Designing the interface is also non-trivial.
One would have to think about whether / how it
is possible to use structured SVMs just from Python,
without writing Cython functions.

- Low rank kernel approximations (Nystrom methods)
This is mainly interesting for SVMs.
The idea is to approximate the kernel matrix with
a low rank factorization and use this to construct
a linear SVM problem.
This is related to the current kernel approximation
module but has a somewhat other approach.
This method makes large scale SVMs fast / possible

- Kernel Perceptron
There is a (I think) pure Python implementation
by Mathieu that could be Cythonized.


That's it for the moment, I think.
I'd be happy to mentor any of the above projects
if the others agree that they are sensible.

Maybe we should update the wiki for the next GSoC?

Cheers,
Andy
Post by Bala Subrahmanyam Varanasi
Dear all,
I would like to participate in Google Summer of Code this year. Please
let me know the ideas which you would like to implement in
scikit-learn in GSoC - 2012.
Also... I'm attending to Stanford's Online courses - ML class and NLP
class. I believe this is the right time to discuss. Because, I can
learn new things before the start of GSoC and can work on challenging
implementations in scikit-learn.
Thank you.
Bala Subrahmanyam Varanasi
IV B.Tech, Information Technology
Vishnu Institute of Technology
contact number: +919985415959
Lars Buitinck
2012-01-18 10:28:52 UTC
Permalink
Post by Andreas
- Structured SVM / CRF learning
    This is a big one. Not sure what other people think of it.
    I think having a structured SVM would be great.
+100 on this one...
Post by Andreas
    Designing the interface is also non-trivial.
Indeed. I suspect different APIs would be needed for different cases
(linear-chain case, tree case, general case). Having just one of these
would be great, but I agree this might be a *big* project.

One more thing one my wishlist is semisupervised meta-algorithms, like
self-training, co-training, co-boosting. These should not be
incredibly hard to implement, but they're still far from trivial.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Gael Varoquaux
2012-01-18 22:26:23 UTC
Permalink
Post by Lars Buitinck
Post by Andreas
- Structured SVM / CRF learning
    This is a big one. Not sure what other people think of it.
    I think having a structured SVM would be great.
+100 on this one...
For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.

This seems to me as a fairly challenging project.
Post by Lars Buitinck
One more thing one my wishlist is semisupervised meta-algorithms, like
self-training, co-training, co-boosting. These should not be
incredibly hard to implement, but they're still far from trivial.
Yes, I think that having a good semi-supervised codebase (including model
selection) would be a interesting project. What I like about it, is that
it seems to me as having a gradual difficulty.

Of course, other suggestions are welcome. Maybe we should start a wiki
page.

Gael
Andreas
2012-01-18 22:37:15 UTC
Permalink
Post by Gael Varoquaux
Post by Lars Buitinck
Post by Andreas
- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.
+100 on this one...
For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.
This seems to me as a fairly challenging project.
This is quite definitely a challenging project.
This should only be given to someone with a fair understanding
of the topic.

There are several options, as I tried to say in my initial post:
1) bindings for an existing structured SVM.
2) bindings for a smart solver with structured svm code by us
3) using SGD for solving. This means "having our own SVM solver"
-- but we already got one in SGDClassifier.
4) write a solver using cutting plane or bundle methods (not quite
sure if this is a good idea)
Gael Varoquaux
2012-01-18 22:44:46 UTC
Permalink
Having this feature might get us a LOT of attention.
But this is really not a simple project.
Before trying to jump to the super fancy features, I'd rather have a
polished and versatile version of the scikit. They are many things that I
find that we haven't explored right. For instance these are my personal
pain points:

* we don't have an online learning framework.

* Our model selection framework is still weak

- see
https://github.com/scikit-learn/scikit-learn/pull/443#issuecomment-3231270

- also, it the difficulty to do nested cross-validation with a specific
cross-validation strategy,

* we are light on the semi-supervised API

* our parameter naming is not uniform-enough across models.

All these are points that I'd like to see addressed, because I fear that
they could all induce a change in API or conventions. And I'd like API
and conventions to be stabilized, to be able to push out a 1.0 (I am
talking 6 months to 1 year horizon).

G

PS: On the down side, I am having crazy days these last weeks. I spend
the whole day at work looking at other people's problem, and when I get
home in the evening, instead of hacking on open source, I answer email
and try to get done the work that should have done during the day...
Sorry lads, just mumbling at my own improductivity.
Andreas
2012-01-18 23:08:41 UTC
Permalink
Post by Gael Varoquaux
Having this feature might get us a LOT of attention.
But this is really not a simple project.
Before trying to jump to the super fancy features, I'd rather have a
polished and versatile version of the scikit.
I totally agree - I tried to do as much polishing as I can the
last couple of weeks.
There is still a lot to do. I opened some issues today and
yesterday to track stuff that seemed important to me.

I have no experience with GSoC and I will totally bow
to you wisdom there. My thinking was that single
algorithms are more "project-like" than doing polishing here and
there.

There is important refactoring being done by Lars and Mathieu
at the moment which is really great. But I wouldn't give that
to someone as a project.
Post by Gael Varoquaux
They are many things that I
find that we haven't explored right. For instance these are my personal
* we don't have an online learning framework.
* Our model selection framework is still weak
- see
https://github.com/scikit-learn/scikit-learn/pull/443#issuecomment-3231270
- also, it the difficulty to do nested cross-validation with a specific
cross-validation strategy,
* we are light on the semi-supervised API
* our parameter naming is not uniform-enough across models.
All these are points that I'd like to see addressed, because I fear that
they could all induce a change in API or conventions.
I noticed some cross-validation issues but not all that you mentioned.
We should maybe plan a bit more on that.

About online and semi-supervised learning:
I feel these are two specific sub-fields that many people are interested
in but that are not central to machine learning.
I am not sure I would want the scikit api to focus on these.
If you go to a machine learning conference, I'm pretty sure
there will be more people working on structured learning
than on semi-supervised and online learning.

Don't get me wrong. I don't want to quickly forcestructured
learning into the scikit. It is a long term goal of me to
have this in a nice accessible form. I just wanted to mention
it as an option.
Post by Gael Varoquaux
And I'd like API
and conventions to be stabilized, to be able to push out a 1.0 (I am
talking 6 months to 1 year horizon).
I couldn't agree more!

Cheers,
Andy
Gael Varoquaux
2012-01-19 06:59:08 UTC
Permalink
Post by Andreas
I have no experience with GSoC and I will totally bow
to you wisdom there. My thinking was that single
algorithms are more "project-like" than doing polishing here and
there.
Yes. My point was that I'd like to see project that help us close these
gaps, rather than open new ones.

Multi-task learning or semi-supervised learning are two projects that
would probably help us find the limits of our cross-validation scheme.
Structured output, as much as I personnally am interested in it, seems to
be of the kind to open a gap, and it should probably be tackled by
someone with good experience with the scikit.

Vlad's idea of exploring trace norm would be interesting, as it would be
relevent for both multi-task and unsupervised learning.

G
Mathieu Blondel
2012-01-18 23:24:09 UTC
Permalink
On Thu, Jan 19, 2012 at 7:44 AM, Gael Varoquaux
Post by Gael Varoquaux
Having this feature might get us a LOT of attention.
But this is really not a simple project.
Before trying to jump to the super fancy features, I'd rather have a
polished and versatile version of the scikit. They are many things that I
find that we haven't explored right. For instance these are my personal
 * we don't have an online learning framework.
 * Our model selection framework is still weak
   - see
     https://github.com/scikit-learn/scikit-learn/pull/443#issuecomment-3231270
   - also, it the difficulty to do nested cross-validation with a specific
      cross-validation strategy,
 * we are light on the semi-supervised API
 * our parameter naming is not uniform-enough across models.
Also the scikit has a bias towards dense data. It would be nice if
more estimators could work with sparse data too.

Mathieu
Peter Prettenhofer
2012-01-19 07:38:16 UTC
Permalink
Post by Andreas
Post by Gael Varoquaux
Post by Lars Buitinck
[..]
Post by Andreas
- Structured SVM / CRF learning
     This is a big one. Not sure what other people think of it.
     I think having a structured SVM would be great.
+100 on this one...
For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.
This seems to me as a fairly challenging project.
This is quite definitely a challenging project.
This should only be given to someone with a fair understanding
of the topic.
1) bindings for an existing structured SVM.
2) bindings for a smart solver with structured svm code by us
3) using SGD for solving. This means "having our own SVM solver"
-- but we already got one in SGDClassifier.
4) write a solver using cutting plane or bundle methods (not quite
sure if this is a good idea)
 
Alexandre Gramfort
2012-01-19 08:03:22 UTC
Permalink
i've created the wiki page to organize what was suggested and so
people can volunteer for mentoring.

https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012

Alex

On Thu, Jan 19, 2012 at 8:38 AM, Peter Prettenhofer
Post by Andreas
Post by Gael Varoquaux
Post by Lars Buitinck
[..]
Post by Andreas
- Structured SVM / CRF learning
     This is a big one. Not sure what other people think of it.
     I think having a structured SVM would be great.
+100 on this one...
For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.
This seems to me as a fairly challenging project.
This is quite definitely a challenging project.
This should only be given to someone with a fair understanding
of the topic.
1) bindings for an existing structured SVM.
2) bindings for a smart solver with structured svm code by us
3) using SGD for solving. This means "having our own SVM solver"
-- but we already got one in SGDClassifier.
4) write a solver using cutting plane or bundle methods (not quite
sure if this is a good idea)
 
Vincent Michel
2012-01-19 08:10:05 UTC
Permalink
Hi list,

I'm more than +1 for online learning, it could be a killing feature of the
scikit !
I also like the first suggestion of Andreas, about Multinomial Logistic
regression. I think there is interesting work to do in the junction with
Bayesian statistics and priors.


Vincent
Post by Alexandre Gramfort
i've created the wiki page to organize what was suggested and so
people can volunteer for mentoring.
https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
Alex
On Thu, Jan 19, 2012 at 8:38 AM, Peter Prettenhofer
Post by Andreas
Post by Gael Varoquaux
Post by Lars Buitinck
[..]
Post by Andreas
- Structured SVM / CRF learning
This is a big one. Not sure what other people think of it.
I think having a structured SVM would be great.
+100 on this one...
For this, do we need to have our own SVM solver? This is a naive
question, I have never looked at structured SVM.
This seems to me as a fairly challenging project.
This is quite definitely a challenging project.
This should only be given to someone with a fair understanding
of the topic.
1) bindings for an existing structured SVM.
2) bindings for a smart solver with structured svm code by us
3) using SGD for solving. This means "having our own SVM solver"
-- but we already got one in SGDClassifier.
4) write a solver using cutting plane or bundle methods (not quite
sure if this is a good idea)
Mathieu Blondel
2012-01-18 10:37:12 UTC
Permalink
On Wed, Jan 18, 2012 at 3:12 PM, Bala Subrahmanyam Varanasi
Also... I'm attending to Stanford's Online courses - ML class and NLP class.
I believe this is the right time to discuss. Because, I can learn new things
before the start of GSoC and can work on challenging implementations in
scikit-learn.
It would be nice if you could make a few contributions to scikit-learn
before the application process starts. This will allow you to
familiarize with the code base, us to evaluate your potential and, if
I remember correctly, this is actually a requirement from the PSF.

Mathieu
Bala Subrahmanyam Varanasi
2012-01-18 10:47:15 UTC
Permalink
Dear Mathieu,

It would be nice if you could make a few contributions to scikit-learn
Post by Mathieu Blondel
before the application process starts. This will allow you to
familiarize with the code base, us to evaluate your potential and, if
I remember correctly, this is actually a requirement from the PSF.
I would like to contribute to scikit-learn. I'm going through the source
code. As I'm a newbie for ML, I'm trying to learn the concepts and going
through the documentation.

Upto now, I pulled two commits regarding the documentation. I hope I could
do more in the coming days. Here are my commits.

https://github.com/Balu-Varanasi/scikit-learn/commit/36d0adb8c14b8105b9ba690073d0501955bce328

https://github.com/Balu-Varanasi/scikit-learn/commit/f28ee57cc637ed1d36de9a4a322aba6fd641d478

Thank you.
Post by Mathieu Blondel
Mathieu
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2012-01-18 11:10:58 UTC
Permalink
On Wed, Jan 18, 2012 at 7:47 PM, Bala Subrahmanyam Varanasi
Post by Bala Subrahmanyam Varanasi
Upto now, I pulled two commits regarding the documentation. I hope I could
do more in the coming days. Here are my commits.
https://github.com/Balu-Varanasi/scikit-learn/commit/36d0adb8c14b8105b9ba690073d0501955bce328
https://github.com/Balu-Varanasi/scikit-learn/commit/f28ee57cc637ed1d36de9a4a322aba6fd641d478
Thanks you! When you have small changes like this, you can send us a
pull request. We usually merge it quite fast.

Another good way to familiarize yourself with the code base is to try
to tackle open issues:
https://github.com/scikit-learn/scikit-learn/issues

Cheers,
Mathieu
Andreas
2012-01-18 12:23:29 UTC
Permalink
You might start on this one:
https://github.com/scikit-learn/scikit-learn/issues/559
It should be fairly easy to do.
Bala Subrahmanyam Varanasi
2012-01-18 12:34:36 UTC
Permalink
Hi :)
Post by Andreas
https://github.com/scikit-learn/scikit-learn/issues/559
It should be fairly easy to do.
Okay... Sure ! I'll try to do this.
Post by Andreas
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Jaidev Deshpande
2012-01-18 15:31:47 UTC
Permalink
Hi Bala,

Well, the two of us do have a busy summer coming up, but a word of
caution - Google hasn't decided yet whether they will hold GSoC this
year. Please join the GSoC mailing list too.

We'll talk more tonight if you are free...

Cheers
Bala Subrahmanyam Varanasi
2012-01-18 15:39:55 UTC
Permalink
Hi Jaidev,

Well, the two of us do have a busy summer coming up, but a word of
Post by Jaidev Deshpande
caution - Google hasn't decided yet whether they will hold GSoC this
year. Please join the GSoC mailing list too.
Hm... Let us hope for the best.
Post by Jaidev Deshpande
We'll talk more tonight if you are free...
Sure.

See you soon.
Post by Jaidev Deshpande
Cheers
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2012-01-18 22:23:47 UTC
Permalink
Post by Mathieu Blondel
It would be nice if you could make a few contributions to scikit-learn
before the application process starts. This will allow you to
familiarize with the code base, us to evaluate your potential and, if
I remember correctly, this is actually a requirement from the PSF.
I would like to stress this point. A GSOC is supposed to be an ambitious
and challenging project. This means that you should prove to yourself and
to your mentors that you can undertake it. Also, getting familiar with
the codebase of the host project, and the stakes of your specific project
is a great way of making sure that the application is well scoped and
realistic.

In my opinion, something that made Vlad's GSOC especially succesful, was
that he had already started on the project before writing the
application. As a result the application was clear and to the point, and
everybody was convinced that he could get the job done (well done Vlad!).

Keep in mind that there is a lot of competition for GSOC.

Gael
Vlad Niculae
2012-01-18 23:00:48 UTC
Permalink
Post by Gael Varoquaux
Post by Mathieu Blondel
It would be nice if you could make a few contributions to scikit-learn
before the application process starts. This will allow you to
familiarize with the code base, us to evaluate your potential and, if
I remember correctly, this is actually a requirement from the PSF.
I would like to stress this point. A GSOC is supposed to be an ambitious
and challenging project. This means that you should prove to yourself and
to your mentors that you can undertake it. Also, getting familiar with
the codebase of the host project, and the stakes of your specific project
is a great way of making sure that the application is well scoped and
realistic.
In my opinion, something that made Vlad's GSOC especially succesful, was
that he had already started on the project before writing the
application. As a result the application was clear and to the point, and
everybody was convinced that he could get the job done (well done Vlad!).
Thank you Gael. I came here to say congratulations to Bala for choosing a great
team with who to do a GSoC, based on my great experience last summer.

An online framework could be a good idea for a GSoC but I don't like it as a project for someone new entering the team because it's much too oriented towards interface engineering and not enough towards algorithms. Same goes for a project on wrapping some structured SVM solver or something like that.

How about trace (nuclear) norm minimization, low rank + sparse decompositions?

Vlad
Post by Gael Varoquaux
Keep in mind that there is a lot of competition for GSOC.
Gael
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...