Discussion:
[Scikit-learn-general] [GSoC2015 metric learning]
Michael Eickenberg
2015-05-04 11:02:36 UTC
Permalink
Dear Artem,

congratulations on the acceptance of your GSoC proposal! I am certain there
will be a very interesting summer ahead of us. Kyle and I are excited to be
mentors and will do our best to provide all the guidance necessary for your
project to succeed. It is very rich and will be a great addition to the
codebase.

Your blog post <http://barmaley-exe.blogspot.ru/2015/05/introduction.html>
on the gists of the methods is written in a very understandable way and
permits a good overview of the topics you are going to address in depth. It
shows that you have the right intuitions, and are ready to delve into the
intricacies of the methods [1]. Take advantage of the next weeks to do so!
Let's make sure we hit the ground running at the end of this warm-up phase.

As for your next plans, sketching the algorithms in very high level
pseudo-code is of course an excellent idea and can be a next blog post.
After this, you can zoom in on the details of how each pseudo-code step can
be implemented. If you get the level of detail right, I recommend the
Python language to describe your algorithms ;) -- what I mean is that
getting a minimal version of the algorithm to work, just as a function, not
a sklearn estimator, is a valuable baseline to have, and it usually deepens
the understanding as well.

As for the API questions, it is of course quite essential to remain
conscious at all times of the issues that have been identified in prior
discussion and to think of ways to add a metric learning module without
succumbing to excessive feature creep. My hunch is that given some working
minimal versions of the algorithms, we can perhaps crystallize out what is
absolutely necessary in terms of additions, so I would prefer that order of
priorities. There is also some work to be done in identifying other parts
of scikit-learn that already deal with (dis-)similarity type data (cf eg
the kernels defined in the PR for gaussian processes) and see how these can
be made to work in a consistent way.

A crucial aspect that we need to figure out is "what is a good outcome?":
Of course we would like to have some PRs merged at the end of summer, yes.
But what makes a concrete implementation good? Speed (with respect to
what)? Readability? Maintainability (yes please!)? Elegance (what does that
even mean?)?

It may be helpful if you could devise a more fine-grained timeline
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#timeline>
for the community bonding period than what is currently stated on the wiki
page. How about blogging your progress in understanding? Writing things
down for others to understand is a very good way of identifying any doubts
you may have on particular aspects. A mini blog-entry at the end of each
week simply recounting what has been done and also what has been discussed
will also yield an effective overview of ongoing topics.

In the meantime, please don't hesitate to bombard us, the other mentors,
and the list with any questions you may have.

Michael

[1] Loading Image...
Yao
2015-05-05 10:59:52 UTC
Permalink
Hi, all,
Congratulations on the acceptance of your GSoC proposal!
I'm a Master student from Peking University, and I have a great interest on scikit-learn because of frequency use.
And I also want to contribute to the community, or try to fix some bugs. Could you count me in?


By the way, I have met a problem, Unlike most other scores, R^2 score may be negative (it need not actually be the square of a quantity R).
In my understanding, R^2 could be thought as the square of Correlation coefficient thus R^2 can't be negative. Is there something wrong, who
can tell me?


Thanks,
Yao.





At 2015-05-04 19:02:36, "Michael Eickenberg" <***@gmail.com> wrote:

Dear Artem,


congratulations on the acceptance of your GSoC proposal! I am certain there will be a very interesting summer ahead of us. Kyle and I are excited to be mentors and will do our best to provide all the guidance necessary for your project to succeed. It is very rich and will be a great addition to the codebase.

Your blog post on the gists of the methods is written in a very understandable way and permits a good overview of the topics you are going to address in depth. It shows that you have the right intuitions, and are ready to delve into the intricacies of the methods [1]. Take advantage of the next weeks to do so! Let's make sure we hit the ground running at the end of this warm-up phase.

As for your next plans, sketching the algorithms in very high level pseudo-code is of course an excellent idea and can be a next blog post.
After this, you can zoom in on the details of how each pseudo-code step can be implemented. If you get the level of detail right, I recommend the Python language to describe your algorithms ;) -- what I mean is that getting a minimal version of the algorithm to work, just as a function, not a sklearn estimator, is a valuable baseline to have, and it usually deepens the understanding as well.

As for the API questions, it is of course quite essential to remain conscious at all times of the issues that have been identified in prior discussion and to think of ways to add a metric learning module without succumbing to excessive feature creep. My hunch is that given some working minimal versions of the algorithms, we can perhaps crystallize out what is absolutely necessary in terms of additions, so I would prefer that order of priorities. There is also some work to be done in identifying other parts of scikit-learn that already deal with (dis-)similarity type data (cf eg the kernels defined in the PR for gaussian processes) and see how these can be made to work in a consistent way.

A crucial aspect that we need to figure out is "what is a good outcome?": Of course we would like to have some PRs merged at the end of summer, yes. But what makes a concrete implementation good? Speed (with respect to what)? Readability? Maintainability (yes please!)? Elegance (what does that even mean?)?


It may be helpful if you could devise a more fine-grained timeline for the community bonding period than what is currently stated on the wiki page. How about blogging your progress in understanding? Writing things down for others to understand is a very good way of identifying any doubts you may have on particular aspects. A mini blog-entry at the end of each week simply recounting what has been done and also what has been discussed will also yield an effective overview of ongoing topics.


In the meantime, please don't hesitate to bombard us, the other mentors, and the list with any questions you may have.


Michael


[1] http://i.imgur.com/at3vIHh.png
Andreas Mueller
2015-05-05 18:34:56 UTC
Permalink
In extreme cases, R^2 can be negative.
If you'd like to contribute, have a look at the guidelines here:
http://scikit-learn.org/dev/developers/index.html
Post by Yao
Hi, all,
Congratulations on the acceptance of your GSoC proposal!
I'm a Master student from Peking University, and I have a great
interest on scikit-learn because of frequency use.
And I also want to contribute to the community, or try to fix some
bugs. Could you count me in?
By the way, I have met a problem,/Unlike most other scores, R^2
score may be negative (it need not actually be the square of a
quantity R)./
In my understanding, R^2 could be thought as the square of
Correlation coefficient thus R^2 can't be negative. Is there something
wrong, who
can tell me?
Thanks,
Yao.
At 2015-05-04 19:02:36, "Michael Eickenberg"
Dear Artem,
congratulations on the acceptance of your GSoC proposal! I am
certain there will be a very interesting summer ahead of us. Kyle
and I are excited to be mentors and will do our best to provide
all the guidance necessary for your project to succeed. It is very
rich and will be a great addition to the codebase.
Your blog post
<http://barmaley-exe.blogspot.ru/2015/05/introduction.html> on the
gists of the methods is written in a very understandable way and
permits a good overview of the topics you are going to address in
depth. It shows that you have the right intuitions, and are ready
to delve into the intricacies of the methods [1]. Take advantage
of the next weeks to do so! Let's make sure we hit the ground
running at the end of this warm-up phase.
As for your next plans, sketching the algorithms in very high
level pseudo-code is of course an excellent idea and can be a next
blog post.
After this, you can zoom in on the details of how each pseudo-code
step can be implemented. If you get the level of detail right, I
recommend the Python language to describe your algorithms ;) --
what I mean is that getting a minimal version of the algorithm to
work, just as a function, not a sklearn estimator, is a valuable
baseline to have, and it usually deepens the understanding as well.
As for the API questions, it is of course quite essential to
remain conscious at all times of the issues that have been
identified in prior discussion and to think of ways to add a
metric learning module without succumbing to excessive feature
creep. My hunch is that given some working minimal versions of the
algorithms, we can perhaps crystallize out what is absolutely
necessary in terms of additions, so I would prefer that order of
priorities. There is also some work to be done in identifying
other parts of scikit-learn that already deal with
(dis-)similarity type data (cf eg the kernels defined in the PR
for gaussian processes) and see how these can be made to work in a
consistent way.
A crucial aspect that we need to figure out is "what is a good
outcome?": Of course we would like to have some PRs merged at the
end of summer, yes. But what makes a concrete implementation good?
Speed (with respect to what)? Readability? Maintainability (yes
please!)? Elegance (what does that even mean?)?
It may be helpful if you could devise a more fine-grained timeline
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#timeline>
for the community bonding period than what is currently stated on
the wiki page. How about blogging your progress in understanding?
Writing things down for others to understand is a very good way of
identifying any doubts you may have on particular aspects. A mini
blog-entry at the end of each week simply recounting what has been
done and also what has been discussed will also yield an effective
overview of ongoing topics.
In the meantime, please don't hesitate to bombard us, the other
mentors, and the list with any questions you may have.
Michael
[1] http://i.imgur.com/at3vIHh.png
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2015-05-05 18:54:33 UTC
Permalink
Post by Andreas Mueller
Post by Yao
In my understanding, R^2 could be thought as the square of Correlation coefficient thus R^2 can't be negative. Is there something wrong, who
I agree that this can be confusing. However, R^2 is not always calculated as R*R, e.g., see http://biomet.oxfordjournals.org/content/78/3/691.full.pdf+html <http://biomet.oxfordjournals.org/content/78/3/691.full.pdf+html>
There is also something called "adjusted R^2" which can result in negative R^2 values. But as far as I know those are typically set to 0 in most application.

Best,
Sebastian
Post by Andreas Mueller
In extreme cases, R^2 can be negative.
http://scikit-learn.org/dev/developers/index.html <http://scikit-learn.org/dev/developers/index.html>
Post by Yao
Hi, all,
Congratulations on the acceptance of your GSoC proposal!
I'm a Master student from Peking University, and I have a great interest on scikit-learn because of frequency use.
And I also want to contribute to the community, or try to fix some bugs. Could you count me in?
By the way, I have met a problem, Unlike most other scores, R^2 score may be negative (it need not actually be the square of a quantity R).
In my understanding, R^2 could be thought as the square of Correlation coefficient thus R^2 can't be negative. Is there something wrong, who
can tell me?
Thanks,
Yao.
Dear Artem,
congratulations on the acceptance of your GSoC proposal! I am certain there will be a very interesting summer ahead of us. Kyle and I are excited to be mentors and will do our best to provide all the guidance necessary for your project to succeed. It is very rich and will be a great addition to the codebase.
Your blog post <http://barmaley-exe.blogspot.ru/2015/05/introduction.html> on the gists of the methods is written in a very understandable way and permits a good overview of the topics you are going to address in depth. It shows that you have the right intuitions, and are ready to delve into the intricacies of the methods [1]. Take advantage of the next weeks to do so! Let's make sure we hit the ground running at the end of this warm-up phase.
As for your next plans, sketching the algorithms in very high level pseudo-code is of course an excellent idea and can be a next blog post.
After this, you can zoom in on the details of how each pseudo-code step can be implemented. If you get the level of detail right, I recommend the Python language to describe your algorithms ;) -- what I mean is that getting a minimal version of the algorithm to work, just as a function, not a sklearn estimator, is a valuable baseline to have, and it usually deepens the understanding as well.
As for the API questions, it is of course quite essential to remain conscious at all times of the issues that have been identified in prior discussion and to think of ways to add a metric learning module without succumbing to excessive feature creep. My hunch is that given some working minimal versions of the algorithms, we can perhaps crystallize out what is absolutely necessary in terms of additions, so I would prefer that order of priorities. There is also some work to be done in identifying other parts of scikit-learn that already deal with (dis-)similarity type data (cf eg the kernels defined in the PR for gaussian processes) and see how these can be made to work in a consistent way.
A crucial aspect that we need to figure out is "what is a good outcome?": Of course we would like to have some PRs merged at the end of summer, yes. But what makes a concrete implementation good? Speed (with respect to what)? Readability? Maintainability (yes please!)? Elegance (what does that even mean?)?
It may be helpful if you could devise a more fine-grained timeline <https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#timeline> for the community bonding period than what is currently stated on the wiki page. How about blogging your progress in understanding? Writing things down for others to understand is a very good way of identifying any doubts you may have on particular aspects. A mini blog-entry at the end of each week simply recounting what has been done and also what has been discussed will also yield an effective overview of ongoing topics.
In the meantime, please don't hesitate to bombard us, the other mentors, and the list with any questions you may have.
Michael
[1] http://i.imgur.com/at3vIHh.png <http://i.imgur.com/at3vIHh.png>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y <http://ad.doubleclick.net/ddm/clk/290420510;117567292;y>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Chris Holdgraf
2015-05-05 18:55:25 UTC
Permalink
To my knowledge, R2 is basically a comparison of your model fit, to the
model that is fit by simply drawing a line through the mean of your output
variable. To that extent, if your fit model does worse than the "mean fit",
then R2 will be negative. E.g., check out:

http://randomanalyses.blogspot.com/2011/11/coefficient-of-determination-r2.html
Michael Eickenberg
2015-05-05 19:17:50 UTC
Permalink
Under gaussian (and most other) noise and a nonexplicative model the
distribution of r^2 has a negative mode.
As a consequence, sometimes an r^2 score of 0 already implies significant
predictive capacity... happy publishing!
Post by Chris Holdgraf
To my knowledge, R2 is basically a comparison of your model fit, to the
model that is fit by simply drawing a line through the mean of your output
variable. To that extent, if your fit model does worse than the "mean fit",
http://randomanalyses.blogspot.com/2011/11/coefficient-of-determination-r2.html
Yao
2015-05-06 13:29:13 UTC
Permalink
Thanks for your information, I've understood the R^2.






At 2015-05-06 03:17:50, "Michael Eickenberg" <***@gmail.com> wrote:
Under gaussian (and most other) noise and a nonexplicative model the distribution of r^2 has a negative mode.
As a consequence, sometimes an r^2 score of 0 already implies significant predictive capacity... happy publishing!

On Tuesday, May 5, 2015, Chris Holdgraf <***@berkeley.edu> wrote:


To my knowledge, R2 is basically a comparison of your model fit, to the model that is fit by simply drawing a line through the mean of your output variable. To that extent, if your fit model does worse than the "mean fit", then R2 will be negative. E.g., check out:



http://randomanalyses.blogspot.com/2011/11/coefficient-of-determination-r2.html
Artem
2015-05-28 10:25:09 UTC
Permalink
Hello everyone

I finally published my API blogpost
<http://barmaley-exe.blogspot.ru/2015/05/api-design.html>. It mostly
summarizes ideas discussed during project proposal stage.

For now I'd like to discuss NCA implementation. I already wrote
<http://barmaley-exe.blogspot.ru/2015/05/nca.html> a small prototype that
works, but there are issues I want to talk about:
1. Efficiency. Should I aim for Cython implementation, or just having
everything vectorized is fine? In general, some rule of thumb like when to
use Cython would be nice.
2. Memory issues. If we cache outer products, there might be not enough
memory. What's your opinion on my idea (see the blogpost) of a cache of
fixed size? There's cache_size parameter for SVC, but I haven't looked into
that yet. Is it of any help to this problem?
3. Optimization. At the moment I use plain batch gradient ascent with a
fixed learning rate.

- I think it'd better to use something more advanced like Momentum /
AdaGrad (line search seems too impractical)
- Another point to consider is stochastic (minibatch?) gradient ascent.
- What do you think about "cell-wise" optimization (fixing all cells of
matrix but one)?
- Or maybe it's enough to use scipy's conjugate gradients optimizer?


On Mon, May 4, 2015 at 2:02 PM, Michael Eickenberg <
Post by Michael Eickenberg
Dear Artem,
congratulations on the acceptance of your GSoC proposal! I am certain
there will be a very interesting summer ahead of us. Kyle and I are excited
to be mentors and will do our best to provide all the guidance necessary
for your project to succeed. It is very rich and will be a great addition
to the codebase.
Your blog post <http://barmaley-exe.blogspot.ru/2015/05/introduction.html>
on the gists of the methods is written in a very understandable way and
permits a good overview of the topics you are going to address in depth. It
shows that you have the right intuitions, and are ready to delve into the
intricacies of the methods [1]. Take advantage of the next weeks to do so!
Let's make sure we hit the ground running at the end of this warm-up phase.
As for your next plans, sketching the algorithms in very high level
pseudo-code is of course an excellent idea and can be a next blog post.
After this, you can zoom in on the details of how each pseudo-code step
can be implemented. If you get the level of detail right, I recommend the
Python language to describe your algorithms ;) -- what I mean is that
getting a minimal version of the algorithm to work, just as a function, not
a sklearn estimator, is a valuable baseline to have, and it usually deepens
the understanding as well.
As for the API questions, it is of course quite essential to remain
conscious at all times of the issues that have been identified in prior
discussion and to think of ways to add a metric learning module without
succumbing to excessive feature creep. My hunch is that given some working
minimal versions of the algorithms, we can perhaps crystallize out what is
absolutely necessary in terms of additions, so I would prefer that order of
priorities. There is also some work to be done in identifying other parts
of scikit-learn that already deal with (dis-)similarity type data (cf eg
the kernels defined in the PR for gaussian processes) and see how these can
be made to work in a consistent way.
Of course we would like to have some PRs merged at the end of summer, yes.
But what makes a concrete implementation good? Speed (with respect to
what)? Readability? Maintainability (yes please!)? Elegance (what does that
even mean?)?
It may be helpful if you could devise a more fine-grained timeline
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#timeline>
for the community bonding period than what is currently stated on the wiki
page. How about blogging your progress in understanding? Writing things
down for others to understand is a very good way of identifying any doubts
you may have on particular aspects. A mini blog-entry at the end of each
week simply recounting what has been done and also what has been discussed
will also yield an effective overview of ongoing topics.
In the meantime, please don't hesitate to bombard us, the other mentors,
and the list with any questions you may have.
Michael
[1] http://i.imgur.com/at3vIHh.png
Andreas Mueller
2015-05-28 16:09:24 UTC
Permalink
Hi Artem.
Thanks for sharing the post.
For 1.: Always start without Cython. Then do profiling to identify the
bottlenecks. Then we can talk about if the bottleneck can be helped with
Cython.
For this kind of algorithm, I'd expect the bottleneck to be in some
matrix multiplication (though I don't know the actual details), so
Cython is unlikely to help.

For the rest I have to look more into it.

Cheers,
Andy
Post by Artem
Hello everyone
I finally published my API blogpost
<http://barmaley-exe.blogspot.ru/2015/05/api-design.html>. It mostly
summarizes ideas discussed during project proposal stage.
<http://barmaley-exe.blogspot.ru/2015/05/nca.html> a small prototype
1. Efficiency. Should I aim for Cython implementation, or just having
everything vectorized is fine? In general, some rule of thumb like
when to use Cython would be nice.
2. Memory issues. If we cache outer products, there might be not
enough memory. What's your opinion on my idea (see the blogpost) of a
cache of fixed size? There's cache_size parameter for SVC, but I
haven't looked into that yet. Is it of any help to this problem?
3. Optimization. At the moment I use plain batch gradient ascent with
a fixed learning rate.
* I think it'd better to use something more advanced like Momentum /
AdaGrad (line search seems too impractical)
* Another point to consider is stochastic (minibatch?) gradient ascent.
* What do you think about "cell-wise" optimization (fixing all cells
of matrix but one)?
* Or maybe it's enough to use scipy's conjugate gradients optimizer?
On Mon, May 4, 2015 at 2:02 PM, Michael Eickenberg
Dear Artem,
congratulations on the acceptance of your GSoC proposal! I am
certain there will be a very interesting summer ahead of us. Kyle
and I are excited to be mentors and will do our best to provide
all the guidance necessary for your project to succeed. It is very
rich and will be a great addition to the codebase.
Your blog post
<http://barmaley-exe.blogspot.ru/2015/05/introduction.html> on the
gists of the methods is written in a very understandable way and
permits a good overview of the topics you are going to address in
depth. It shows that you have the right intuitions, and are ready
to delve into the intricacies of the methods [1]. Take advantage
of the next weeks to do so! Let's make sure we hit the ground
running at the end of this warm-up phase.
As for your next plans, sketching the algorithms in very high
level pseudo-code is of course an excellent idea and can be a next
blog post.
After this, you can zoom in on the details of how each pseudo-code
step can be implemented. If you get the level of detail right, I
recommend the Python language to describe your algorithms ;) --
what I mean is that getting a minimal version of the algorithm to
work, just as a function, not a sklearn estimator, is a valuable
baseline to have, and it usually deepens the understanding as well.
As for the API questions, it is of course quite essential to
remain conscious at all times of the issues that have been
identified in prior discussion and to think of ways to add a
metric learning module without succumbing to excessive feature
creep. My hunch is that given some working minimal versions of the
algorithms, we can perhaps crystallize out what is absolutely
necessary in terms of additions, so I would prefer that order of
priorities. There is also some work to be done in identifying
other parts of scikit-learn that already deal with
(dis-)similarity type data (cf eg the kernels defined in the PR
for gaussian processes) and see how these can be made to work in a
consistent way.
A crucial aspect that we need to figure out is "what is a good
outcome?": Of course we would like to have some PRs merged at the
end of summer, yes. But what makes a concrete implementation good?
Speed (with respect to what)? Readability? Maintainability (yes
please!)? Elegance (what does that even mean?)?
It may be helpful if you could devise a more fine-grained timeline
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#timeline>
for the community bonding period than what is currently stated on
the wiki page. How about blogging your progress in understanding?
Writing things down for others to understand is a very good way of
identifying any doubts you may have on particular aspects. A mini
blog-entry at the end of each week simply recounting what has been
done and also what has been discussed will also yield an effective
overview of ongoing topics.
In the meantime, please don't hesitate to bombard us, the other
mentors, and the list with any questions you may have.
Michael
[1] http://i.imgur.com/at3vIHh.png
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2015-05-28 18:33:56 UTC
Permalink
I talked to my office mate Brian McFee for some feedback.
Apparently there are some memory and computation saving methods, but
they are all not well-published unfortunately.
I think you should try create some benchmarks for your current
implementation, using classification with 1 nearest neighbor, maybe
start with iris and digits and make_classification.

You should probably start by comparing precomputing all outer products
(which will not scale) with recomputing at every iteration.
And using recomputing at every iteration, try to compare SGD and L-BFGS
(from scipy).

Brian suggested to try a rather radical optimization, where you only
consider pik for pairs that are close in the original space. It is
unclear how bad an approximation that would be.
The problem with tracking the large pik is that computing the pik is
already pretty expensive (n ** 2).
Maybe doing mini-batch sgd over the i with adagrad would work.

Btw, there are some formatting issues in your blogpost code.
Post by Artem
Hello everyone
I finally published my API blogpost
<http://barmaley-exe.blogspot.ru/2015/05/api-design.html>. It mostly
summarizes ideas discussed during project proposal stage.
<http://barmaley-exe.blogspot.ru/2015/05/nca.html> a small prototype
1. Efficiency. Should I aim for Cython implementation, or just having
everything vectorized is fine? In general, some rule of thumb like
when to use Cython would be nice.
2. Memory issues. If we cache outer products, there might be not
enough memory. What's your opinion on my idea (see the blogpost) of a
cache of fixed size? There's cache_size parameter for SVC, but I
haven't looked into that yet. Is it of any help to this problem?
3. Optimization. At the moment I use plain batch gradient ascent with
a fixed learning rate.
* I think it'd better to use something more advanced like Momentum /
AdaGrad (line search seems too impractical)
* Another point to consider is stochastic (minibatch?) gradient ascent.
* What do you think about "cell-wise" optimization (fixing all cells
of matrix but one)?
* Or maybe it's enough to use scipy's conjugate gradients optimizer?
On Mon, May 4, 2015 at 2:02 PM, Michael Eickenberg
Dear Artem,
congratulations on the acceptance of your GSoC proposal! I am
certain there will be a very interesting summer ahead of us. Kyle
and I are excited to be mentors and will do our best to provide
all the guidance necessary for your project to succeed. It is very
rich and will be a great addition to the codebase.
Your blog post
<http://barmaley-exe.blogspot.ru/2015/05/introduction.html> on the
gists of the methods is written in a very understandable way and
permits a good overview of the topics you are going to address in
depth. It shows that you have the right intuitions, and are ready
to delve into the intricacies of the methods [1]. Take advantage
of the next weeks to do so! Let's make sure we hit the ground
running at the end of this warm-up phase.
As for your next plans, sketching the algorithms in very high
level pseudo-code is of course an excellent idea and can be a next
blog post.
After this, you can zoom in on the details of how each pseudo-code
step can be implemented. If you get the level of detail right, I
recommend the Python language to describe your algorithms ;) --
what I mean is that getting a minimal version of the algorithm to
work, just as a function, not a sklearn estimator, is a valuable
baseline to have, and it usually deepens the understanding as well.
As for the API questions, it is of course quite essential to
remain conscious at all times of the issues that have been
identified in prior discussion and to think of ways to add a
metric learning module without succumbing to excessive feature
creep. My hunch is that given some working minimal versions of the
algorithms, we can perhaps crystallize out what is absolutely
necessary in terms of additions, so I would prefer that order of
priorities. There is also some work to be done in identifying
other parts of scikit-learn that already deal with
(dis-)similarity type data (cf eg the kernels defined in the PR
for gaussian processes) and see how these can be made to work in a
consistent way.
A crucial aspect that we need to figure out is "what is a good
outcome?": Of course we would like to have some PRs merged at the
end of summer, yes. But what makes a concrete implementation good?
Speed (with respect to what)? Readability? Maintainability (yes
please!)? Elegance (what does that even mean?)?
It may be helpful if you could devise a more fine-grained timeline
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#timeline>
for the community bonding period than what is currently stated on
the wiki page. How about blogging your progress in understanding?
Writing things down for others to understand is a very good way of
identifying any doubts you may have on particular aspects. A mini
blog-entry at the end of each week simply recounting what has been
done and also what has been discussed will also yield an effective
overview of ongoing topics.
In the meantime, please don't hesitate to bombard us, the other
mentors, and the list with any questions you may have.
Michael
[1] http://i.imgur.com/at3vIHh.png
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Michael Eickenberg
2015-05-28 21:11:36 UTC
Permalink
Hi Artem,

thanks for your blog posts and summary email. Andy has already given the
right direction to work along, but here are my \varepsilon\$:

1. What Andy says: no cython unless absolutely necessary, and if so, at the
end of the development process. It is great that you already have a working
version. As you indicate, there are some obvious vectorizations that can be
done e.g. as in
https://github.com/RolT/NCA-python/blob/master/nca_cost.py#L54 .

2. Once these changes are made you can start benching and get a feeling for
the tradeoffs between memory and speed that you have brought up.

3. For the extreme of recomputing at every iteration, you already have a
baseline to compare other optimization algorithms to, which is Roland
ThiolliÚre's code from which I cite the line above. It uses an optimizer
from scipy, which you can also exchange for other ones that have been
mentioned by you and Andy.
Post by Artem
What do you think about "cell-wise" optimization (fixing all cells of
matrix but one)?


Do you mean like coordinate descent/ascent as in e.g. the Lasso estimator?
You can try writing the optimization problem, but the implementation will
probably be useless without cython. So I'd keep that for the end if there
is time (and if I understand the idea correctly, because maybe there is a
vectorizable axis that I overlooked that sufficiently compensates for
looping over the others).
Post by Artem
Brian suggested to try a rather radical optimization, where you only
consider pik for pairs that are close in the original space. It is unclear
how bad an approximation that would be.

This goes in the direction of the idea behind LMNN, so should probably be
considered at the time when LMNN is attacked.

As for example data, I really like the one you show in your blog post. It
would be great to have a few more low-dimensional examples like that. Maybe
you can find something useful in the already existing ones, or think up
some extra ones that are more or less difficult to resolve for your methods.

Code-wise, I would attack the problem as a function first. Write a function
that takes X and y (plus maybe some options) and gives back L. You can put
a skeleton of a sklearn estimator around it by calling this function from
fit.
Please keep your code either in a sklearn WIP PR or a public gist, so it
can be reviewed. Writing benchmarks can be framed as writing examples, i.e.
plot_* functions (maybe Andy or Olivier have a comment on how benchmarks
have been handled in the past?).

Michael
Post by Artem
I talked to my office mate Brian McFee for some feedback.
Apparently there are some memory and computation saving methods, but they
are all not well-published unfortunately.
I think you should try create some benchmarks for your current
implementation, using classification with 1 nearest neighbor, maybe start
with iris and digits and make_classification.
You should probably start by comparing precomputing all outer products
(which will not scale) with recomputing at every iteration.
And using recomputing at every iteration, try to compare SGD and L-BFGS
(from scipy).
Brian suggested to try a rather radical optimization, where you only
consider pik for pairs that are close in the original space. It is unclear
how bad an approximation that would be.
The problem with tracking the large pik is that computing the pik is
already pretty expensive (n ** 2).
Maybe doing mini-batch sgd over the i with adagrad would work.
Btw, there are some formatting issues in your blogpost code.
Hello everyone
I finally published my API blogpost
<http://barmaley-exe.blogspot.ru/2015/05/api-design.html>. It mostly
summarizes ideas discussed during project proposal stage.
<http://barmaley-exe.blogspot.ru/2015/05/nca.html> a small prototype that
1. Efficiency. Should I aim for Cython implementation, or just having
everything vectorized is fine? In general, some rule of thumb like when to
use Cython would be nice.
2. Memory issues. If we cache outer products, there might be not enough
memory. What's your opinion on my idea (see the blogpost) of a cache of
fixed size? There's cache_size parameter for SVC, but I haven't looked into
that yet. Is it of any help to this problem?
3. Optimization. At the moment I use plain batch gradient ascent with a
fixed learning rate.
- I think it'd better to use something more advanced like Momentum /
AdaGrad (line search seems too impractical)
- Another point to consider is stochastic (minibatch?) gradient ascent.
- What do you think about "cell-wise" optimization (fixing all cells
of matrix but one)?
- Or maybe it's enough to use scipy's conjugate gradients optimizer?
On Mon, May 4, 2015 at 2:02 PM, Michael Eickenberg <
Post by Michael Eickenberg
Dear Artem,
congratulations on the acceptance of your GSoC proposal! I am certain
there will be a very interesting summer ahead of us. Kyle and I are excited
to be mentors and will do our best to provide all the guidance necessary
for your project to succeed. It is very rich and will be a great addition
to the codebase.
Your blog post
<http://barmaley-exe.blogspot.ru/2015/05/introduction.html> on the gists
of the methods is written in a very understandable way and permits a good
overview of the topics you are going to address in depth. It shows that you
have the right intuitions, and are ready to delve into the intricacies of
the methods [1]. Take advantage of the next weeks to do so! Let's make sure
we hit the ground running at the end of this warm-up phase.
As for your next plans, sketching the algorithms in very high level
pseudo-code is of course an excellent idea and can be a next blog post.
After this, you can zoom in on the details of how each pseudo-code step
can be implemented. If you get the level of detail right, I recommend the
Python language to describe your algorithms ;) -- what I mean is that
getting a minimal version of the algorithm to work, just as a function, not
a sklearn estimator, is a valuable baseline to have, and it usually deepens
the understanding as well.
As for the API questions, it is of course quite essential to remain
conscious at all times of the issues that have been identified in prior
discussion and to think of ways to add a metric learning module without
succumbing to excessive feature creep. My hunch is that given some working
minimal versions of the algorithms, we can perhaps crystallize out what is
absolutely necessary in terms of additions, so I would prefer that order of
priorities. There is also some work to be done in identifying other parts
of scikit-learn that already deal with (dis-)similarity type data (cf eg
the kernels defined in the PR for gaussian processes) and see how these can
be made to work in a consistent way.
Of course we would like to have some PRs merged at the end of summer, yes.
But what makes a concrete implementation good? Speed (with respect to
what)? Readability? Maintainability (yes please!)? Elegance (what does that
even mean?)?
It may be helpful if you could devise a more fine-grained timeline
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#timeline>
for the community bonding period than what is currently stated on the wiki
page. How about blogging your progress in understanding? Writing things
down for others to understand is a very good way of identifying any doubts
you may have on particular aspects. A mini blog-entry at the end of each
week simply recounting what has been done and also what has been discussed
will also yield an effective overview of ongoing topics.
In the meantime, please don't hesitate to bombard us, the other mentors,
and the list with any questions you may have.
Michael
[1] http://i.imgur.com/at3vIHh.png
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2015-05-28 21:20:50 UTC
Permalink
Post by Michael Eickenberg
Code-wise, I would attack the problem as a function first. Write a
function that takes X and y (plus maybe some options) and gives back
L. You can put a skeleton of a sklearn estimator around it by calling
this function from fit.
Please keep your code either in a sklearn WIP PR or a public gist, so
it can be reviewed. Writing benchmarks can be framed as writing
examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
on how benchmarks have been handled in the past?).
There is a "benchmark" folder, which is in a horrible shape.
Basically there are three ways to do it: examples (with or without plot
depending on the runtime), a script in the benchmark folder, or a gist.
Often we just use a gist and the PR person posts the output. Not that
great for reproducibility, though.

------------------------------------------------------------------------------
Aurélien Bellet
2015-05-28 22:05:02 UTC
Permalink
Hi everyone,

A few additional things to consider for scaling-up NCA to large datasets:

- Take a look at the t-SNE (technique for visualization/dim reduction
very similar to NCA) implementations, I think they have a few speed-up
tricks that you could potentially re-use:
http://lvdmaaten.github.io/tsne/

- Like you said, SGD can help reduce the computational cost - you could
also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.

- Similarly to what was suggested in previous replies, a general idea is
to only consider a neighborhood around each point (either fixed in
advance, or updated every now and then during the course of
optimization), since the probabilities decrease very fast with the
distance so farther points can be safely ignored in the computation.
This is explored for instance in:
http://dl.acm.org/citation.cfm?id=2142432

- Another related idea is to construct class representatives (for
instance using k-means), and to model the distribution only wrt these
points instead of the entire dataset. This is especially useful if some
classes are very large. An extreme version of this is to reframe NCA for
a Nearest Class Mean classifier, where each class is only modeled by its
center:
https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf

Hope this helps.

Aurelien
Post by Andreas Mueller
Post by Michael Eickenberg
Code-wise, I would attack the problem as a function first. Write a
function that takes X and y (plus maybe some options) and gives back
L. You can put a skeleton of a sklearn estimator around it by calling
this function from fit.
Please keep your code either in a sklearn WIP PR or a public gist, so
it can be reviewed. Writing benchmarks can be framed as writing
examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
on how benchmarks have been handled in the past?).
There is a "benchmark" folder, which is in a horrible shape.
Basically there are three ways to do it: examples (with or without plot
depending on the runtime), a script in the benchmark folder, or a gist.
Often we just use a gist and the PR person posts the output. Not that
great for reproducibility, though.
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Michael Eickenberg
2015-05-29 07:51:11 UTC
Permalink
Hi Aurélien,

thanks for these very good pointers!
(Now we also know who else to bug periodically for opinions ;))

Michael


On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
Post by Aurélien Bellet
Hi everyone,
- Take a look at the t-SNE (technique for visualization/dim reduction
very similar to NCA) implementations, I think they have a few speed-up
http://lvdmaaten.github.io/tsne/
- Like you said, SGD can help reduce the computational cost - you could
also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
- Similarly to what was suggested in previous replies, a general idea is
to only consider a neighborhood around each point (either fixed in
advance, or updated every now and then during the course of
optimization), since the probabilities decrease very fast with the
distance so farther points can be safely ignored in the computation.
http://dl.acm.org/citation.cfm?id=2142432
- Another related idea is to construct class representatives (for
instance using k-means), and to model the distribution only wrt these
points instead of the entire dataset. This is especially useful if some
classes are very large. An extreme version of this is to reframe NCA for
a Nearest Class Mean classifier, where each class is only modeled by its
https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
Hope this helps.
Aurelien
Post by Andreas Mueller
Post by Michael Eickenberg
Code-wise, I would attack the problem as a function first. Write a
function that takes X and y (plus maybe some options) and gives back
L. You can put a skeleton of a sklearn estimator around it by calling
this function from fit.
Please keep your code either in a sklearn WIP PR or a public gist, so
it can be reviewed. Writing benchmarks can be framed as writing
examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
on how benchmarks have been handled in the past?).
There is a "benchmark" folder, which is in a horrible shape.
Basically there are three ways to do it: examples (with or without plot
depending on the runtime), a script in the benchmark folder, or a gist.
Often we just use a gist and the PR person posts the output. Not that
great for reproducibility, though.
------------------------------------------------------------------------------
Post by Andreas Mueller
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Artem
2015-05-29 16:24:49 UTC
Permalink
So, I created a WIP PR dedicated to NCA:
https://github.com/scikit-learn/scikit-learn/pull/4789

As suggested by Michael, I refactored "the meat" into a function. I also
rewrote it as a first order oracle, so I can (and I do) use scipy's
optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
sometimes stopping at some weird point (a local minimum / saddle point?),
whereas gradient descent seems to always converge. Though, I didn't test
either of them extensively.

I also fully vectorized function and gradient calculations, no loops
involved.

So, what's the consensus on benchmarks? I can share ipython notebooks via
gist, for example.

On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg <
Post by Michael Eickenberg
Hi Aurélien,
thanks for these very good pointers!
(Now we also know who else to bug periodically for opinions ;))
Michael
On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
Post by Aurélien Bellet
Hi everyone,
- Take a look at the t-SNE (technique for visualization/dim reduction
very similar to NCA) implementations, I think they have a few speed-up
http://lvdmaaten.github.io/tsne/
- Like you said, SGD can help reduce the computational cost - you could
also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
- Similarly to what was suggested in previous replies, a general idea is
to only consider a neighborhood around each point (either fixed in
advance, or updated every now and then during the course of
optimization), since the probabilities decrease very fast with the
distance so farther points can be safely ignored in the computation.
http://dl.acm.org/citation.cfm?id=2142432
- Another related idea is to construct class representatives (for
instance using k-means), and to model the distribution only wrt these
points instead of the entire dataset. This is especially useful if some
classes are very large. An extreme version of this is to reframe NCA for
a Nearest Class Mean classifier, where each class is only modeled by its
https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
Hope this helps.
Aurelien
Post by Andreas Mueller
Post by Michael Eickenberg
Code-wise, I would attack the problem as a function first. Write a
function that takes X and y (plus maybe some options) and gives back
L. You can put a skeleton of a sklearn estimator around it by calling
this function from fit.
Please keep your code either in a sklearn WIP PR or a public gist, so
it can be reviewed. Writing benchmarks can be framed as writing
examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
on how benchmarks have been handled in the past?).
There is a "benchmark" folder, which is in a horrible shape.
Basically there are three ways to do it: examples (with or without plot
depending on the runtime), a script in the benchmark folder, or a gist.
Often we just use a gist and the PR person posts the output. Not that
great for reproducibility, though.
------------------------------------------------------------------------------
Post by Andreas Mueller
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Michael Eickenberg
2015-05-29 21:33:22 UTC
Permalink
Thanks for the update.
Post by Artem
So, what's the consensus on benchmarks? I can share ipython notebooks via
gist, for example.

My (weak) preference would be to have a script within the sklearn repo,
just to keep stuff in one place for easy future reference.

Michael
Post by Artem
https://github.com/scikit-learn/scikit-learn/pull/4789
As suggested by Michael, I refactored "the meat" into a function. I also
rewrote it as a first order oracle, so I can (and I do) use scipy's
optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
sometimes stopping at some weird point (a local minimum / saddle point?),
whereas gradient descent seems to always converge. Though, I didn't test
either of them extensively.
I also fully vectorized function and gradient calculations, no loops
involved.
So, what's the consensus on benchmarks? I can share ipython notebooks via
gist, for example.
On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg <
Post by Michael Eickenberg
Hi Aurélien,
thanks for these very good pointers!
(Now we also know who else to bug periodically for opinions ;))
Michael
On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
Post by Aurélien Bellet
Hi everyone,
- Take a look at the t-SNE (technique for visualization/dim reduction
very similar to NCA) implementations, I think they have a few speed-up
http://lvdmaaten.github.io/tsne/
- Like you said, SGD can help reduce the computational cost - you could
also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
- Similarly to what was suggested in previous replies, a general idea is
to only consider a neighborhood around each point (either fixed in
advance, or updated every now and then during the course of
optimization), since the probabilities decrease very fast with the
distance so farther points can be safely ignored in the computation.
http://dl.acm.org/citation.cfm?id=2142432
- Another related idea is to construct class representatives (for
instance using k-means), and to model the distribution only wrt these
points instead of the entire dataset. This is especially useful if some
classes are very large. An extreme version of this is to reframe NCA for
a Nearest Class Mean classifier, where each class is only modeled by its
https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
Hope this helps.
Aurelien
Post by Andreas Mueller
Post by Michael Eickenberg
Code-wise, I would attack the problem as a function first. Write a
function that takes X and y (plus maybe some options) and gives back
L. You can put a skeleton of a sklearn estimator around it by calling
this function from fit.
Please keep your code either in a sklearn WIP PR or a public gist, so
it can be reviewed. Writing benchmarks can be framed as writing
examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
on how benchmarks have been handled in the past?).
There is a "benchmark" folder, which is in a horrible shape.
Basically there are three ways to do it: examples (with or without plot
depending on the runtime), a script in the benchmark folder, or a gist.
Often we just use a gist and the PR person posts the output. Not that
great for reproducibility, though.
------------------------------------------------------------------------------
Post by Andreas Mueller
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Artem
2015-05-31 17:25:33 UTC
Permalink
I added a simple benchmark
<https://github.com/Barmaley-exe/scikit-learn/blob/metric-learning/benchmarks/bench_nca.py>
that
compares NCA-assisted 1NN with default one (with Euclidean distance) on
Wine dataset (it was one of the datasets reported in the NCA paper). See my
output here <https://gist.github.com/Barmaley-exe/a713f23f74eb53a2f2bd>.

It also compares semivectorized and vectorized implementations:
surprisingly, semivectorizes is about 2 times faster. I think, this might
be a reason to throw fully vectorized (nca_vectorized_oracle)
implementation away.

On Sat, May 30, 2015 at 12:33 AM, Michael Eickenberg <
Post by Michael Eickenberg
Thanks for the update.
Post by Artem
So, what's the consensus on benchmarks? I can share ipython notebooks
via gist, for example.
My (weak) preference would be to have a script within the sklearn repo,
just to keep stuff in one place for easy future reference.
Michael
Post by Artem
https://github.com/scikit-learn/scikit-learn/pull/4789
As suggested by Michael, I refactored "the meat" into a function. I also
rewrote it as a first order oracle, so I can (and I do) use scipy's
optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
sometimes stopping at some weird point (a local minimum / saddle point?),
whereas gradient descent seems to always converge. Though, I didn't test
either of them extensively.
I also fully vectorized function and gradient calculations, no loops
involved.
So, what's the consensus on benchmarks? I can share ipython notebooks via
gist, for example.
On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg <
Post by Michael Eickenberg
Hi Aurélien,
thanks for these very good pointers!
(Now we also know who else to bug periodically for opinions ;))
Michael
On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
Post by Aurélien Bellet
Hi everyone,
- Take a look at the t-SNE (technique for visualization/dim reduction
very similar to NCA) implementations, I think they have a few speed-up
http://lvdmaaten.github.io/tsne/
- Like you said, SGD can help reduce the computational cost - you could
also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
- Similarly to what was suggested in previous replies, a general idea is
to only consider a neighborhood around each point (either fixed in
advance, or updated every now and then during the course of
optimization), since the probabilities decrease very fast with the
distance so farther points can be safely ignored in the computation.
http://dl.acm.org/citation.cfm?id=2142432
- Another related idea is to construct class representatives (for
instance using k-means), and to model the distribution only wrt these
points instead of the entire dataset. This is especially useful if some
classes are very large. An extreme version of this is to reframe NCA for
a Nearest Class Mean classifier, where each class is only modeled by its
https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
Hope this helps.
Aurelien
Post by Andreas Mueller
Post by Michael Eickenberg
Code-wise, I would attack the problem as a function first. Write a
function that takes X and y (plus maybe some options) and gives back
L. You can put a skeleton of a sklearn estimator around it by calling
this function from fit.
Please keep your code either in a sklearn WIP PR or a public gist, so
it can be reviewed. Writing benchmarks can be framed as writing
examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
on how benchmarks have been handled in the past?).
There is a "benchmark" folder, which is in a horrible shape.
Basically there are three ways to do it: examples (with or without
plot
Post by Andreas Mueller
depending on the runtime), a script in the benchmark folder, or a
gist.
Post by Andreas Mueller
Often we just use a gist and the PR person posts the output. Not that
great for reproducibility, though.
------------------------------------------------------------------------------
Post by Andreas Mueller
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Michael Eickenberg
2015-05-31 18:29:11 UTC
Permalink
Post by Artem
I added a simple benchmark
<https://github.com/Barmaley-exe/scikit-learn/blob/metric-learning/benchmarks/bench_nca.py> that
compares NCA-assisted 1NN with default one (with Euclidean distance) on
Wine dataset (it was one of the datasets reported in the NCA paper). See my
output here <https://gist.github.com/Barmaley-exe/a713f23f74eb53a2f2bd>.
surprisingly, semivectorizes is about 2 times faster. I think, this might
be a reason to throw fully vectorized (nca_vectorized_oracle)
implementation away.
This is very interesting. I like the way you made the semi-vectorized way
and this seems to show that you are still benefitting from the remaining
vectorizations you have.
However, I am surprised to see such a difference between the two methods.
Do you have any idea why this could be the case? Where do you think the
vectorized version looses all this time?
Post by Artem
On Sat, May 30, 2015 at 12:33 AM, Michael Eickenberg <
Post by Michael Eickenberg
Thanks for the update.
Post by Artem
So, what's the consensus on benchmarks? I can share ipython notebooks
via gist, for example.
My (weak) preference would be to have a script within the sklearn repo,
just to keep stuff in one place for easy future reference.
Michael
Post by Artem
https://github.com/scikit-learn/scikit-learn/pull/4789
As suggested by Michael, I refactored "the meat" into a function. I also
rewrote it as a first order oracle, so I can (and I do) use scipy's
optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
sometimes stopping at some weird point (a local minimum / saddle point?),
whereas gradient descent seems to always converge. Though, I didn't test
either of them extensively.
I also fully vectorized function and gradient calculations, no loops
involved.
So, what's the consensus on benchmarks? I can share ipython notebooks
via gist, for example.
On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg <
Post by Michael Eickenberg
Hi Aurélien,
thanks for these very good pointers!
(Now we also know who else to bug periodically for opinions ;))
Michael
On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
Post by Aurélien Bellet
Hi everyone,
- Take a look at the t-SNE (technique for visualization/dim reduction
very similar to NCA) implementations, I think they have a few speed-up
http://lvdmaaten.github.io/tsne/
- Like you said, SGD can help reduce the computational cost - you could
also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
- Similarly to what was suggested in previous replies, a general idea is
to only consider a neighborhood around each point (either fixed in
advance, or updated every now and then during the course of
optimization), since the probabilities decrease very fast with the
distance so farther points can be safely ignored in the computation.
http://dl.acm.org/citation.cfm?id=2142432
- Another related idea is to construct class representatives (for
instance using k-means), and to model the distribution only wrt these
points instead of the entire dataset. This is especially useful if some
classes are very large. An extreme version of this is to reframe NCA for
a Nearest Class Mean classifier, where each class is only modeled by its
https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
Hope this helps.
Aurelien
Post by Andreas Mueller
Post by Michael Eickenberg
Code-wise, I would attack the problem as a function first. Write a
function that takes X and y (plus maybe some options) and gives back
L. You can put a skeleton of a sklearn estimator around it by
calling
Post by Andreas Mueller
Post by Michael Eickenberg
this function from fit.
Please keep your code either in a sklearn WIP PR or a public gist,
so
Post by Andreas Mueller
Post by Michael Eickenberg
it can be reviewed. Writing benchmarks can be framed as writing
examples, i.e. plot_* functions (maybe Andy or Olivier have a
comment
Post by Andreas Mueller
Post by Michael Eickenberg
on how benchmarks have been handled in the past?).
There is a "benchmark" folder, which is in a horrible shape.
Basically there are three ways to do it: examples (with or without
plot
Post by Andreas Mueller
depending on the runtime), a script in the benchmark folder, or a
gist.
Post by Andreas Mueller
Often we just use a gist and the PR person posts the output. Not that
great for reproducibility, though.
------------------------------------------------------------------------------
Post by Andreas Mueller
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Artem
2015-06-01 00:18:48 UTC
Permalink
Apparently, the problem was with multiplying big 4D tensors:

grad = np.sum(proba[:, :, np.newaxis, np.newaxis] * outer, axis=1).sum(axis=
0)
I don't know the details, but line-prof said that most of the time was
spent here (and the other similar line), probably it was due to copies. I
recalled your words on np.einsum, replaced sums with it and got quite a
speedup (x4)!

New output log: link
<https://gist.github.com/Barmaley-exe/d5ca99d7b94fb665ad03>. Now fully
vectorized version takes ~2s per iteration and is faster than
semivectorized one.

P.S. I made the same optimization in semivectorized version, but its
speedup isn't that significant.

On Sun, May 31, 2015 at 9:29 PM, Michael Eickenberg <
Post by Michael Eickenberg
Post by Artem
I added a simple benchmark
<https://github.com/Barmaley-exe/scikit-learn/blob/metric-learning/benchmarks/bench_nca.py> that
compares NCA-assisted 1NN with default one (with Euclidean distance) on
Wine dataset (it was one of the datasets reported in the NCA paper). See my
output here <https://gist.github.com/Barmaley-exe/a713f23f74eb53a2f2bd>.
surprisingly, semivectorizes is about 2 times faster. I think, this might
be a reason to throw fully vectorized (nca_vectorized_oracle)
implementation away.
This is very interesting. I like the way you made the semi-vectorized way
and this seems to show that you are still benefitting from the remaining
vectorizations you have.
However, I am surprised to see such a difference between the two methods.
Do you have any idea why this could be the case? Where do you think the
vectorized version looses all this time?
Post by Artem
On Sat, May 30, 2015 at 12:33 AM, Michael Eickenberg <
Post by Michael Eickenberg
Thanks for the update.
Post by Artem
So, what's the consensus on benchmarks? I can share ipython notebooks
via gist, for example.
My (weak) preference would be to have a script within the sklearn repo,
just to keep stuff in one place for easy future reference.
Michael
Post by Artem
https://github.com/scikit-learn/scikit-learn/pull/4789
As suggested by Michael, I refactored "the meat" into a function. I
also rewrote it as a first order oracle, so I can (and I do) use scipy's
optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
sometimes stopping at some weird point (a local minimum / saddle point?),
whereas gradient descent seems to always converge. Though, I didn't test
either of them extensively.
I also fully vectorized function and gradient calculations, no loops
involved.
So, what's the consensus on benchmarks? I can share ipython notebooks
via gist, for example.
On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg <
Post by Michael Eickenberg
Hi Aurélien,
thanks for these very good pointers!
(Now we also know who else to bug periodically for opinions ;))
Michael
On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet <
Post by Aurélien Bellet
Hi everyone,
- Take a look at the t-SNE (technique for visualization/dim reduction
very similar to NCA) implementations, I think they have a few speed-up
http://lvdmaaten.github.io/tsne/
- Like you said, SGD can help reduce the computational cost - you could
also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.
- Similarly to what was suggested in previous replies, a general idea is
to only consider a neighborhood around each point (either fixed in
advance, or updated every now and then during the course of
optimization), since the probabilities decrease very fast with the
distance so farther points can be safely ignored in the computation.
http://dl.acm.org/citation.cfm?id=2142432
- Another related idea is to construct class representatives (for
instance using k-means), and to model the distribution only wrt these
points instead of the entire dataset. This is especially useful if some
classes are very large. An extreme version of this is to reframe NCA for
a Nearest Class Mean classifier, where each class is only modeled by its
https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf
Hope this helps.
Aurelien
Post by Andreas Mueller
Post by Michael Eickenberg
Code-wise, I would attack the problem as a function first. Write a
function that takes X and y (plus maybe some options) and gives
back
Post by Andreas Mueller
Post by Michael Eickenberg
L. You can put a skeleton of a sklearn estimator around it by
calling
Post by Andreas Mueller
Post by Michael Eickenberg
this function from fit.
Please keep your code either in a sklearn WIP PR or a public gist,
so
Post by Andreas Mueller
Post by Michael Eickenberg
it can be reviewed. Writing benchmarks can be framed as writing
examples, i.e. plot_* functions (maybe Andy or Olivier have a
comment
Post by Andreas Mueller
Post by Michael Eickenberg
on how benchmarks have been handled in the past?).
There is a "benchmark" folder, which is in a horrible shape.
Basically there are three ways to do it: examples (with or without
plot
Post by Andreas Mueller
depending on the runtime), a script in the benchmark folder, or a
gist.
Post by Andreas Mueller
Often we just use a gist and the PR person posts the output. Not
that
Post by Andreas Mueller
great for reproducibility, though.
------------------------------------------------------------------------------
Post by Andreas Mueller
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...