[Scikit-learn-general] [GSoC] Metric Learning

Post by Artem
Hello everyone
Recently I mentioned metric learning as one of possible projects for
this years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning distance
functions. Usually the metric that is learned is a Mahalanobis metric,
thus the problem reduces to finding a PSD matrix A that minimizes some
functional.
Metric learning is usually done in a supervised way, that is, a user
tells which points should be closer and which should be more distant.
It can be expressed either in form of "similar" / "dissimilar", or "A
is closer to B than to C".
Since metric learning is (mostly) about a PSD matrix A, one can
do Cholesky decomposition on it to obtain a matrix G to transform the
data. It could lead to something like guided clustering, where we
first transform the data space according to our prior knowledge of
similarity.
Metric learning seems to be quite an active field of research ([1
<http://www.icml2010.org/tutorials.html>], [2
<http://www.ariel.ac.il/sites/ofirpele/DFML_ECCV2010_tutorial/>], [3
<http://nips.cc/Conferences/2011/Program/event.php?ID=2543>]). There
are 2 somewhat up-to date surveys: [1
<http://web.cse.ohio-state.edu/%7Ekulis/pubs/ftml_metric_learning.pdf>] and
[2 <http://arxiv.org/abs/1306.6709>].
Top 3 seemingly most cited methods (according to Google Scholar) are
* MMC by Xing et al.
<http://papers.nips.cc/paper/2164-distance-metric-learning-with-application-to-clustering-with-side-information.pdf> This
is a pioneering work and, according to the survey #2
The algorithm used to solve (1) is a simple projected gradient
approach requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for medium
â â
and high-dimensional problems
* âLarge Margin Nearest Neighbor by Weinberger et al
<http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf>.
The survey 2 acknowledges this method as "one of the most
widely-used Mahalanobis distance learning methods"
LMNN generally performs very well in practice, although it is
sometimes prone to overfitting due to the absence of
regularization, especially in high dimension
* Information-theoretic metric learning by Davis et al.
<http://dl.acm.org/citation.cfm?id=1273523> This one features a
special kind of regularizer called logDet.
* There are many other methods. If you guys know that other methods
rock, let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or
both?) algorithms along with a relevant transformer.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2015-03-18 15:27:09 UTC

Simple, efficient and robust metric learning that learns on a supervised
set and can do a transform that applies the metric? Do you think that
would be useful? It seems to me that it would.

If people agree that it would be useful with such a very simple API, I
would be in favor of a GSoC proposal on this. As I don't think that we
have mentors that are experts of the algorithms involved, the student
would need to show in his proposal that he has a good understanding of
the algorithms and usecases.

Importantly, when introducing a new type of algorithms to scikit-learn,
simpler is always better: the API, the examples, and the usecases must be
tuned on simple algorithms.

Cheers,

Gaël

Hey.
I am not very familiar with the literature on metric learning, but I think one
thing that we need to think about before
is what the interface would be.
We really want something that works in a .fit().predict() or .fit().transform()
way.
I guess you could do "transform" to get the distances to the training data (is
that what one would want?)
But how would the labels for the "fit" look like?
Cheers,
Andy
Hello everyone
Recently I mentioned metric learning as one of possible projects for this
years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning distance
functions. Usually the metric that is learned is a Mahalanobis metric, thus
the problem reduces to finding a PSD matrix A that minimizes some
functional.
Metric learning is usually done in a supervised way, that is, a user tells
which points should be closer and which should be more distant. It can be
expressed either in form of "similar" / "dissimilar", or "A is closer to B
than to C".
Since metric learning is (mostly) about a PSD matrix A, one can do Cholesky
decomposition on it to obtain a matrix G to transform the data. It could
lead to something like guided clustering, where we first transform the data
space according to our prior knowledge of similarity.
Metric learning seems to be quite an active field of research ([1], [2], [3
]). There are 2 somewhat up-to date surveys: [1] and [2].
Top 3 seemingly most cited methods (according to Google Scholar) are
□ MMC by Xing et al. This is a pioneering work and, according to the
survey #2
The algorithm used to solve (1) is a simple projected gradient
approach requiring the full

eigenvalue decomposition of

M

at each iteration. This is typically intractable for medium

and high-dimensional problems
□ Large Margin Nearest Neighbor by Weinberger et al. The survey 2
acknowledges this method as "one of the most widely-used Mahalanobis
distance learning methods"
LMNN generally performs very well in practice, although it is
sometimes prone to overfitting due to the absence of
regularization, especially in high dimension
□ Information-theoretic metric learning by Davis et al. This one features
a special kind of regularizer called logDet.
□ There are many other methods. If you guys know that other methods rock,
let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or both?)
algorithms along with a relevant transformer.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2015-03-18 15:32:05 UTC

Do you have an idea of what y would look like?
Also +1 on what you said (but you knew that ;)

Post by Gael Varoquaux
Simple, efficient and robust metric learning that learns on a supervised
set and can do a transform that applies the metric? Do you think that
would be useful? It seems to me that it would.
If people agree that it would be useful with such a very simple API, I
would be in favor of a GSoC proposal on this. As I don't think that we
have mentors that are experts of the algorithms involved, the student
would need to show in his proposal that he has a good understanding of
the algorithms and usecases.
Importantly, when introducing a new type of algorithms to scikit-learn,
simpler is always better: the API, the examples, and the usecases must be
tuned on simple algorithms.
Cheers,
Gaël

Gael Varoquaux

2015-03-18 15:32:53 UTC

Post by Andreas Mueller
Do you have an idea of what y would look like?

Me. Not sure, no. I haven't looked at the corresponding literature.

G

Post by Andreas Mueller
Also +1 on what you said (but you knew that ;)

Artem

2015-03-18 16:21:18 UTC

Yeah, the API is the most important question of the implementation.

These learners are not classifiers (though there exist metric-adapting
algorithms like Neighbourhood Components Analysis
<http://en.wikipedia.org/wiki/Neighbourhood_components_analysis>), so they
don't fit into usual estimator-like fit + predict scheme.
Another thing to take into consideration is that we could want to use
learned metric, say, in the KNN, thus it'd helpful to have a way to get a
DistanceMetric corresponding to the learned metric.
With this in mind, a Transformer's instance with y-aware fit and an
attribute like `metric_` should work.

As to what y should look like, it depends on what we'd like the algorithm
to do. We can go with usual y vector consisting of feature labels.
Actually, LMNN is done this way, the optimization objective depends on the
equality of labels only. For ITML (any many others) we need sets of
(S)imilar and (D)issimilar pairs, which can also be inferred from labels.

This is a bit less generic since we would imply that similarity is
transitive, and that's not true in a general case. For the general case
we'd need a way to feed in actual pairs. This could be done with fit having
2 optional arguments (similar and dissimilar) defaulted to None, which are
inferred from y in case of absence.

So, the interface would be usual fit(X, y) if we only want to facilitate
non-linear methods (like KNN)

space_warper = ITMLTransformer(...)
new_X = space_warper.fit(X, y).transform(X)

Or more sophisticated

new_X = space_warper.fit(X, similar=S, different=D).transform(X)

Not all methods support the later scheme, so the former one would be
default, whereas (S, D)-aware methods will infer those sets from labels y.

P.S. I didn't consider a case when prior knowledge is given in the form of
"X is closer to A than to B", but it can be treated the same way, the set
of relations could be inferred from labels as R = {(X, A, B) : y(A) = y(X),
y(X) != y(B)}

requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for medium
â â
and high-dimensional problems

- âLarge Margin Nearest Neighbor by Weinberger et al
<http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf>.
The survey 2 acknowledges this method as "one of the most widely-used
Mahalanobis distance learning methods"
LMNN generally performs very well in practice, although it is sometimes

prone to overfitting due to the absence of regularization, especially in
high dimension

- Information-theoretic metric learning by Davis et al.
<http://dl.acm.org/citation.cfm?id=1273523> This one features a
special kind of regularizer called logDet.
- There are many other methods. If you guys know that other methods
rock, let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or both?)
algorithms along with a relevant transformer.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2015-03-18 16:26:17 UTC

As to what y should look like, it depends on what we'd like the algorithm to
do. We can go with usual y vector consisting of feature labels. Actually, LMNN
is done this way, the optimization objective depends on the equality of labels
only. For ITML (any many others) we need sets of (S)imilar and (D)issimilar
pairs, which can also be inferred from labels.
This is a bit less generic since we would imply that similarity is transitive,
and that's not true in a general case. For the general case we'd need a way to
feed in actual pairs. This could be done with fit having 2 optional arguments
(similar and dissimilar) defaulted to None, which are inferred from y in case
of absence.

For now, I don't think that we want to add new variants of the
scikit-learn API.

G

Artem

2015-03-18 16:55:22 UTC

Well, we could go with fit(X, y), but since algorithms use S and D, it'd
better to give user a way to specify them directly if (s)he wants to.
Either way, LMNN works with raw labels, so it doesn't require any changes
to the existing API.

On Wed, Mar 18, 2015 at 7:26 PM, Gael Varoquaux <

Post by Artem
As to what y should look like, it depends on what we'd like the

algorithm to

Post by Artem
do. We can go with usual y vector consisting of feature labels.

Actually, LMNN

Post by Artem
is done this way, the optimization objective depends on the equality of

labels

Post by Artem
only. For ITML (any many others) we need sets of (S)imilar and

(D)issimilar

Post by Artem
pairs, which can also be inferred from labels.
This is a bit less generic since we would imply that similarity is

transitive,

Post by Artem
and that's not true in a general case. For the general case we'd need a

way to

Post by Artem
feed in actual pairs. This could be done with fit having 2 optional

arguments

Post by Artem
(similar and dissimilar) defaulted to None, which are inferred from y in

case

Post by Artem
of absence.

Andreas Mueller

2015-03-18 18:36:53 UTC

Post by Artem
Well, we could go with fit(X, y), but since algorithms use S and D,
it'd better to give user a way to specify them directly if (s)he wants
to. Either way, LMNN works with raw labels, so it doesn't require any
changes to the existing API.
On Wed, Mar 18, 2015 at 7:26 PM, Gael Varoquaux

Post by Artem
As to what y should look like, it depends on what we'd like the

algorithm to

Post by Artem
do. We can go with usual y vector consisting of feature labels.

Actually, LMNN

Post by Artem
is done this way, the optimization objective depends on the

equality of labels

Post by Artem
only. For ITML (any many others) we need sets of (S)imilar and

(D)issimilar

Post by Artem
pairs, which can also be inferred from labels.
This is a bit less generic since we would imply that similarity

is transitive,

Post by Artem
and that's not true in a general case. For the general case we'd

need a way to

Post by Artem
feed in actual pairs. This could be done with fit having 2

optional arguments

Post by Artem
(similar and dissimilar) defaulted to None, which are inferred

from y in case

Post by Artem
of absence.

For now, I don't think that we want to add new variants of the
scikit-learn API.
G
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-18 18:53:42 UTC

I mean that if we were solving classification, we would have y that tells
us which class each example belongs to. So if we pass this classification's
ground truth vector y to metric learning's fit, we can form S and D inside
by saying that observations from the same class should be similar.

â

Only being able to "transform" to a distance to the training set is a bit
limiting

The issue with having anything else than fit(X, y) would break
cross_val_score, GridSearchCV and Pipeline.
I agree that more control is good, but having functions that don't work
well with the rest of the package is not great.
Only being able to "transform" to a distance to the training set is a bit
limiting, but I don't see a different way to do
it within the current API.
Can you explain this statement a bit more " We can go with usual y vector
consisting of feature labels" ?
Thanks,
Andy
Well, we could go with fit(X, y), but since algorithms use S and D, it'd
better to give user a way to specify them directly if (s)he wants to.
Either way, LMNN works with raw labels, so it doesn't require any changes
to the existing API.
On Wed, Mar 18, 2015 at 7:26 PM, Gael Varoquaux <

Post by Artem
As to what y should look like, it depends on what we'd like the

algorithm to

Post by Artem
do. We can go with usual y vector consisting of feature labels.

Actually, LMNN

Post by Artem
is done this way, the optimization objective depends on the equality of

labels

Post by Artem
only. For ITML (any many others) we need sets of (S)imilar and

(D)issimilar

Post by Artem
pairs, which can also be inferred from labels.
This is a bit less generic since we would imply that similarity is

transitive,

Post by Artem
and that's not true in a general case. For the general case we'd need a

way to

Post by Artem
feed in actual pairs. This could be done with fit having 2 optional

arguments

Post by Artem
(similar and dissimilar) defaulted to None, which are inferred from y

in case

Post by Artem
of absence.

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2015-03-18 19:07:07 UTC

Post by Artem
I mean that if we were solving classification, we would have y that
tells us which class each example belongs to. So if we pass this
classification's ground truth vector y to metric learning's fit, we
can form S and D inside by saying that observations from the same
class should be similar.

Ah, I got it now.

Post by Artem
â
Only being able to "transform" to a distance to the training set
is a bit limiting
âSorry, I don't understand what you mean by this. Can you elaborate?â
â
â
The metric does not memorize training samples, it finds a (linear
unless kernelized) transformation that makes similar examples cluster
together. Moreover, since the metric is completely determined by a PSD
matrix, we can compute its square root, and use to transform new data
without any supervision.â

Artem

2015-03-18 21:14:48 UTC

Post by Andreas Mueller
â
Do you think this interface would be useful enough?

âOne of mentioned methods (LMNN) actually uses prior knowledge in exactly
the same way, by comparing labels' equality. Though, it was designed to
facilitate KNN. â
â
âAuthors of the other one (ITML) explicitly mention in the paper that one
can construct those sets S and D from labels.

Do you think it would make sense to use such a transformer in a pipeline

Post by Andreas Mueller
with a KNN classifier?
I feel that training both on the same labels might be a bit of an issue
with overfitting

Pipelining looks like a good way to combine these methods, but overfitting
could be a problem, indeed.
Not sure how severe it can be.

Post by Andreas Mueller
I mean that if we were solving classification, we would have y that
tells us which class each example belongs to. So if we pass this
classification's ground truth vector y to metric learning's fit, we can
form S and D inside by saying that observations from the same class should
be similar.
Ah, I got it now.
â

Only being able to "transform" to a distance to the training set is a bit
limiting

âSorry, I don't understand what you mean by this. Can you elaborate?â
â
â
The metric does not memorize training samples, it finds a (linear unless
kernelized) transformation that makes similar examples cluster together.
Moreover, since the metric is completely determined by a PSD matrix, we can
compute its square root, and use to transform new data without any
supervision.â
Ah, I think I misunderstood your proposal for the transformer interface.
Never mind.
Do you think this interface would be useful enough? I can think of a
couple of applications.
It would definitely fit well into the current scikit-learn framework.
Do you think it would make sense to use such a transformer in a pipeline
with a KNN classifier?
I feel that training both on the same labels might be a bit of an issue
with overfitting.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2015-03-18 21:47:39 UTC

In summary, I think this does look like a good basis for a proposal :)

Post by Andreas Mueller
â
Do you think this interface would be useful enough?
âOne of mentioned methods (LMNN) actually uses prior knowledge in
exactly the same way, by comparing labels' equality. Though, it was
designed to facilitate KNN. â
â
âAuthors of the other one (ITML) explicitly mention in the paper that
one can construct those sets S and D from labels.
Do you think it would make sense to use such a transformer in a
pipeline with a KNN classifier?
I feel that training both on the same labels might be a bit of an
issue with overfitting
Pipelining looks like a good way to combine these methods, but
overfitting could be a problem, indeed.
Not sure how severe it can be.

Post by Artem
I mean that if we were solving classification, we would have y
that tells us which class each example belongs to. So if we pass
this classification's ground truth vector y to metric learning's
fit, we can form S and D inside by saying that observations from
the same class should be similar.

Ah, I got it now.

Post by Artem
â
Only being able to "transform" to a distance to the training
set is a bit limiting
âSorry, I don't understand what you mean by this. Can you elaborate?â
â
â
The metric does not memorize training samples, it finds a (linear
unless kernelized) transformation that makes similar examples
cluster together. Moreover, since the metric is completely
determined by a PSD matrix, we can compute its square root, and
use to transform new data without any supervision.â

Ah, I think I misunderstood your proposal for the transformer
interface. Never mind.
Do you think this interface would be useful enough? I can think of
a couple of applications.
It would definitely fit well into the current scikit-learn framework.
Do you think it would make sense to use such a transformer in a
pipeline with a KNN classifier?
I feel that training both on the same labels might be a bit of an
issue with overfitting.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2015-03-19 02:35:47 UTC

I don't know a lot about metric learning either, but it sounded like from
your initial statement that fit(X, D) where D is the target/known distance
between each point in X might be appropriate. I have no idea if this is how
it is formulated in the literature (your mention of asymmetric metrics
means it might be), but it seems an intuitive representation of the problem.

Your suggestion of "similar" and "dissimilar" groups could be represented
by D being a symmetric matrix with some distances 1 (dissimilar) and others
0 (similar), but you imply that some or the majority of cells would be
unknown (in which case a sparse D interpreting all non-explicit values as
unknown may be appropriate).

I would have thought in the case of Mahalanobis distances that transform
would transform each feature such that the resulting feature space was
Euclidean.

Post by Andreas Mueller
In summary, I think this does look like a good basis for a proposal :)
â

Post by Andreas Mueller
Do you think this interface would be useful enough?

âOne of mentioned methods (LMNN) actually uses prior knowledge in exactly
the same way, by comparing labels' equality. Though, it was designed to
facilitate KNN. â
â
âAuthors of the other one (ITML) explicitly mention in the paper that one
can construct those sets S and D from labels.
Do you think it would make sense to use such a transformer in a pipeline

Post by Andreas Mueller
with a KNN classifier?
I feel that training both on the same labels might be a bit of an issue
with overfitting

Pipelining looks like a good way to combine these methods, but overfitting
could be a problem, indeed.
Not sure how severe it can be.

Post by Andreas Mueller
Only being able to "transform" to a distance to the training set is a
bit limiting

âSorry, I don't understand what you mean by this. Can you elaborate?â
â
â
The metric does not memorize training samples, it finds a (linear
unless kernelized) transformation that makes similar examples cluster
together. Moreover, since the metric is completely determined by a PSD
matrix, we can compute its square root, and use to transform new data
without any supervision.â
Ah, I think I misunderstood your proposal for the transformer interface.
Never mind.
Do you think this interface would be useful enough? I can think of a
couple of applications.
It would definitely fit well into the current scikit-learn framework.
Do you think it would make sense to use such a transformer in a pipeline
with a KNN classifier?
I feel that training both on the same labels might be a bit of an issue
with overfitting.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-19 15:13:44 UTC

Yes, your suggestion is viable, but I have not seen any algorithms in
sklearn that use y like that in fit method.

ââ

Post by Joel Nothman
I would have thought in the case of Mahalanobis distances that transform
would transform each feature such that the resulting feature space was
Euclidean.

âExactly. Thus, methods that use usual L2 distance (like KMeans) will be
effectively using those custom metrics.â

Also, one can do the kernel trick to get a metric for a non-linear
transformation.

Post by Joel Nothman
I don't know a lot about metric learning either, but it sounded like from
your initial statement that fit(X, D) where D is the target/known distance
between each point in X might be appropriate. I have no idea if this is how
it is formulated in the literature (your mention of asymmetric metrics
means it might be), but it seems an intuitive representation of the problem.
Your suggestion of "similar" and "dissimilar" groups could be represented
by D being a symmetric matrix with some distances 1 (dissimilar) and others
0 (similar), but you imply that some or the majority of cells would be
unknown (in which case a sparse D interpreting all non-explicit values as
unknown may be appropriate).
I would have thought in the case of Mahalanobis distances that transform
would transform each feature such that the resulting feature space was
Euclidean.

Post by Andreas Mueller
In summary, I think this does look like a good basis for a proposal :)
â

Post by Andreas Mueller
Do you think this interface would be useful enough?

âOne of mentioned methods (LMNN) actually uses prior knowledge in exactly
the same way, by comparing labels' equality. Though, it was designed to
facilitate KNN. â
â
âAuthors of the other one (ITML) explicitly mention in the paper that one
can construct those sets S and D from labels.
Do you think it would make sense to use such a transformer in a pipeline

Post by Andreas Mueller
with a KNN classifier?
I feel that training both on the same labels might be a bit of an issue
with overfitting

Pipelining looks like a good way to combine these methods, but
overfitting could be a problem, indeed.
Not sure how severe it can be.

Post by Andreas Mueller
Only being able to "transform" to a distance to the training set is a
bit limiting

âSorry, I don't understand what you mean by this. Can you elaborate?â
â
â
The metric does not memorize training samples, it finds a (linear
unless kernelized) transformation that makes similar examples cluster
together. Moreover, since the metric is completely determined by a PSD
matrix, we can compute its square root, and use to transform new data
without any supervision.â
Ah, I think I misunderstood your proposal for the transformer
interface. Never mind.
Do you think this interface would be useful enough? I can think of a
couple of applications.
It would definitely fit well into the current scikit-learn framework.
Do you think it would make sense to use such a transformer in a pipeline
with a KNN classifier?
I feel that training both on the same labels might be a bit of an issue
with overfitting.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Zay Maung Maung Aye

2015-03-20 00:50:37 UTC

Neighborhood Component Analysis is more cited than ITML.

requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for medium
â â
and high-dimensional problems

- âLarge Margin Nearest Neighbor by Weinberger et al
<http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf>.
The survey 2 acknowledges this method as "one of the most widely-used
Mahalanobis distance learning methods"
LMNN generally performs very well in practice, although it is sometimes

prone to overfitting due to the absence of regularization, especially in
high dimension

--
*åæ³œ*
Meng Ze ( Zay Maung Maung Aye)

Gael Varoquaux

2015-03-20 07:01:45 UTC

Neighborhood Component Analysis is more cited than ITML.

There is already a pull request on neighborhood component analysis
https://github.com/scikit-learn/scikit-learn/issues/3213

A first step of the GSoC could be to complete it.

Gaël

Hello everyone
Recently I mentioned metric learning as one of possible projects for this
years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning distance
functions. Usually the metric that is learned is a Mahalanobis metric, thus
the problem reduces to finding a PSD matrix A that minimizes some
functional.
Metric learning is usually done in a supervised way, that is, a user tells
which points should be closer and which should be more distant. It can be
expressed either in form of "similar" / "dissimilar", or "A is closer to B
than to C".
Since metric learning is (mostly) about a PSD matrix A, one can do Cholesky
decomposition on it to obtain a matrix G to transform the data. It could
lead to something like guided clustering, where we first transform the data
space according to our prior knowledge of similarity.
Metric learning seems to be quite an active field of research ([1], [2], [3
]). There are 2 somewhat up-to date surveys: [1] and [2].
Top 3 seemingly most cited methods (according to Google Scholar) are
□ MMC by Xing et al. This is a pioneering work and, according to the
survey #2
The algorithm used to solve (1) is a simple projected gradient
approach requiring the full

eigenvalue decomposition of

M

at each iteration. This is typically intractable for medium

and high-dimensional problems
□ Large Margin Nearest Neighbor by Weinberger et al. The survey 2
acknowledges this method as "one of the most widely-used Mahalanobis
distance learning methods"
LMNN generally performs very well in practice, although it is
sometimes prone to overfitting due to the absence of
regularization, especially in high dimension
□ Information-theoretic metric learning by Davis et al. This one features
a special kind of regularizer called logDet.
□ There are many other methods. If you guys know that other methods rock,
let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or both?)
algorithms along with a relevant transformer.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for
all
things parallel software development, from weekly thought leadership blogs
to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-22 00:54:37 UTC

Are there any objections on Joel's variant of y? It serves my needs, but is
quite different from what one can usually find in scikit-learn.

------

Another point I want to bring up is metric-aware KMeans. Currently it works
with Euclidean distance only, which is not a problem for a Mahalanobis
distance, but as (and if) we move towards kernel metrics, it becomes
impossible to transform the data in a way that the Euclidean distance
between the transformed points accurately reflects the distance between the
points in a space with the learned metric.

I think it'd nice to have "non-linear" metrics, too. One of the possible
approaches (widely recognized among researchers on metric learning) is to
use KernelPCA before learning the metric. This would work really well with
sklearn's Pipelines.
But not all the methods are justified to be used with Kernel PCA. Namely,
ITML uses a special kind of regularization that breaks all theoretical
guarantees.

And, it's a bit weird that something that is called a metric learning
actually does space transformation. Maybe we should also add something like
factories of metrics, whose sole result is a DistanceMetric (in particular
for those kernel metrics)?

On Fri, Mar 20, 2015 at 10:01 AM, Gael Varoquaux <

Post by Zay Maung Maung Aye
Neighborhood Component Analysis is more cited than ITML.

There is already a pull request on neighborhood component analysis
https://github.com/scikit-learn/scikit-learn/issues/3213
A first step of the GSoC could be to complete it.
GaÃ«l

Post by Zay Maung Maung Aye
Hello everyone
Recently I mentioned metric learning as one of possible projects for

this

Post by Zay Maung Maung Aye
years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning distance
functions. Usually the metric that is learned is a Mahalanobis

metric, thus

Post by Zay Maung Maung Aye
the problem reduces to finding a PSD matrix A that minimizes some
functional.
Metric learning is usually done in a supervised way, that is, a user

tells

Post by Zay Maung Maung Aye
which points should be closer and which should be more distant. It

can be

Post by Zay Maung Maung Aye
expressed either in form of "similar" / "dissimilar", or "A is

closer to B

Post by Zay Maung Maung Aye
than to C".
Since metric learning is (mostly) about a PSD matrix A, one can

do Cholesky

Post by Zay Maung Maung Aye
decomposition on it to obtain a matrix G to transform the data. It

could

Post by Zay Maung Maung Aye
lead to something like guided clustering, where we first transform

the data

Post by Zay Maung Maung Aye
space according to our prior knowledge of similarity.
Metric learning seems to be quite an active field of research ([1],

[2], [3

Post by Zay Maung Maung Aye
]). There are 2 somewhat up-to date surveys: [1] and [2].
Top 3 seemingly most cited methods (according to Google Scholar) are
â¡ MMC by Xing et al. This is a pioneering work and, according to

the

Post by Zay Maung Maung Aye
survey #2
The algorithm used to solve (1) is a simple projected

gradient

Post by Zay Maung Maung Aye
approach requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for medium
â â
and high-dimensional problems
â¡ âLarge Margin Nearest Neighbor by Weinberger et al. The survey 2
acknowledges this method as "one of the most widely-used

Mahalanobis

Post by Zay Maung Maung Aye
distance learning methods"
LMNN generally performs very well in practice, although it is
sometimes prone to overfitting due to the absence of
regularization, especially in high dimension
â¡ Information-theoretic metric learning by Davis et al. This one

features

Post by Zay Maung Maung Aye
a special kind of regularizer called logDet.
â¡ There are many other methods. If you guys know that other

methods rock,

Post by Zay Maung Maung Aye
let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or

both?)

Post by Zay Maung Maung Aye
algorithms along with a relevant transformer.

------------------------------------------------------------------------------

Post by Zay Maung Maung Aye
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your

hub for

Post by Zay Maung Maung Aye
all
things parallel software development, from weekly thought leadership

blogs

Post by Zay Maung Maung Aye
to
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Zay Maung Maung Aye
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for
all
things parallel software development, from weekly thought leadership blogs
to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2015-03-22 00:59:17 UTC

Post by Artem
Are there any objections on Joel's variant of y? It serves my needs, but
is quite different from what one can usually find in scikit-learn.

FWIW It'll require some changes to cross-validation routines.

Post by Artem
Are there any objections on Joel's variant of y? It serves my needs, but
is quite different from what one can usually find in scikit-learn.
------
Another point I want to bring up is metric-aware KMeans. Currently it
works with Euclidean distance only, which is not a problem for a
Mahalanobis distance, but as (and if) we move towards kernel metrics, it
becomes impossible to transform the data in a way that the Euclidean
distance between the transformed points accurately reflects the distance
between the points in a space with the learned metric.
I think it'd nice to have "non-linear" metrics, too. One of the possible
approaches (widely recognized among researchers on metric learning) is to
use KernelPCA before learning the metric. This would work really well with
sklearn's Pipelines.
But not all the methods are justified to be used with Kernel PCA. Namely,
ITML uses a special kind of regularization that breaks all theoretical
guarantees.
And, it's a bit weird that something that is called a metric learning
actually does space transformation. Maybe we should also add something like
factories of metrics, whose sole result is a DistanceMetric (in particular
for those kernel metrics)?
On Fri, Mar 20, 2015 at 10:01 AM, Gael Varoquaux <

Post by Zay Maung Maung Aye
Neighborhood Component Analysis is more cited than ITML.

There is already a pull request on neighborhood component analysis
https://github.com/scikit-learn/scikit-learn/issues/3213
A first step of the GSoC could be to complete it.
GaÃ«l

Post by Zay Maung Maung Aye
Hello everyone
Recently I mentioned metric learning as one of possible projects

for this

Post by Zay Maung Maung Aye
years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning

distance

Post by Zay Maung Maung Aye
functions. Usually the metric that is learned is a Mahalanobis

metric, thus

Post by Zay Maung Maung Aye
the problem reduces to finding a PSD matrix A that minimizes some
functional.
Metric learning is usually done in a supervised way, that is, a

user tells

Post by Zay Maung Maung Aye
which points should be closer and which should be more distant. It

can be

Post by Zay Maung Maung Aye
expressed either in form of "similar" / "dissimilar", or "A is

closer to B

Post by Zay Maung Maung Aye
than to C".
Since metric learning is (mostly) about a PSD matrix A, one can

do Cholesky

Post by Zay Maung Maung Aye
decomposition on it to obtain a matrix G to transform the data. It

could

Post by Zay Maung Maung Aye
lead to something like guided clustering, where we first transform

the data

Post by Zay Maung Maung Aye
space according to our prior knowledge of similarity.
Metric learning seems to be quite an active field of research ([1],

[2], [3

Post by Zay Maung Maung Aye
]). There are 2 somewhat up-to date surveys: [1] and [2].
Top 3 seemingly most cited methods (according to Google Scholar)

are

Post by Zay Maung Maung Aye
â¡ MMC by Xing et al. This is a pioneering work and, according to

the

Post by Zay Maung Maung Aye
survey #2
The algorithm used to solve (1) is a simple projected

gradient

Mahalanobis

Post by Zay Maung Maung Aye
distance learning methods"
LMNN generally performs very well in practice, although it

Post by Zay Maung Maung Aye
sometimes prone to overfitting due to the absence of
regularization, especially in high dimension
â¡ Information-theoretic metric learning by Davis et al. This one

features

Post by Zay Maung Maung Aye
a special kind of regularizer called logDet.
â¡ There are many other methods. If you guys know that other

methods rock,

Post by Zay Maung Maung Aye
let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or

both?)

Post by Zay Maung Maung Aye
algorithms along with a relevant transformer.

------------------------------------------------------------------------------

Post by Zay Maung Maung Aye
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your

hub for

Post by Zay Maung Maung Aye
all
things parallel software development, from weekly thought

leadership blogs

Post by Zay Maung Maung Aye
to
news, videos, case studies, tutorials and more. Take a look and

join the

--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info
http://twitter.com/GaelVaroquaux
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub
for all
things parallel software development, from weekly thought leadership
blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2015-03-22 03:42:26 UTC

I skimmed through this survey:
http://arxiv.org/abs/1306.6709

For methods that learn a Mahalanobis distance, as Artem said, we can indeed
compute the Cholesky decomposition of the learned precision matrix and use
it to transform the data. Thus in this case metric learning can be seen as
supervised dimensionality reduction, where supervision comes in the form of
sample similarity / dissimilarity.

So yes fit(X, Y) where Y is a sparse symmetric matrix should work but as
mentioned by Joel this would need modification to cross-validation.
Maybe we can add a _pairwise_y property like the _pairwise property that we
use for kernel methods:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py#L122

The method of Xing is highly cited but seems limited. For scikit-learn, we
would prefer methods which work and scale well in practice.
Any suggestion?

For parametric forms other than Mahalanobis distance, we would need a way
to get the learned similarity matrix and plug it into clustering or
classification/regression algorithms.

In any case, a core developer needs to step up to mentor this project.
Gael, you seem excited about metric learning :)?

Mathieu

Post by Artem
Are there any objections on Joel's variant of y? It serves my needs, but

Post by Artem
is quite different from what one can usually find in scikit-learn.

FWIW It'll require some changes to cross-validation routines.

Post by Zay Maung Maung Aye
Neighborhood Component Analysis is more cited than ITML.

There is already a pull request on neighborhood component analysis
https://github.com/scikit-learn/scikit-learn/issues/3213
A first step of the GSoC could be to complete it.
GaÃ«l

Post by Zay Maung Maung Aye
Hello everyone
Recently I mentioned metric learning as one of possible projects

for this

Post by Zay Maung Maung Aye
years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning

distance

Post by Zay Maung Maung Aye
functions. Usually the metric that is learned is a Mahalanobis

metric, thus

Post by Zay Maung Maung Aye
the problem reduces to finding a PSD matrix A that minimizes some
functional.
Metric learning is usually done in a supervised way, that is, a

user tells

Post by Zay Maung Maung Aye
which points should be closer and which should be more distant. It

can be

Post by Zay Maung Maung Aye
expressed either in form of "similar" / "dissimilar", or "A is

closer to B

Post by Zay Maung Maung Aye
than to C".
Since metric learning is (mostly) about a PSD matrix A, one can

do Cholesky

Post by Zay Maung Maung Aye
decomposition on it to obtain a matrix G to transform the data. It

could

Post by Zay Maung Maung Aye
lead to something like guided clustering, where we first transform

the data

Post by Zay Maung Maung Aye
space according to our prior knowledge of similarity.
Metric learning seems to be quite an active field of research

([1], [2], [3

Post by Zay Maung Maung Aye
]). There are 2 somewhat up-to date surveys: [1] and [2].
Top 3 seemingly most cited methods (according to Google Scholar)

are

Post by Zay Maung Maung Aye
â¡ MMC by Xing et al. This is a pioneering work and, according to

the

Post by Zay Maung Maung Aye
survey #2
The algorithm used to solve (1) is a simple projected

gradient

Post by Zay Maung Maung Aye
acknowledges this method as "one of the most widely-used

Mahalanobis

Post by Zay Maung Maung Aye
distance learning methods"
LMNN generally performs very well in practice, although it

Post by Zay Maung Maung Aye
sometimes prone to overfitting due to the absence of
regularization, especially in high dimension
â¡ Information-theoretic metric learning by Davis et al. This one

features

Post by Zay Maung Maung Aye
a special kind of regularizer called logDet.
â¡ There are many other methods. If you guys know that other

methods rock,

Post by Zay Maung Maung Aye
let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or

both?)

Post by Zay Maung Maung Aye
algorithms along with a relevant transformer.

------------------------------------------------------------------------------

Post by Zay Maung Maung Aye
Dive into the World of Parallel Programming The Go Parallel

Website,

Post by Zay Maung Maung Aye
sponsored
by Intel and developed in partnership with Slashdot Media, is your

hub for

Post by Zay Maung Maung Aye
all
things parallel software development, from weekly thought

leadership blogs

Post by Zay Maung Maung Aye
to
news, videos, case studies, tutorials and more. Take a look and

join the

--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info
http://twitter.com/GaelVaroquaux
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub
for all
things parallel software development, from weekly thought leadership
blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-22 10:42:48 UTC

I suppose Xing's MMC is highly cited because it's the pioneer of the field.
Though, having Ng and Jordan as co-authors looks impressing. Either way, it
requires to perform an eigen decomposition at each step, which has cubic
(in number of features) complexity.

â

we would need a way to get the learned similarity matrix and plug it into
clustering or classification/regression algorithms

âWould current implementation of KMeans â
â
âsupport this? From what I know, KMeans uses distances to centroids at E
step, and then sets centroids as mean of all vectors of its cluster at M
step, so it (a centroid) won't be presented in that matrix, so no luck when
measuring the distance at the next E step.
I heard, though, that one do still do KMeans in such a setting, but don't
know the details. Is that what is done when *precompute_distances* is on?â

http://arxiv.org/abs/1306.6709
For methods that learn a Mahalanobis distance, as Artem said, we can
indeed compute the Cholesky decomposition of the learned precision matrix
and use it to transform the data. Thus in this case metric learning can be
seen as supervised dimensionality reduction, where supervision comes in the
form of sample similarity / dissimilarity.
So yes fit(X, Y) where Y is a sparse symmetric matrix should work but as
mentioned by Joel this would need modification to cross-validation.
Maybe we can add a _pairwise_y property like the _pairwise property that
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py#L122
The method of Xing is highly cited but seems limited. For scikit-learn, we
would prefer methods which work and scale well in practice.
Any suggestion?
For parametric forms other than Mahalanobis distance, we would need a way
to get the learned similarity matrix and plug it into clustering or
classification/regression algorithms.
In any case, a core developer needs to step up to mentor this project.
Gael, you seem excited about metric learning :)?
Mathieu

Post by Artem
Are there any objections on Joel's variant of y? It serves my needs, but

Post by Artem
is quite different from what one can usually find in scikit-learn.

FWIW It'll require some changes to cross-validation routines.

Post by Artem
Are there any objections on Joel's variant of y? It serves my needs, but
is quite different from what one can usually find in scikit-learn.
------
Another point I want to bring up is metric-aware KMeans. Currently it
works with Euclidean distance only, which is not a problem for a
Mahalanobis distance, but as (and if) we move towards kernel metrics, it
becomes impossible to transform the data in a way that the Euclidean
distance between the transformed points accurately reflects the distance
between the points in a space with the learned metric.
I think it'd nice to have "non-linear" metrics, too. One of the possible
approaches (widely recognized among researchers on metric learning) is to
use KernelPCA before learning the metric. This would work really well with
sklearn's Pipelines.
But not all the methods are justified to be used with Kernel PCA.
Namely, ITML uses a special kind of regularization that breaks all
theoretical guarantees.
And, it's a bit weird that something that is called a metric learning
actually does space transformation. Maybe we should also add something like
factories of metrics, whose sole result is a DistanceMetric (in particular
for those kernel metrics)?
On Fri, Mar 20, 2015 at 10:01 AM, Gael Varoquaux <

Post by Zay Maung Maung Aye
Neighborhood Component Analysis is more cited than ITML.

There is already a pull request on neighborhood component analysis
https://github.com/scikit-learn/scikit-learn/issues/3213
A first step of the GSoC could be to complete it.
GaÃ«l

Post by Zay Maung Maung Aye
Hello everyone
Recently I mentioned metric learning as one of possible projects

for this

Post by Zay Maung Maung Aye
years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning

distance

Post by Zay Maung Maung Aye
functions. Usually the metric that is learned is a Mahalanobis

metric, thus

Post by Zay Maung Maung Aye
the problem reduces to finding a PSD matrix A that minimizes some
functional.
Metric learning is usually done in a supervised way, that is, a

user tells

Post by Zay Maung Maung Aye
which points should be closer and which should be more distant.

It can be

Post by Zay Maung Maung Aye
expressed either in form of "similar" / "dissimilar", or "A is

closer to B

Post by Zay Maung Maung Aye
than to C".
Since metric learning is (mostly) about a PSD matrix A, one can

do Cholesky

Post by Zay Maung Maung Aye
decomposition on it to obtain a matrix G to transform the data.

It could

Post by Zay Maung Maung Aye
lead to something like guided clustering, where we first

transform the data

Post by Zay Maung Maung Aye
space according to our prior knowledge of similarity.
Metric learning seems to be quite an active field of research

([1], [2], [3

Post by Zay Maung Maung Aye
]). There are 2 somewhat up-to date surveys: [1] and [2].
Top 3 seemingly most cited methods (according to Google Scholar)

are

Post by Zay Maung Maung Aye
â¡ MMC by Xing et al. This is a pioneering work and, according

to the

Post by Zay Maung Maung Aye
survey #2
The algorithm used to solve (1) is a simple projected

gradient

Post by Zay Maung Maung Aye
approach requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for

medium

Post by Zay Maung Maung Aye
â â
and high-dimensional problems
â¡ âLarge Margin Nearest Neighbor by Weinberger et al. The

survey 2

Post by Zay Maung Maung Aye
acknowledges this method as "one of the most widely-used

Mahalanobis

Post by Zay Maung Maung Aye
distance learning methods"
LMNN generally performs very well in practice, although

it is

Post by Zay Maung Maung Aye
sometimes prone to overfitting due to the absence of
regularization, especially in high dimension
â¡ Information-theoretic metric learning by Davis et al. This

one features

Post by Zay Maung Maung Aye
a special kind of regularizer called logDet.
â¡ There are many other methods. If you guys know that other

methods rock,

Post by Zay Maung Maung Aye
let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or

both?)

Post by Zay Maung Maung Aye
algorithms along with a relevant transformer.

------------------------------------------------------------------------------

Post by Zay Maung Maung Aye
Dive into the World of Parallel Programming The Go Parallel

Website,

Post by Zay Maung Maung Aye
sponsored
by Intel and developed in partnership with Slashdot Media, is

your hub for

Post by Zay Maung Maung Aye
all
things parallel software development, from weekly thought

leadership blogs

Post by Zay Maung Maung Aye
to
news, videos, case studies, tutorials and more. Take a look and

join the

--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info
http://twitter.com/GaelVaroquaux
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub
for all
things parallel software development, from weekly thought leadership
blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2015-03-22 08:26:09 UTC

Post by Joel Nothman
FWIW It'll require some changes to cross-validation routines.

I'd rather we try not to add new needs and usecases to these before we
release 1.0. We are already having a hard time covering in a homogeneous
way all the possible options.

Gaël

Andreas Mueller

2015-03-23 14:07:28 UTC

This can also be done using the Nystroem kernel approximation class,
which just transforms data into the subspace of the Hilbert space
spanned by the training examples (or a subset of these).

Artem

2015-03-23 22:03:56 UTC

Theoretical justifications of using kernel PCA is that the data needs to be
projected onto span of eigenvectors of a covariance matrix (section 3.1.4
of Kulis' survey
<http://web.cse.ohio-state.edu/~kulis/pubs/ftml_metric_learning.pdf>). Does
kernel approximation whiten the data?

Either way, apparently, there's no justification to use kernel
approximation with ITML, since even the regular KPCA trick doesn't apply to
it.

Post by Artem
Are there any objections on Joel's variant of y? It serves my needs, but
is quite different from what one can usually find in scikit-learn.
------
Another point I want to bring up is metric-aware KMeans. Currently it
works with Euclidean distance only, which is not a problem for a
Mahalanobis distance, but as (and if) we move towards kernel metrics, it
becomes impossible to transform the data in a way that the Euclidean
distance between the transformed points accurately reflects the distance
between the points in a space with the learned metric.
I think it'd nice to have "non-linear" metrics, too. One of the possible
approaches (widely recognized among researchers on metric learning) is to
use KernelPCA before learning the metric. This would work really well with
sklearn's Pipelines.
But not all the methods are justified to be used with Kernel PCA. Namely,
ITML uses a special kind of regularization that breaks all theoretical
guarantees.
This can also be done using the Nystroem kernel approximation class,
which just transforms data into the subspace of the Hilbert space spanned
by the training examples (or a subset of these).
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for
all
things parallel software development, from weekly thought leadership blogs
to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2015-03-23 22:09:44 UTC

Hi Artem.
I think the overall feedback on your proposal was positive.
Did you get the chance to write it up yet?
Please submit your proposal on melange https://www.google-melange.com
(deadline is this Friday)
and mention / link it in our wiki:
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2015

Btw, what is your github name?

Andy

Artem

2015-03-23 22:31:26 UTC

Hi Andreas

My GitHub's name is Barmaley-exe. I put a draft
<https://github.com/scikit-learn/scikit-learn/wiki/%5BWIP%5D-GSoC-2015-Proposal:-Metric-Learning-module>
of my proposal on wiki, but there are still several unanswered questions:

1. One of the applications of metric learning I envision is a
"somewhat-supervised" clustering, where user can seed in some knowledge,
and then use the resultant metric in clustering. To get it working
following is needed:
1. DistanceMetric-aware Clustering. Turned out, there are already
methods that can do clustering on a similarity matrix, but should I
generalize KMeans / Hierarchical clustering?
2. General scheme of training would require matrix-like y (Like the
one proposed by Joel). What is the consensus on that?
2. Though 2 of 3 methods that are planned to implement are kernelizable
by KPCA, the last one (ITML) is not. So if I implement it (ITML with a
kernel trick), it'd be impossible to transform the data space. Thus, it
won't work as a Transformer. This problem can be fixed by making it not a
Transformer, but an Estimator that would predict a similarity matrix. What
do you think?

Post by Andreas Mueller
Hi Artem.
I think the overall feedback on your proposal was positive.
Did you get the chance to write it up yet?
Please submit your proposal on melange https://www.google-melange.com
(deadline is this Friday)
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2015
Btw, what is your github name?
Andy
Hello everyone
Recently I mentioned metric learning as one of possible projects for
this years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning distance
functions. Usually the metric that is learned is a Mahalanobis metric, thus
the problem reduces to finding a PSD matrix A that minimizes some
functional.
Metric learning is usually done in a supervised way, that is, a user
tells which points should be closer and which should be more distant. It
can be expressed either in form of "similar" / "dissimilar", or "A is
closer to B than to C".
Since metric learning is (mostly) about a PSD matrix A, one can
do Cholesky decomposition on it to obtain a matrix G to transform the data.
It could lead to something like guided clustering, where we first transform
the data space according to our prior knowledge of similarity.
Metric learning seems to be quite an active field of research ([1
<http://www.icml2010.org/tutorials.html>], [2
<http://www.ariel.ac.il/sites/ofirpele/DFML_ECCV2010_tutorial/>], [3
<http://nips.cc/Conferences/2011/Program/event.php?ID=2543>]). There are
2 somewhat up-to date surveys: [1
<http://web.cse.ohio-state.edu/%7Ekulis/pubs/ftml_metric_learning.pdf>]
and [2 <http://arxiv.org/abs/1306.6709>].
Top 3 seemingly most cited methods (according to Google Scholar) are
- MMC by Xing et al.
<http://papers.nips.cc/paper/2164-distance-metric-learning-with-application-to-clustering-with-side-information.pdf> This
is a pioneering work and, according to the survey #2
The algorithm used to solve (1) is a simple projected gradient approach

requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for medium
â â
and high-dimensional problems

- âLarge Margin Nearest Neighbor by Weinberger et al
<http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf>.
The survey 2 acknowledges this method as "one of the most widely-used
Mahalanobis distance learning methods"
LMNN generally performs very well in practice, although it is sometimes

prone to overfitting due to the absence of regularization, especially in
high dimension

Andreas Mueller

2015-03-23 22:43:12 UTC

Hi Artem.
I thought that was you, but I wasn't sure.
Great, I linked to your draft from the wiki overview page, otherwise it
is hard to find.
I haven't looked at it in detail yet, though.

1.1: no, generalizing K-Means is out of scope. Hierarchical should work
with arbitrary metrics.
1.2: matrix-like Y should actually be fine with cross-validation. I
think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

2. I'd have to look into it. I don't understand why KPCA wouldn't work.
It should work for all metrics, right? Having something produce a
similarity matrix is not ideal, but I think it could be made to work.
I'd still call it ``transform`` probably, though. It would be a bit
confusing because it uses the squared transform, but it would make it
possible to build pipelines with clustering algorithms.

Best,
Andy

Post by Artem
Hi Andreas
My GitHub's name is Barmaley-exe. I put a draft
<https://github.com/scikit-learn/scikit-learn/wiki/%5BWIP%5D-GSoC-2015-Proposal:-Metric-Learning-module>
1. One of the applications of metric learning I envision is a
"somewhat-supervised" clustering, where user can seed in some
knowledge, and then use the resultant metric in clustering. To get
1. DistanceMetric-aware Clustering. Turned out, there are already
methods that can do clustering on a similarity matrix, but
should I generalize KMeans / Hierarchical clustering?
2. General scheme of training would require matrix-like y (Like
the one proposed by Joel). What is the consensus on that?
2. Though 2 of 3 methods that are planned to implement are
kernelizable by KPCA, the last one (ITML) is not. So if I
implement it (ITML with a kernel trick), it'd be impossible to
transform the data space. Thus, it won't work as a Transformer.
This problem can be fixed by making it not a Transformer, but an
Estimator that would predict a similarity matrix. What do you think?
Hi Artem.
I think the overall feedback on your proposal was positive.
Did you get the chance to write it up yet?
Please submit your proposal on melange
https://www.google-melange.com (deadline is this Friday)
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2015
Btw, what is your github name?
Andy

Post by Artem
Hello everyone
Recently I mentioned metric learning as one of possible projects
for this years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning
distance functions. Usually the metric that is learned is a
Mahalanobis metric, thus the problem reduces to finding a PSD
matrix A that minimizes some functional.
Metric learning is usually done in a supervised way, that is, a
user tells which points should be closer and which should be more
distant. It can be expressed either in form of "similar" /
"dissimilar", or "A is closer to B than to C".
Since metric learning is (mostly) about a PSD matrix A, one can
do Cholesky decomposition on it to obtain a matrix G to transform
the data. It could lead to something like guided clustering,
where we first transform the data space according to our prior
knowledge of similarity.
Metric learning seems to be quite an active field of research ([1
<http://www.icml2010.org/tutorials.html>], [2
<http://www.ariel.ac.il/sites/ofirpele/DFML_ECCV2010_tutorial/>],
[3 <http://nips.cc/Conferences/2011/Program/event.php?ID=2543>]).
There are 2 somewhat up-to date surveys: [1
<http://web.cse.ohio-state.edu/%7Ekulis/pubs/ftml_metric_learning.pdf>]
and [2 <http://arxiv.org/abs/1306.6709>].
Top 3 seemingly most cited methods (according to Google Scholar) are
* MMC by Xing et al.
<http://papers.nips.cc/paper/2164-distance-metric-learning-with-application-to-clustering-with-side-information.pdf> This
is a pioneering work and, according to the survey #2
The algorithm used to solve (1) is a simple projected
gradient approach requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for medium
â â
and high-dimensional problems
* âLarge Margin Nearest Neighbor by Weinberger et al
<http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf>.
The survey 2 acknowledges this method as "one of the most
widely-used Mahalanobis distance learning methods"
LMNN generally performs very well in practice, although
it is sometimes prone to overfitting due to the absence
of regularization, especially in high dimension
* Information-theoretic metric learning by Davis et al.
<http://dl.acm.org/citation.cfm?id=1273523> This one features
a special kind of regularizer called logDet.
* There are many other methods. If you guys know that other
methods rock, let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or
both?) algorithms along with a relevant transformer.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Aurélien Bellet

2015-03-23 23:20:59 UTC

Hi everyone,

I don't know a lot about scikit-learn but perhaps I can help answer some
of the questions about metric learning:

- Like someone mentioned, any Mahalanobis distance metric can be used to
linearly project data into a new space (based on the square root of the
learned PSD matrix) where the Euclidean distance is equivalent. This can
used as a transformer in scikit-learn.

- LMNN, NCA and ITML are indeed the most standard algorithms in metric
learning and work well in practice (although they may not scale too
well). Starting with these makes sense.

- Someone said it would be nice to have a more scalable method. I would
recommend OASIS
(http://www.jmlr.org/papers/volume11/chechik10a/chechik10a.pdf), which
scales well to large datasets due to its simple online algorithm. Note
that it learns a similarity function, not a distance. However it can
still be used to transform the data if the learned matrix is projected
onto the PSD cone - then the dot product in the new space is equivalent
to the learned similarity (see discussion in Section 6 of the paper).
Other popular online methods include LEGO
(https://www.cs.utexas.edu/~pjain/pubs/online_nips.pdf) and RDML
(http://www.cse.msu.edu/~rongjin/publications/nips10-dist-learn.pdf)

- The KPCA trick is a convenient method to make a metric learning
algorithm nonlinear. Theoretical justification does not hold for all
algorithms but in practice it is a preprocessing applied to the data
before running the metric learning algorithm so it can be used together
with any method. There are also methods that directly learn a nonlinear
distance, for instance GB-LMNN
(http://www-bcf.usc.edu/~feisha/pubs/chi2.pdf) or some approaches based
on deep neural nets.

Aurélien

Post by Artem
Hello everyone
Recently I mentioned metric learning as one of possible projects
for this years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning
distance functions. Usually the metric that is learned is a
Mahalanobis metric, thus the problem reduces to finding a PSD
matrix A that minimizes some functional.
Metric learning is usually done in a supervised way, that is, a
user tells which points should be closer and which should be more
distant. It can be expressed either in form of "similar" /
"dissimilar", or "A is closer to B than to C".
Since metric learning is (mostly) about a PSD matrix A, one can
do Cholesky decomposition on it to obtain a matrix G to transform
the data. It could lead to something like guided clustering,
where we first transform the data space according to our prior
knowledge of similarity.
Metric learning seems to be quite an active field of research ([1
<http://www.icml2010.org/tutorials.html>], [2
<http://www.ariel.ac.il/sites/ofirpele/DFML_ECCV2010_tutorial/>],
[3 <http://nips.cc/Conferences/2011/Program/event.php?ID=2543>]).
There are 2 somewhat up-to date surveys: [1
<http://web.cse.ohio-state.edu/%7Ekulis/pubs/ftml_metric_learning.pdf>]
and [2 <http://arxiv.org/abs/1306.6709>].
Top 3 seemingly most cited methods (according to Google Scholar) are
* MMC by Xing et al.
<http://papers.nips.cc/paper/2164-distance-metric-learning-with-application-to-clustering-with-side-information.pdf> This
is a pioneering work and, according to the survey #2
The algorithm used to solve (1) is a simple projected
gradient approach requiring the full

eigenvalue decomposition of

M

at each iteration. This is typically intractable for medium

and high-dimensional problems
* Large Margin Nearest Neighbor by Weinberger et al
<http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf>.
The survey 2 acknowledges this method as "one of the most
widely-used Mahalanobis distance learning methods"
LMNN generally performs very well in practice, although
it is sometimes prone to overfitting due to the absence
of regularization, especially in high dimension
* Information-theoretic metric learning by Davis et al.
<http://dl.acm.org/citation.cfm?id=1273523> This one features
a special kind of regularizer called logDet.
* There are many other methods. If you guys know that other
methods rock, let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or
both?) algorithms along with a relevant transformer.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-24 00:13:46 UTC

Hi AurÃ©lien

Thanks for your comments! Can you say anything on kernelization as part of
a model, not KPCA? I'm especially interested in a kernelized version of
ITML. I think, kernel metric learning methods don't scale well, since one
has to work a huge matrix of size n_samples x n_samples, which quickly
becomes unpractical.

â

Post by AurÃ©lien Bellet
some approaches based
â â
on deep neural netsâ

âYou mean NNCA by Hinton's lab? I got scared by amount of NN-specific
voodoo magic happening there, don't think it's worth implementing. The
other method I'm aware of (by Chopra et al.) relies on convolution
networks, which are not present in sklearn.

Can't say anything against GB-LMNN (or Chi-square-LMNN), though. I think,
they can be added later, if needed.â

On Tue, Mar 24, 2015 at 2:20 AM, AurÃ©lien Bellet <

Post by AurÃ©lien Bellet
Hi everyone,
I don't know a lot about scikit-learn but perhaps I can help answer some
- Like someone mentioned, any Mahalanobis distance metric can be used to
linearly project data into a new space (based on the square root of the
learned PSD matrix) where the Euclidean distance is equivalent. This can
used as a transformer in scikit-learn.
- LMNN, NCA and ITML are indeed the most standard algorithms in metric
learning and work well in practice (although they may not scale too
well). Starting with these makes sense.
- Someone said it would be nice to have a more scalable method. I would
recommend OASIS
(http://www.jmlr.org/papers/volume11/chechik10a/chechik10a.pdf), which
scales well to large datasets due to its simple online algorithm. Note
that it learns a similarity function, not a distance. However it can
still be used to transform the data if the learned matrix is projected
onto the PSD cone - then the dot product in the new space is equivalent
to the learned similarity (see discussion in Section 6 of the paper).
Other popular online methods include LEGO
(https://www.cs.utexas.edu/~pjain/pubs/online_nips.pdf) and RDML
(http://www.cse.msu.edu/~rongjin/publications/nips10-dist-learn.pdf)
- The KPCA trick is a convenient method to make a metric learning
algorithm nonlinear. Theoretical justification does not hold for all
algorithms but in practice it is a preprocessing applied to the data
before running the metric learning algorithm so it can be used together
with any method. There are also methods that directly learn a nonlinear
distance, for instance GB-LMNN
(http://www-bcf.usc.edu/~feisha/pubs/chi2.pdf) or some approaches based
on deep neural nets.
AurÃ©lien

Post by Artem
Hi Andreas
My GitHub's name is Barmaley-exe. I put a draft
<

https://github.com/scikit-learn/scikit-learn/wiki/%5BWIP%5D-GSoC-2015-Proposal:-Metric-Learning-module

Post by Artem
of my proposal on wiki, but there are still several unanswered
1. One of the applications of metric learning I envision is a
"somewhat-supervised" clustering, where user can seed in some
knowledge, and then use the resultant metric in clustering. To get
1. DistanceMetric-aware Clustering. Turned out, there are already
methods that can do clustering on a similarity matrix, but
should I generalize KMeans / Hierarchical clustering?
2. General scheme of training would require matrix-like y (Like
the one proposed by Joel). What is the consensus on that?
2. Though 2 of 3 methods that are planned to implement are
kernelizable by KPCA, the last one (ITML) is not. So if I
implement it (ITML with a kernel trick), it'd be impossible to
transform the data space. Thus, it won't work as a Transformer.
This problem can be fixed by making it not a Transformer, but an
Estimator that would predict a similarity matrix. What do you think?
Hi Artem.
I think the overall feedback on your proposal was positive.
Did you get the chance to write it up yet?
Please submit your proposal on melange
https://www.google-melange.com (deadline is this Friday)

https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2015

Post by Artem
Btw, what is your github name?
Andy

http://web.cse.ohio-state.edu/%7Ekulis/pubs/ftml_metric_learning.pdf>]

Post by Artem
and [2 <http://arxiv.org/abs/1306.6709>].
Top 3 seemingly most cited methods (according to Google Scholar)

are

Post by Artem
* MMC by Xing et al.
<

http://papers.nips.cc/paper/2164-distance-metric-learning-with-application-to-clustering-with-side-information.pdf>
This

Post by Artem
is a pioneering work and, according to the survey #2
The algorithm used to solve (1) is a simple projected
gradient approach requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for medium
â â
and high-dimensional problems
* âLarge Margin Nearest Neighbor by Weinberger et al
<

http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf

Post by Andreas Mueller
.

Post by Artem
The survey 2 acknowledges this method as "one of the most
widely-used Mahalanobis distance learning methods"
LMNN generally performs very well in practice, although
it is sometimes prone to overfitting due to the absence
of regularization, especially in high dimension
* Information-theoretic metric learning by Davis et al.
<http://dl.acm.org/citation.cfm?id=1273523> This one features
a special kind of regularizer called logDet.
* There are many other methods. If you guys know that other
methods rock, let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or
both?) algorithms along with a relevant transformer.

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel

Website, sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought

leadership blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and

join the

Post by Artem
conversation now.http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel
Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your
hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Andreas Mueller
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Andreas Mueller
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Andreas Mueller
things parallel software development, from weekly thought leadership

blogs to

Post by Andreas Mueller
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Aurélien Bellet

2015-03-24 07:42:08 UTC

Post by Artem
Thanks for your comments! Can you say anything on kernelization as part
of a model, not KPCA? I'm especially interested in a kernelized version
of ITML. I think, kernel metric learning methods don't scale well, since
one has to work a huge matrix of size n_samples x n_samples, which
quickly becomes unpractical.

ITML can be formally kernelized, see Section 4.3 of the paper (see also
http://www.cs.utexas.edu/users/inderjit/public_papers/metric_kernel_learning_jmlr12.pdf).
However you will indeed face the problem of having to construct an n x n
kernel matrix and learn an n x n matrix.

Post by Artem
You mean NNCA by Hinton's lab? I got scared by amount of NN-specific
voodoo magic happening there, don't think it's worth implementing. The
other method I'm aware of (by Chopra et al.) relies on convolution
networks, which are not present in sklearn.

You're right : it will involve some neural nets black magic ;-)

Two things I forgot to mention:

-
http://www.cs.cornell.edu/people/tj/publications/schultz_joachims_03a.pdf is
also popular and can be solved using a linear SVM solver, so it can
scale rather well and is easy to implement since SVM solvers are already
part of scikit-learn.

- Generally, it might be interesting, for all algorithms, to have an
option to restrict the learned matrix to be diagonal when dealing with
high-dimensional data. This is a simple trick that avoids the O(d^2) or
O(d^3) time/memory cost and performs fairly well when d is large.

Aurelien

Post by Artem
On Tue, Mar 24, 2015 at 2:20 AM, Aurélien Bellet
Hi everyone,
I don't know a lot about scikit-learn but perhaps I can help answer some
- Like someone mentioned, any Mahalanobis distance metric can be used to
linearly project data into a new space (based on the square root of the
learned PSD matrix) where the Euclidean distance is equivalent. This can
used as a transformer in scikit-learn.
- LMNN, NCA and ITML are indeed the most standard algorithms in metric
learning and work well in practice (although they may not scale too
well). Starting with these makes sense.
- Someone said it would be nice to have a more scalable method. I would
recommend OASIS
(http://www.jmlr.org/papers/volume11/chechik10a/chechik10a.pdf), which
scales well to large datasets due to its simple online algorithm. Note
that it learns a similarity function, not a distance. However it can
still be used to transform the data if the learned matrix is projected
onto the PSD cone - then the dot product in the new space is equivalent
to the learned similarity (see discussion in Section 6 of the paper).
Other popular online methods include LEGO
(https://www.cs.utexas.edu/~pjain/pubs/online_nips.pdf) and RDML
(http://www.cse.msu.edu/~rongjin/publications/nips10-dist-learn.pdf)
- The KPCA trick is a convenient method to make a metric learning
algorithm nonlinear. Theoretical justification does not hold for all
algorithms but in practice it is a preprocessing applied to the data
before running the metric learning algorithm so it can be used together
with any method. There are also methods that directly learn a nonlinear
distance, for instance GB-LMNN
(http://www-bcf.usc.edu/~feisha/pubs/chi2.pdf) or some approaches based
on deep neural nets.
Aurélien

Post by Artem
Hi Andreas
My GitHub's name is Barmaley-exe. I put a draft

<https://github.com/scikit-learn/scikit-learn/wiki/%5BWIP%5D-GSoC-2015-Proposal:-Metric-Learning-module>

Post by Artem
1. One of the applications of metric learning I envision is a
"somewhat-supervised" clustering, where user can seed in some
knowledge, and then use the resultant metric in clustering. To get
1. DistanceMetric-aware Clustering. Turned out, there are

already

Post by Artem
methods that can do clustering on a similarity matrix, but
should I generalize KMeans / Hierarchical clustering?
2. General scheme of training would require matrix-like y (Like
the one proposed by Joel). What is the consensus on that?
2. Though 2 of 3 methods that are planned to implement are
kernelizable by KPCA, the last one (ITML) is not. So if I
implement it (ITML with a kernel trick), it'd be impossible to
transform the data space. Thus, it won't work as a Transformer.
This problem can be fixed by making it not a Transformer, but an
Estimator that would predict a similarity matrix. What do you think?
Hi Artem.
I think the overall feedback on your proposal was positive.
Did you get the chance to write it up yet?
Please submit your proposal on melange
https://www.google-melange.com (deadline is this Friday)

https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2015

Post by Artem
Btw, what is your github name?
Andy

Post by Artem
Hello everyone
Recently I mentioned metric learning as one of possible

projects

Post by Artem
for this years' GSoC, and would like to hear your comments.
Metric learning, as follows from the name, is about learning
distance functions. Usually the metric that is learned is a
Mahalanobis metric, thus the problem reduces to finding a PSD
matrix A that minimizes some functional.
Metric learning is usually done in a supervised way, that is, a
user tells which points should be closer and which should

be more

Post by Artem
distant. It can be expressed either in form of "similar" /
"dissimilar", or "A is closer to B than to C".
Since metric learning is (mostly) about a PSD matrix A, one can
do Cholesky decomposition on it to obtain a matrix G to

transform

Post by Artem
the data. It could lead to something like guided clustering,
where we first transform the data space according to our prior
knowledge of similarity.
Metric learning seems to be quite an active field of

research ([1

Post by Artem
<http://www.icml2010.org/tutorials.html>], [2

<http://www.ariel.ac.il/sites/ofirpele/DFML_ECCV2010_tutorial/>],

Post by Artem
[3

<http://nips.cc/Conferences/2011/Program/event.php?ID=2543>]).

Post by Artem
There are 2 somewhat up-to date surveys: [1

<http://web.cse.ohio-state.edu/%7Ekulis/pubs/ftml_metric_learning.pdf>]

Post by Artem
and [2 <http://arxiv.org/abs/1306.6709>].
Top 3 seemingly most cited methods (according to Google Scholar) are
* MMC by Xing et al.

<http://papers.nips.cc/paper/2164-distance-metric-learning-with-application-to-clustering-with-side-information.pdf> This

Post by Artem
is a pioneering work and, according to the survey #2
The algorithm used to solve (1) is a simple projected
gradient approach requiring the full

eigenvalue decomposition of

M

at each iteration. This is typically intractable for medium

and high-dimensional problems
* Large Margin Nearest Neighbor by Weinberger et al

<http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf>.

features

Post by Artem
a special kind of regularizer called logDet.
* There are many other methods. If you guys know that other
methods rock, let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or
both?) algorithms along with a relevant transformer.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel
Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your
hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and
join the
conversation now.http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Andreas Mueller
Dive into the World of Parallel Programming The Go Parallel

Website, sponsored

Post by Andreas Mueller
by Intel and developed in partnership with Slashdot Media, is

your hub for all

Post by Andreas Mueller
things parallel software development, from weekly thought

leadership blogs to

Post by Andreas Mueller
news, videos, case studies, tutorials and more. Take a look and

join the

Post by Andreas Mueller
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2015-03-24 11:34:48 UTC

Hi Artem, I've taken a look at your proposal. I think this is an
interesting contribution, but I suspect your proposal is far too ambitious:

- The proposal doesn't well account for the need to receive reviews and
alter the PR in accordance. This is especially so because you are
developing a new variant of the API which means that even if the algorithm
works perfectly you won't get a free green light.
- With an implementation of one or two algorithms two algorithms, it
would be much better to add good examples of their utility and their
features to the example gallery than to implement more algorithms.
Developing good examples takes time too (and the reviewers are just as
picky).
- You will need to package your contributions into manageable PRs, and
ideally after each is merged, the overall project should still be usable
(well-tested, documented, etc.). So the documentation will, at least in
some measure need to be integrated.
- As GaÃ«l suggested, there's some cause for concern in that it requires
developing a new variant of the general API. This means everything is
slower, including more need for sanity and integration testing than other
projects may entail.

Artem

2015-03-24 11:52:53 UTC

Hi Joel. Thanks for your input!

I understand that I put a lot into my proposal, but it's hard to estimate
timeline exactly. Thus, I suggest thinking about it as being ordered by
priority: most important things go first, and least important (like kernel
ITML) may be abandoned in favor of documentation / testing / etc if I run
out of the timeline.

The point about self-contained PRs is valid, I agree with you. Though, I'd
like to write a tutorial once the base is established (like, there're at
least 2 working algorithms). I'll think if I can reschedule my timeline
appropriately.

I understand that it takes some time to receive feedback for PRs. I think,
my current timeline works pretty good in that sense: after each 2 weeks
"iteration" (Preferably even earlier) I should have something for review.
Then I'll be working on a next piece of the project, while waiting for a
review feedback.

I'll elaborate my proposal later today.

Post by Joel Nothman
Hi Artem, I've taken a look at your proposal. I think this is an
- The proposal doesn't well account for the need to receive reviews
and alter the PR in accordance. This is especially so because you are
developing a new variant of the API which means that even if the algorithm
works perfectly you won't get a free green light.
- With an implementation of one or two algorithms two algorithms, it
would be much better to add good examples of their utility and their
features to the example gallery than to implement more algorithms.
Developing good examples takes time too (and the reviewers are just as
picky).
- You will need to package your contributions into manageable PRs, and
ideally after each is merged, the overall project should still be usable
(well-tested, documented, etc.). So the documentation will, at least in
some measure need to be integrated.
- As GaÃ«l suggested, there's some cause for concern in that it
requires developing a new variant of the general API. This means everything
is slower, including more need for sanity and integration testing than
other projects may entail.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-23 23:45:35 UTC

2. From mathematical point of view applying KPCA first would be equivalent
to doing kernelized metric learning for some of methods (with no
regularization or a special kind of). ITML has a LogDet regularization,
which doesn't fit into that class of models.

Post by Andreas Mueller
Hi Artem.
I thought that was you, but I wasn't sure.
Great, I linked to your draft from the wiki overview page, otherwise it is
hard to find.
I haven't looked at it in detail yet, though.
1.1: no, generalizing K-Means is out of scope. Hierarchical should work
with arbitrary metrics.
1.2: matrix-like Y should actually be fine with cross-validation. I think
it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.
2. I'd have to look into it. I don't understand why KPCA wouldn't work. It
should work for all metrics, right? Having something produce a similarity
matrix is not ideal, but I think it could be made to work.
I'd still call it ``transform`` probably, though. It would be a bit
confusing because it uses the squared transform, but it would make it
possible to build pipelines with clustering algorithms.
Best,
Andy
Hi Andreas
My GitHub's name is Barmaley-exe. I put a draft
<https://github.com/scikit-learn/scikit-learn/wiki/%5BWIP%5D-GSoC-2015-Proposal:-Metric-Learning-module>
1. One of the applications of metric learning I envision is a
"somewhat-supervised" clustering, where user can seed in some knowledge,
and then use the resultant metric in clustering. To get it working
1. DistanceMetric-aware Clustering. Turned out, there are already
methods that can do clustering on a similarity matrix, but should I
generalize KMeans / Hierarchical clustering?
2. General scheme of training would require matrix-like y (Like the
one proposed by Joel). What is the consensus on that?
2. Though 2 of 3 methods that are planned to implement are
kernelizable by KPCA, the last one (ITML) is not. So if I implement it
(ITML with a kernel trick), it'd be impossible to transform the data space.
Thus, it won't work as a Transformer. This problem can be fixed by making
it not a Transformer, but an Estimator that would predict a similarity
matrix. What do you think?

requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for medium
â â
and high-dimensional problems

- âLarge Margin Nearest Neighbor by Weinberger et al
<http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf>.
The survey 2 acknowledges this method as "one of the most widely-used
Mahalanobis distance learning methods"
LMNN generally performs very well in practice, although it is

sometimes prone to overfitting due to the absence of regularization,
especially in high dimension

- Information-theoretic metric learning by Davis et al.
<http://dl.acm.org/citation.cfm?id=1273523> This one features a
special kind of regularizer called logDet.
- There are many other methods. If you guys know that other methods
rock, let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or
both?) algorithms along with a relevant transformer.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-24 12:54:20 UTC

Post by Andreas Mueller
â
I'd still call it ``transform`` probably, though. It would be a bit
confusing because it uses the squared transform, but it would make it
possible to build pipelines with clustering algorithms.

â
It's unfortunate that we already have a transform for "linear" metric
learners. One could want to use such a linear learner to get a similarity
matrix.

So I just thought: what if metric learners will have an attribute `metric`â
â
â which behaves as DistanceMetric and Transformer at the same time? Its
`transform` would return a similarity matrix, and `fit` would trigger
"parent's" `fit` (an instance of this distance-metric-transformer would
have a reference to the containing instance of metric learner and call his
fit). This is a bit non-standard for sklearn, but looks good when combined
with pipelines. Like this:

ml =
ââ
MetricLearner()
sc = SpectralClustering(affinity='precomputed')
pipeline = Pipeline([ ('ml', ml.metric), ('sc', sc) ])
pipeline.fit(X_train, y_train)â # ml.metric.fit(X, y) calls ml.fit(X, y)
pipeline.predict(X_test) # ml.metric.transform returns similarity matrix
â, using ml's dataâ

One would also be able to pipeline it with KNN:

ml = MetricLearner()
knn = KNN()
pipeline = Pipeline([ ('ml', ml), ('knn', knn) ])
pipeline.fit(X_train, y_train)â
pipeline.predict(X_test) # ml.transform returns transformed data

requiring the full
â â
eigenvalue decomposition of
â â
M
â â
at each iteration. This is typically intractable for medium
â â
and high-dimensional problems

- âLarge Margin Nearest Neighbor by Weinberger et al
<http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf>.
The survey 2 acknowledges this method as "one of the most widely-used
Mahalanobis distance learning methods"
LMNN generally performs very well in practice, although it is

sometimes prone to overfitting due to the absence of regularization,
especially in high dimension

- Information-theoretic metric learning by Davis et al.
<http://dl.acm.org/citation.cfm?id=1273523> This one features a
special kind of regularizer called logDet.
- There are many other methods. If you guys know that other methods
rock, let me know.
So the project I'm proposing is about implementing 2nd or 3rd (or
both?) algorithms along with a relevant transformer.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2015-03-24 12:56:56 UTC

Post by Artem
So I just thought: what if metric learners will have an attribute `metric`

Before adding features and API entries, I'd really like to focus on
having a 1.0 release, with a fixed API that really solves the problems
that we currently are trying to solve.

In other words, I would like to get in an "API freeze" state where we
add/modify only essentials stuff to the API.

Gaël

Joel Nothman

2015-03-24 12:59:36 UTC

Post by Artem
So I just thought: what if metric learners will have an attribute

`metric`
Before adding features and API entries, I'd really like to focus on
having a 1.0 release, with a fixed API that really solves the problems
that we currently are trying to solve.
In other words, I would like to get in an "API freeze" state where we
add/modify only essentials stuff to the API.
GaÃ«l

To make this more concrete, the MetricLearner().metric_ estimator would
require specialised set_params or clone behaviour, I assume. I.e. it
involves hacking API fundamentals.

Gael Varoquaux

2015-03-24 13:01:28 UTC

Post by Joel Nothman
To make this more concrete, the MetricLearner().metric_ estimator would
require specialised set_params or clone behaviour, I assume. I.e. it
involves hacking API fundamentals.

It's more a general principle of "freeze": to be able to settle down on
something that we _know_ works and is robust, understandable, bugless...
we need to stop changing or adding things.

Gaël

Joel Nothman

2015-03-24 13:06:43 UTC

Post by Joel Nothman
To make this more concrete, the MetricLearner().metric_ estimator would
require specialised set_params or clone behaviour, I assume. I.e. it
involves hacking API fundamentals.

It's more a general principle of "freeze": to be able to settle down on
something that we _know_ works and is robust, understandable, bugless...
we need to stop changing or adding things.

Yes, I get that too. GSoC tends to pull in the opposite direction by way of
being project oriented.

Artem

2015-03-24 19:15:09 UTC

Post by Gael Varoquaux
â
In other words, I would like to get in an "API freeze" state where we

add/modify only essentials stuff to the API.
âOk, then I suppose, the easiest way would be to create 2 kind of
transformers for each method: one that transforms the space so that
Euclidean distance acts like Mahalanobis' one, and another that transforms
the data into a similarity matrix.â
â
â Any objections?â

I removed kernel ITML from the timeline (though it's still there in case
I'd have time), and added a week for each iteration to add documentation
and tests as I go.

New version:
https://github.com/scikit-learn/scikit-learn/wiki/%5BWIP%5D-GSoC-2015-Proposal:-Metric-Learning-module

Post by Joel Nothman
To make this more concrete, the MetricLearner().metric_ estimator would
require specialised set_params or clone behaviour, I assume. I.e. it
involves hacking API fundamentals.

It's more a general principle of "freeze": to be able to settle down on
something that we _know_ works and is robust, understandable, bugless...
we need to stop changing or adding things.

Yes, I get that too. GSoC tends to pull in the opposite direction by way
of being project oriented.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2015-03-24 22:44:19 UTC

I also share Gael's concerns with respect to extending our API in yet
another direction at a time where we are trying to focus on ironing
out consistency issues...

--
Olivier

Artem

2015-03-25 00:25:40 UTC

You mean matrix-like y?

Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before we

â â
release 1.0. We are already having a hard time covering in a homogeneous
â â
way all the possible options.â

âThen Andreas

â
1.2: matrix-like Y should actually be fine with cross-validation. I think
it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

So if we don't want to alter API, I suppose this feature should be
postponed until 1.0?â
â

I also share Gael's concerns with respect to extending our API in yet
another direction at a time where we are trying to focus on ironing
out consistency issues...
--
Olivier
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Vlad Niculae

2015-03-25 01:04:28 UTC

Hi Artem, hi everybody,

There were two API issues and I think both need thought. The first is the matrix-like Y which at the moment overlaps semantically with multilabel and multioutput-multiclass (though I think it could be seen as a form of multi-target regression…)

The second is the `estimator.metric` which would be a new convention. The problem here is proxying fit/predict/{set|get}_params calls to the parent, as Joel noted.

IMHO the first is slightly less scary that the second, but I’m not sure where we should draw the line.

A few thoughts and questions about your proposal, on top of the excellent comments the others gave so far:

The matrix-like Y links to a question I had: you say it only has -1, 1s and 0s. But don’t metric learning methods support more fine-grained (continuous) values there? Otherwise the expressiveness gain over just having a classification y is not that big, is it?

Overall the proposal would benefit by including a bit more detail on the metric learning methods and the relationship/differences/tradeoffs between them.

Would metric learning be useful for regression in any way? My question was triggered by your saying that it could be used in the KNN classifier, which made me wonder why not in the regressor. E.g. one could bin the `y`.

Nitpicks:

* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a (linear), as if trained kernelized -> as if we trained a kernelized, Core contribution-> The core contribution, expect integration phase -> expect the integration phase.
* I think ITML skips from review #1 to review #3.

Hope this helps,

Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.

I'd rather we try not to add new needs and usecases to these before we release 1.0. We are already having a hard time covering in a homogeneous way all the possible options.
Then Andreas
1.2: matrix-like Y should actually be fine with cross-validation. I think it would be nice if we could get some benefit by having a classification-like y, but I'm not opposed to also allowing matrix Y.
So if we don't want to alter API, I suppose this feature should be postponed until 1.0?
I also share Gael's concerns with respect to extending our API in yet
another direction at a time where we are trying to focus on ironing
out consistency issues...
--
Olivier
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-25 01:38:31 UTC

Hi Vlad

1. Usually metric learning uses supervision in one of 2 forms: either two
sets of similar (distance is less than some predefined value u) and
dissimilar (distance is bigger than l) pairs, or a set of triplets (x, y,
z) such that d(x, y) < d(x, z). Though, I think, it's possible to
generalize the former to a case when we have control over thresholds u and
l for each pair, I'm not sure if it'd be useful.

The drawback of classification-like y is that it induces transitivity on
the notion of similarity, which may be not a good idea.

2. I mentioned KNN because it was the first distance-based algorithm I
though of. Also, existing literature mostly deals with applications to the
classification. One way to approach regression is to use kernel regression
(also known as Nadaraya-Watson method) with an RBF-like kernel where
Euclidean distance is replaced by Mahalanobis' distance.

I think, one can, indeed, bin target ys, learn a metric on top of these
bins, and then use any distance-based regression algorithm.

3. Each algorithm (NCA, LMNN, ITML) will have a separate pull request and
will be reviewed separately. I expect to finish the first PR (NCA) before
submitting the last one (ITML). By the end of the 10th week I might still
not have the second review completed, but it's okay, there're 2+ more weeks
to get it done.

Post by Vlad Niculae
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The first is the
matrix-like Y which at the moment overlaps semantically with multilabel and
multioutput-multiclass (though I think it could be seen as a form of
multi-target regressionâŠ)
The second is the `estimator.metric` which would be a new convention. The
problem here is proxying fit/predict/{set|get}_params calls to the parent,
as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm not sure
where we should draw the line.
A few thoughts and questions about your proposal, on top of the excellent
The matrix-like Y links to a question I had: you say it only has -1, 1s
and 0s. But donât metric learning methods support more fine-grained
(continuous) values there? Otherwise the expressiveness gain over just
having a classification y is not that big, is it?
Overall the proposal would benefit by including a bit more detail on the
metric learning methods and the relationship/differences/tradeoffs between
them.
Would metric learning be useful for regression in any way? My question was
triggered by your saying that it could be used in the KNN classifier, which
made me wonder why not in the regressor. E.g. one could bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a (linear), as
if trained kernelized -> as if we trained a kernelized, Core contribution->
The core contribution, expect integration phase -> expect the integration
phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before weâ

ârelease 1.0. We are already having a hard time covering in a homogeneousâ
âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation. I

think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

Post by Artem
On Wed, Mar 25, 2015 at 1:44 AM, Olivier Grisel <
I also share Gael's concerns with respect to extending our API in yet
another direction at a time where we are trying to focus on ironing
out consistency issues...
--
Olivier

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2015-03-25 02:05:13 UTC

I think the problem with matrix-like Y is that Y would be symmetric. Thus
for doing cross-validation one would need to select both rows and columns.
This is why I suggested to add a _pairwise_y property like the _pairwise
property that we use in kernel methods, e.g.,
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py

This project could potentially be developed in a separate git repo. This
would take off the pressure of having to design a perfect API. This will of
course depend on how many slots we get and how we want to prioritize them.

M.

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before weâ

ârelease 1.0. We are already having a hard time covering in a homogeneousâ
âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation. I

think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2015-03-25 06:28:19 UTC

I think the problem with matrix-like Y is that Y would be symmetric. Thus for
doing cross-validation one would need to select both rows and columns.

Correct. Then ideed it's off limits. These are specifically the kind of
problem I would like not to have to worry about. The combination of all
the various constraints related to the different specific uses makes the
API choice interlocked and intractable. We need to release 1.0 before.

This project could potentially be developed in a separate git repo. This would
take off the pressure of having to design a perfect API. This will of course
depend on how many slots we get and how we want to prioritize them.

Indeed. I am going to be less excited about adding resources to code that
will not get merged.

Gaël

Artem

2015-03-25 20:18:52 UTC

âOk, so I removed matrix y from the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module>.
Therefore I also shortened the first iteration by one week, since no
changes to the current code are needed.

This allowed me to extend the last iteration by one week, which makes
kernel ITML a bit more probable.
I'm going to send this proposal to melange tomorrow, so if you have
comments â please reply.

Also, if some of previous objections were not addressed, please repeat
them. âI might have missed something.

Post by Mathieu Blondel
I think the problem with matrix-like Y is that Y would be symmetric. Thus
for doing cross-validation one would need to select both rows and columns.
This is why I suggested to add a _pairwise_y property like the _pairwise
property that we use in kernel methods, e.g.,
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py
This project could potentially be developed in a separate git repo. This
would take off the pressure of having to design a perfect API. This will of
course depend on how many slots we get and how we want to prioritize them.
M.

Post by Vlad Niculae
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The first is the
matrix-like Y which at the moment overlaps semantically with multilabel and
multioutput-multiclass (though I think it could be seen as a form of
multi-target regressionâŠ)
The second is the `estimator.metric` which would be a new convention. The
problem here is proxying fit/predict/{set|get}_params calls to the parent,
as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm not sure
where we should draw the line.
A few thoughts and questions about your proposal, on top of the excellent
The matrix-like Y links to a question I had: you say it only has -1, 1s
and 0s. But donât metric learning methods support more fine-grained
(continuous) values there? Otherwise the expressiveness gain over just
having a classification y is not that big, is it?
Overall the proposal would benefit by including a bit more detail on the
metric learning methods and the relationship/differences/tradeoffs between
them.
Would metric learning be useful for regression in any way? My question
was triggered by your saying that it could be used in the KNN classifier,
which made me wonder why not in the regressor. E.g. one could bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a (linear),
as if trained kernelized -> as if we trained a kernelized, Core
contribution-> The core contribution, expect integration phase -> expect
the integration phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before weâ

ârelease 1.0. We are already having a hard time covering in a homogeneousâ
âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation. I

think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2015-03-25 20:22:01 UTC

You can always amend your melange proposal, so there is no reason not to
submit an early version.

Post by Artem
âOk, so I removed matrix y from the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module>.
Therefore I also shortened the first iteration by one week, since no
changes to the current code are needed.
This allowed me to extend the last iteration by one week, which makes
kernel ITML a bit more probable.
I'm going to send this proposal to melange tomorrow, so if you have
comments â please reply.
Also, if some of previous objections were not addressed, please repeat
them. âI might have missed something.
I think the problem with matrix-like Y is that Y would be
symmetric. Thus for doing cross-validation one would need to
select both rows and columns. This is why I suggested to add a
_pairwise_y property like the _pairwise property that we use in
kernel methods, e.g.,
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py
This project could potentially be developed in a separate git
repo. This would take off the pressure of having to design a
perfect API. This will of course depend on how many slots we get
and how we want to prioritize them.
M.
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The
first is the matrix-like Y which at the moment overlaps
semantically with multilabel and multioutput-multiclass
(though I think it could be seen as a form of multi-target
regressionâŠ)
The second is the `estimator.metric` which would be a new
convention. The problem here is proxying
fit/predict/{set|get}_params calls to the parent, as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm
not sure where we should draw the line.
A few thoughts and questions about your proposal, on top of
The matrix-like Y links to a question I had: you say it only
has -1, 1s and 0s. But donât metric learning methods support
more fine-grained (continuous) values there? Otherwise the
expressiveness gain over just having a classification y is not
that big, is it?
Overall the proposal would benefit by including a bit more
detail on the metric learning methods and the
relationship/differences/tradeoffs between them.
Would metric learning be useful for regression in any way? My
question was triggered by your saying that it could be used in
the KNN classifier, which made me wonder why not in the
regressor. E.g. one could bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a
(linear), as if trained kernelized -> as if we trained a
kernelized, Core contribution-> The core contribution, expect
integration phase -> expect the integration phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these

before weâ ârelease 1.0. We are already having a hard time
covering in a homogeneousâ âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with

cross-validation. I think it would be nice if we could get
some benefit by having a classification-like y, but I'm not
opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature

should be postponed until 1.0?ââ

Post by Artem
On Wed, Mar 25, 2015 at 1:44 AM, Olivier Grisel
I also share Gael's concerns with respect to extending our

API in yet

Post by Artem
another direction at a time where we are trying to focus on

ironing

Post by Artem
out consistency issues...
--
Olivier

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel

Website, sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media,

is your hub for all

Post by Artem
things parallel software development, from weekly thought

leadership blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look

and join the

Post by Artem
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel

Website, sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media,

is your hub for all

Post by Artem
things parallel software development, from weekly thought

leadership blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look

and join the

Post by Artem
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel
Website, sponsored
by Intel and developed in partnership with Slashdot Media, is
your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Michael Eickenberg

2015-03-25 22:05:16 UTC

FWIW, although the NCA conversation on github (
https://github.com/scikit-learn/scikit-learn/issues/3213) is only an issue,
Roland (https://github.com/RolT) actually has a full implementation of NCA,
which is almost (up to a few details, such as the **kwargs, the class
inheritance and some camel casing) scikit-learn compatible:
https://github.com/RolT/NCA-python . It would be good to leverage this by
either building upon it or testing against it, after having evaluated it.

Michael

Post by Andreas Mueller
You can always amend your melange proposal, so there is no reason not to
submit an early version.
âOk, so I removed matrix y from the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module>.
Therefore I also shortened the first iteration by one week, since no
changes to the current code are needed.
This allowed me to extend the last iteration by one week, which makes
kernel ITML a bit more probable.
I'm going to send this proposal to melange tomorrow, so if you have
comments â please reply.
Also, if some of previous objections were not addressed, please repeat
them. âI might have missed something.

Post by Mathieu Blondel
I think the problem with matrix-like Y is that Y would be symmetric.
Thus for doing cross-validation one would need to select both rows and
columns. This is why I suggested to add a _pairwise_y property like the
_pairwise property that we use in kernel methods, e.g.,
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py
This project could potentially be developed in a separate git repo.
This would take off the pressure of having to design a perfect API. This
will of course depend on how many slots we get and how we want to
prioritize them.
M.

Post by Vlad Niculae
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The first is
the matrix-like Y which at the moment overlaps semantically with multilabel
and multioutput-multiclass (though I think it could be seen as a form of
multi-target regressionâŠ)
The second is the `estimator.metric` which would be a new convention.
The problem here is proxying fit/predict/{set|get}_params calls to the
parent, as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm not sure
where we should draw the line.
A few thoughts and questions about your proposal, on top of the
The matrix-like Y links to a question I had: you say it only has -1, 1s
and 0s. But donât metric learning methods support more fine-grained
(continuous) values there? Otherwise the expressiveness gain over just
having a classification y is not that big, is it?
Overall the proposal would benefit by including a bit more detail on the
metric learning methods and the relationship/differences/tradeoffs between
them.
Would metric learning be useful for regression in any way? My question
was triggered by your saying that it could be used in the KNN classifier,
which made me wonder why not in the regressor. E.g. one could bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a (linear),
as if trained kernelized -> as if we trained a kernelized, Core
contribution-> The core contribution, expect integration phase -> expect
the integration phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before

weâ ârelease 1.0. We are already having a hard time covering in a
homogeneousâ âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation. I

think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-25 22:08:35 UTC

Yes, I saw the repo. Didn't know, though, that it's almost completed,
thanks for checking!

On Thu, Mar 26, 2015 at 1:05 AM, Michael Eickenberg <

Post by Michael Eickenberg
FWIW, although the NCA conversation on github (
https://github.com/scikit-learn/scikit-learn/issues/3213) is only an
issue, Roland (https://github.com/RolT) actually has a full
implementation of NCA, which is almost (up to a few details, such as the
**kwargs, the class inheritance and some camel casing) scikit-learn
compatible: https://github.com/RolT/NCA-python . It would be good to
leverage this by either building upon it or testing against it, after
having evaluated it.
Michael

Post by Mathieu Blondel
I think the problem with matrix-like Y is that Y would be symmetric.
Thus for doing cross-validation one would need to select both rows and
columns. This is why I suggested to add a _pairwise_y property like the
_pairwise property that we use in kernel methods, e.g.,
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py
This project could potentially be developed in a separate git repo.
This would take off the pressure of having to design a perfect API. This
will of course depend on how many slots we get and how we want to
prioritize them.
M.

Post by Vlad Niculae
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The first is
the matrix-like Y which at the moment overlaps semantically with multilabel
and multioutput-multiclass (though I think it could be seen as a form of
multi-target regressionâŠ)
The second is the `estimator.metric` which would be a new convention.
The problem here is proxying fit/predict/{set|get}_params calls to the
parent, as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm not sure
where we should draw the line.
A few thoughts and questions about your proposal, on top of the
The matrix-like Y links to a question I had: you say it only has -1, 1s
and 0s. But donât metric learning methods support more fine-grained
(continuous) values there? Otherwise the expressiveness gain over just
having a classification y is not that big, is it?
Overall the proposal would benefit by including a bit more detail on
the metric learning methods and the relationship/differences/tradeoffs
between them.
Would metric learning be useful for regression in any way? My question
was triggered by your saying that it could be used in the KNN classifier,
which made me wonder why not in the regressor. E.g. one could bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a (linear),
as if trained kernelized -> as if we trained a kernelized, Core
contribution-> The core contribution, expect integration phase -> expect
the integration phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before

weâ ârelease 1.0. We are already having a hard time covering in a
homogeneousâ âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation. I

think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Michael Eickenberg

2015-03-25 22:12:56 UTC

I do not know the exact state of the algorithm, but the author was working
on sklearn compatibility at a sklearn sprint last summer. It seemed like
the algorithmic side had been pretty much taken care of, but this needs to
be checked.

Michael

Post by Artem
Yes, I saw the repo. Didn't know, though, that it's almost completed,
thanks for checking!
On Thu, Mar 26, 2015 at 1:05 AM, Michael Eickenberg <

Post by Andreas Mueller
You can always amend your melange proposal, so there is no reason not
to submit an early version.
âOk, so I removed matrix y from the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module>.
Therefore I also shortened the first iteration by one week, since no
changes to the current code are needed.
This allowed me to extend the last iteration by one week, which makes
kernel ITML a bit more probable.
I'm going to send this proposal to melange tomorrow, so if you have
comments â please reply.
Also, if some of previous objections were not addressed, please
repeat them. âI might have missed something.

Post by Mathieu Blondel
I think the problem with matrix-like Y is that Y would be symmetric.
Thus for doing cross-validation one would need to select both rows and
columns. This is why I suggested to add a _pairwise_y property like the
_pairwise property that we use in kernel methods, e.g.,
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py
This project could potentially be developed in a separate git repo.
This would take off the pressure of having to design a perfect API. This
will of course depend on how many slots we get and how we want to
prioritize them.
M.

Post by Vlad Niculae
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The first is
the matrix-like Y which at the moment overlaps semantically with multilabel
and multioutput-multiclass (though I think it could be seen as a form of
multi-target regressionâŠ)
The second is the `estimator.metric` which would be a new convention.
The problem here is proxying fit/predict/{set|get}_params calls to the
parent, as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm not
sure where we should draw the line.
A few thoughts and questions about your proposal, on top of the
The matrix-like Y links to a question I had: you say it only has -1,
1s and 0s. But donât metric learning methods support more fine-grained
(continuous) values there? Otherwise the expressiveness gain over just
having a classification y is not that big, is it?
Overall the proposal would benefit by including a bit more detail on
the metric learning methods and the relationship/differences/tradeoffs
between them.
Would metric learning be useful for regression in any way? My question
was triggered by your saying that it could be used in the KNN classifier,
which made me wonder why not in the regressor. E.g. one could bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a
(linear), as if trained kernelized -> as if we trained a kernelized, Core
contribution-> The core contribution, expect integration phase -> expect
the integration phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before

weâ ârelease 1.0. We are already having a hard time covering in a
homogeneousâ âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation. I

think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2015-03-26 00:20:27 UTC

Each of them is a transformer that utilizes y during fit, where y is a

usual vector of labels of training samples, just like in case of
classification.

I am actually confused by this. How are you going to encode the
similarities / dissimilarities between samples if y is a vector?

Another possible application is getting a similarity matrix according to

the metric learned. Thus, there will be 2 transformers for each algorithm:
one maps input data from the original space into a linearly transformed
one, and the other maps input data into a square similarity matrix, that
can be used for clustering, for example.

Please give a code example in your proposal to see how this would look like.

M.

âOk, so I removed matrix y from the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module>.
Therefore I also shortened the first iteration by one week, since no
changes to the current code are needed.
This allowed me to extend the last iteration by one week, which makes
kernel ITML a bit more probable.
I'm going to send this proposal to melange tomorrow, so if you have
comments â please reply.
Also, if some of previous objections were not addressed, please repeat
them. âI might have missed something.

Post by Vlad Niculae
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The first is
the matrix-like Y which at the moment overlaps semantically with multilabel
and multioutput-multiclass (though I think it could be seen as a form of
multi-target regressionâŠ)
The second is the `estimator.metric` which would be a new convention.
The problem here is proxying fit/predict/{set|get}_params calls to the
parent, as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm not sure
where we should draw the line.
A few thoughts and questions about your proposal, on top of the
The matrix-like Y links to a question I had: you say it only has -1, 1s
and 0s. But donât metric learning methods support more fine-grained
(continuous) values there? Otherwise the expressiveness gain over just
having a classification y is not that big, is it?
Overall the proposal would benefit by including a bit more detail on the
metric learning methods and the relationship/differences/tradeoffs between
them.
Would metric learning be useful for regression in any way? My question
was triggered by your saying that it could be used in the KNN classifier,
which made me wonder why not in the regressor. E.g. one could bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a (linear),
as if trained kernelized -> as if we trained a kernelized, Core
contribution-> The core contribution, expect integration phase -> expect
the integration phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before

weâ ârelease 1.0. We are already having a hard time covering in a
homogeneousâ âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation. I

think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your hub

for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-26 07:50:52 UTC

Sorry, apparently I clicked reply and my previous message went to Mathieu
only. Repeat them here:

In case of vector y there's no other way, but to assume transitivity. Which
is not general enough, but should work in a classification setting. After
all, many of these methods are designed to aid KNN. So because of the
transitivity of similarity, we can assign classes to similar objects, and
infer similarity / dissimilarity from classes. This is the way LMNN
supposed to work, for example. Authors of other methods mention this
approach, too.

Added an example to the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#api>.
Names are a bit awkward, but couldn't think of better ones.

Each of them is a transformer that utilizes y during fit, where y is a

usual vector of labels of training samples, just like in case of
classification.
I am actually confused by this. How are you going to encode the
similarities / dissimilarities between samples if y is a vector?

Another possible application is getting a similarity matrix according to

one maps input data from the original space into a linearly transformed
one, and the other maps input data into a square similarity matrix, that
can be used for clustering, for example.
Please give a code example in your proposal to see how this would look like.
M.

âOk, so I removed matrix y from the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module>.
Therefore I also shortened the first iteration by one week, since no
changes to the current code are needed.
This allowed me to extend the last iteration by one week, which makes
kernel ITML a bit more probable.
I'm going to send this proposal to melange tomorrow, so if you have
comments â please reply.
Also, if some of previous objections were not addressed, please repeat
them. âI might have missed something.

Post by Mathieu Blondel
I think the problem with matrix-like Y is that Y would be symmetric.
Thus for doing cross-validation one would need to select both rows and
columns. This is why I suggested to add a _pairwise_y property like the
_pairwise property that we use in kernel methods, e.g.,
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py
This project could potentially be developed in a separate git repo. This
would take off the pressure of having to design a perfect API. This will of
course depend on how many slots we get and how we want to prioritize them.
M.

Post by Vlad Niculae
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The first is
the matrix-like Y which at the moment overlaps semantically with multilabel
and multioutput-multiclass (though I think it could be seen as a form of
multi-target regressionâŠ)
The second is the `estimator.metric` which would be a new convention.
The problem here is proxying fit/predict/{set|get}_params calls to the
parent, as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm not sure
where we should draw the line.
A few thoughts and questions about your proposal, on top of the
The matrix-like Y links to a question I had: you say it only has -1, 1s
and 0s. But donât metric learning methods support more fine-grained
(continuous) values there? Otherwise the expressiveness gain over just
having a classification y is not that big, is it?
Overall the proposal would benefit by including a bit more detail on
the metric learning methods and the relationship/differences/tradeoffs
between them.
Would metric learning be useful for regression in any way? My question
was triggered by your saying that it could be used in the KNN classifier,
which made me wonder why not in the regressor. E.g. one could bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a (linear),
as if trained kernelized -> as if we trained a kernelized, Core
contribution-> The core contribution, expect integration phase -> expect
the integration phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before

weâ ârelease 1.0. We are already having a hard time covering in a
homogeneousâ âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation. I

think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2015-03-26 08:39:07 UTC

- Spectral clustering use similarities rather than distances and needs
affinity="precomputed" (otherwise, it assumes that X is [n_samples,
n_features])
- Instead of duplicating each class, you could create a generic transformer
that outputs a similarity / distance matrix from X.

M.

Post by Artem
Sorry, apparently I clicked reply and my previous message went to Mathieu
In case of vector y there's no other way, but to assume transitivity.
Which is not general enough, but should work in a classification setting.
After all, many of these methods are designed to aid KNN. So because of the
transitivity of similarity, we can assign classes to similar objects, and
infer similarity / dissimilarity from classes. This is the way LMNN
supposed to work, for example. Authors of other methods mention this
approach, too.
Added an example to the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#api>.
Names are a bit awkward, but couldn't think of better ones.

Each of them is a transformer that utilizes y during fit, where y is a

usual vector of labels of training samples, just like in case of
classification.
I am actually confused by this. How are you going to encode the
similarities / dissimilarities between samples if y is a vector?

Another possible application is getting a similarity matrix according

to the metric learned. Thus, there will be 2 transformers for each
algorithm: one maps input data from the original space into a linearly
transformed one, and the other maps input data into a square similarity
matrix, that can be used for clustering, for example.
Please give a code example in your proposal to see how this would look like.
M.

âOk, so I removed matrix y from the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module>.
Therefore I also shortened the first iteration by one week, since no
changes to the current code are needed.
This allowed me to extend the last iteration by one week, which makes
kernel ITML a bit more probable.
I'm going to send this proposal to melange tomorrow, so if you have
comments â please reply.
Also, if some of previous objections were not addressed, please repeat
them. âI might have missed something.

Post by Mathieu Blondel
I think the problem with matrix-like Y is that Y would be symmetric.
Thus for doing cross-validation one would need to select both rows and
columns. This is why I suggested to add a _pairwise_y property like the
_pairwise property that we use in kernel methods, e.g.,
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py
This project could potentially be developed in a separate git repo.
This would take off the pressure of having to design a perfect API. This
will of course depend on how many slots we get and how we want to
prioritize them.
M.

Post by Vlad Niculae
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The first is
the matrix-like Y which at the moment overlaps semantically with multilabel
and multioutput-multiclass (though I think it could be seen as a form of
multi-target regressionâŠ)
The second is the `estimator.metric` which would be a new convention.
The problem here is proxying fit/predict/{set|get}_params calls to the
parent, as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm not
sure where we should draw the line.
A few thoughts and questions about your proposal, on top of the
The matrix-like Y links to a question I had: you say it only has -1,
1s and 0s. But donât metric learning methods support more fine-grained
(continuous) values there? Otherwise the expressiveness gain over just
having a classification y is not that big, is it?
Overall the proposal would benefit by including a bit more detail on
the metric learning methods and the relationship/differences/tradeoffs
between them.
Would metric learning be useful for regression in any way? My question
was triggered by your saying that it could be used in the KNN classifier,
which made me wonder why not in the regressor. E.g. one could bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a
(linear), as if trained kernelized -> as if we trained a kernelized, Core
contribution-> The core contribution, expect integration phase -> expect
the integration phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before

weâ ârelease 1.0. We are already having a hard time covering in a
homogeneousâ âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation. I

think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel Website,

sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought leadership

blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and join

the

Post by Artem
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Artem

2015-03-26 08:49:21 UTC

1. Right, forgot to add that parameter. Well, I can apply an RBF kernel to
get a similarity matrix from a distance matrix inside transform.

2. Usual transformer returns neither distance, nor similarity, but
transforms the input space so that usual Euclidean distance acts like the
learned Mahalanobis. I don't see an easy way to combine these 2 modes.

Actually, I think, *Similarity classes can be moved out into an own module,
like `similarity_learning`.

Post by Mathieu Blondel
- Spectral clustering use similarities rather than distances and needs
affinity="precomputed" (otherwise, it assumes that X is [n_samples,
n_features])
- Instead of duplicating each class, you could create a generic
transformer that outputs a similarity / distance matrix from X.
M.

Each of them is a transformer that utilizes y during fit, where y is

a usual vector of labels of training samples, just like in case of
classification.
I am actually confused by this. How are you going to encode the
similarities / dissimilarities between samples if y is a vector?

Another possible application is getting a similarity matrix according

âOk, so I removed matrix y from the proposal
<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module>.
Therefore I also shortened the first iteration by one week, since no
changes to the current code are needed.
This allowed me to extend the last iteration by one week, which makes
kernel ITML a bit more probable.
I'm going to send this proposal to melange tomorrow, so if you have
comments â please reply.
Also, if some of previous objections were not addressed, please repeat
them. âI might have missed something.

Post by Mathieu Blondel
I think the problem with matrix-like Y is that Y would be symmetric.
Thus for doing cross-validation one would need to select both rows and
columns. This is why I suggested to add a _pairwise_y property like the
_pairwise property that we use in kernel methods, e.g.,
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/kernel_pca.py
This project could potentially be developed in a separate git repo.
This would take off the pressure of having to design a perfect API. This
will of course depend on how many slots we get and how we want to
prioritize them.
M.

Post by Vlad Niculae
Hi Artem, hi everybody,
There were two API issues and I think both need thought. The first is
the matrix-like Y which at the moment overlaps semantically with multilabel
and multioutput-multiclass (though I think it could be seen as a form of
multi-target regressionâŠ)
The second is the `estimator.metric` which would be a new convention.
The problem here is proxying fit/predict/{set|get}_params calls to the
parent, as Joel noted.
IMHO the first is slightly less scary that the second, but Iâm not
sure where we should draw the line.
A few thoughts and questions about your proposal, on top of the
The matrix-like Y links to a question I had: you say it only has -1,
1s and 0s. But donât metric learning methods support more fine-grained
(continuous) values there? Otherwise the expressiveness gain over just
having a classification y is not that big, is it?
Overall the proposal would benefit by including a bit more detail on
the metric learning methods and the relationship/differences/tradeoffs
between them.
Would metric learning be useful for regression in any way? My
question was triggered by your saying that it could be used in the KNN
classifier, which made me wonder why not in the regressor. E.g. one could
bin the `y`.
* what does SWE stand for?
* missing articles: equivalent to (linear) -> equivalent to a
(linear), as if trained kernelized -> as if we trained a kernelized, Core
contribution-> The core contribution, expect integration phase -> expect
the integration phase.
* I think ITML skips from review #1 to review #3.
Hope this helps,
Yours,
Vlad

Post by Artem
You mean matrix-like y?
Gael said

FWIW It'll require some changes to cross-validation routines.â

I'd rather we try not to add new needs and usecases to these before

weâ ârelease 1.0. We are already having a hard time covering in a
homogeneousâ âway all the possible options.â

Post by Artem
âThen Andreas
â1.2: matrix-like Y should actually be fine with cross-validation.

I think it would be nice if we could get some benefit by having a
classification-like y, but I'm not opposed to also allowing matrix Y.

Post by Artem
So if we don't want to alter API, I suppose this feature should be

postponed until 1.0?ââ

Post by Artem
On Wed, Mar 25, 2015 at 1:44 AM, Olivier Grisel <
I also share Gael's concerns with respect to extending our API in

yet

Post by Artem
another direction at a time where we are trying to focus on ironing
out consistency issues...
--
Olivier

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel

Website, sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought

leadership blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and

join the

Post by Artem
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Artem
Dive into the World of Parallel Programming The Go Parallel

Website, sponsored

Post by Artem
by Intel and developed in partnership with Slashdot Media, is your

hub for all

Post by Artem
things parallel software development, from weekly thought

leadership blogs to

Post by Artem
news, videos, case studies, tutorials and more. Take a look and

join the

Post by Artem
conversation now.

http://goparallel.sourceforge.net/_______________________________________________

Post by Artem
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2015-03-26 09:08:36 UTC

Post by Artem
1. Right, forgot to add that parameter. Well, I can apply an RBF kernel to
get a similarity matrix from a distance matrix inside transform.
2. Usual transformer returns neither distance, nor similarity, but
transforms the input space so that usual Euclidean distance acts like the
learned Mahalanobis.

I'd really try to avoid duplicating all classes. As you said the Euclidean
distance can be used on the transformed data. So we can get a similarity
matrix in just two lines:

X_transformed = LMNN().fit_transform(X, y)
S = -euclidean_distances(X_transformed)

The only benefit I see of being able to transform to a similarity matrix is
for pipelines. This can be done as I said using a generic transformer X ->
S. However, I am not completely sure this is even needed since all our
algorithms work on X of shape [n_samples, n_features] by default.

M.

Artem

2015-03-26 09:29:54 UTC

Oops, missed "Reply all" once again. Copying the message

Yes, the only need for such similarity learners is to use them in a
pipeline. It's especially convenient if one wants to do non-linear metric
learning using Kernel PCA trick. Then it'd be just another step in the
pipeline.

What do you mean by a generic transformer? In order to be usable in a
pipeline, it needs to be fit-able. Do you mean a wrapper like
OneVsRestClassifier?

The reason I included similarities is that I want to bring some supervision
into clustering by introducing meaningful metric. AFAIK, at the moment only
`AgglomerativeClustering` works well with a custom metric, and Spectral
Clustering and Affinity Propagation can work with a [n_samples, n_samples]
affinity matrix.

Post by Artem
1. Right, forgot to add that parameter. Well, I can apply an RBF kernel
to get a similarity matrix from a distance matrix inside transform.
2. Usual transformer returns neither distance, nor similarity, but
transforms the input space so that usual Euclidean distance acts like the
learned Mahalanobis.

I'd really try to avoid duplicating all classes. As you said the Euclidean
distance can be used on the transformed data. So we can get a similarity
X_transformed = LMNN().fit_transform(X, y)
S = -euclidean_distances(X_transformed)
The only benefit I see of being able to transform to a similarity matrix
is for pipelines. This can be done as I said using a generic transformer X
-> S. However, I am not completely sure this is even needed since all our
algorithms work on X of shape [n_samples, n_features] by default.
M.

Mathieu Blondel

2015-03-26 09:36:33 UTC

Something like this:

class SimilarityTransformer(TransformerMixin):
def fit(self, X, y):
self.X_ = X; return self

def transform(self, X):
return -euclidean_distances(X, self.X_)

Post by Artem
Yes, the only need for such similarity learners is to use them in a
pipeline. It's especially convenient if one wants to do non-linear metric
learning using Kernel PCA trick. Then it'd be just another step in the
pipeline.
What do you mean by a generic transformer? In order to be usable in a
pipeline, it needs to be fit-able. Do you mean a wrapper like
OneVsRestClassifier?
The reason I included similarities is that I want to bring some
supervision into clustering by introducing meaningful metric. AFAIK, at the
moment only `AgglomerativeClustering` works well with a custom metric, and
Spectral Clustering and Affinity Propagation can work with a [n_samples,
n_samples] affinity matrix.

Post by Artem
1. Right, forgot to add that parameter. Well, I can apply an RBF kernel
to get a similarity matrix from a distance matrix inside transform.
2. Usual transformer returns neither distance, nor similarity, but
transforms the input space so that usual Euclidean distance acts like the
learned Mahalanobis.

I'd really try to avoid duplicating all classes. As you said the
Euclidean distance can be used on the transformed data. So we can get a
X_transformed = LMNN().fit_transform(X, y)
S = -euclidean_distances(X_transformed)
The only benefit I see of being able to transform to a similarity matrix
is for pipelines. This can be done as I said using a generic transformer X
-> S. However, I am not completely sure this is even needed since all our
algorithms work on X of shape [n_samples, n_features] by default.
M.

Artem

2015-03-26 09:42:22 UTC

Hm, but similarity-based clustering works with inter-data similarities,
doesn't it? The result's shape would be like [n_samples_in_transform,
n_samples_in_train] which is not what we want.

Post by Mathieu Blondel
self.X_ = X; return self
return -euclidean_distances(X, self.X_)

Post by Artem
1. Right, forgot to add that parameter. Well, I can apply an RBF kernel
to get a similarity matrix from a distance matrix inside transform.
2. Usual transformer returns neither distance, nor similarity, but
transforms the input space so that usual Euclidean distance acts like the
learned Mahalanobis.

Gael Varoquaux

2015-03-25 06:25:30 UTC

Post by Vlad Niculae
There were two API issues and I think both need thought. The first is the matrix-like Y which at the moment overlaps semantically with multilabel and multioutput-multiclass (though I think it could be seen as a form of multi-target regression…)

I would see it as a multi-target regression and not worry too much about
the overlap. We will have to trust a bit the user to know what he is
doing, but I don't have the feeling that it is adding an inconsistency.

Post by Vlad Niculae
The second is the `estimator.metric` which would be a new convention.
The problem here is proxying fit/predict/{set|get}_params calls to the
parent, as Joel noted.

That one is off limits for me.

G

Gael Varoquaux

2015-03-25 06:23:42 UTC