Discussion:
[Scikit-learn-general] Specify rather than learn sparse coding dictionary?
Ian Goodfellow
2011-12-05 21:42:18 UTC
Permalink
I'm interested in doing sparse coding with scikits.learn. It looks
like the way to do this is with
sklearn.decomposition.MiniBatchDictionaryLearning. Am I correct about
that?

If so:
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
I thought I could do this by passing the desired dictionary to the
dict_init parameter of the __init__
method and then never calling a fit method. However, when I attempted
this, the transform method
seems broken. It complains there is no components_ field. Patching it
to work without having called
fit doesn't seem trivial; simply filling in the components_ field with
the desired dictionary doesn't work
either. Is there a way to do what I want?

Thanks,
Ian
Alexandre Gramfort
2011-12-05 21:50:58 UTC
Permalink
Post by Ian Goodfellow
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
if you have the dictionaries then use a batch lasso or batch OMP with
precomputed gram to get the coefficients. That will give you the full
estimated model.

Alex
Olivier Grisel
2011-12-05 22:03:38 UTC
Permalink
Post by Alexandre Gramfort
Post by Ian Goodfellow
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
if you have the dictionaries then use a batch lasso or batch OMP with
precomputed gram to get the coefficients. That will give you the full
estimated model.
There are slightly higher level utility functions too here:

http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode.html
http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode_parallel.html
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Ian Goodfellow
2011-12-05 22:07:52 UTC
Permalink
On Mon, Dec 5, 2011 at 4:50 PM, Alexandre Gramfort
Post by Alexandre Gramfort
Post by Ian Goodfellow
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
if you have the dictionaries then use a batch lasso or batch OMP with
precomputed gram to get the coefficients. That will give you the full
estimated model.
I do not understand. I have the dictionary already, so what is being estimated?
Post by Alexandre Gramfort
Alex
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2011-12-05 22:14:00 UTC
Permalink
Post by Ian Goodfellow
On Mon, Dec 5, 2011 at 4:50 PM, Alexandre Gramfort
Post by Alexandre Gramfort
Post by Ian Goodfellow
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
if you have the dictionaries then use a batch lasso or batch OMP with
precomputed gram to get the coefficients. That will give you the full
estimated model.
I do not understand. I have the dictionary already, so what is being estimated?
The loadings.

- original data matrix has shape (n_samples, n_features)
- dictionary has shape (n_components, n_features)
- loadings matrix has shape (n_samples, n_components)

The loadings are the results of the sparse encoded samples in the
dictionary space.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Alexandre Gramfort
2011-12-05 22:14:40 UTC
Permalink
Post by Ian Goodfellow
I do not understand. I have the dictionary already, so what is being estimated?
well I am not sure to follow now, but if you have the dictionary the
only missing part is the coefs of the decomposition.

X = dico x coefs

Alex
Andreas Mueller
2011-12-05 22:21:01 UTC
Permalink
Post by Alexandre Gramfort
Post by Ian Goodfellow
I do not understand. I have the dictionary already, so what is being estimated?
well I am not sure to follow now, but if you have the dictionary the
only missing part is the coefs of the decomposition.
X = dico x coefs
I think there is a little misunderstanding here.
As I understand Ian, he has estimated a dictionary on some dataset
and wants to use this dictionary to encode some "new" data.

You do not need to estimate anything to get the "model".

What you do want is to "transform" the new data so that it
is coded using the specified dictionary.
I think this is exactly what the sparse encoding method that Olivier
referenced is doing.

Hope I got it right and this clears more things up than confuses.

Cheers,
Andy
Olivier Grisel
2011-12-05 22:31:55 UTC
Permalink
Post by Andreas Mueller
Post by Alexandre Gramfort
Post by Ian Goodfellow
I do not understand. I have the dictionary already, so what is being estimated?
well I am not sure to follow now, but if you have the dictionary the
only missing part is the coefs of the decomposition.
X = dico x coefs
I think there is a little misunderstanding here.
As I understand Ian, he has estimated a dictionary on some dataset
and wants to use this dictionary to encode some "new" data.
You do not need to estimate anything to get the "model".
What you do want is to "transform" the new data so that it
is coded using the specified dictionary.
Yes.
Post by Andreas Mueller
I think this is exactly what the sparse encoding method that Olivier
referenced is doing.
Yes and this method is just a wrapper for what Alexandre is explaining
(about LARS and OMP). Both replies are consistent.

Sparse coding is not simple linear projection (as in the transform
method of PCA for instance) as there is a penalty on the non-zero
loadings. Hence there is some "estimator fitting" that still occur at
the "transform" time. Unless of course who use fixed arbitrary
thresholding that might be enough for some tasks (e.g. sparse feature
extraction for image classification).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Ian Goodfellow
2011-12-05 22:50:46 UTC
Permalink
Post by Olivier Grisel
Post by Andreas Mueller
Post by Alexandre Gramfort
Post by Ian Goodfellow
I do not understand. I have the dictionary already, so what is being estimated?
well I am not sure to follow now, but if you have the dictionary the
only missing part is the coefs of the decomposition.
X = dico x coefs
I think there is a little misunderstanding here.
As I understand Ian, he has estimated a dictionary on some dataset
and wants to use this dictionary to encode some "new" data.
You do not need to estimate anything to get the "model".
What you do want is to "transform" the new data so that it
is coded using the specified dictionary.
Yes.
Post by Andreas Mueller
I think this is exactly what the sparse encoding method that Olivier
referenced is doing.
Yes and this method is just a wrapper for what Alexandre is explaining
(about LARS and OMP). Both replies are consistent.
Sparse coding is not simple linear projection (as in the transform
method of PCA for instance) as there is a penalty on the non-zero
loadings. Hence there is some "estimator fitting" that still occur at
the "transform" time. Unless of course who use fixed arbitrary
thresholding that might be enough for some tasks (e.g. sparse feature
extraction for image classification).
I think I was mostly confused by the terminology-- I don't consider the code
to be part of a sparse coding model, nor to be estimated (I am aware that
sparse coding involves iterative optimization but I don't consider the optimizer
to be solving an estimation problem).

I don't understand exactly what interface Alexandre is saying to use.

To use the sparse_encode interface, should I pass a dictionary of shape
(num_data_features, num_code_elements) for X and a data matrix of shape
(num_data_features, num_examples) for Y?

I have tried doing that, but for alpha = 1. or alpha = 0.1 it returns
a matrix of
all zeros, and for alpha = .01 it returns a code with NaNs in it.
Post by Olivier Grisel
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
David Warde-Farley
2011-12-05 23:24:59 UTC
Permalink
Post by Ian Goodfellow
I think I was mostly confused by the terminology-- I don't consider the code
to be part of a sparse coding model, nor to be estimated (I am aware that
sparse coding involves iterative optimization but I don't consider the optimizer
to be solving an estimation problem).
I don't understand exactly what interface Alexandre is saying to use.
To use the sparse_encode interface, should I pass a dictionary of shape
(num_data_features, num_code_elements) for X and a data matrix of shape
(num_data_features, num_examples) for Y?
I have tried doing that, but for alpha = 1. or alpha = 0.1 it returns
a matrix of
all zeros, and for alpha = .01 it returns a code with NaNs in it.
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?

David
Alexandre Gramfort
2011-12-06 08:04:22 UTC
Permalink
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?

it should ultimately use the duality gap to stop the iterations but
there might be a corner case …

let me know

Alex
David Warde-Farley
2011-12-06 18:34:21 UTC
Permalink
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)

In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))

In [36]: signal = rng.normal(size=100) / 1000

In [37]: from sklearn.linear_model import Lasso

In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)

In [39]: lasso.fit(dictionary, signal)
Out[39]:
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
normalize=False, precompute='auto', tol=1e-08)

In [40]: max(abs(lasso.coef_))
Out[40]: 0.0

In [41]: from pylearn2.optimization.feature_sign import feature_sign_search

In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)

In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018

And I'm pretty sure the latter result is the right one, since

In [45]: def gradient(coefs):
....: gram = np.dot(dictionary.T, dictionary)
....: corr = np.dot(dictionary.T, signal)
....: return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
....:


and

In [51]: max(abs(gradient(coef)[coef == 0]))
Out[51]: 9.9849794040467979e-05

In [52]: max(abs(gradient(lasso.coef_)))
Out[52]: 0.0065556004140218012

meaning that whatever that maximum gradient coefficient is should have been
activated, since the gradient from the linear portion should override the
sparsity penalty.
Olivier Grisel
2011-12-06 19:43:06 UTC
Permalink
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)
In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))
In [36]: signal = rng.normal(size=100) / 1000
In [37]: from sklearn.linear_model import Lasso
In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)
In [39]: lasso.fit(dictionary, signal)
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
  normalize=False, precompute='auto', tol=1e-08)
In [40]: max(abs(lasso.coef_))
Out[40]: 0.0
In [41]: from pylearn2.optimization.feature_sign import feature_sign_search
In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018
And I'm pretty sure the latter result is the right one, since
  ....:     gram = np.dot(dictionary.T, dictionary)
  ....:     corr = np.dot(dictionary.T, signal)
  ....:     return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
Actually, alpha in scikit-learn is multiplied by n_samples. I agree
this is misleading and not documented in the docstring.
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
max(abs(lasso.coef_))
0.0027627270397484554
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
max(abs(gradient(lasso.coef_)))
0.00019687294269977963
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Ian Goodfellow
2011-12-06 20:17:47 UTC
Permalink
ok, decreasing alpha by a factor of n_samples (5000 in my case) makes
sparse_encode behave much more reasonably.

However I still have two bugs to report:

1. The default algorithm returns this error:

Traceback (most recent call last):
File "s3c_sparsity_scale_plot.py", line 86, in <module>
HS = sparse_encode( model.W.get_value(), X.T, alpha = 1./5000.).T
File "/u/goodfeli/python_modules/lib/python2.7/site-packages/scikit_learn-0.9-py2.7-linux-x86_64.egg/sklearn/decomposition/dict_learning.py",
line 117, in sparse_encode
method='lasso')
File "/u/goodfeli/python_modules/lib/python2.7/site-packages/scikit_learn-0.9-py2.7-linux-x86_64.egg/sklearn/linear_model/least_angle.py",
line 249, in lars_path
arrayfuncs.cholesky_delete(L[:n_active, :n_active], idx)
File "arrayfuncs.pyx", line 104, in
sklearn.utils.arrayfuncs.cholesky_delete
(sklearn/utils/arrayfuncs.c:1516)
TypeError: only length-1 arrays can be converted to Python scalars


2. The lasso_lars algorithm tells me I am not using enough iterations,
but as far as I can tell the sparse_encode interface does not expose
any way for me to increase the number of iterations that cd uses.

/u/goodfeli/python_modules/lib/python2.7/site-packages/scikit_learn-0.9-py2.7-linux-x86_64.egg/sklearn/linear_model/coordinate_descent.py:173:
UserWarning: Objective did not converge, you might want to increase
the number of iterations
warnings.warn('Objective did not converge, you might want'
Post by Olivier Grisel
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)
In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))
In [36]: signal = rng.normal(size=100) / 1000
In [37]: from sklearn.linear_model import Lasso
In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)
In [39]: lasso.fit(dictionary, signal)
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
  normalize=False, precompute='auto', tol=1e-08)
In [40]: max(abs(lasso.coef_))
Out[40]: 0.0
In [41]: from pylearn2.optimization.feature_sign import feature_sign_search
In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018
And I'm pretty sure the latter result is the right one, since
  ....:     gram = np.dot(dictionary.T, dictionary)
  ....:     corr = np.dot(dictionary.T, signal)
  ....:     return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
Actually, alpha in scikit-learn is multiplied by n_samples. I agree
this is misleading and not documented in the docstring.
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
max(abs(lasso.coef_))
0.0027627270397484554
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
max(abs(gradient(lasso.coef_)))
0.00019687294269977963
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2011-12-06 21:25:47 UTC
Permalink
regarding the scaling by n_samples using estimators I am convinced the right
thing to do cf. my current PR to do this also on SVM models

regarding the convergence pb and potential error, can you put a gist on github
to make the pb more easily reproducible.

Alex
Post by Ian Goodfellow
ok, decreasing alpha by a factor of n_samples (5000 in my case) makes
sparse_encode behave much more reasonably.
 File "s3c_sparsity_scale_plot.py", line 86, in <module>
   HS = sparse_encode( model.W.get_value(), X.T, alpha = 1./5000.).T
 File "/u/goodfeli/python_modules/lib/python2.7/site-packages/scikit_learn-0.9-py2.7-linux-x86_64.egg/sklearn/decomposition/dict_learning.py",
line 117, in sparse_encode
   method='lasso')
 File "/u/goodfeli/python_modules/lib/python2.7/site-packages/scikit_learn-0.9-py2.7-linux-x86_64.egg/sklearn/linear_model/least_angle.py",
line 249, in lars_path
   arrayfuncs.cholesky_delete(L[:n_active, :n_active], idx)
 File "arrayfuncs.pyx", line 104, in
sklearn.utils.arrayfuncs.cholesky_delete
(sklearn/utils/arrayfuncs.c:1516)
TypeError: only length-1 arrays can be converted to Python scalars
2. The lasso_lars algorithm tells me I am not using enough iterations,
but as far as I can tell the sparse_encode interface does not expose
any way for me to increase the number of iterations that cd uses.
UserWarning: Objective did not converge, you might want to increase
the number of iterations
 warnings.warn('Objective did not converge, you might want'
Post by Olivier Grisel
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)
In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))
In [36]: signal = rng.normal(size=100) / 1000
In [37]: from sklearn.linear_model import Lasso
In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)
In [39]: lasso.fit(dictionary, signal)
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
  normalize=False, precompute='auto', tol=1e-08)
In [40]: max(abs(lasso.coef_))
Out[40]: 0.0
In [41]: from pylearn2.optimization.feature_sign import feature_sign_search
In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018
And I'm pretty sure the latter result is the right one, since
  ....:     gram = np.dot(dictionary.T, dictionary)
  ....:     corr = np.dot(dictionary.T, signal)
  ....:     return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
Actually, alpha in scikit-learn is multiplied by n_samples. I agree
this is misleading and not documented in the docstring.
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
max(abs(lasso.coef_))
0.0027627270397484554
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
max(abs(gradient(lasso.coef_)))
0.00019687294269977963
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Ian Goodfellow
2011-12-06 21:52:04 UTC
Permalink
On Tue, Dec 6, 2011 at 4:25 PM, Alexandre Gramfort
Post by Alexandre Gramfort
regarding the scaling by n_samples using estimators I am convinced the right
thing to do cf. my current PR to do this also on SVM models
I am not going to get involved in the discussion of whether to
normalize the coefficient or not, but in all cases the objective
function should be clearly documented.
Post by Alexandre Gramfort
regarding the convergence pb and potential error, can you put a gist on github
to make the pb more easily reproducible.
Problem #1 seems to be dependent on the dataset and/or weights. Is
there a good way to transfer some .npy files to you?

Problem #2 is just part of the definition of the interface; I can't
really write code to demonstrate that the interface is missing a
feature.
Post by Alexandre Gramfort
Alex
Post by Ian Goodfellow
ok, decreasing alpha by a factor of n_samples (5000 in my case) makes
sparse_encode behave much more reasonably.
 File "s3c_sparsity_scale_plot.py", line 86, in <module>
   HS = sparse_encode( model.W.get_value(), X.T, alpha = 1./5000.).T
 File "/u/goodfeli/python_modules/lib/python2.7/site-packages/scikit_learn-0.9-py2.7-linux-x86_64.egg/sklearn/decomposition/dict_learning.py",
line 117, in sparse_encode
   method='lasso')
 File "/u/goodfeli/python_modules/lib/python2.7/site-packages/scikit_learn-0.9-py2.7-linux-x86_64.egg/sklearn/linear_model/least_angle.py",
line 249, in lars_path
   arrayfuncs.cholesky_delete(L[:n_active, :n_active], idx)
 File "arrayfuncs.pyx", line 104, in
sklearn.utils.arrayfuncs.cholesky_delete
(sklearn/utils/arrayfuncs.c:1516)
TypeError: only length-1 arrays can be converted to Python scalars
2. The lasso_lars algorithm tells me I am not using enough iterations,
but as far as I can tell the sparse_encode interface does not expose
any way for me to increase the number of iterations that cd uses.
UserWarning: Objective did not converge, you might want to increase
the number of iterations
 warnings.warn('Objective did not converge, you might want'
Post by Olivier Grisel
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)
In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))
In [36]: signal = rng.normal(size=100) / 1000
In [37]: from sklearn.linear_model import Lasso
In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)
In [39]: lasso.fit(dictionary, signal)
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
  normalize=False, precompute='auto', tol=1e-08)
In [40]: max(abs(lasso.coef_))
Out[40]: 0.0
In [41]: from pylearn2.optimization.feature_sign import feature_sign_search
In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018
And I'm pretty sure the latter result is the right one, since
  ....:     gram = np.dot(dictionary.T, dictionary)
  ....:     corr = np.dot(dictionary.T, signal)
  ....:     return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
Actually, alpha in scikit-learn is multiplied by n_samples. I agree
this is misleading and not documented in the docstring.
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
max(abs(lasso.coef_))
0.0027627270397484554
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
max(abs(gradient(lasso.coef_)))
0.00019687294269977963
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2011-12-06 22:00:08 UTC
Permalink
Post by Ian Goodfellow
I am not going to get involved in the discussion of whether to
normalize the coefficient or not, but in all cases the objective
function should be clearly documented.
+1

if it's not done before the NIPS sprint that will be an easy first task

Alex
David Warde-Farley
2011-12-07 00:43:26 UTC
Permalink
Post by Alexandre Gramfort
regarding the scaling by n_samples using estimators I am convinced the right
thing to do cf. my current PR to do this also on SVM models
I think that scaling by n_samples makes sense in the supervised learning
context (we often do the equivalent thing where we take the mean, rather than
the sum, over the unregularized training objective, making the regularization
invariant to the size of the training set), however there is a disconnect
between the dictionary learning notion of n_samples and the supervised
estimator notion of n_samples, and the conflation of these two because one
can be implemented by the other.

To be precise, (and I hope I got this right lest I confuse things further), a
sparse coding problem with K different training examples and L different
input features and M sparse components corresponds to K independent lasso
problems with L training examples each and M input features.

In this case, scaling the penalty by the Lasso "n_samples" corresponds to
scaling by the sparse coding "n_features", which I think you'll agree is a
bit weird.

David
Gael Varoquaux
2011-12-07 04:02:46 UTC
Permalink
Post by David Warde-Farley
I think that scaling by n_samples makes sense in the supervised learning
context (we often do the equivalent thing where we take the mean, rather than
the sum, over the unregularized training objective, making the regularization
invariant to the size of the training set), however there is a disconnect
between the dictionary learning notion of n_samples and the supervised
estimator notion of n_samples, and the conflation of these two because one
can be implemented by the other.
+1. We may need to have a different convention in Lasso and
sparse_encode. This should be well documented.

G
Olivier Grisel
2011-12-07 08:40:50 UTC
Permalink
Post by Gael Varoquaux
Post by David Warde-Farley
I think that scaling by n_samples makes sense in the supervised learning
context (we often do the equivalent thing where we take the mean, rather than
the sum, over the unregularized training objective, making the regularization
invariant to the size of the training set), however there is a disconnect
between the dictionary learning notion of n_samples and the supervised
estimator notion of n_samples, and the conflation of these two because one
can be implemented by the other.
+1. We may need to have a different convention in Lasso and
sparse_encode. This should be well documented.
+1 as Vlad said we just need to divide alpha by the size of the
dictionary before calling Lasso from the sparse_encode function to
nullify the normalization for this specific case only.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2011-12-13 09:31:06 UTC
Permalink
On Wed, Dec 7, 2011 at 9:43 AM, David Warde-Farley
Post by David Warde-Farley
To be precise, (and I hope I got this right lest I confuse things further), a
sparse coding problem with K different training examples and L different
input features and M sparse components corresponds to K independent lasso
problems with L training examples each and M input features.
In the doc, it would probably be easier to understand if we make an
analogy with regression.

A regression problem consists in learning w given X and y:

argmin_w ||y - X^T w||^2

w: [n_features, ]
X: [n_samples, n_features]
y: [n_samples, ]

A sparse coding problem consists in learning alpha given D and x:

argmin_alpha ||x - D^T alpha||^2

alpha: [n_components, ]
D: [n_features, n_components]
x: [n_features, ]

Therefore, to encode the entire dataset X, we need to to solve
n_samples regression problems each with n_features instances and
n_components features.
When using the square loss, each problem is independent (not when
using the hinge loss).

Mathieu

David Warde-Farley
2011-12-06 21:27:16 UTC
Permalink
Post by Olivier Grisel
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)
In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))
In [36]: signal = rng.normal(size=100) / 1000
In [37]: from sklearn.linear_model import Lasso
In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)
In [39]: lasso.fit(dictionary, signal)
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
  normalize=False, precompute='auto', tol=1e-08)
In [40]: max(abs(lasso.coef_))
Out[40]: 0.0
In [41]: from pylearn2.optimization.feature_sign import feature_sign_search
In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018
And I'm pretty sure the latter result is the right one, since
  ....:     gram = np.dot(dictionary.T, dictionary)
  ....:     corr = np.dot(dictionary.T, signal)
  ....:     return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
Actually, alpha in scikit-learn is multiplied by n_samples. I agree
this is misleading and not documented in the docstring.
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
max(abs(lasso.coef_))
0.0027627270397484554
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
max(abs(gradient(lasso.coef_)))
0.00019687294269977963
Seems like there's an added factor of 2 in there as well,
though this is a little more standard:

In [94]: lasso = Lasso(alpha=0.0001 / (2 * dictionary.shape[0]),
max_iter=1e8, fit_intercept=False, tol=1e-8).fit(dictionary, signal)

In [95]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [96]: allclose(lasso.coef_, coef, atol=1e-7)
Out[96]: True

I think you're right that the precise cost function definitely ought to be
documented in the front-facing classes rather than just the low-level Cython
routines.

I also think that scaling the way Lasso/ElasticNet does in the context of
sparse coding may be very confusing, since in sparse coding it corresponds
not to a number of training samples in a regression problem but to the number
of input dimensions.

The docstring of sparse_encode is quite confusing in that X, the dictionary,
says "n_samples, n_components". The number of samples (in the context of
sparse coding) should have no influence over the shape of the dictionary;
this seems to have leaked over from the Lasso documentation.

The shape and mathematical definition of cov doesn't make much sense to me
given this change, though (or to begin with, for that matter): In the case of
a single problem, the desired covariance is X^T y, with y a column vector,
yielding another column vector of (n_components, 1). So the shape, if you
have multiple examples you're precomputing for, should end up being
(n_components, n_samples), and given the shape of Y that would be achieved by
X^T Y^T.

David
Olivier Grisel
2011-12-06 21:44:56 UTC
Permalink
Post by David Warde-Farley
Post by Olivier Grisel
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)
In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))
In [36]: signal = rng.normal(size=100) / 1000
In [37]: from sklearn.linear_model import Lasso
In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)
In [39]: lasso.fit(dictionary, signal)
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
  normalize=False, precompute='auto', tol=1e-08)
In [40]: max(abs(lasso.coef_))
Out[40]: 0.0
In [41]: from pylearn2.optimization.feature_sign import feature_sign_search
In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018
And I'm pretty sure the latter result is the right one, since
  ....:     gram = np.dot(dictionary.T, dictionary)
  ....:     corr = np.dot(dictionary.T, signal)
  ....:     return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
Actually, alpha in scikit-learn is multiplied by n_samples. I agree
this is misleading and not documented in the docstring.
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
max(abs(lasso.coef_))
0.0027627270397484554
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
max(abs(gradient(lasso.coef_)))
0.00019687294269977963
Seems like there's an added factor of 2 in there as well,
In [94]: lasso = Lasso(alpha=0.0001 / (2 * dictionary.shape[0]),
max_iter=1e8, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
In [95]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [96]: allclose(lasso.coef_, coef, atol=1e-7)
Out[96]: True
I think you're right that the precise cost function definitely ought to be
documented in the front-facing classes rather than just the low-level Cython
routines.
I also think that scaling the way Lasso/ElasticNet does in the context of
sparse coding may be very confusing, since in sparse coding it corresponds
not to a number of training samples in a regression problem but to the number
of input dimensions.
The docstring of sparse_encode is quite confusing in that X, the dictionary,
says "n_samples, n_components". The number of samples (in the context of
sparse coding) should have no influence over the shape of the dictionary;
this seems to have leaked over from the Lasso documentation.
The shape and mathematical definition of cov doesn't make much sense to me
given this change, though (or to begin with, for that matter): In the case of
a single problem, the desired covariance is X^T y, with y a column vector,
yielding another column vector of (n_components, 1). So the shape, if you
have multiple examples you're precomputing for, should end up being
(n_components, n_samples), and given the shape of Y that would be achieved by
X^T Y^T.
Yes to me we should have:

encoded_data = sparse_encode(dictionary, original_data, **params)

- original_data with shape (n_samples, n_features)
- dictionary with shape (n_components, n_features)
- encoded_data with shape (n_samples, n_components)

and original_data ~= np.dot(encoded_data, dictionary)

This is the most intuitive API. Internally this will call Lasso in an
unusual manner as we pass the dictionary as input data and the
original_data array as target values.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Alexandre Gramfort
2011-12-06 21:46:52 UTC
Permalink
I do confirm that Lasso and LassoLars both minimize

1/2n || y - Xw || + alpha ||w||_1

and that the n should not be present in the sparse coding context.

it means :

http://scikit-learn.org/stable/modules/linear_model.html#lasso

is not correct. I don't know if this also affects the doc of the SGD.
I would also vote for writing the cost function minimized in the Lasso
(etc.) docstrings.

regarding the shapes using sparse_encode I'll let Vlad comment.

Alex

On Tue, Dec 6, 2011 at 10:27 PM, David Warde-Farley
Post by David Warde-Farley
Post by Olivier Grisel
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)
In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))
In [36]: signal = rng.normal(size=100) / 1000
In [37]: from sklearn.linear_model import Lasso
In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)
In [39]: lasso.fit(dictionary, signal)
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
  normalize=False, precompute='auto', tol=1e-08)
In [40]: max(abs(lasso.coef_))
Out[40]: 0.0
In [41]: from pylearn2.optimization.feature_sign import feature_sign_search
In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018
And I'm pretty sure the latter result is the right one, since
  ....:     gram = np.dot(dictionary.T, dictionary)
  ....:     corr = np.dot(dictionary.T, signal)
  ....:     return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
Actually, alpha in scikit-learn is multiplied by n_samples. I agree
this is misleading and not documented in the docstring.
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
max(abs(lasso.coef_))
0.0027627270397484554
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
max(abs(gradient(lasso.coef_)))
0.00019687294269977963
Seems like there's an added factor of 2 in there as well,
In [94]: lasso = Lasso(alpha=0.0001 / (2 * dictionary.shape[0]),
max_iter=1e8, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
In [95]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [96]: allclose(lasso.coef_, coef, atol=1e-7)
Out[96]: True
I think you're right that the precise cost function definitely ought to be
documented in the front-facing classes rather than just the low-level Cython
routines.
I also think that scaling the way Lasso/ElasticNet does in the context of
sparse coding may be very confusing, since in sparse coding it corresponds
not to a number of training samples in a regression problem but to the number
of input dimensions.
The docstring of sparse_encode is quite confusing in that X, the dictionary,
says "n_samples, n_components". The number of samples (in the context of
sparse coding) should have no influence over the shape of the dictionary;
this seems to have leaked over from the Lasso documentation.
The shape and mathematical definition of cov doesn't make much sense to me
given this change, though (or to begin with, for that matter): In the case of
a single problem, the desired covariance is X^T y, with y a column vector,
yielding another column vector of (n_components, 1). So the shape, if you
have multiple examples you're precomputing for, should end up being
(n_components, n_samples), and given the shape of Y that would be achieved by
X^T Y^T.
David
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Vlad Niculae
2011-12-06 21:53:02 UTC
Permalink
On Tue, Dec 6, 2011 at 11:46 PM, Alexandre Gramfort
Post by Alexandre Gramfort
I do confirm that Lasso and LassoLars both minimize
1/2n || y - Xw || + alpha ||w||_1
and that the n should not be present in the sparse coding context.
http://scikit-learn.org/stable/modules/linear_model.html#lasso
is not correct. I don't know if this also affects the doc of the SGD.
I would also vote for writing the cost function minimized in the Lasso
(etc.) docstrings.
regarding the shapes using sparse_encode I'll let Vlad comment.
At first sight I agree with Olivier re: the shapes. The alpha issue is
a semantics one and we should simply multiply it back by the
appropriate dimension in order to expose a clear interface.

How about I address these issues in the pull request I opened earlier today?

I just remember discussing with Alex that the alpha values for the
MiniBatch versions of the algorithms didn't correspond to the batch
versions. I now realize that this might be the reason; the scaling for
a mini-batch was different than for the full batch.

Vlad
Post by Alexandre Gramfort
Alex
On Tue, Dec 6, 2011 at 10:27 PM, David Warde-Farley
Post by David Warde-Farley
Post by Olivier Grisel
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)
In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))
In [36]: signal = rng.normal(size=100) / 1000
In [37]: from sklearn.linear_model import Lasso
In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)
In [39]: lasso.fit(dictionary, signal)
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
  normalize=False, precompute='auto', tol=1e-08)
In [40]: max(abs(lasso.coef_))
Out[40]: 0.0
In [41]: from pylearn2.optimization.feature_sign import feature_sign_search
In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018
And I'm pretty sure the latter result is the right one, since
  ....:     gram = np.dot(dictionary.T, dictionary)
  ....:     corr = np.dot(dictionary.T, signal)
  ....:     return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
Actually, alpha in scikit-learn is multiplied by n_samples. I agree
this is misleading and not documented in the docstring.
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
max(abs(lasso.coef_))
0.0027627270397484554
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
max(abs(gradient(lasso.coef_)))
0.00019687294269977963
Seems like there's an added factor of 2 in there as well,
In [94]: lasso = Lasso(alpha=0.0001 / (2 * dictionary.shape[0]),
max_iter=1e8, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
In [95]: coef = feature_sign_search(dictionary, signal, 0.0001)
In [96]: allclose(lasso.coef_, coef, atol=1e-7)
Out[96]: True
I think you're right that the precise cost function definitely ought to be
documented in the front-facing classes rather than just the low-level Cython
routines.
I also think that scaling the way Lasso/ElasticNet does in the context of
sparse coding may be very confusing, since in sparse coding it corresponds
not to a number of training samples in a regression problem but to the number
of input dimensions.
The docstring of sparse_encode is quite confusing in that X, the dictionary,
says "n_samples, n_components". The number of samples (in the context of
sparse coding) should have no influence over the shape of the dictionary;
this seems to have leaked over from the Lasso documentation.
The shape and mathematical definition of cov doesn't make much sense to me
given this change, though (or to begin with, for that matter): In the case of
a single problem, the desired covariance is X^T y, with y a column vector,
yielding another column vector of (n_components, 1). So the shape, if you
have multiple examples you're precomputing for, should end up being
(n_components, n_samples), and given the shape of Y that would be achieved by
X^T Y^T.
David
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2011-12-07 08:35:05 UTC
Permalink
Post by Vlad Niculae
On Tue, Dec 6, 2011 at 11:46 PM, Alexandre Gramfort
Post by Alexandre Gramfort
I do confirm that Lasso and LassoLars both minimize
1/2n || y - Xw || + alpha ||w||_1
and that the n should not be present in the sparse coding context.
http://scikit-learn.org/stable/modules/linear_model.html#lasso
is not correct. I don't know if this also affects the doc of the SGD.
I would also vote for writing the cost function minimized in the Lasso
(etc.) docstrings.
regarding the shapes using sparse_encode I'll let Vlad comment.
At first sight I agree with Olivier re: the shapes. The alpha issue is
a semantics one and we should simply multiply it back by the
appropriate dimension in order to expose a  clear interface.
How about I address these issues in the pull request I opened earlier today?
+1, also make it possible to expose the ability to up the max_iter of
LARS from the sparse_encode API as reported by Ian.
Post by Vlad Niculae
I just remember discussing with Alex that the alpha values for the
MiniBatch versions of the algorithms didn't correspond to the batch
versions. I now realize that this might be the reason; the scaling for
a mini-batch was different than for the full batch.
Indeed.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2011-12-07 08:39:00 UTC
Permalink
Post by Olivier Grisel
Post by Vlad Niculae
How about I address these issues in the pull request I opened earlier today?
+1, also make it possible to expose the ability to up the max_iter of
LARS from the sparse_encode API as reported by Ian.
+1.
Post by Olivier Grisel
Post by Vlad Niculae
I just remember discussing with Alex that the alpha values for the
MiniBatch versions of the algorithms didn't correspond to the batch
versions. I now realize that this might be the reason; the scaling for
a mini-batch was different than for the full batch.
Good catch. It does take quite a while to iron out all the details, but
it's very important in terms of quality of the package.

G
Mathieu Blondel
2011-12-12 16:10:07 UTC
Permalink
On Wed, Dec 7, 2011 at 3:34 AM, David Warde-Farley
Post by David Warde-Farley
In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
David, I see you have implemented the feature sign search algorithm of
Lee et al. (NIPS 2007). Do you think there would be value in adding it
as an encoder for the dictionary learning module in scikit-learn? (in
addition to Lars and Lasso)

Mathieu
Gael Varoquaux
2011-12-06 03:57:53 UTC
Permalink
Post by Andreas Mueller
What you do want is to "transform" the new data so that it
is coded using the specified dictionary.
I think this is exactly what the sparse encoding method that Olivier
referenced is doing.
We would need an intermediate object that does the tranform and not the
fit, and from which the diferent dictionary learning object could
inherit.

G
Vlad Niculae
2011-12-06 07:41:56 UTC
Permalink
On Tue, Dec 6, 2011 at 5:57 AM, Gael Varoquaux
Post by Gael Varoquaux
Post by Andreas Mueller
What you do want is to "transform" the new data so that it
is coded using the specified dictionary.
I think this is exactly what the sparse encoding method that Olivier
referenced is doing.
We would need an intermediate object that does the tranform and not the
fit, and from which the diferent dictionary learning object could
inherit.
This is actually exactly how the module is designed. We have
BaseDictionaryLearning which only implements transforms. I didn't try
but you should be able to instantiate a BaseDictionaryLearning object,
set its components_ manually, and use its transform. Then again, I
think that it should work to set the components_ by hand for any
dictionary learning object. Ian, could you send us the code that you
used when you noticed that this didn't work?

Best
Vlad
Post by Gael Varoquaux
G
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2011-12-06 09:04:40 UTC
Permalink
Post by Vlad Niculae
This is actually exactly how the module is designed.
Great design! I should have looked at it closer before writing my mail.
Post by Vlad Niculae
We have BaseDictionaryLearning which only implements transforms. I
didn't try but you should be able to instantiate a
BaseDictionaryLearning object, set its components_ manually, and use
its transform.
Maybe we need a subclass of this object, for instance 'sparse_coder' that
takes as __init__ argument the dictionnary to be used.

G
Vlad Niculae
2011-12-06 09:12:54 UTC
Permalink
Post by Gael Varoquaux
Post by Vlad Niculae
This is actually exactly how the module is designed.
Great design! I should have looked at it closer before writing my mail.
Post by Vlad Niculae
We have BaseDictionaryLearning which only implements transforms. I
didn't try but you should be able to instantiate a
BaseDictionaryLearning object, set its components_ manually, and use
its transform.
Maybe we need a subclass of this object, for instance 'sparse_coder' that
takes as __init__ argument the dictionnary to be used.
Sounds good, this way it can be used in pipelines. I'll make a pull request.

Vlad
Post by Gael Varoquaux
G
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2011-12-06 10:07:29 UTC
Permalink
Post by Vlad Niculae
Post by Gael Varoquaux
Post by Vlad Niculae
This is actually exactly how the module is designed.
Great design! I should have looked at it closer before writing my mail.
Post by Vlad Niculae
We have BaseDictionaryLearning which only implements transforms. I
didn't try but you should be able to instantiate a
BaseDictionaryLearning object, set its components_ manually, and use
its transform.
Maybe we need a subclass of this object, for instance 'sparse_coder' that
takes as __init__ argument the dictionnary to be used.
Sounds good, this way it can be used in pipelines. I'll make a pull request.
Also vlad can you check the shape of the output of:

http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode.html
(and its' parallel variant) ?

It looks wrong to me. I would have expected `(n_samples, n_components)` instead.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Vlad Niculae
2011-12-06 10:45:07 UTC
Permalink
On Tue, Dec 6, 2011 at 12:07 PM, Olivier Grisel
Post by Olivier Grisel
Post by Vlad Niculae
Post by Gael Varoquaux
Post by Vlad Niculae
This is actually exactly how the module is designed.
Great design! I should have looked at it closer before writing my mail.
Post by Vlad Niculae
We have BaseDictionaryLearning which only implements transforms. I
didn't try but you should be able to instantiate a
BaseDictionaryLearning object, set its components_ manually, and use
its transform.
Maybe we need a subclass of this object, for instance 'sparse_coder' that
takes as __init__ argument the dictionnary to be used.
Sounds good, this way it can be used in pipelines. I'll make a pull request.
http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode.html
(and its' parallel variant) ?
It looks wrong to me. I would have expected `(n_samples, n_components)` instead.
There's no really wrong or right here, but indeed it's backwards than
the dictionary learning framework, but it's shaped like the linear
estimators.

I was lead to the same problem when implementing the sparse coder
discussed above. Also I think that the defaults should be more
consistent between sparse_coding functions and DictionaryLearning
objects.

I also ran into some missing documentation.

I'll put this in a form of a pull request in a couple of minutes, so
we can discuss there.

Vlad
Post by Olivier Grisel
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2011-12-06 10:59:20 UTC
Permalink
Post by Vlad Niculae
On Tue, Dec 6, 2011 at 12:07 PM, Olivier Grisel
Post by Olivier Grisel
Post by Vlad Niculae
Post by Gael Varoquaux
Post by Vlad Niculae
This is actually exactly how the module is designed.
Great design! I should have looked at it closer before writing my mail.
Post by Vlad Niculae
We have BaseDictionaryLearning which only implements transforms. I
didn't try but you should be able to instantiate a
BaseDictionaryLearning object, set its components_ manually, and use
its transform.
Maybe we need a subclass of this object, for instance 'sparse_coder' that
takes as __init__ argument the dictionnary to be used.
Sounds good, this way it can be used in pipelines. I'll make a pull request.
http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode.html
(and its' parallel variant) ?
It looks wrong to me. I would have expected `(n_samples, n_components)` instead.
There's no really wrong or right here, but indeed it's backwards than
the dictionary learning framework, but it's shaped like the linear
estimators.
Ok I think I understand: the dictionary X is currently documented as shape:

(n_samples, n_components)

Whereas, IMHO it should be (n_components, n_features) : each atom (row
of the dictionary) should have the dimensionality of the input data
space which it is supposed to be a summary of.

In that case the output of the sparse encoder for a new dataset of
shape (n_samples, n_features) would be shaped (n_samples,
n_components).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Ian Goodfellow
2011-12-06 15:26:04 UTC
Permalink
I was initially confused by the specification of the dictionary size
for sparse_encode. It makes sense if you think of it as solving
multiple lasso problems, but as Vlad said it is different from the
dictionary learning setup. As Vlad said there is no right or wrong,
but personally I think it is confusing to use the lasso terminology to
describe the shape of the arguments when the method has a name from
the dictionary learning point of view ("sparse_encode").

As for my code the line that fails is:

from sklearn.decomposition import sparse_encode
HS = sparse_encode( model.W.get_value(), X.T, alpha = 0.01).T

model and X are both fairly complicated, but basically X is a 5000x108
design matrix consisting of whitened 6x6 image patches from the STL-10
dataset. model.W.get_value() is a 108x1600 dictionary matrix. Both
arguments are represented as 2 dimensional float32 numpy ndarrays.

I agree with David that it seems like the optimizer is broken, but I
disagree that the problem is the termination criterion. There should
not be any NaNs anywhere in the course of optimization.
Post by Vlad Niculae
On Tue, Dec 6, 2011 at 12:07 PM, Olivier Grisel
Post by Olivier Grisel
Post by Vlad Niculae
Post by Gael Varoquaux
Post by Vlad Niculae
This is actually exactly how the module is designed.
Great design! I should have looked at it closer before writing my mail.
Post by Vlad Niculae
We have BaseDictionaryLearning which only implements transforms. I
didn't try but you should be able to instantiate a
BaseDictionaryLearning object, set its components_ manually, and use
its transform.
Maybe we need a subclass of this object, for instance 'sparse_coder' that
takes as __init__ argument the dictionnary to be used.
Sounds good, this way it can be used in pipelines. I'll make a pull request.
http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode.html
(and its' parallel variant) ?
It looks wrong to me. I would have expected `(n_samples, n_components)` instead.
There's no really wrong or right here, but indeed it's backwards than
the dictionary learning framework, but it's shaped like the linear
estimators.
 (n_samples, n_components)
Whereas, IMHO it should be (n_components, n_features) : each atom (row
of the dictionary) should have the dimensionality of the input data
space which it is supposed to be a summary of.
In that case the output of the sparse encoder for a new dataset of
shape (n_samples, n_features) would be shaped (n_samples,
n_components).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Vlad Niculae
2011-12-06 15:31:45 UTC
Permalink
Post by Ian Goodfellow
I was initially confused by the specification of the dictionary size
for sparse_encode. It makes sense if you think of it as solving
multiple lasso problems, but as Vlad said it is different from the
dictionary learning setup. As Vlad said there is no right or wrong,
but personally I think it is confusing to use the lasso terminology to
describe the shape of the arguments when the method has a name from
the dictionary learning point of view ("sparse_encode").
I think we can agree to switch around the dimensions so that
sparse_encode fits with dictionary learning, right?
I was specifically refering to when you attempted to manually plug in
the components_ attribute, because I feel that that should work. Was
the problem regarding invalid output, or a formal error?

Vlad
Post by Ian Goodfellow
Post by Vlad Niculae
On Tue, Dec 6, 2011 at 12:07 PM, Olivier Grisel
Post by Olivier Grisel
Post by Vlad Niculae
Post by Gael Varoquaux
Post by Vlad Niculae
This is actually exactly how the module is designed.
Great design! I should have looked at it closer before writing my mail.
Post by Vlad Niculae
We have BaseDictionaryLearning which only implements transforms. I
didn't try but you should be able to instantiate a
BaseDictionaryLearning object, set its components_ manually, and use
its transform.
Maybe we need a subclass of this object, for instance 'sparse_coder' that
takes as __init__ argument the dictionnary to be used.
Sounds good, this way it can be used in pipelines. I'll make a pull request.
http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode.html
(and its' parallel variant) ?
It looks wrong to me. I would have expected `(n_samples, n_components)` instead.
There's no really wrong or right here, but indeed it's backwards than
the dictionary learning framework, but it's shaped like the linear
estimators.
 (n_samples, n_components)
Whereas, IMHO it should be (n_components, n_features) : each atom (row
of the dictionary) should have the dimensionality of the input data
space which it is supposed to be a summary of.
In that case the output of the sparse encoder for a new dataset of
shape (n_samples, n_features) would be shaped (n_samples,
n_components).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Ian Goodfellow
2011-12-06 15:41:04 UTC
Permalink
Post by Vlad Niculae
Post by Ian Goodfellow
I was initially confused by the specification of the dictionary size
for sparse_encode. It makes sense if you think of it as solving
multiple lasso problems, but as Vlad said it is different from the
dictionary learning setup. As Vlad said there is no right or wrong,
but personally I think it is confusing to use the lasso terminology to
describe the shape of the arguments when the method has a name from
the dictionary learning point of view ("sparse_encode").
I think we can agree to switch around the dimensions so that
sparse_encode fits with dictionary learning, right?
I was specifically refering to when you attempted to manually plug in
the components_ attribute, because I feel that that should work. Was
the problem regarding invalid output, or a formal error?
from sklearn.decomposition import MiniBatchDictionaryLearning
m = MiniBatchDictionaryLearning(n_atoms = model.nhid,
fit_algorithm='lars', transform_algorithm = 'lasso_lars',
dict_init = model.W.get_value().T)
m.components_ = model.W.get_value().T

HS = m.transform(X)

Before I patched m.components_ there was an AttributeError.
After I patched it, it was just incorrect output (all zeros, or NaNs,
depending on alpha). I thought that it was not working because the
object was not set up properly, but now that I have tried
sparse_encode it seems more likely that the underlying optimizer just
doesn't work.
Post by Vlad Niculae
Vlad
Post by Ian Goodfellow
Post by Vlad Niculae
On Tue, Dec 6, 2011 at 12:07 PM, Olivier Grisel
Post by Olivier Grisel
Post by Vlad Niculae
Post by Gael Varoquaux
Post by Vlad Niculae
This is actually exactly how the module is designed.
Great design! I should have looked at it closer before writing my mail.
Post by Vlad Niculae
We have BaseDictionaryLearning which only implements transforms. I
didn't try but you should be able to instantiate a
BaseDictionaryLearning object, set its components_ manually, and use
its transform.
Maybe we need a subclass of this object, for instance 'sparse_coder' that
takes as __init__ argument the dictionnary to be used.
Sounds good, this way it can be used in pipelines. I'll make a pull request.
http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode.html
(and its' parallel variant) ?
It looks wrong to me. I would have expected `(n_samples, n_components)` instead.
There's no really wrong or right here, but indeed it's backwards than
the dictionary learning framework, but it's shaped like the linear
estimators.
 (n_samples, n_components)
Whereas, IMHO it should be (n_components, n_features) : each atom (row
of the dictionary) should have the dimensionality of the input data
space which it is supposed to be a summary of.
In that case the output of the sparse encoder for a new dataset of
shape (n_samples, n_features) would be shaped (n_samples,
n_components).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2011-12-07 03:58:14 UTC
Permalink
Post by Ian Goodfellow
I agree with David that it seems like the optimizer is broken, but I
disagree that the problem is the termination criterion. There should
not be any NaNs anywhere in the course of optimization.
I have also seen problems with NaNs in the case of very correlated
dictionnaries/designs.

I think that this is due to the conditionning of the problem, and
numerical erros that accumulate. I looked a bit, but really couldn't find
where the problem was. I had too leave for travels, but Virgile, Fabian
and I where hoping to work on this more.

Can people confirm that some other solvers (e.g. GLMnet) do not have the
same problem? In which case, we need to figure out how they do it.

G
Olivier Grisel
2011-12-07 08:38:32 UTC
Permalink
Post by Gael Varoquaux
Post by Ian Goodfellow
I agree with David that it seems like the optimizer is broken, but I
disagree that the problem is the termination criterion. There should
not be any NaNs anywhere in the course of optimization.
I have also seen problems with NaNs in the case of very correlated
dictionnaries/designs.
I think that this is due to the conditionning of the problem, and
numerical erros that accumulate. I looked a bit, but really couldn't find
where the problem was. I had too leave for travels, but Virgile, Fabian
and I where hoping to work on this more.
Can people confirm that some other solvers (e.g. GLMnet) do not have the
same problem? In which case, we need to figure out how they do it.
Are we still talking about the Lasso with Coordinate Descent or
another estimator? Would be great to have a reproduction case in a
gist and a matching issue ticket on github not to forget about it.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2011-12-07 08:41:37 UTC
Permalink
Post by Olivier Grisel
Post by Gael Varoquaux
Can people confirm that some other solvers (e.g. GLMnet) do not have the
same problem? In which case, we need to figure out how they do it.
Are we still talking about the Lasso with Coordinate Descent
Yes.
Post by Olivier Grisel
Would be great to have a reproduction case in a gist
Virgile? I thought that you had almost isolated a simple test case.

G
Virgile Fritsch
2011-12-07 09:51:12 UTC
Permalink
Post by Gael Varoquaux
Virgile? I thought that you had almost isolated a simple test case.
Not really, I could have the graph_lasso crash but I was using a
configuration in which my n_features-dimensional observations actually lied
in a (n_features - 1)-dimensional subspace.

Currently, I am using the graph_lasso to estimate covariance matrix on n x
p datasets with p / n > 1 (and matrix condition number up to 1000). I get a
lot a warning in the GraphLassoCV, but it can complete the computation
anyway. Whenever I get an exception, I catch it, ignore it, and rely on the
number of simulations that I perform to compensate the failure.
We may still find cases that yield an error, but it has not been a problem
anymore for me.
I guess I will go back to that issue when I work with real data.

On Wed, Dec 7, 2011 at 9:41 AM, Gael Varoquaux <
Post by Gael Varoquaux
Post by Olivier Grisel
Post by Gael Varoquaux
Can people confirm that some other solvers (e.g. GLMnet) do not have
the
Post by Olivier Grisel
Post by Gael Varoquaux
same problem? In which case, we need to figure out how they do it.
Are we still talking about the Lasso with Coordinate Descent
Yes.
Post by Olivier Grisel
Would be great to have a reproduction case in a gist
Virgile? I thought that you had almost isolated a simple test case.
G
Loading...