Discussion:
[PATCH] pm_ops: add irq enable/disable hooks
Johannes Berg
2007-04-05 21:54:14 UTC
Permalink
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.irq_off and pm_ops.irq_on which will be called instead
of disabling/enabling irqs so platforms can do a bit more work there if
necessary.

Signed-off-by: Johannes Berg <***@sipsolutions.net>

---
My previous patches moving to the pm_ops interface for powermac were
buggy due to exactly this issue, but the bug never surfaced on my
machine, only on machines with an Apple Desktop Bus (or an emulation
thereof)

Does this look ok to you? Would you want different names? Should I stick
in a BUG_ON(interrupts_enabled()) or such?

include/linux/pm.h | 10 ++++++++++
kernel/power/main.c | 10 ++++++++--
2 files changed, 18 insertions(+), 2 deletions(-)

--- wireless-dev.orig/include/linux/pm.h 2007-04-05 18:14:07.948549941 +0200
+++ wireless-dev/include/linux/pm.h 2007-04-05 23:15:29.934829459 +0200
@@ -131,9 +131,17 @@ typedef int __bitwise suspend_disk_metho
* @prepare: Prepare the platform for the given suspend state. Can return a
* negative error code if necessary.
*
+ * @irq_off: If assigned, the generic suspend code does not turn off IRQs
+ * but relies on this callback instead. It is currently not called for
+ * %PM_SUSPEND_DISK.
+ *
* @enter: Enter the given suspend state, must be assigned. Can return a
* negative error code if necessary.
*
+ * @irq_on: If assigned, the generic suspend code does not turn on IRQs
+ * but relies on this callback instead. It is currently not called for
+ * %PM_SUSPEND_DISK.
+ *
* @finish: Called when the system has left the given state and all devices
* are resumed. The return value is ignored.
*
@@ -152,7 +160,9 @@ typedef int __bitwise suspend_disk_metho
struct pm_ops {
int (*valid)(suspend_state_t state);
int (*prepare)(suspend_state_t state);
+ void (*irq_off)(suspend_state_t state);
int (*enter)(suspend_state_t state);
+ void (*irq_on)(suspend_state_t state);
int (*finish)(suspend_state_t state);
suspend_disk_method_t pm_disk_mode;
};
--- wireless-dev.orig/kernel/power/main.c 2007-04-05 18:14:07.988549941 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-05 18:25:21.108549941 +0200
@@ -117,7 +117,10 @@ int suspend_enter(suspend_state_t state)
int error = 0;
unsigned long flags;

- local_irq_save(flags);
+ if (pm_ops->irq_off)
+ pm_ops->irq_off(state);
+ else
+ local_irq_save(flags);

if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "Some devices failed to power down\n");
@@ -126,7 +129,10 @@ int suspend_enter(suspend_state_t state)
error = pm_ops->enter(state);
device_power_up();
Done:
- local_irq_restore(flags);
+ if (pm_ops->irq_on)
+ pm_ops->irq_on(state);
+ else
+ local_irq_restore(flags);
return error;
}
Rafael J. Wysocki
2007-04-05 23:30:52 UTC
Permalink
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.irq_off and pm_ops.irq_on which will be called instead
of disabling/enabling irqs so platforms can do a bit more work there if
necessary.
---
My previous patches moving to the pm_ops interface for powermac were
buggy due to exactly this issue, but the bug never surfaced on my
machine, only on machines with an Apple Desktop Bus (or an emulation
thereof)
Does this look ok to you? Would you want different names? Should I stick
in a BUG_ON(interrupts_enabled()) or such?
Well is this possible to do something like

if (pm_ops->do_something_before_disabling_irqs)
pm_ops->do_something_before_disabling_irqs()
local_irq_save(flags);
if (pm_ops->do_something_after_disabling_irqs)
pm_ops->do_something_after_disabling_irqs()

and analogously for enabling the IRQs?

Rafael
Johannes Berg
2007-04-05 23:28:57 UTC
Permalink
Post by Rafael J. Wysocki
Well is this possible to do something like
if (pm_ops->do_something_before_disabling_irqs)
pm_ops->do_something_before_disabling_irqs()
local_irq_save(flags);
if (pm_ops->do_something_after_disabling_irqs)
pm_ops->do_something_after_disabling_irqs()
and analogously for enabling the IRQs?
Ultimately yes, but is it worth the added complexity? Somebody mucking
with pm_ops has to know what he's doing anyway.

johannes
Rafael J. Wysocki
2007-04-06 00:02:13 UTC
Permalink
Post by Johannes Berg
Post by Rafael J. Wysocki
Well is this possible to do something like
if (pm_ops->do_something_before_disabling_irqs)
pm_ops->do_something_before_disabling_irqs()
local_irq_save(flags);
if (pm_ops->do_something_after_disabling_irqs)
pm_ops->do_something_after_disabling_irqs()
and analogously for enabling the IRQs?
Ultimately yes, but is it worth the added complexity? Somebody mucking
with pm_ops has to know what he's doing anyway.
Well, this seems to be more natural ("if you want to do something before/after
disabling the IRQs, define it" instead of "you can do something instead of
calling local_irq_save(), but please remember to disable the IRQs yourself
in that case").

BTW, it need not be in pm_ops (actually, why should it be there?).
Alternatively, you can define something like arch_prepare_for_disabling_irqs()
and/or arch_prepare_device_power_down() that will be empty on x86, for
example.

BTW2, I think BUG_ON(irqs_enabled) is needed after the arch does something
instead of or after local_irq_save().

Rafael
Johannes Berg
2007-04-06 00:09:18 UTC
Permalink
Post by Rafael J. Wysocki
Well, this seems to be more natural ("if you want to do something before/after
disabling the IRQs, define it" instead of "you can do something instead of
calling local_irq_save(), but please remember to disable the IRQs yourself
in that case").
Heh. Yeah, I guess. It just didn't seem worth it. I personally don't
care, I just need to be able to get at those spots.
Post by Rafael J. Wysocki
BTW, it need not be in pm_ops (actually, why should it be there?).
Alternatively, you can define something like arch_prepare_for_disabling_irqs()
and/or arch_prepare_device_power_down() that will be empty on x86, for
example.
Not sure, it might be different for different suspend methods. We
actually need to do some platform-function stuff inbetween, and if we
ever want some S4-like state then we might need to do it differently.
Post by Rafael J. Wysocki
BTW2, I think BUG_ON(irqs_enabled) is needed after the arch does something
instead of or after local_irq_save().
Sure, I can add that. Probably also in the resume path.

johannes
Rafael J. Wysocki
2007-04-06 00:17:50 UTC
Permalink
Post by Johannes Berg
Post by Rafael J. Wysocki
Well, this seems to be more natural ("if you want to do something before/after
disabling the IRQs, define it" instead of "you can do something instead of
calling local_irq_save(), but please remember to disable the IRQs yourself
in that case").
Heh. Yeah, I guess. It just didn't seem worth it. I personally don't
care, I just need to be able to get at those spots.
Post by Rafael J. Wysocki
BTW, it need not be in pm_ops (actually, why should it be there?).
Alternatively, you can define something like arch_prepare_for_disabling_irqs()
and/or arch_prepare_device_power_down() that will be empty on x86, for
example.
Not sure, it might be different for different suspend methods. We
actually need to do some platform-function stuff inbetween, and if we
ever want some S4-like state then we might need to do it differently.
Ah, OK

Rafael
Johannes Berg
2007-04-06 08:48:42 UTC
Permalink
Post by Rafael J. Wysocki
Post by Johannes Berg
Not sure, it might be different for different suspend methods. We
actually need to do some platform-function stuff inbetween, and if we
ever want some S4-like state then we might need to do it differently.
Ah, OK
Keep in mind that I don't know that yet, and am not totally sure I ever
will implement something S4-like (it would probably require kexec or
similar tricks). Also, these handlers are not even called fro the
suspend to disk case right now (and documented that way.)

I will repost with some BUG_ON() assertions, but should I change it to
have 4 handlers before_irq_off/after_irq_off/before_irq_on/after_irq_on
instead of the two I have now?

johannes
Rafael J. Wysocki
2007-04-06 09:41:31 UTC
Permalink
Post by Johannes Berg
Post by Rafael J. Wysocki
Post by Johannes Berg
Not sure, it might be different for different suspend methods. We
actually need to do some platform-function stuff inbetween, and if we
ever want some S4-like state then we might need to do it differently.
Ah, OK
Keep in mind that I don't know that yet, and am not totally sure I ever
will implement something S4-like (it would probably require kexec or
similar tricks). Also, these handlers are not even called fro the
suspend to disk case right now (and documented that way.)
I will repost with some BUG_ON() assertions, but should I change it to
have 4 handlers before_irq_off/after_irq_off/before_irq_on/after_irq_on
instead of the two I have now?
Frankly, I'm not sure.

For practical purposes the BUG_ON() assertions will suffice, so I think you
can keep the two handlers. I'd change the names, though, to something
like quiesce() and activate(), for example.

[Hm, it feels more appropriate to define them for all platforms and make them
call local_irq_save() on the platforms that don't need to do anything more.]

BTW, please remember to update the SNAPSHOT_S2RAM ioctl accordingly (well,
I think we should move the common code to a separate function).

Greetings,
Rafael
Johannes Berg
2007-04-06 09:44:12 UTC
Permalink
Post by Rafael J. Wysocki
Frankly, I'm not sure.
For practical purposes the BUG_ON() assertions will suffice, so I think you
can keep the two handlers. I'd change the names, though, to something
like quiesce() and activate(), for example.
Sure.
Post by Rafael J. Wysocki
[Hm, it feels more appropriate to define them for all platforms and make them
call local_irq_save() on the platforms that don't need to do anything more.]
Is there much point in that? It seems to make implementing new pm_ops a
bit more complex seeing that nobody but us seems to require such a thing
yet.
Post by Rafael J. Wysocki
BTW, please remember to update the SNAPSHOT_S2RAM ioctl accordingly (well,
I think we should move the common code to a separate function).
Good point. I'll take a look.

johannes
Rafael J. Wysocki
2007-04-06 10:02:41 UTC
Permalink
Post by Johannes Berg
Post by Rafael J. Wysocki
Frankly, I'm not sure.
For practical purposes the BUG_ON() assertions will suffice, so I think you
can keep the two handlers. I'd change the names, though, to something
like quiesce() and activate(), for example.
Sure.
Post by Rafael J. Wysocki
[Hm, it feels more appropriate to define them for all platforms and make them
call local_irq_save() on the platforms that don't need to do anything more.]
Is there much point in that? It seems to make implementing new pm_ops a
bit more complex seeing that nobody but us seems to require such a thing
yet.
That's why I said "it feels". :-)

Still, we can, for example, define default_quiesce() and default_activate()
that will only call local_irq_save/restore() to be used by the architectures
that don't need anything more etc.

Greetings,
Rafael
Johannes Berg
2007-04-06 10:00:57 UTC
Permalink
Post by Rafael J. Wysocki
That's why I said "it feels". :-)
:)
Post by Rafael J. Wysocki
Still, we can, for example, define default_quiesce() and default_activate()
that will only call local_irq_save/restore() to be used by the architectures
that don't need anything more etc.
Like I did with pm_ops_only_mem or whatever it was...

I don't feel strongly either way. But the last big discussion when I
changed all pm_ops users sort of discouraged me ;)

johannes
Pavel Machek
2007-04-06 19:19:03 UTC
Permalink
Hi!
Post by Johannes Berg
Post by Rafael J. Wysocki
Frankly, I'm not sure.
For practical purposes the BUG_ON() assertions will suffice, so I think you
can keep the two handlers. I'd change the names, though, to something
like quiesce() and activate(), for example.
Sure.
Post by Rafael J. Wysocki
[Hm, it feels more appropriate to define them for all platforms and make them
call local_irq_save() on the platforms that don't need to do anything more.]
Is there much point in that? It seems to make implementing new pm_ops a
bit more complex seeing that nobody but us seems to require such a thing
yet.
And why do _you_ need it? Unfortunately I do not know what the
decrementer is...?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-06 21:59:27 UTC
Permalink
Post by Pavel Machek
And why do _you_ need it? Unfortunately I do not know what the
decrementer is...?
The decrementer is a sort of timer that counts down and once it
overflows raises an exception in the CPU. Like a timer interrupt, but
not coming from anything external. We need to be able to make sure we
don't get a decrementer exception at some spots.

Also, we need to call some platform function things at that point to
prepare the system for suspend, and it needs to be done with interrupts
enabled but after all devices are suspended.

johannes
Johannes Berg
2007-04-10 13:42:32 UTC
Permalink
Hi,
Hmm, and can't you simply create sysdev for decrementer and special
platform handling? sysdevs should be suspended last...
In theory, yes.
In practise, however, it seems to be impossible to get a sysdev into the
queue that is suspended before any other sysdevs are suspended (i.e.
right after interrupts are disabled)
And then there are the platform functions. In theory, they could be done
with a regular struct device, but in practice they need to be the very
last thing before interrupts are disabled, and that again is impossible
to achieve.
Is it feasible to improve sysdev handling to allow this?
Actually, I think I spoke too soon when I wrote above.
device_power_down() calls sysdev's last so even being first in the
sysdev queue probably wouldn't help here. Then again, I'm not 100%
certain that we do need the decrementer right after disabling IRQs, it
seems to me on casual inspections that it isn't necessary but I'd have
to take another look.

Pushing aside the decrementer, the second issue with the platform
functions remain. I'm not opposed to working on something to allow me to
order the device tree suspend, but I have absolutely no idea how that
could be done.

johannes
Johannes Berg
2007-04-10 11:45:07 UTC
Permalink
Hmm, and can't you simply create sysdev for decrementer and special
platform handling? sysdevs should be suspended last...
In theory, yes.

In practise, however, it seems to be impossible to get a sysdev into the
queue that is suspended before any other sysdevs are suspended (i.e.
right after interrupts are disabled)

And then there are the platform functions. In theory, they could be done
with a regular struct device, but in practice they need to be the very
last thing before interrupts are disabled, and that again is impossible
to achieve.

johannes
Pavel Machek
2007-04-10 12:00:28 UTC
Permalink
Hi!
Hmm, and can't you simply create sysdev for decrementer and special
platform handling? sysdevs should be suspended last...
In theory, yes.
In practise, however, it seems to be impossible to get a sysdev into the
queue that is suspended before any other sysdevs are suspended (i.e.
right after interrupts are disabled)
And then there are the platform functions. In theory, they could be done
with a regular struct device, but in practice they need to be the very
last thing before interrupts are disabled, and that again is impossible
to achieve.
Is it feasible to improve sysdev handling to allow this?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Benjamin Herrenschmidt
2007-04-11 11:22:13 UTC
Permalink
Hi!
Hmm, and can't you simply create sysdev for decrementer and special
platform handling? sysdevs should be suspended last...
In theory, yes.
In practise, however, it seems to be impossible to get a sysdev into the
queue that is suspended before any other sysdevs are suspended (i.e.
right after interrupts are disabled)
And then there are the platform functions. In theory, they could be done
with a regular struct device, but in practice they need to be the very
last thing before interrupts are disabled, and that again is impossible
to achieve.
Is it feasible to improve sysdev handling to allow this?
It would be an absolutely ugly hack imho...

The trick to get the decrementer right is to really do it at the same
time as we turn irqs off... hence the idea to make the whole "turning
irqs off" platform specific to get the platform a chance to perform
whatever trickery is necessary after normal driver suspend and around
IRQ disabling.

(In the decrementer case, it looks approx like:
1- set decrementer to max value
2- switch irqs off
3- set decrementer to max value (again)
)

Along with that, we have all sort of platform specific hackery we need
to perform just before disabling IRQs on suspend and just after
re-enabling them on resume. That's really the simplest/easiest way to do
it.

In fact, I'm pretty sure ACPI would love to use such a hook into as
well, in order to run all that motherboard stuff that needs to be able
to sleep, take semaphores, etc... before IRQs are off but after all
devices have suspended.

sysdev's are just a pain in the neck if you ask me... Most of the
problems I've seen with cpufreq are due to the fact that it's a sysdev.
They were a bad idea in the first place and keep being abused :-)

Ben.
Alan Stern
2007-04-11 14:07:44 UTC
Permalink
Post by Benjamin Herrenschmidt
Is it feasible to improve sysdev handling to allow this?
It would be an absolutely ugly hack imho...
The trick to get the decrementer right is to really do it at the same
time as we turn irqs off... hence the idea to make the whole "turning
irqs off" platform specific to get the platform a chance to perform
whatever trickery is necessary after normal driver suspend and around
IRQ disabling.
1- set decrementer to max value
2- switch irqs off
3- set decrementer to max value (again)
)
Along with that, we have all sort of platform specific hackery we need
to perform just before disabling IRQs on suspend and just after
re-enabling them on resume. That's really the simplest/easiest way to do
it.
In fact, I'm pretty sure ACPI would love to use such a hook into as
well, in order to run all that motherboard stuff that needs to be able
to sleep, take semaphores, etc... before IRQs are off but after all
devices have suspended.
sysdev's are just a pain in the neck if you ask me... Most of the
problems I've seen with cpufreq are due to the fact that it's a sysdev.
They were a bad idea in the first place and keep being abused :-)
Have you guys considered using a notifier chain for this? The platform
code could register itself on the chain at boot time. When suspending you
send out notifications just before and just after disabling interrupts,
and inversely on resume.

Alan Stern
Johannes Berg
2007-04-11 16:39:13 UTC
Permalink
Post by Alan Stern
Have you guys considered using a notifier chain for this? The platform
code could register itself on the chain at boot time. When suspending you
send out notifications just before and just after disabling interrupts,
and inversely on resume.
Not really, but I don't see the point. There shouldn't be anything but
the platform using it so why have a chain? And we'd have to register two
things since it's not exactly in the inverse order here.

johannes
Benjamin Herrenschmidt
2007-04-11 21:40:33 UTC
Permalink
Post by Alan Stern
Have you guys considered using a notifier chain for this? The platform
code could register itself on the chain at boot time. When suspending you
send out notifications just before and just after disabling interrupts,
and inversely on resume.
Except that in the case of the decrementer, I really need to be just
around the actual switching of irqs off ...

Ben.
Pavel Machek
2007-04-10 11:36:24 UTC
Permalink
Hi!
Post by Johannes Berg
Post by Pavel Machek
And why do _you_ need it? Unfortunately I do not know what the
decrementer is...?
The decrementer is a sort of timer that counts down and once it
overflows raises an exception in the CPU. Like a timer interrupt, but
not coming from anything external. We need to be able to make sure we
don't get a decrementer exception at some spots.
Also, we need to call some platform function things at that point to
prepare the system for suspend, and it needs to be done with interrupts
enabled but after all devices are suspended.
Hmm, and can't you simply create sysdev for decrementer and special
platform handling? sysdevs should be suspended last...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-11 11:15:18 UTC
Permalink
Hmm, and can't you simply create sysdev for decrementer and special
platform handling? sysdevs should be suspended last...
So I looked again and noticed that my earlier statement was wrong.

The special platform handling can well be done with platform devices, I
can simply insert them into the power management list early by using a
quite early initcall when registering them, that way they'll be
suspended last. So contrary to what I said, those aren't a problem.

The decrementer, however, I looked at again and noticed that I had a
wrong assumption. Let me give a short description of how the decrementer
works first:

The decrementer is a 32-bit register that counts down based on an
implementation-dependent clock, on my powerbook it's somewhere around
18MHz, on my powermac 32MHz (IIRC). When the most significant bit
transitions from 0 to 1 the decrementer exception is raised, when
exceptions are enabled it's taken. However, when exceptions are
disabled, it stays pending until it is taken when exceptions are enabled
again.

Because the exception stays pending, we need to set the decrementer to
the highest possible value before interrupts are disabled so we can be
sure that the exception is not pending. This is to ensure that the
processor can later go to sleep properly to be turned off.

Since we only have a limited amount of time after setting the
decrementer to such a high value (granted, it's at least a minute in all
cases I've seen so far) we like to do it right before disabling
interrupts.

johannes
Pavel Machek
2007-04-06 19:16:48 UTC
Permalink
Hi!
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.irq_off and pm_ops.irq_on which will be called instead
of disabling/enabling irqs so platforms can do a bit more work there if
necessary.
---
My previous patches moving to the pm_ops interface for powermac were
buggy due to exactly this issue, but the bug never surfaced on my
machine, only on machines with an Apple Desktop Bus (or an emulation
thereof)
Does this look ok to you? Would you want different names? Should I stick
in a BUG_ON(interrupts_enabled()) or such?
include/linux/pm.h | 10 ++++++++++
kernel/power/main.c | 10 ++++++++--
2 files changed, 18 insertions(+), 2 deletions(-)
--- wireless-dev.orig/include/linux/pm.h 2007-04-05 18:14:07.948549941 +0200
+++ wireless-dev/include/linux/pm.h 2007-04-05 23:15:29.934829459 +0200
@@ -131,9 +131,17 @@ typedef int __bitwise suspend_disk_metho
* negative error code if necessary.
*
+ * but relies on this callback instead. It is currently not called for
+ * %PM_SUSPEND_DISK.
Well, I guess you should also describe what irq_off's responsibilities
are...? Obviously it needs to disable irqs, but if you need something
special on ppc, does that mean it needs to do more?
Post by Johannes Berg
* negative error code if necessary.
*
+ * but relies on this callback instead. It is currently not called for
+ * %PM_SUSPEND_DISK.
+ *
* are resumed. The return value is ignored.
*
@@ -152,7 +160,9 @@ typedef int __bitwise suspend_disk_metho
struct pm_ops {
int (*valid)(suspend_state_t state);
int (*prepare)(suspend_state_t state);
+ void (*irq_off)(suspend_state_t state);
int (*enter)(suspend_state_t state);
+ void (*irq_on)(suspend_state_t state);
int (*finish)(suspend_state_t state);
suspend_disk_method_t pm_disk_mode;
};
--- wireless-dev.orig/kernel/power/main.c 2007-04-05 18:14:07.988549941 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-05 18:25:21.108549941 +0200
@@ -117,7 +117,10 @@ int suspend_enter(suspend_state_t state)
int error = 0;
unsigned long flags;
- local_irq_save(flags);
+ if (pm_ops->irq_off)
+ pm_ops->irq_off(state);
+ else
+ local_irq_save(flags);
I'd just unconditionally call irq_off, and set up irq_off to do
local_irq_save on all the other archs... This is just ugly.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-11 15:54:27 UTC
Permalink
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.quiesce and pm_ops.activate which will be called instead
of disabling/enabling irqs so platforms get control over this step.

Since those two new calls are required, this patch also introduces default
methods that can be assigned and updates all pm_ops users to use these
defaults.

Signed-off-by: Johannes Berg <***@sipsolutions.net>

---
Rafael: I don't need to change SNAPSHOT_S2RAM since it calls suspend_enter
which is where the change is.

I'm not sure if it makes sense to have this for PM_SUSPEND_DISK as well so
unless somebody comes up with a reason for now don't.

arch/arm/common/sharpsl_pm.c | 2 ++
arch/arm/mach-at91/pm.c | 2 ++
arch/arm/mach-omap1/pm.c | 2 ++
arch/arm/mach-omap2/pm.c | 2 ++
arch/arm/mach-pnx4008/pm.c | 2 ++
arch/arm/mach-pxa/pm.c | 2 ++
arch/arm/mach-sa1100/pm.c | 2 ++
arch/arm/plat-s3c24xx/pm.c | 2 ++
arch/powerpc/platforms/powermac/setup.c | 2 ++
arch/sh/boards/hp6xx/pm.c | 2 ++
drivers/acpi/sleep/main.c | 7 +++++++
include/linux/pm.h | 12 ++++++++++++
kernel/power/main.c | 29 ++++++++++++++++++++++++++---
13 files changed, 65 insertions(+), 3 deletions(-)

--- wireless-dev.orig/include/linux/pm.h 2007-04-11 17:02:22.969481358 +0200
+++ wireless-dev/include/linux/pm.h 2007-04-11 17:02:42.609481358 +0200
@@ -135,9 +135,17 @@ typedef int __bitwise suspend_disk_metho
* @prepare: Prepare the platform for the given suspend state. Can return a
* negative error code if necessary.
*
+ * @quiesce: This callback must be assigned and at least turn off IRQs. If
+ * your platform requires nothing else, assign %pm_default_quiesce.
+ * It is currently not called for %PM_SUSPEND_DISK.
+ *
* @enter: Enter the given suspend state, must be assigned. Can return a
* negative error code if necessary.
*
+ * @activate: This callback must be assigned and at least turn on IRQS. If
+ * your platform requires nothing else, assign %pm_default_activate.
+ * It is currently not called for %PM_SUSPEND_DISK.
+ *
* @finish: Called when the system has left the given state and all devices
* are resumed. The return value is ignored.
*
@@ -155,7 +163,9 @@ typedef int __bitwise suspend_disk_metho
struct pm_ops {
int (*valid)(suspend_state_t state);
int (*prepare)(suspend_state_t state);
+ void (*quiesce)(suspend_state_t state);
int (*enter)(suspend_state_t state);
+ void (*activate)(suspend_state_t state);
int (*finish)(suspend_state_t state);
suspend_disk_method_t pm_disk_mode;
};
@@ -169,6 +179,8 @@ extern struct pm_ops *pm_ops;
extern int pm_suspend(suspend_state_t state);

extern int pm_valid_only_mem(suspend_state_t state);
+extern void pm_default_quiesce(suspend_state_t state);
+extern void pm_default_activate(suspend_state_t state);

/*
* Device power management
--- wireless-dev.orig/kernel/power/main.c 2007-04-11 17:02:23.149481358 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-11 17:18:14.879481358 +0200
@@ -60,6 +60,28 @@ int pm_valid_only_mem(suspend_state_t st
return state == PM_SUSPEND_MEM;
}

+/**
+ * pm_default_quiesce - default quiesce callback
+ *
+ * pm_ops drivers that need to do nothing but disable
+ * IRQs use this instead of their own code.
+ */
+void pm_default_quiesce(suspend_state_t state)
+{
+ local_irq_disable();
+}
+
+/**
+ * pm_default_activate - default activate callback
+ *
+ * pm_ops drivers that need to do nothing but enable
+ * IRQs use this instead of their own code.
+ */
+void pm_default_activate(suspend_state_t state)
+{
+ local_irq_enable();
+}
+

static inline void pm_finish(suspend_state_t state)
{
@@ -132,9 +154,9 @@ static int suspend_prepare(suspend_state
int suspend_enter(suspend_state_t state)
{
int error = 0;
- unsigned long flags;

- local_irq_save(flags);
+ pm_ops->quiesce(state);
+ BUG_ON(!irqs_disabled());

if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "Some devices failed to power down\n");
@@ -143,7 +165,8 @@ int suspend_enter(suspend_state_t state)
error = pm_ops->enter(state);
device_power_up();
Done:
- local_irq_restore(flags);
+ pm_ops->activate(state);
+ BUG_ON(irqs_disabled());
return error;
}

--- wireless-dev.orig/arch/arm/common/sharpsl_pm.c 2007-04-11 17:07:50.109481358 +0200
+++ wireless-dev/arch/arm/common/sharpsl_pm.c 2007-04-11 17:08:28.439481358 +0200
@@ -767,7 +767,9 @@ static void sharpsl_apm_get_power_status

static struct pm_ops sharpsl_pm_ops = {
.prepare = pxa_pm_prepare,
+ .quiesce = pm_default_quiesce,
.enter = corgi_pxa_pm_enter,
+ .activate = pm_default_activate,
.finish = pxa_pm_finish,
.valid = pm_valid_only_mem,
};
--- wireless-dev.orig/arch/arm/mach-at91/pm.c 2007-04-11 17:07:50.889481358 +0200
+++ wireless-dev/arch/arm/mach-at91/pm.c 2007-04-11 17:10:17.399481358 +0200
@@ -203,7 +203,9 @@ error:
static struct pm_ops at91_pm_ops ={
.valid = at91_pm_valid_state,
.prepare = at91_pm_prepare,
+ .quiesce = pm_default_quiesce,
.enter = at91_pm_enter,
+ .activate = pm_default_activate,
};

static int __init at91_pm_init(void)
--- wireless-dev.orig/arch/arm/mach-omap1/pm.c 2007-04-11 17:07:50.219481358 +0200
+++ wireless-dev/arch/arm/mach-omap1/pm.c 2007-04-11 17:08:48.149481358 +0200
@@ -699,7 +699,9 @@ static struct irqaction omap_wakeup_irq

static struct pm_ops omap_pm_ops ={
.prepare = omap_pm_prepare,
+ .quiesce = pm_default_quiesce,
.enter = omap_pm_enter,
+ .activate = pm_default_activate,
.finish = omap_pm_finish,
.valid = pm_valid_only_mem,
};
--- wireless-dev.orig/arch/arm/mach-omap2/pm.c 2007-04-11 17:07:50.349481358 +0200
+++ wireless-dev/arch/arm/mach-omap2/pm.c 2007-04-11 17:09:08.369481358 +0200
@@ -371,7 +371,9 @@ static int omap2_pm_finish(suspend_state

static struct pm_ops omap_pm_ops = {
.prepare = omap2_pm_prepare,
+ .quiesce = pm_default_quiesce,
.enter = omap2_pm_enter,
+ .activate = pm_default_activate,
.finish = omap2_pm_finish,
.valid = pm_valid_only_mem,
};
--- wireless-dev.orig/arch/arm/mach-pnx4008/pm.c 2007-04-11 17:07:50.539481358 +0200
+++ wireless-dev/arch/arm/mach-pnx4008/pm.c 2007-04-11 17:09:23.759481358 +0200
@@ -116,7 +116,9 @@ static int pnx4008_pm_enter(suspend_stat
}

static struct pm_ops pnx4008_pm_ops = {
+ .quiesce = pm_default_quiesce,
.enter = pnx4008_pm_enter,
+ .activate = pm_default_activate,
.valid = pm_valid_only_mem,
};

--- wireless-dev.orig/arch/arm/mach-pxa/pm.c 2007-04-11 17:07:50.639481358 +0200
+++ wireless-dev/arch/arm/mach-pxa/pm.c 2007-04-11 17:09:45.959481358 +0200
@@ -225,7 +225,9 @@ EXPORT_SYMBOL_GPL(pxa_pm_finish);

static struct pm_ops pxa_pm_ops = {
.prepare = pxa_pm_prepare,
+ .quiesce = pm_default_quiesce,
.enter = pxa_pm_enter,
+ .activate = pm_default_activate,
.finish = pxa_pm_finish,
.valid = pm_valid_only_mem,
};
--- wireless-dev.orig/arch/arm/mach-sa1100/pm.c 2007-04-11 17:07:50.729481358 +0200
+++ wireless-dev/arch/arm/mach-sa1100/pm.c 2007-04-11 17:09:59.759481358 +0200
@@ -132,7 +132,9 @@ unsigned long sleep_phys_sp(void *sp)
}

static struct pm_ops sa11x0_pm_ops = {
+ .quiesce = pm_default_quiesce,
.enter = sa11x0_pm_enter,
+ .activate = pm_default_activate,
.valid = pm_valid_only_mem,
};

--- wireless-dev.orig/arch/arm/plat-s3c24xx/pm.c 2007-04-11 17:07:50.959481358 +0200
+++ wireless-dev/arch/arm/plat-s3c24xx/pm.c 2007-04-11 17:10:36.039481358 +0200
@@ -613,7 +613,9 @@ static int s3c2410_pm_enter(suspend_stat
}

static struct pm_ops s3c2410_pm_ops = {
+ .quiesce = pm_default_quiesce,
.enter = s3c2410_pm_enter,
+ .activate = pm_default_activate,
.valid = pm_valid_only_mem,
};

--- wireless-dev.orig/arch/powerpc/platforms/powermac/setup.c 2007-04-11 17:07:51.029481358 +0200
+++ wireless-dev/arch/powerpc/platforms/powermac/setup.c 2007-04-11 17:10:53.659481358 +0200
@@ -475,7 +475,9 @@ static int pmac_pm_valid(suspend_state_t
static struct pm_ops pmac_pm_ops = {
.pm_disk_mode = PM_DISK_SHUTDOWN,
.prepare = pmac_pm_prepare,
+ .quiesce = pm_default_quiesce,
.enter = pmac_pm_enter,
+ .activate = pm_default_activate,
.finish = pmac_pm_finish,
.valid = pmac_pm_valid,
};
--- wireless-dev.orig/arch/sh/boards/hp6xx/pm.c 2007-04-11 17:07:51.099481358 +0200
+++ wireless-dev/arch/sh/boards/hp6xx/pm.c 2007-04-11 17:11:06.269481358 +0200
@@ -68,7 +68,9 @@ static int hp6x0_pm_enter(suspend_state_
}

static struct pm_ops hp6x0_pm_ops = {
+ .quiesce = pm_default_quiesce,
.enter = hp6x0_pm_enter,
+ .activate = pm_default_activate,
.valid = pm_valid_only_mem,
};

--- wireless-dev.orig/drivers/acpi/sleep/main.c 2007-04-11 17:07:51.339481358 +0200
+++ wireless-dev/drivers/acpi/sleep/main.c 2007-04-11 17:12:49.679481358 +0200
@@ -182,10 +182,17 @@ static int acpi_pm_state_valid(suspend_s
}
}

+/*
+ * If .quiesce or .activate are changed, keep in mind
+ * that they aren't currently called for S4. That can
+ * be changed if needed, of course.
+ */
static struct pm_ops acpi_pm_ops = {
.valid = acpi_pm_state_valid,
.prepare = acpi_pm_prepare,
+ .quiesce = pm_default_quiesce,
.enter = acpi_pm_enter,
+ .activate = pm_default_activate,
.finish = acpi_pm_finish,
};
Dmitry Krivoschekov
2007-04-11 20:47:09 UTC
Permalink
Hi Johannes,
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer.
I doubt your reasoning is cogent enough. If you add such a global thing,
you should prove it, or clearly explain what is the thing for.
I don't believe that it's not possible to solve that decrementer case
without this change. Can't you add appropriate platform_device which
would register early enough, to meet that timing requirement when it
gets suspended?
Post by Johannes Berg
--- wireless-dev.orig/arch/powerpc/platforms/powermac/setup.c 2007-04-11 17:07:51.029481358 +0200
+++ wireless-dev/arch/powerpc/platforms/powermac/setup.c 2007-04-11 17:10:53.659481358 +0200
@@ -475,7 +475,9 @@ static int pmac_pm_valid(suspend_state_t
static struct pm_ops pmac_pm_ops = {
.pm_disk_mode = PM_DISK_SHUTDOWN,
.prepare = pmac_pm_prepare,
+ .quiesce = pm_default_quiesce,
.enter = pmac_pm_enter,
+ .activate = pm_default_activate,
.finish = pmac_pm_finish,
.valid = pmac_pm_valid,
};
So, where is that callbacks that are so desirable for powermac?


Thanks,
Dmitry
Johannes Berg
2007-04-12 08:39:38 UTC
Permalink
Post by Dmitry Krivoschekov
I doubt your reasoning is cogent enough. If you add such a global thing,
you should prove it, or clearly explain what is the thing for.
I don't believe that it's not possible to solve that decrementer case
without this change. Can't you add appropriate platform_device which
would register early enough, to meet that timing requirement when it
gets suspended?
I think Ben explained that.
Post by Dmitry Krivoschekov
Post by Johannes Berg
static struct pm_ops pmac_pm_ops = {
.pm_disk_mode = PM_DISK_SHUTDOWN,
.prepare = pmac_pm_prepare,
+ .quiesce = pm_default_quiesce,
.enter = pmac_pm_enter,
+ .activate = pm_default_activate,
.finish = pmac_pm_finish,
.valid = pmac_pm_valid,
};
So, where is that callbacks that are so desirable for powermac?
If you'd looked at the current pmac_pm_valid, you'd see that it allows
nothing. Currently, pm_ops are totally broken on powermac. The callback
that is so desirable for powermac will be added when I actually make
powermac use the pm_ops infrastructure which currently isn't possible
due to this problem.

johannes
Benjamin Herrenschmidt
2007-04-12 08:42:32 UTC
Permalink
Post by Dmitry Krivoschekov
I doubt your reasoning is cogent enough. If you add such a global thing,
you should prove it, or clearly explain what is the thing for.
I don't believe that it's not possible to solve that decrementer case
without this change. Can't you add appropriate platform_device which
would register early enough, to meet that timing requirement when it
gets suspended?
That's would be absolutely disgusting... (relying on magic ordering of
platform devices).

The fact is, the architecture needs to do various things around the
IRQ disabling and enabling, and thus we should have a hook for it.
I fail to understand why there is so much resistance here...

Ben.
Pavel Machek
2007-04-12 10:16:53 UTC
Permalink
Hi!
Post by Benjamin Herrenschmidt
Post by Dmitry Krivoschekov
I doubt your reasoning is cogent enough. If you add such a global thing,
you should prove it, or clearly explain what is the thing for.
I don't believe that it's not possible to solve that decrementer case
without this change. Can't you add appropriate platform_device which
would register early enough, to meet that timing requirement when it
gets suspended?
That's would be absolutely disgusting... (relying on magic ordering of
platform devices).
Adding platform hook to disable interrupts just because we need one
platform device last is not nice, either. (And does decrementer even
need to come last?)

Anyway, platform devices may need specific ordering on more than power
pc, so perhaps we should just make ordering more stable so that
relying on it is no longer a hack?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Rafael J. Wysocki
2007-04-12 10:45:58 UTC
Permalink
Post by Pavel Machek
Hi!
Post by Benjamin Herrenschmidt
Post by Dmitry Krivoschekov
I doubt your reasoning is cogent enough. If you add such a global thing,
you should prove it, or clearly explain what is the thing for.
I don't believe that it's not possible to solve that decrementer case
without this change. Can't you add appropriate platform_device which
would register early enough, to meet that timing requirement when it
gets suspended?
That's would be absolutely disgusting... (relying on magic ordering of
platform devices).
Adding platform hook to disable interrupts just because we need one
platform device last is not nice, either. (And does decrementer even
need to come last?)
Anyway, platform devices may need specific ordering on more than power
pc, so perhaps we should just make ordering more stable so that
relying on it is no longer a hack?
I second that, although I think it also would be a global change.

Greetings,
Rafael
Johannes Berg
2007-04-12 10:47:27 UTC
Permalink
Post by Pavel Machek
Adding platform hook to disable interrupts just because we need one
platform device last is not nice, either. (And does decrementer even
need to come last?)
But the hack is classifying the decrementer as a platform device in the
first place. And then we'd still need to have one platform device and
one sysdev since we want to wake up the decrementer before enabling IRQs
but set it to a high value before disabling IRQs. Or we'd need some sort
of special foo device that has hooks both before and after and comes
last/first, just for one user?

Besides, we can always rewrite it later when somebody introduces a foo
device that fits above model. Since this patch essentially changes
nothing, I'd been hoping to get it into .22 so we can finally start
having a sane userspace interface for suspend on powermac. I can dream,
right?

Besides, disabling IRQs really seems like a platform issue and as such
pm_ops should be able to control it. Why are you so much opposed to that
idea? Would a patch that says

Introduce pm_ops.quiesce and pm_ops.activate to make platforms
control IRQ disabling and enabling (respectively).

and doesn't mention powermac be more acceptable?

johannes
Pavel Machek
2007-04-13 21:00:23 UTC
Permalink
Hi!
Post by Johannes Berg
Post by Pavel Machek
Adding platform hook to disable interrupts just because we need one
platform device last is not nice, either. (And does decrementer even
need to come last?)
But the hack is classifying the decrementer as a platform device in the
first place. And then we'd still need to have one platform device and
one sysdev since we want to wake up the decrementer before enabling IRQs
Can decrementer be waked a little later?
Post by Johannes Berg
but set it to a high value before disabling IRQs. Or we'd need some sort
of special foo device that has hooks both before and after and comes
last/first, just for one user?
Have two devices, then.
Post by Johannes Berg
Besides, we can always rewrite it later when somebody introduces a foo
device that fits above model. Since this patch essentially changes
nothing, I'd been hoping to get it into .22 so we can finally start
No. Hacks like this are impossible to get rid of.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-13 21:11:31 UTC
Permalink
Post by Pavel Machek
Have two devices, then.
I'm definitely *not* going to do hacks like that.

johannes
Pavel Machek
2007-04-13 21:43:55 UTC
Permalink
Hi!
Post by Johannes Berg
Post by Pavel Machek
Have two devices, then.
I'm definitely *not* going to do hacks like that.
Come on, that's still better than introducing new pm_op with unclear
semantic, then calling it from half the places that need it.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Paul Mackerras
2007-04-13 21:11:31 UTC
Permalink
Post by Pavel Machek
Post by Johannes Berg
Besides, we can always rewrite it later when somebody introduces a foo
device that fits above model. Since this patch essentially changes
nothing, I'd been hoping to get it into .22 so we can finally start
No. Hacks like this are impossible to get rid of.
OK, we'll just stick with our separate code path that does what we
need, then.

Paul.
Benjamin Herrenschmidt
2007-04-13 21:15:41 UTC
Permalink
Post by Pavel Machek
Post by Johannes Berg
Besides, we can always rewrite it later when somebody introduces a foo
device that fits above model. Since this patch essentially changes
nothing, I'd been hoping to get it into .22 so we can finally start
No. Hacks like this are impossible to get rid of.
it's the cleanest solution, it's you who is a hack

Ben.
Benjamin Herrenschmidt
2007-04-12 11:23:37 UTC
Permalink
Post by Pavel Machek
Adding platform hook to disable interrupts just because we need one
platform device last is not nice, either. (And does decrementer even
need to come last?)
Anyway, platform devices may need specific ordering on more than power
pc, so perhaps we should just make ordering more stable so that
relying on it is no longer a hack?
It would be pretty hard to do cleanly without raping the core driver
suspend/resume core.

I have experience implementing suspend/resume in all sorts of
environments, and I do strongly beleive that an arch hook right at this
point will be useful to more than that anyway.

There are actions the arch code might need to take after drivers and
before going to "atomic" mode (irqs off mean you can no longer sleep or
take semaphores etc...), and in a similar vein, actions to take on
resume just after interrupts are back and before all the drivers get
woken up.

The decrementer is one example, there might be other processor specific
bits or arch specific bits that are better done at that stage.

I think the way to go for 2.6.22 is to have that hook unless you can
propose me a way to cleanly provide a platform device that is guaranteed
to suspend last and resume first in all cases (and even then, I wouldn't
be that happy since the proper decrementer stop procedure involves
really wrapping the "irqs off" operation).

Why not give this added flexibility ? Archs who don't care don't need to
bother and it will make us happy... it's not like we are about to -add-
burden to other architectures.

Ben.
Rafael J. Wysocki
2007-04-12 15:03:15 UTC
Permalink
Post by Benjamin Herrenschmidt
Post by Pavel Machek
Adding platform hook to disable interrupts just because we need one
platform device last is not nice, either. (And does decrementer even
need to come last?)
Anyway, platform devices may need specific ordering on more than power
pc, so perhaps we should just make ordering more stable so that
relying on it is no longer a hack?
It would be pretty hard to do cleanly without raping the core driver
suspend/resume core.
Agreed.
Post by Benjamin Herrenschmidt
I have experience implementing suspend/resume in all sorts of
environments, and I do strongly beleive that an arch hook right at this
point will be useful to more than that anyway.
There are actions the arch code might need to take after drivers and
before going to "atomic" mode (irqs off mean you can no longer sleep or
take semaphores etc...), and in a similar vein, actions to take on
resume just after interrupts are back and before all the drivers get
woken up.
The decrementer is one example, there might be other processor specific
bits or arch specific bits that are better done at that stage.
I think the way to go for 2.6.22 is to have that hook unless you can
propose me a way to cleanly provide a platform device that is guaranteed
to suspend last and resume first in all cases (and even then, I wouldn't
be that happy since the proper decrementer stop procedure involves
really wrapping the "irqs off" operation).
Why not give this added flexibility ? Archs who don't care don't need to
bother and it will make us happy... it's not like we are about to -add-
burden to other architectures.
Well, I think it would be reasonable to add the quiesce()/activate() hooks for
all of the above reasons. However, once we've done that, it'll be quite
difficult to remove them, so we should better be sure they are really really
needed and there's no other way to implement what you need (or all of the
alternative ways are far worse).

Greetings,
Rafael
David Brownell
2007-04-12 16:32:41 UTC
Permalink
Post by Rafael J. Wysocki
Post by Benjamin Herrenschmidt
Why not give this added flexibility ? Archs who don't care don't need to
bother and it will make us happy... it's not like we are about to -add-
burden to other architectures.
Well, I think it would be reasonable to add the quiesce()/activate() hooks for
all of the above reasons.
Makes sense to me. Though I'd rather see pm_set_ops() patch the default
hooks than expect all platforms to change ... that would shrink the size
of the patch adding these, as well as the potential cost of removing them.
Post by Rafael J. Wysocki
However, once we've done that, it'll be quite
difficult to remove them, so we should better be sure they are really really
needed and there's no other way to implement what you need (or all of the
alternative ways are far worse).
In general I think too much of the way PM "works" now is a bit more
due to convenient side effects than by clear intent. So while I agree
that adding hooks should be done with care, this one certain seems to
be a step in the right direction.

- Dave
Johannes Berg
2007-04-13 06:52:49 UTC
Permalink
Post by David Brownell
Makes sense to me. Though I'd rather see pm_set_ops() patch the default
hooks than expect all platforms to change ... that would shrink the size
of the patch adding these, as well as the potential cost of removing them.
I was thinking about that, but wasn't sure if we could do that. I like
interfaces that don't change global variables they are passed. But if I
document it properly I can change the patch to do this easily.

johannes
Johannes Berg
2007-04-13 07:59:53 UTC
Permalink
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.quiesce and pm_ops.activate which will be called instead
of disabling/enabling irqs so platforms get control over this step.

If not assigned, these two calls will be assigned with defaults during
set_pm_ops.

Signed-off-by: Johannes Berg <***@sipsolutions.net>

---
This doesn't make sense for suspend to disk since there the arch gets
much more control over the process anyway, and using pm_ops for suspend
to disk is a hack in the first place (only makes sense for ACPI).
Therefore, this patch doesn't affect suspend to disk.

You can now chose either version 2 or 3 of this patch :)

include/linux/pm.h | 10 ++++++++++
kernel/power/main.c | 35 ++++++++++++++++++++++++++++++++---
2 files changed, 42 insertions(+), 3 deletions(-)

--- wireless-dev.orig/include/linux/pm.h 2007-04-13 09:39:42.465356880 +0200
+++ wireless-dev/include/linux/pm.h 2007-04-13 09:43:42.985356880 +0200
@@ -135,9 +135,17 @@ typedef int __bitwise suspend_disk_metho
* @prepare: Prepare the platform for the given suspend state. Can return a
* negative error code if necessary.
*
+ * @quiesce: If assigned, this callback must at least turn off IRQs. If
+ * left unassigned, a default callback that does nothing but turn
+ * off IRQs is assigned during pm_set_ops().
+ *
* @enter: Enter the given suspend state, must be assigned. Can return a
* negative error code if necessary.
*
+ * @activate: If assigned, this callback must at least turn on IRQs. If
+ * left unassigned, a default callback that does nothing but turn
+ * on IRQs is assigned during pm_set_ops().
+ *
* @finish: Called when the system has left the given state and all devices
* are resumed. The return value is ignored.
*
@@ -155,7 +163,9 @@ typedef int __bitwise suspend_disk_metho
struct pm_ops {
int (*valid)(suspend_state_t state);
int (*prepare)(suspend_state_t state);
+ void (*quiesce)(suspend_state_t state);
int (*enter)(suspend_state_t state);
+ void (*activate)(suspend_state_t state);
int (*finish)(suspend_state_t state);
suspend_disk_method_t pm_disk_mode;
};
--- wireless-dev.orig/kernel/power/main.c 2007-04-13 09:39:42.465356880 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-13 09:47:38.645356880 +0200
@@ -33,6 +33,28 @@ struct pm_ops *pm_ops;
suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN;

/**
+ * pm_default_quiesce - default quiesce callback
+ *
+ * pm_ops drivers that need to do nothing but disable IRQs
+ * leave .quiesce unassigned and pm_set_ops() assigns this.
+ */
+static void pm_default_quiesce(suspend_state_t state)
+{
+ local_irq_disable();
+}
+
+/**
+ * pm_default_activate - default activate callback
+ *
+ * pm_ops drivers that need to do nothing but enable IRQs
+ * leave .activate unassigned and pm_set_ops assigns this.
+ */
+static void pm_default_activate(suspend_state_t state)
+{
+ local_irq_enable();
+}
+
+/**
* pm_set_ops - Set the global power method table.
* @ops: Pointer to ops structure.
*/
@@ -45,6 +67,12 @@ void pm_set_ops(struct pm_ops * ops)
pm_disk_mode = ops->pm_disk_mode;
} else
pm_disk_mode = PM_DISK_SHUTDOWN;
+
+ if (!ops->quiesce)
+ ops->quiesce = pm_default_quiesce;
+ if (!ops->activate)
+ ops->activate = pm_default_activate;
+
mutex_unlock(&pm_mutex);
}

@@ -132,9 +160,9 @@ static int suspend_prepare(suspend_state
int suspend_enter(suspend_state_t state)
{
int error = 0;
- unsigned long flags;

- local_irq_save(flags);
+ pm_ops->quiesce(state);
+ BUG_ON(!irqs_disabled());

if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "Some devices failed to power down\n");
@@ -143,7 +171,8 @@ int suspend_enter(suspend_state_t state)
error = pm_ops->enter(state);
device_power_up();
Done:
- local_irq_restore(flags);
+ pm_ops->activate(state);
+ BUG_ON(irqs_disabled());
return error;
}
Dmitry Krivoschekov
2007-04-12 17:36:07 UTC
Permalink
Post by Benjamin Herrenschmidt
Post by Pavel Machek
Adding platform hook to disable interrupts just because we need one
platform device last is not nice, either. (And does decrementer even
need to come last?)
Anyway, platform devices may need specific ordering on more than power
pc, so perhaps we should just make ordering more stable so that
relying on it is no longer a hack?
It would be pretty hard to do cleanly without raping the core driver
suspend/resume core.
I have experience implementing suspend/resume in all sorts of
environments, and I do strongly beleive that an arch hook right at this
point will be useful to more than that anyway.
There are actions the arch code might need to take after drivers and
before going to "atomic" mode (irqs off mean you can no longer sleep or
take semaphores etc...), and in a similar vein, actions to take on
resume just after interrupts are back and before all the drivers get
woken up.
The decrementer is one example, there might be other processor specific
bits or arch specific bits that are better done at that stage.
It'd be nice adding some examples to patch header, and accurate
explanation what is possible to do at this new level, this would
sign to other people that they don't have to invent theirs own hacks
anymore.
Post by Benjamin Herrenschmidt
I think the way to go for 2.6.22 is to have that hook unless you can
propose me a way to cleanly provide a platform device that is guaranteed
to suspend last and resume first in all cases (and even then, I wouldn't
be that happy since the proper decrementer stop procedure involves
really wrapping the "irqs off" operation).
Why not give this added flexibility ? Archs who don't care don't need to
bother and it will make us happy... it's not like we are about to -add-
burden to other architectures.
Actually, I personally two hands up for adding the flexibility,
but you should define what is supposed to do on this level and
what is don't, or not desirable.

For example, I'd like to enter back to suspend mode
right from "activate" stage, because I've woken up just
to update some data and I do not want to resume all devices
for that, is it ok for "activate"?


Thanks,
Dmitry
Benjamin Herrenschmidt
2007-04-12 20:51:16 UTC
Permalink
Post by Dmitry Krivoschekov
Actually, I personally two hands up for adding the flexibility,
but you should define what is supposed to do on this level and
what is don't, or not desirable.
For example, I'd like to enter back to suspend mode
right from "activate" stage, because I've woken up just
to update some data and I do not want to resume all devices
for that, is it ok for "activate"?
That's an interesting approach... might be something we can define via
activate result code...

Ben.
Johannes Berg
2007-04-13 06:54:54 UTC
Permalink
Post by Dmitry Krivoschekov
It'd be nice adding some examples to patch header, and accurate
explanation what is possible to do at this new level, this would
sign to other people that they don't have to invent theirs own hacks
anymore.
That's pretty hard to do generically since it'll most likely differ for
all platforms. I can paste some powermac specific code but that's
unlikely to help anybody.
Post by Dmitry Krivoschekov
For example, I'd like to enter back to suspend mode
right from "activate" stage, because I've woken up just
to update some data and I do not want to resume all devices
for that, is it ok for "activate"?
I'd think that better be done within ->enter itself, no? Like putting a
loop into ->enter:

do {
go to sleep;
while (should sleep);

johannes
David Brownell
2007-04-13 08:04:22 UTC
Permalink
Post by Johannes Berg
Post by Dmitry Krivoschekov
For example, I'd like to enter back to suspend mode
right from "activate" stage, because I've woken up just
to update some data and I do not want to resume all devices
for that, is it ok for "activate"?
That came up on the list a while back in some email from MontaVista, ISTR.
And ISTR some scientific or industrial system needed it too.
Post by Johannes Berg
I'd think that better be done within ->enter itself, no? Like putting a
do {
go to sleep;
while (should sleep);
Better answer, yes. Plus remember that the "activate" stage is at a
weird spot in the resume sequence: some of the devices have been
resumed, but not all of them.

Surely if this "partial wakeup" scenario is needed, it's going to
be highly platform-specific which devices need to be resumed so
the relvant work can be done?


In fact, I wonder if a better way to look at this might not be
that it's not a real "system sleep" state so much as just a deep
runtime PM state ... the sort of thing that the idle loop ought
to be able to enter, in a system that's been quiesced far enough
because of NO_HZ, power-aware drivers, and general inactivity.
For some systems I'm relatively sure that'd be a good perspective:
ones where the runtime PM is good enough.


Another solution recognized that the only work that was needed
was updating certain counters and GPIOs. So the code that went
to sleep knew about various levels of wakeup; if it was just the
counting update, it left most clocks (and the DRAM!) off, and
went right back to sleep. The real intelligence was in the
sleep logic called by enter(). The generalization was in that
down'n'dirty platform code; it had callouts to other code which
also lived in the SRAM, and not every wake event caused return
from enter().

- Dave
Johannes Berg
2007-04-13 08:59:16 UTC
Permalink
Post by Dmitry Krivoschekov
For example, I'd like to enter back to suspend mode
right from "activate" stage, because I've woken up just
to update some data and I do not want to resume all devices
for that, is it ok for "activate"?
Come to think of it, this sounds more like a confusion of terms instead
of actually thinking "activate" would be good.

quiesce here is mostly to turn off IRQs and activate mostly to turn them
back on. The enter callback is what should actually enter suspend. I can
see how you can confused "enter" suspend and "activate" suspend though.

The docs in the patch I posted should make it clear enough, but if
somebody can come up with a better name for "activate" I'm all ears.

johannes
Benjamin Herrenschmidt
2007-04-13 09:07:53 UTC
Permalink
Post by Johannes Berg
The docs in the patch I posted should make it clear enough, but if
somebody can come up with a better name for "activate" I'm all ears.
I personally quite like your original irq_off/irq_on :-)

But I won't put up a fight for it if others disagree..

Ben.
Rafael J. Wysocki
2007-04-13 11:47:29 UTC
Permalink
Post by Benjamin Herrenschmidt
Post by Johannes Berg
The docs in the patch I posted should make it clear enough, but if
somebody can come up with a better name for "activate" I'm all ears.
I personally quite like your original irq_off/irq_on :-)
But I won't put up a fight for it if others disagree..
Well, I didn't like them, because they should refer to the local CPU only (the
other ones are supposed to be disabled at that point). Also, since we are
going to do some things apart from disabling the (local) IRQs in there, I
thought some other names would be better. Still, if that's confusing, I'm not
going to fight either. :-)

Greetings,
Rafael
Johannes Berg
2007-04-13 12:58:00 UTC
Permalink
Post by Rafael J. Wysocki
Well, I didn't like them, because they should refer to the local CPU only (the
other ones are supposed to be disabled at that point). Also, since we are
going to do some things apart from disabling the (local) IRQs in there, I
thought some other names would be better.
True.
Post by Rafael J. Wysocki
Still, if that's confusing, I'm not
going to fight either. :-)
Well it seemed to me that Dmitry's question might have been due to a
slight confusion here. Then again, the docs are there and clearly
describe what to do... I've clarified the description again and will
resend in a minute.

Do we want the patch that munges pm_ops when registering or the one that
modifies all current users?

johannes
Johannes Berg
2007-04-13 13:26:59 UTC
Permalink
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.quiesce and pm_ops.activate which will be called instead
of disabling/enabling irqs so platforms get control over this step.

If not assigned, these two calls will be assigned with defaults during
set_pm_ops.

Signed-off-by: Johannes Berg <***@sipsolutions.net>

---
include/linux/pm.h | 12 ++++++++++++
kernel/power/main.c | 35 ++++++++++++++++++++++++++++++++---
2 files changed, 44 insertions(+), 3 deletions(-)

--- wireless-dev.orig/include/linux/pm.h 2007-04-13 10:09:17.835356880 +0200
+++ wireless-dev/include/linux/pm.h 2007-04-13 14:57:01.525356880 +0200
@@ -135,9 +135,19 @@ typedef int __bitwise suspend_disk_metho
* @prepare: Prepare the platform for the given suspend state. Can return a
* negative error code if necessary.
*
+ * @quiesce: This callback is called after devices are suspended but before
+ * they are powered down. If assigned, this callback must at least turn
+ * off local IRQs. If left unassigned, a default callback that does
+ * nothing but turn off local IRQs is assigned during pm_set_ops().
+ *
* @enter: Enter the given suspend state, must be assigned. Can return a
* negative error code if necessary.
*
+ * @activate: This callback is called after devices are powered up but
+ * before they resume. If assigned, this callback must at least turn
+ * on local IRQs. If left unassigned, a default callback that does
+ * nothing but turn on local IRQs is assigned during pm_set_ops().
+ *
* @finish: Called when the system has left the given state and all devices
* are resumed. The return value is ignored.
*
@@ -155,7 +165,9 @@ typedef int __bitwise suspend_disk_metho
struct pm_ops {
int (*valid)(suspend_state_t state);
int (*prepare)(suspend_state_t state);
+ void (*quiesce)(suspend_state_t state);
int (*enter)(suspend_state_t state);
+ void (*activate)(suspend_state_t state);
int (*finish)(suspend_state_t state);
suspend_disk_method_t pm_disk_mode;
};
--- wireless-dev.orig/kernel/power/main.c 2007-04-13 10:09:17.835356880 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-13 14:53:44.975356880 +0200
@@ -33,6 +33,28 @@ struct pm_ops *pm_ops;
suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN;

/**
+ * pm_default_quiesce - default quiesce callback
+ *
+ * pm_ops drivers that need to do nothing but disable IRQs
+ * leave .quiesce unassigned and pm_set_ops() assigns this.
+ */
+static void pm_default_quiesce(suspend_state_t state)
+{
+ local_irq_disable();
+}
+
+/**
+ * pm_default_activate - default activate callback
+ *
+ * pm_ops drivers that need to do nothing but enable IRQs
+ * leave .activate unassigned and pm_set_ops assigns this.
+ */
+static void pm_default_activate(suspend_state_t state)
+{
+ local_irq_enable();
+}
+
+/**
* pm_set_ops - Set the global power method table.
* @ops: Pointer to ops structure.
*/
@@ -45,6 +67,12 @@ void pm_set_ops(struct pm_ops * ops)
pm_disk_mode = ops->pm_disk_mode;
} else
pm_disk_mode = PM_DISK_SHUTDOWN;
+
+ if (!ops->quiesce)
+ ops->quiesce = pm_default_quiesce;
+ if (!ops->activate)
+ ops->activate = pm_default_activate;
+
mutex_unlock(&pm_mutex);
}

@@ -132,9 +160,9 @@ static int suspend_prepare(suspend_state
int suspend_enter(suspend_state_t state)
{
int error = 0;
- unsigned long flags;

- local_irq_save(flags);
+ pm_ops->quiesce(state);
+ BUG_ON(!irqs_disabled());

if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "Some devices failed to power down\n");
@@ -143,7 +171,8 @@ int suspend_enter(suspend_state_t state)
error = pm_ops->enter(state);
device_power_up();
Done:
- local_irq_restore(flags);
+ pm_ops->activate(state);
+ BUG_ON(irqs_disabled());
return error;
}
Rafael J. Wysocki
2007-04-13 20:43:59 UTC
Permalink
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.quiesce and pm_ops.activate which will be called instead
of disabling/enabling irqs so platforms get control over this step.
If not assigned, these two calls will be assigned with defaults during
set_pm_ops.
I have no objections. Pavel, what do you think?

Rafael
Post by Johannes Berg
---
include/linux/pm.h | 12 ++++++++++++
kernel/power/main.c | 35 ++++++++++++++++++++++++++++++++---
2 files changed, 44 insertions(+), 3 deletions(-)
--- wireless-dev.orig/include/linux/pm.h 2007-04-13 10:09:17.835356880 +0200
+++ wireless-dev/include/linux/pm.h 2007-04-13 14:57:01.525356880 +0200
@@ -135,9 +135,19 @@ typedef int __bitwise suspend_disk_metho
* negative error code if necessary.
*
+ * they are powered down. If assigned, this callback must at least turn
+ * off local IRQs. If left unassigned, a default callback that does
+ * nothing but turn off local IRQs is assigned during pm_set_ops().
+ *
* negative error code if necessary.
*
+ * before they resume. If assigned, this callback must at least turn
+ * on local IRQs. If left unassigned, a default callback that does
+ * nothing but turn on local IRQs is assigned during pm_set_ops().
+ *
* are resumed. The return value is ignored.
*
@@ -155,7 +165,9 @@ typedef int __bitwise suspend_disk_metho
struct pm_ops {
int (*valid)(suspend_state_t state);
int (*prepare)(suspend_state_t state);
+ void (*quiesce)(suspend_state_t state);
int (*enter)(suspend_state_t state);
+ void (*activate)(suspend_state_t state);
int (*finish)(suspend_state_t state);
suspend_disk_method_t pm_disk_mode;
};
--- wireless-dev.orig/kernel/power/main.c 2007-04-13 10:09:17.835356880 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-13 14:53:44.975356880 +0200
@@ -33,6 +33,28 @@ struct pm_ops *pm_ops;
suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN;
/**
+ * pm_default_quiesce - default quiesce callback
+ *
+ * pm_ops drivers that need to do nothing but disable IRQs
+ * leave .quiesce unassigned and pm_set_ops() assigns this.
+ */
+static void pm_default_quiesce(suspend_state_t state)
+{
+ local_irq_disable();
+}
+
+/**
+ * pm_default_activate - default activate callback
+ *
+ * pm_ops drivers that need to do nothing but enable IRQs
+ * leave .activate unassigned and pm_set_ops assigns this.
+ */
+static void pm_default_activate(suspend_state_t state)
+{
+ local_irq_enable();
+}
+
+/**
* pm_set_ops - Set the global power method table.
*/
@@ -45,6 +67,12 @@ void pm_set_ops(struct pm_ops * ops)
pm_disk_mode = ops->pm_disk_mode;
} else
pm_disk_mode = PM_DISK_SHUTDOWN;
+
+ if (!ops->quiesce)
+ ops->quiesce = pm_default_quiesce;
+ if (!ops->activate)
+ ops->activate = pm_default_activate;
+
mutex_unlock(&pm_mutex);
}
@@ -132,9 +160,9 @@ static int suspend_prepare(suspend_state
int suspend_enter(suspend_state_t state)
{
int error = 0;
- unsigned long flags;
- local_irq_save(flags);
+ pm_ops->quiesce(state);
+ BUG_ON(!irqs_disabled());
if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "Some devices failed to power down\n");
@@ -143,7 +171,8 @@ int suspend_enter(suspend_state_t state)
error = pm_ops->enter(state);
device_power_up();
- local_irq_restore(flags);
+ pm_ops->activate(state);
+ BUG_ON(irqs_disabled());
return error;
}
Pavel Machek
2007-04-13 20:58:37 UTC
Permalink
Hi!
Post by Rafael J. Wysocki
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.quiesce and pm_ops.activate which will be called instead
of disabling/enabling irqs so platforms get control over this step.
If not assigned, these two calls will be assigned with defaults during
set_pm_ops.
I have no objections. Pavel, what do you think?
That it is same old hack it used to be. There's nothing magic about
decrementer, and I do not think it is _neccessary_ to touch it just
before cli. Just disable it from sysdev handler or something.
Post by Rafael J. Wysocki
Post by Johannes Berg
--- wireless-dev.orig/include/linux/pm.h 2007-04-13 10:09:17.835356880 +0200
+++ wireless-dev/include/linux/pm.h 2007-04-13 14:57:01.525356880 +0200
@@ -135,9 +135,19 @@ typedef int __bitwise suspend_disk_metho
* negative error code if necessary.
*
+ * they are powered down. If assigned, this callback must at least turn
+ * off local IRQs. If left unassigned, a default callback that does
+ * nothing but turn off local IRQs is assigned during pm_set_ops().
Is it called for s2ram? swsusp? uswsusp?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-13 21:06:08 UTC
Permalink
Post by Pavel Machek
That it is same old hack it used to be. There's nothing magic about
decrementer, and I do not think it is _neccessary_ to touch it just
before cli. Just disable it from sysdev handler or something.
No, you cannot disable it from sysdev handler because you have to be
able to take interrupts. Can we *please* move on to some more productive
discussion?
Post by Pavel Machek
Post by Johannes Berg
--- wireless-dev.orig/include/linux/pm.h 2007-04-13 10:09:17.835356880 +0200
+++ wireless-dev/include/linux/pm.h 2007-04-13 14:57:01.525356880 +0200
@@ -135,9 +135,19 @@ typedef int __bitwise suspend_disk_metho
* negative error code if necessary.
*
+ * they are powered down. If assigned, this callback must at least turn
+ * off local IRQs. If left unassigned, a default callback that does
+ * nothing but turn off local IRQs is assigned during pm_set_ops().
Is it called for s2ram? swsusp? uswsusp?
Whoops, I had that in an earlier version, it's not supposed to be called
for (u)swsusp because pm_ops involvement there is a hack anyway. I'll
fix the description.

johannes
Pavel Machek
2007-04-13 21:12:32 UTC
Permalink
Hi!
Post by Johannes Berg
Post by Pavel Machek
That it is same old hack it used to be. There's nothing magic about
decrementer, and I do not think it is _neccessary_ to touch it just
before cli. Just disable it from sysdev handler or something.
No, you cannot disable it from sysdev handler because you have to be
able to take interrupts. Can we *please* move on to some more productive
discussion?
So what happens if you disable it as a platform device? There has to
be better solution than this. What is so special about decrementer?

We were able to handle MTRR registers on PC. Timer setup fit into the
model too...
Post by Johannes Berg
Post by Pavel Machek
Post by Johannes Berg
@@ -135,9 +135,19 @@ typedef int __bitwise suspend_disk_metho
* negative error code if necessary.
*
+ * they are powered down. If assigned, this callback must at least turn
+ * off local IRQs. If left unassigned, a default callback that does
+ * nothing but turn off local IRQs is assigned during pm_set_ops().
Is it called for s2ram? swsusp? uswsusp?
Whoops, I had that in an earlier version, it's not supposed to be called
for (u)swsusp because pm_ops involvement there is a hack anyway. I'll
fix the description.
But (u)swsusp needs to shut down the devices, and it does both suspend
and powerdown, too...?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-13 21:18:05 UTC
Permalink
Post by Pavel Machek
Post by Johannes Berg
Whoops, I had that in an earlier version, it's not supposed to be called
for (u)swsusp because pm_ops involvement there is a hack anyway. I'll
fix the description.
But (u)swsusp needs to shut down the devices, and it does both suspend
and powerdown, too...?
Yes, but the special S4 state is just hacked onto pm_ops, nothing but
ACPI S4 will ever sanely use pm_ops. If I'd invented pm_ops I'd made
pm_ops solely for suspend to ram/standby/... and added a *separate* hook
for pm_disk_mode and S4 because logically that really doesn't belong to
pm_ops which does almost exclusively things for suspend to ram/standby.

johannes
Pavel Machek
2007-04-13 21:33:49 UTC
Permalink
Post by Johannes Berg
Post by Pavel Machek
Post by Johannes Berg
Whoops, I had that in an earlier version, it's not supposed to be called
for (u)swsusp because pm_ops involvement there is a hack anyway. I'll
fix the description.
But (u)swsusp needs to shut down the devices, and it does both suspend
and powerdown, too...?
Yes, but the special S4 state is just hacked onto pm_ops, nothing but
ACPI S4 will ever sanely use pm_ops. If I'd invented pm_ops I'd made
pm_ops solely for suspend to ram/standby/... and added a *separate* hook
for pm_disk_mode and S4 because logically that really doesn't belong to
pm_ops which does almost exclusively things for suspend to ram/standby.
Still, swsusp shuts down the devices:

if ((error = arch_prepare_suspend()))
return error;

local_irq_disable();
/* At this point, device_suspend() has been called, but *not*
* device_power_down(). We *must* device_power_down() now.
* Otherwise, drivers for some devices (e.g. interrupt
controllers)
* become desynchronized with the actual state of the hardware
* at resume time, and evil weirdness ensues.
*/
if ((error = device_power_down(PMSG_FREEZE))) {
printk(KERN_ERR "Some devices failed to power down, aborting suspend\n");
goto Enable_irqs;
}

This local_irq_disable() is very similar to the one you are doing in
suspend-to-RAM. According to your description, it needs to play with
the decrementor, too. But I'd really prefer not to have pm_ops call
here -- exactly because swsusp has nothing to do with pm_ops in
shutdown mode.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-13 21:45:48 UTC
Permalink
Post by Pavel Machek
if ((error = arch_prepare_suspend()))
return error;
local_irq_disable();
/* At this point, device_suspend() has been called, but *not*
* device_power_down(). We *must* device_power_down() now.
* Otherwise, drivers for some devices (e.g. interrupt
controllers)
* become desynchronized with the actual state of the hardware
* at resume time, and evil weirdness ensues.
*/
if ((error = device_power_down(PMSG_FREEZE))) {
printk(KERN_ERR "Some devices failed to power down, aborting suspend\n");
goto Enable_irqs;
}
This local_irq_disable() is very similar to the one you are doing in
suspend-to-RAM. According to your description, it needs to play with
the decrementor, too. But I'd really prefer not to have pm_ops call
here -- exactly because swsusp has nothing to do with pm_ops in
shutdown mode.
Right. But we don't actually care because we just power down, we don't
have to put the CPU into a proper state.

johannes
Pavel Machek
2007-04-13 21:52:41 UTC
Permalink
Post by Johannes Berg
Post by Pavel Machek
if ((error = arch_prepare_suspend()))
return error;
local_irq_disable();
/* At this point, device_suspend() has been called, but *not*
* device_power_down(). We *must* device_power_down() now.
* Otherwise, drivers for some devices (e.g. interrupt
controllers)
* become desynchronized with the actual state of the hardware
* at resume time, and evil weirdness ensues.
*/
if ((error = device_power_down(PMSG_FREEZE))) {
printk(KERN_ERR "Some devices failed to power down, aborting suspend\n");
goto Enable_irqs;
}
This local_irq_disable() is very similar to the one you are doing in
suspend-to-RAM. According to your description, it needs to play with
the decrementor, too. But I'd really prefer not to have pm_ops call
here -- exactly because swsusp has nothing to do with pm_ops in
shutdown mode.
Right. But we don't actually care because we just power down, we don't
have to put the CPU into a proper state.
Look again, this sequence is before taking system snapshot, and then
we have to recover and save the snapshot, too.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-13 21:59:07 UTC
Permalink
Post by Pavel Machek
Look again, this sequence is before taking system snapshot, and then
we have to recover and save the snapshot, too.
I know. But that doesn't make us care.

As I explained previously, the decrementer exception is pending until it
is taken. It doesn't matter one bit whether it is pending during the
snapshot, but it does matter if we need the CPU to go to sleep later
because it won't when the exception is pending.

johannes
Rafael J. Wysocki
2007-04-13 22:18:06 UTC
Permalink
Post by Johannes Berg
Post by Pavel Machek
Look again, this sequence is before taking system snapshot, and then
we have to recover and save the snapshot, too.
I know. But that doesn't make us care.
As I explained previously, the decrementer exception is pending until it
is taken. It doesn't matter one bit whether it is pending during the
snapshot, but it does matter if we need the CPU to go to sleep later
because it won't when the exception is pending.
Technically, they may be not needed by (u)swsusp, but since (u)swsusp already
uses pm_ops in the platform mode, it formally should use these new hooks in
the platform mode too. Otherwise it won't make sense to use pm_ops in
(u)swsusp at all, IMHO.

Greetings,
Rafael
Johannes Berg
2007-04-13 22:20:07 UTC
Permalink
Post by Rafael J. Wysocki
Technically, they may be not needed by (u)swsusp, but since (u)swsusp already
uses pm_ops in the platform mode, it formally should use these new hooks in
the platform mode too.
Sure. I'll do that.
Post by Rafael J. Wysocki
Otherwise it won't make sense to use pm_ops in
(u)swsusp at all, IMHO.
IMHO it never has and never will be, it's just to make it all fit the
ACPI model ;)

johannes
Rafael J. Wysocki
2007-04-13 22:49:34 UTC
Permalink
Post by Johannes Berg
Post by Rafael J. Wysocki
Technically, they may be not needed by (u)swsusp, but since (u)swsusp already
uses pm_ops in the platform mode, it formally should use these new hooks in
the platform mode too.
Sure. I'll do that.
Post by Rafael J. Wysocki
Otherwise it won't make sense to use pm_ops in
(u)swsusp at all, IMHO.
IMHO it never has and never will be, it's just to make it all fit the
ACPI model ;)
I'm not against changing that, as you might have noticed. :-)

Greetings,
Rafael
Johannes Berg
2007-04-13 22:55:17 UTC
Permalink
Post by Rafael J. Wysocki
Post by Johannes Berg
IMHO it never has and never will be, it's just to make it all fit the
ACPI model ;)
I'm not against changing that, as you might have noticed. :-)
Yeah, but I can happily ignore it ;) Maybe if I really have a lot of
time at some point...

johannes
Rafael J. Wysocki
2007-04-13 22:09:32 UTC
Permalink
Post by Pavel Machek
Post by Johannes Berg
Post by Pavel Machek
Post by Johannes Berg
Whoops, I had that in an earlier version, it's not supposed to be called
for (u)swsusp because pm_ops involvement there is a hack anyway. I'll
fix the description.
But (u)swsusp needs to shut down the devices, and it does both suspend
and powerdown, too...?
Yes, but the special S4 state is just hacked onto pm_ops, nothing but
ACPI S4 will ever sanely use pm_ops. If I'd invented pm_ops I'd made
pm_ops solely for suspend to ram/standby/... and added a *separate* hook
for pm_disk_mode and S4 because logically that really doesn't belong to
pm_ops which does almost exclusively things for suspend to ram/standby.
if ((error = arch_prepare_suspend()))
return error;
local_irq_disable();
/* At this point, device_suspend() has been called, but *not*
* device_power_down(). We *must* device_power_down() now.
* Otherwise, drivers for some devices (e.g. interrupt
controllers)
* become desynchronized with the actual state of the hardware
* at resume time, and evil weirdness ensues.
*/
if ((error = device_power_down(PMSG_FREEZE))) {
printk(KERN_ERR "Some devices failed to power down, aborting suspend\n");
goto Enable_irqs;
}
This local_irq_disable() is very similar to the one you are doing in
suspend-to-RAM. According to your description, it needs to play with
the decrementor, too. But I'd really prefer not to have pm_ops call
here -- exactly because swsusp has nothing to do with pm_ops in
shutdown mode.
Hmm, I missed that. :-(

I think we need to make things clear: Either we add the additional hooks to
pm_ops in which case they should be taken into account in the (u)swsusp code
too, or we don't add them at all.

Well, there also is one more solution. Namely, we can add the hooks to pm_ops
and make (u)swsusp use something else instead of pm_ops, but that would require
some more consideration.

Greetings,
Rafael
Johannes Berg
2007-04-13 22:13:26 UTC
Permalink
Post by Rafael J. Wysocki
Hmm, I missed that. :-(
I think we need to make things clear: Either we add the additional hooks to
pm_ops in which case they should be taken into account in the (u)swsusp code
too, or we don't add them at all.
I can do that. I don't care one bit since I will never use pm_ops for
suspend to disk anyway.

johannes
Pavel Machek
2007-04-13 22:16:52 UTC
Permalink
Post by Johannes Berg
Post by Rafael J. Wysocki
Hmm, I missed that. :-(
I think we need to make things clear: Either we add the additional hooks to
pm_ops in which case they should be taken into account in the (u)swsusp code
too, or we don't add them at all.
I can do that. I don't care one bit since I will never use pm_ops for
suspend to disk anyway.
Please don't do that just yet.

I see ppc as being really special here.

Normally, preparation for s2ram and preparation for snapshot is really
similar, but you have exception to that rule. I do think clean
solution is redesigning plaform/sysdevs to have stable order.

If you added one platform and one sysdev, would that _work_? AFAICT
you say it would and Ben says it would not. (Lets leave uglyness
debate for later).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Paul Mackerras
2007-04-14 16:55:13 UTC
Permalink
Post by Pavel Machek
I see ppc as being really special here.
It is, I guess. The salient point is that once the decrementer has an
interrupt pending, the only way to clear it is to take the interrupt.
There is no explicit access to the interrupt pending bit and no
separate enable bit for its interrupt, only the global interrupt
enable for the whole CPU. We have to turn on the global interrupt
enable when finally putting the system into the suspend state (for
suspend to RAM), otherwise on some machines we'll never wake up again.

Effectively, when dealing with the decrementer during preparation for
suspend-to-RAM, we may need to have interrupts enabled briefly in
order to clear a pending decrementer interrupt.

Paul.
Benjamin Herrenschmidt
2007-04-13 22:25:22 UTC
Permalink
Post by Rafael J. Wysocki
Hmm, I missed that. :-(
I think we need to make things clear: Either we add the additional hooks to
pm_ops in which case they should be taken into account in the (u)swsusp code
too, or we don't add them at all.
Well, there also is one more solution. Namely, we can add the hooks to pm_ops
and make (u)swsusp use something else instead of pm_ops, but that would require
some more consideration.
Well, what would be nice would be if swsusp wasn't just such a gross
contraption completely bypassing the rest of the suspend/resume
framework...

Ben.
Pavel Machek
2007-04-13 22:39:22 UTC
Permalink
Post by Benjamin Herrenschmidt
Post by Rafael J. Wysocki
Hmm, I missed that. :-(
I think we need to make things clear: Either we add the additional hooks to
pm_ops in which case they should be taken into account in the (u)swsusp code
too, or we don't add them at all.
Well, there also is one more solution. Namely, we can add the hooks to pm_ops
and make (u)swsusp use something else instead of pm_ops, but that would require
some more consideration.
Well, what would be nice would be if swsusp wasn't just such a gross
contraption completely bypassing the rest of the suspend/resume
framework...
Great, so:

* you claim that current code is hacked up

* I claim, that if I apply the patch, the result will be ugly hack.

Good; I think you agree that Johaness' patch certainly does not fix
the uglyness of current code.

=> we both agree that after applying the patch, we'll have hacked-up
code. So... can we do some cleanups before hacking on it more?

If sysdevs need ordering, lets add ordering. If platform devices need
ordering, lets do it. If acpi S4 should not be using pm_ops, we should
probably fix that. If swsusp should not be called from pm_ops, we can
probably change that (but we need to keep userland interface).

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Benjamin Herrenschmidt
2007-04-13 23:19:43 UTC
Permalink
Ok, PowerPC Decrementer 101

The processor contains a special register, the decrementer, which
keeps ... decrementing. It can be set to any arbitrary value at any time
and will decrement in sync with the processor timebase.

There are some subtle differences between implementations regarding what
happens when reaching 0, but the basic idea is that you get an interrupt
(depending on the processor, that interrupt is somewhat a level
interrupt asserted when the decrementer is negative or it can be a kind
of edge interrupt queued up when the dec transitions from 0 to -1).

This decrementer is used as the main timer. Thus it needs to be
operating normally at all time until interrupts are off or the scheduler
will stop working properly, kernel timers will not fire, etc...

(and saying that platforms devices should use mdelay instead is just
gross, I won't even go there. Interrupts are still on -> the core kernel
should operate normally and that includes the main timer source).

Now what happens when we put the processors (well, most desktop
processors, at least the one that concern us in that discussion) to
sleep is that they get out of sleep when an interrupt occur, for
example ... a decrementer interrupt. This is not good for STR for
various reasons related to the way STR works in hardware (the
northbridge snoops that the CPU is going to sleep and starts putting
things down, ultimately shutting the CPU off, it can't really cope if
the CPU wakes up right away and start doing things). Unfortunately, for
other reasons, the procedure of putting the CPU to sleep involves
turning interrupts on. For all external interrupts, that isn't a problem
as we have previously shut them all down on the main PIC, but it is with
the DEC.

The "trick" is that once interrupts are off, we want the DEC to be set
to such a high value that it won't tick anytime soon (that is actually
several seconds, enough in practice). But if we do that after IRQs have
been turned off (from a sysdev), we have the risk that it might have
ticked between turning IRQs off and our sysdev, and thus a DEC
interrupts is already "queued up" (especially on CPUs where it acts as
an edge interrupt) and will screw up our attempt to put the CPU to sleep
later on.

The procedure we use is to set it to 0x7fffffff with IRQs on, then turn
IRQs off, then set it back to 0x7fffffff in case it kicked in just
before and the timer interrupt set it back to a short value. As you can
imagine, thoseh have to be done close together as part of the main irq
disabling procedure, after platform devices have run (that is we can
consider the scheduler as "off") and before sysdev's etc...

Now, in addition to that, we have some weird motherboard stuff we need
to turn off/on, which has to be done after drivers (because it renders
various busses inaccessible in some cases, and might cause DMA snooping
to stop working, I'm not 100% sure, but I know for sure it has to be
done late) but can't be done as a sysdev because we need some
infrastructure like the i2c stuff (and others) that requires semaphores
and timers. It's based on something remotely akin to AML in that we have
to execute "scripts" provided by the firmware and the code to do so need
to run in an environment where scheduler & timers are operating.

That later thing could be dealt with using a platform device if we could
guarantee that platform device is put to sleep last of all devices in
the tree and woken up first. Right now, we have no such guarantee and no
mecanism for it, and I don't see a solution showing up for 2.6.22

In the long run, we might be able to break up that phase to have each
individual device that has such functions associated have ways to call
into them after the device has been put to sleep, but that involves more
complication, probably hook in the generic PCI code etc... and more
ordering issues vs. some motherboard foo so it's definitely not on the
short term radar.

For all those reasons, I do think that the proper, clean and incremental
approach to get our stuff working is to have that pair of hooks allowing
us to "replace" the local_irq_disable/enable calls...

Now it does not need to be pm_ops. I'm fine with arch_pm_irq_quiesce()
kind of thing (or find a better name if you can, maybe
arch_pm_after_devices_suspend() arch_pm_before_device_wakeup() ?) and
have the default implementation of these just do
local_irq_disable/enable.

It's basically about quiescing the scheduler/timers, which on powerpc
(bcs of the way the DEC operates) requires a little bit more than just a
call to local_irq_disable. And once the hook is there, use it for some
other arch specific bits that we can't quite fit anywhere else at the
moment.

Ben.
Rafael J. Wysocki
2007-04-14 09:14:55 UTC
Permalink
Post by Benjamin Herrenschmidt
Ok, PowerPC Decrementer 101
The processor contains a special register, the decrementer, which
keeps ... decrementing. It can be set to any arbitrary value at any time
and will decrement in sync with the processor timebase.
There are some subtle differences between implementations regarding what
happens when reaching 0, but the basic idea is that you get an interrupt
(depending on the processor, that interrupt is somewhat a level
interrupt asserted when the decrementer is negative or it can be a kind
of edge interrupt queued up when the dec transitions from 0 to -1).
This decrementer is used as the main timer. Thus it needs to be
operating normally at all time until interrupts are off or the scheduler
will stop working properly, kernel timers will not fire, etc...
(and saying that platforms devices should use mdelay instead is just
gross, I won't even go there. Interrupts are still on -> the core kernel
should operate normally and that includes the main timer source).
Now what happens when we put the processors (well, most desktop
processors, at least the one that concern us in that discussion) to
sleep is that they get out of sleep when an interrupt occur, for
example ... a decrementer interrupt. This is not good for STR for
various reasons related to the way STR works in hardware (the
northbridge snoops that the CPU is going to sleep and starts putting
things down, ultimately shutting the CPU off, it can't really cope if
the CPU wakes up right away and start doing things). Unfortunately, for
other reasons, the procedure of putting the CPU to sleep involves
turning interrupts on. For all external interrupts, that isn't a problem
as we have previously shut them all down on the main PIC, but it is with
the DEC.
The "trick" is that once interrupts are off, we want the DEC to be set
to such a high value that it won't tick anytime soon (that is actually
several seconds, enough in practice). But if we do that after IRQs have
been turned off (from a sysdev), we have the risk that it might have
ticked between turning IRQs off and our sysdev, and thus a DEC
interrupts is already "queued up" (especially on CPUs where it acts as
an edge interrupt) and will screw up our attempt to put the CPU to sleep
later on.
The procedure we use is to set it to 0x7fffffff with IRQs on, then turn
IRQs off, then set it back to 0x7fffffff in case it kicked in just
before and the timer interrupt set it back to a short value. As you can
imagine, thoseh have to be done close together as part of the main irq
disabling procedure, after platform devices have run (that is we can
consider the scheduler as "off") and before sysdev's etc...
Now, in addition to that, we have some weird motherboard stuff we need
to turn off/on, which has to be done after drivers (because it renders
various busses inaccessible in some cases, and might cause DMA snooping
to stop working, I'm not 100% sure, but I know for sure it has to be
done late) but can't be done as a sysdev because we need some
infrastructure like the i2c stuff (and others) that requires semaphores
and timers. It's based on something remotely akin to AML in that we have
to execute "scripts" provided by the firmware and the code to do so need
to run in an environment where scheduler & timers are operating.
That later thing could be dealt with using a platform device if we could
guarantee that platform device is put to sleep last of all devices in
the tree and woken up first. Right now, we have no such guarantee and no
mecanism for it, and I don't see a solution showing up for 2.6.22
In the long run, we might be able to break up that phase to have each
individual device that has such functions associated have ways to call
into them after the device has been put to sleep, but that involves more
complication, probably hook in the generic PCI code etc... and more
ordering issues vs. some motherboard foo so it's definitely not on the
short term radar.
For all those reasons, I do think that the proper, clean and incremental
approach to get our stuff working is to have that pair of hooks allowing
us to "replace" the local_irq_disable/enable calls...
Now it does not need to be pm_ops. I'm fine with arch_pm_irq_quiesce()
kind of thing (or find a better name if you can, maybe
arch_pm_after_devices_suspend() arch_pm_before_device_wakeup() ?) and
have the default implementation of these just do
local_irq_disable/enable.
I like this idea.

Greetings,
Rafael
Johannes Berg
2007-04-14 09:19:29 UTC
Permalink
Post by Rafael J. Wysocki
Post by Benjamin Herrenschmidt
Now it does not need to be pm_ops. I'm fine with arch_pm_irq_quiesce()
kind of thing (or find a better name if you can, maybe
arch_pm_after_devices_suspend() arch_pm_before_device_wakeup() ?) and
have the default implementation of these just do
local_irq_disable/enable.
I like this idea.
I don't really. There can possibly be multiple pm_ops for one arch, who
knows that they all need to do the same thing here? (It would probably
be true for us right now though.)

Also, all other things suspend to ram does go through pm_ops so IMHO
adding an arch hook here now would just unnecessarily complicate the
whole thing since suddenly you may need to do more than just pm_ops for
suspend to ram.

On the other hand, if doing an arch hook here and something like
CONFIG_ARCH_HAS_SUSPEND_IRQ_HOOKS would actually allow us to end this
pointless discussion, then why not.

johannes
Benjamin Herrenschmidt
2007-04-15 00:19:42 UTC
Permalink
Post by Johannes Berg
On the other hand, if doing an arch hook here and something like
CONFIG_ARCH_HAS_SUSPEND_IRQ_HOOKS would actually allow us to end this
pointless discussion, then why not.
The preferred approach for that sort of thing nowadays is to have
the generic code contain

#ifndef arch_foo_bar()
#define arch_foo_bar() do { } while(0)
#endif

And have the arch code do

void arch_foo_bar(void)
{
xxx
}
#define arch_foo_bar arch_foo_bar

Ben.
Pavel Machek
2007-04-16 07:32:49 UTC
Permalink
Hi!
Post by Benjamin Herrenschmidt
Ok, PowerPC Decrementer 101
(Thanks for explanation).
Post by Benjamin Herrenschmidt
Now, in addition to that, we have some weird motherboard stuff we need
to turn off/on, which has to be done after drivers (because it renders
various busses inaccessible in some cases, and might cause DMA snooping
to stop working, I'm not 100% sure, but I know for sure it has to be
done late) but can't be done as a sysdev because we need some
infrastructure like the i2c stuff (and others) that requires semaphores
and timers. It's based on something remotely akin to AML in that we have
to execute "scripts" provided by the firmware and the code to do so need
to run in an environment where scheduler & timers are operating.
Does the "weird motherboard" stuff need to be suspended/resumed for
swsusp memory snapshot?
Post by Benjamin Herrenschmidt
For all those reasons, I do think that the proper, clean and incremental
approach to get our stuff working is to have that pair of hooks allowing
us to "replace" the local_irq_disable/enable calls...
Now it does not need to be pm_ops. I'm fine with arch_pm_irq_quiesce()
kind of thing (or find a better name if you can, maybe
Well, I guess arch_pm_irq_quiesce_for_s2ram() would be
acceptable... but that would be only called for s2ram... which should
be enough for decrementer AFAICT.
Post by Benjamin Herrenschmidt
It's basically about quiescing the scheduler/timers, which on powerpc
(bcs of the way the DEC operates) requires a little bit more than just a
call to local_irq_disable. And once the hook is there, use it for some
other arch specific bits that we can't quite fit anywhere else at the
moment.
As decrementer is special for s2ram, we can add the hook. (It is
single hook). If we need to do something for snapshots, too... well,
we can still add arch_pm_irq_quiesce_for_snapshot() and
arch_pm_irq_quiesce_for_powerdown() etc, but it would get ugly fast.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-16 08:37:13 UTC
Permalink
Post by Pavel Machek
As decrementer is special for s2ram, we can add the hook. (It is
single hook).
No, it's two, for enabling as well.

johannes
Pavel Machek
2007-04-16 12:47:33 UTC
Permalink
Hi!
Post by Johannes Berg
Post by Pavel Machek
As decrementer is special for s2ram, we can add the hook. (It is
single hook).
No, it's two, for enabling as well.
Two is ok, six would not be.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Paul Mackerras
2007-04-17 04:58:29 UTC
Permalink
Post by Johannes Berg
No, it's two, for enabling as well.
I'm not sure we actually need a special enable hook. I assume we have
a sysdev for the timekeeping stuff, whose resume function gets called
before interrupts are enabled. That can set the decrementer to a
suitable value. If we get an extra decrementer interrupt when
interrupts are enabled with local_irq_enable, it doesn't matter.

Paul.
Benjamin Herrenschmidt
2007-04-18 07:50:56 UTC
Permalink
Post by Pavel Machek
Does the "weird motherboard" stuff need to be suspended/resumed for
swsusp memory snapshot?
I don't think so... I need to get into the details to verify, it might
make sense to run some of the "resume" bits of it when resuming a STD
kernel but I doubt it. It won't harm tho.
Post by Pavel Machek
Well, I guess arch_pm_irq_quiesce_for_s2ram() would be
acceptable... but that would be only called for s2ram... which should
be enough for decrementer AFAICT.
That's a pretty ugly name :-0 I'd be fine just having
arch_pm_irq_quiesce() and only call it from the right s2ram call sites,
or pass it an argument indicating the type of suspend and let the arch
decide what to do. I fail to see the point in being so overly
restrictive on something which will in the long run have little to no
impact.
Post by Pavel Machek
As decrementer is special for s2ram, we can add the hook. (It is
single hook). If we need to do something for snapshots, too... well,
we can still add arch_pm_irq_quiesce_for_snapshot() and
arch_pm_irq_quiesce_for_powerdown() etc, but it would get ugly fast.
Which is why I think we should stick with the pair

arch_pm_irq_quiesce() and arch_pm_irq_activate() or whatever name you
come up with, and either have them only called for s2ram or pass the
suspend type to them, I don't like the over long weird names you are
coming up with but that's a detail :-)

Ben.
Rafael J. Wysocki
2007-04-13 22:47:56 UTC
Permalink
Post by Benjamin Herrenschmidt
Post by Rafael J. Wysocki
Hmm, I missed that. :-(
I think we need to make things clear: Either we add the additional hooks to
pm_ops in which case they should be taken into account in the (u)swsusp code
too, or we don't add them at all.
Well, there also is one more solution. Namely, we can add the hooks to pm_ops
and make (u)swsusp use something else instead of pm_ops, but that would require
some more consideration.
Well, what would be nice would be if swsusp wasn't just such a gross
contraption completely bypassing the rest of the suspend/resume
framework...
IMHO, swsusp is a really special case, especially when it doesn't use pm_ops.

On a PC it may look like a "typical" suspend/resume operation, but I think it
generally is not one. Although we try to use as many pieces of the framework
in swsusp as we can, we're still running into problems with that. There are
just too many _practical_ differences.

Greetings,
Rafael
Benjamin Herrenschmidt
2007-04-13 21:18:50 UTC
Permalink
Post by Pavel Machek
Hi!
Post by Johannes Berg
Post by Pavel Machek
That it is same old hack it used to be. There's nothing magic about
decrementer, and I do not think it is _neccessary_ to touch it just
before cli. Just disable it from sysdev handler or something.
No, you cannot disable it from sysdev handler because you have to be
able to take interrupts. Can we *please* move on to some more productive
discussion?
So what happens if you disable it as a platform device? There has to
be better solution than this. What is so special about decrementer?
That you are dead since once it's disabled, kernel timers etc... stop
working.
Post by Pavel Machek
We were able to handle MTRR registers on PC. Timer setup fit into the
model too...
I don't care what you do on x86. If I tell you I need to take control at
that point, then I need. Beside I really can't understand why you are so
dead set against an arch hook there. It's not like this will do anything
to you... it can't propagate to drivers, it's purely arch stuff, so
what ?

Ben.
Pavel Machek
2007-04-13 21:56:09 UTC
Permalink
Hi!
Post by Benjamin Herrenschmidt
Post by Pavel Machek
Post by Johannes Berg
Post by Pavel Machek
That it is same old hack it used to be. There's nothing magic about
decrementer, and I do not think it is _neccessary_ to touch it just
before cli. Just disable it from sysdev handler or something.
No, you cannot disable it from sysdev handler because you have to be
able to take interrupts. Can we *please* move on to some more productive
discussion?
So what happens if you disable it as a platform device? There has to
be better solution than this. What is so special about decrementer?
That you are dead since once it's disabled, kernel timers etc... stop
working.
Post by Pavel Machek
We were able to handle MTRR registers on PC. Timer setup fit into the
model too...
I don't care what you do on x86. If I tell you I need to take control at
that point, then I need. Beside I really can't understand why you are so
dead set against an arch hook there. It's not like this will do anything
to you... it can't propagate to drivers, it's purely arch stuff, so
what ?
There are about 7 places where we do stop_platform_devices()
local_irq_disable() stop_sysdevs(). You call your hook from one of
them and I suspect you need to call it from more than one. Plus, the
hook needs to have documented semantics, so people know if they should
do local_irq_disable() or if they should call your hook.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-13 22:01:47 UTC
Permalink
Post by Pavel Machek
There are about 7 places where we do stop_platform_devices()
local_irq_disable() stop_sysdevs().
I can only find three of those.
Post by Pavel Machek
You call your hook from one of
them and I suspect you need to call it from more than one. Plus, the
hook needs to have documented semantics, so people know if they should
do local_irq_disable() or if they should call your hook.
All but one are related to suspend to disk. I don't see anybody
inventing yet another entry point to the kernel that actually uses
pm_ops internally, but if they do surely they'll follow the code that is
there.

johannes
Benjamin Herrenschmidt
2007-04-13 22:24:10 UTC
Permalink
Post by Pavel Machek
There are about 7 places where we do stop_platform_devices()
local_irq_disable() stop_sysdevs(). You call your hook from one of
them and I suspect you need to call it from more than one. Plus, the
hook needs to have documented semantics, so people know if they should
do local_irq_disable() or if they should call your hook.
The simple fact that these sequence is copied in 7 different places is a
pretty good illustration of what's wrong with the core pm code ...

Ben.
Benjamin Herrenschmidt
2007-04-13 21:15:10 UTC
Permalink
Post by Pavel Machek
Hi!
Post by Rafael J. Wysocki
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.quiesce and pm_ops.activate which will be called instead
of disabling/enabling irqs so platforms get control over this step.
If not assigned, these two calls will be assigned with defaults during
set_pm_ops.
I have no objections. Pavel, what do you think?
That it is same old hack it used to be. There's nothing magic about
decrementer, and I do not think it is _neccessary_ to touch it just
before cli. Just disable it from sysdev handler or something.
You don't know what you're talking about... so please stop being an ass
and be constructive for once.

We need to deal with the DEC -there-, and with other arch issues while
at it. A sysdev will not do. A non-sydev device will not do due to
ordering issues among other things, at least not without some
significant rework of the driver core PM code.

Thus this is the solution to make it work. Now, if you have decided that
you will refuse making it work on powermacs, then I'll have no choice
but try to push that patch through anyway since you aren't proposing any
alternative solution that would work for us.

(BTW. I still wonder how ACPI can get away without such a hook either
since they motherboard suspend/resume code needs to run with irqs on and
needs to run after all devices are suspended... if they rely on magic
ordering of devices, that would explain why the stuff isn't reliable.
Ordering of devices at the toplevel is -not- reliable in the current
model).

Ben.
Pavel Machek
2007-04-13 21:50:45 UTC
Permalink
Hi!
Post by Benjamin Herrenschmidt
Post by Pavel Machek
Post by Rafael J. Wysocki
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
introduces pm_ops.quiesce and pm_ops.activate which will be called instead
of disabling/enabling irqs so platforms get control over this step.
If not assigned, these two calls will be assigned with defaults during
set_pm_ops.
I have no objections. Pavel, what do you think?
That it is same old hack it used to be. There's nothing magic about
decrementer, and I do not think it is _neccessary_ to touch it just
before cli. Just disable it from sysdev handler or something.
You don't know what you're talking about... so please stop being an ass
and be constructive for once.
It is quite hard to be constructive when you call me body parts. I do
not mind being called a horse, but I'm definitely not for rent.
Post by Benjamin Herrenschmidt
We need to deal with the DEC -there-, and with other arch issues while
You still did not tell me what other arch issues.
Post by Benjamin Herrenschmidt
at it. A sysdev will not do. A non-sydev device will not do due to
ordering issues among other things, at least not without some
significant rework of the driver core PM code.
Lets do the rework of core PM code, then.
Post by Benjamin Herrenschmidt
Thus this is the solution to make it work. Now, if you have decided that
you will refuse making it work on powermacs, then I'll have no choice
but try to push that patch through anyway since you aren't proposing any
alternative solution that would work for us.
There's more than one local_irq_disable() in suspend code. I need to
know what your magic does, to be able to decide when to call it. (See
the other mail).

You seem to be pushing "this absolutely needs to be done for 2.6.22,
and lets screw the design, we need it now". I disagree. I will not
accept "You don't know what you are talking about, so just merge our
code, we will not tell you". Sorry.

If you absolutely must, just do #ifdef CONFIG_PPC at the specific
part... that way it is at least clear who is responsible, and the
braindamage does not spread.
Post by Benjamin Herrenschmidt
(BTW. I still wonder how ACPI can get away without such a hook either
since they motherboard suspend/resume code needs to run with irqs on and
needs to run after all devices are suspended... if they rely on magic
What motherboard code? We do not have anything like that on PC.

We have "drivers" for few chips every PC has, like i8259 timer,
interrupt controller, APIC, etc. Sysdevs and platform devices are
designed for hardware like that.

If you want to have a rule "platform device is not supposed to rely on
timers still working", I think you can get that. Such low-level
devices should really use udelay()/mdelay() that spins and does not
need timer hw.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Benjamin Herrenschmidt
2007-04-13 22:23:25 UTC
Permalink
Post by Pavel Machek
We have "drivers" for few chips every PC has, like i8259 timer,
interrupt controller, APIC, etc. Sysdevs and platform devices are
designed for hardware like that.
If you want to have a rule "platform device is not supposed to rely on
timers still working", I think you can get that. Such low-level
devices should really use udelay()/mdelay() that spins and does not
need timer hw.
That's nonsense.

Ben.
David Brownell
2007-04-14 22:10:23 UTC
Permalink
Post by Benjamin Herrenschmidt
Post by Pavel Machek
We have "drivers" for few chips every PC has, like i8259 timer,
interrupt controller, APIC, etc. Sysdevs and platform devices are
designed for hardware like that.
If you want to have a rule "platform device is not supposed to rely on
timers still working", I think you can get that. Such low-level
devices should really use udelay()/mdelay() that spins and does not
need timer hw.
That's nonsense.
Agreed. Platform devices are the *primary* device types on
most embedded systems. Their suspend() and resume() methods
are allowed to use timers and sleep() just like they are with
any other bus type.

We do know that timers aren't available during the suspend_late()
and resume_early() calls, for any bus type.
Pavel Machek
2007-04-13 21:05:11 UTC
Permalink
Hi!
Post by Dmitry Krivoschekov
Post by Benjamin Herrenschmidt
Why not give this added flexibility ? Archs who don't care don't need to
bother and it will make us happy... it's not like we are about to -add-
burden to other architectures.
Actually, I personally two hands up for adding the flexibility,
but you should define what is supposed to do on this level and
what is don't, or not desirable.
For example, I'd like to enter back to suspend mode
right from "activate" stage, because I've woken up just
to update some data and I do not want to resume all devices
for that, is it ok for "activate"?
No, I do not think we can do that w/o major surgery.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Benjamin Herrenschmidt
2007-04-12 08:44:21 UTC
Permalink
Post by Dmitry Krivoschekov
So, where is that callbacks that are so desirable for powermac?
One thing at a time... better to separate the patch adding the callback
from the powermac side implementation. The current powermac code does
not work with the generic pm_ops framework yet anyway (that's why
Johannes is trying to fix), it uses it's own ioctl and doesn't go
through the core PM stuff yet.

Ben
Johannes Berg
2007-04-17 17:18:10 UTC
Permalink
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
allows architectures to define arch_s2ram_{en,dis}able_irqs in their
asm/suspend.h to have control over this step.

Signed-off-by: Johannes Berg <***@sipsolutions.net>

---
Even though I don't like it, here it is. I still don't see why this
shouldn't be part of pm_ops when pm_ops is responsible for everything
else related to s2ram. But it at least lets me do what I need to.

kernel/power/main.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

--- wireless-dev.orig/kernel/power/main.c 2007-04-17 18:52:24.536830941 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-17 19:09:40.966830941 +0200
@@ -128,13 +128,22 @@ static int suspend_prepare(suspend_state
return error;
}

+#ifndef arch_s2ram_disable_irqs
+#define arch_s2ram_disable_irqs(flags) local_irq_save(*flags)
+#endif
+
+#ifndef arch_s2ram_enable_irqs
+#define arch_s2ram_enable_irqs(flags) local_irq_restore(*flags)
+#endif
+

int suspend_enter(suspend_state_t state)
{
int error = 0;
unsigned long flags;

- local_irq_save(flags);
+ arch_s2ram_disable_irqs(&flags);
+ BUG_ON(!irqs_disabled());

if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "Some devices failed to power down\n");
@@ -143,7 +152,8 @@ int suspend_enter(suspend_state_t state)
error = pm_ops->enter(state);
device_power_up();
Done:
- local_irq_restore(flags);
+ arch_s2ram_enable_irqs(&flags);
+ BUG_ON(irqs_disabled());
return error;
}
Pavel Machek
2007-04-18 11:27:39 UTC
Permalink
Hi!
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
allows architectures to define arch_s2ram_{en,dis}able_irqs in their
asm/suspend.h to have control over this step.
Looks ok to me.



Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Johannes Berg
2007-04-19 11:00:11 UTC
Permalink
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
allows architectures to define arch_s2ram_{en,dis}able_irqs in their
asm/suspend.h to have control over this step.

Signed-off-by: Johannes Berg <***@sipsolutions.net>

---
kernel/power/main.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

--- wireless-dev.orig/kernel/power/main.c 2007-04-17 18:52:24.536830941 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-17 19:09:40.966830941 +0200
@@ -128,13 +128,22 @@ static int suspend_prepare(suspend_state
return error;
}

+#ifndef arch_s2ram_disable_irqs
+#define arch_s2ram_disable_irqs(flags) local_irq_save(*flags)
+#endif
+
+#ifndef arch_s2ram_enable_irqs
+#define arch_s2ram_enable_irqs(flags) local_irq_restore(*flags)
+#endif
+

int suspend_enter(suspend_state_t state)
{
int error = 0;
unsigned long flags;

- local_irq_save(flags);
+ arch_s2ram_disable_irqs(&flags);
+ BUG_ON(!irqs_disabled());

if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "Some devices failed to power down\n");
@@ -143,7 +152,8 @@ int suspend_enter(suspend_state_t state)
error = pm_ops->enter(state);
device_power_up();
Done:
- local_irq_restore(flags);
+ arch_s2ram_enable_irqs(&flags);
+ BUG_ON(irqs_disabled());
return error;
}
Andrew Morton
2007-04-19 21:55:08 UTC
Permalink
On Thu, 19 Apr 2007 13:00:11 +0200
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
allows architectures to define arch_s2ram_{en,dis}able_irqs in their
asm/suspend.h to have control over this step.
what on earth is a decrementer?
Post by Johannes Berg
---
kernel/power/main.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
--- wireless-dev.orig/kernel/power/main.c 2007-04-17 18:52:24.536830941 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-17 19:09:40.966830941 +0200
@@ -128,13 +128,22 @@ static int suspend_prepare(suspend_state
return error;
}
+#ifndef arch_s2ram_disable_irqs
+#define arch_s2ram_disable_irqs(flags) local_irq_save(*flags)
+#endif
+
+#ifndef arch_s2ram_enable_irqs
+#define arch_s2ram_enable_irqs(flags) local_irq_restore(*flags)
+#endif
+
int suspend_enter(suspend_state_t state)
{
int error = 0;
unsigned long flags;
- local_irq_save(flags);
+ arch_s2ram_disable_irqs(&flags);
hm, ugly. We could use attribute(weak) here, so we don't force
architectures to implement this with a macro. But whatever.


It might be nice to identify which arch header is supposed to define
arch_s2ram_disable_irqs, perhaps via an explicit inclusion of that header.
Johannes Berg
2007-04-20 15:34:25 UTC
Permalink
Post by Andrew Morton
what on earth is a decrementer?
BenH explained it here:
http://thread.gmane.org/gmane.linux.power-management.general/5088/focus=5230
Post by Andrew Morton
It might be nice to identify which arch header is supposed to define
arch_s2ram_disable_irqs, perhaps via an explicit inclusion of that header.
It's supposed to go into asm/suspend.h which is included from
linux/suspend.h. I thought about doing it explicitly but opted against
it in the end. But if I change it to be a weak symbol (somebody tell me
how to please) then that will be moot.

johannes
Greg KH
2007-04-20 06:57:31 UTC
Permalink
Post by Johannes Berg
For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
allows architectures to define arch_s2ram_{en,dis}able_irqs in their
asm/suspend.h to have control over this step.
---
kernel/power/main.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
--- wireless-dev.orig/kernel/power/main.c 2007-04-17 18:52:24.536830941 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-17 19:09:40.966830941 +0200
@@ -128,13 +128,22 @@ static int suspend_prepare(suspend_state
return error;
}
+#ifndef arch_s2ram_disable_irqs
+#define arch_s2ram_disable_irqs(flags) local_irq_save(*flags)
+#endif
+
+#ifndef arch_s2ram_enable_irqs
+#define arch_s2ram_enable_irqs(flags) local_irq_restore(*flags)
+#endif
Can't you just make this a "weak" symbol and override it in your ppc
code if you really need it, and make a "real" function here in this file
for all other arches?

thanks,

greg k-h
Johannes Berg
2007-04-20 15:31:33 UTC
Permalink
Post by Greg KH
Can't you just make this a "weak" symbol and override it in your ppc
code if you really need it, and make a "real" function here in this file
for all other arches?
I'm not opposed to that. How do I do that?

johannes
Greg KH
2007-04-20 17:39:51 UTC
Permalink
Post by Johannes Berg
Post by Greg KH
Can't you just make this a "weak" symbol and override it in your ppc
code if you really need it, and make a "real" function here in this file
for all other arches?
I'm not opposed to that. How do I do that?
Look at the other examples in the kernel, pcibios_disable_device is one
such example.

thanks,

greg k-h
Johannes Berg
2007-04-20 19:37:18 UTC
Permalink
Thanks. Does this look good?

Haven't tested it yet but I don't see a reason it shouldn't work for me.

Sparse will probably complain about this, should I define them
somewhere?

---
kernel/power/main.c | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)

--- wireless-dev.orig/kernel/power/main.c 2007-04-20 21:34:01.531107385 +0200
+++ wireless-dev/kernel/power/main.c 2007-04-20 21:34:02.141107385 +0200
@@ -128,13 +128,35 @@ static int suspend_prepare(suspend_state
return error;
}

+/**
+ * arch_s2ram_disable_irqs - disable IRQs for suspend
+ *
+ * Disables IRQs (in the default case) and allows architectures
+ * to override it if more needs to be done.
+ */
+void __attribute__ ((weak)) arch_s2ram_disable_irqs(unsigned long *flags)
+{
+ local_irq_save(*flags);
+}
+
+/**
+ * arch_s2ram_enable_irqs - enable IRQs after suspend
+ *
+ * Enables IRQs (in the default case) and allows architectures
+ * to override it if more needs to be done.
+ */
+void __attribute__ ((weak)) arch_s2ram_enable_irqs(unsigned long *flags)
+{
+ local_irq_restore(*flags);
+}

int suspend_enter(suspend_state_t state)
{
int error = 0;
unsigned long flags;

- local_irq_save(flags);
+ arch_s2ram_disable_irqs(&flags);
+ BUG_ON(!irqs_disabled());

if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "Some devices failed to power down\n");
@@ -143,7 +165,8 @@ int suspend_enter(suspend_state_t state)
error = pm_ops->enter(state);
device_power_up();
Done:
- local_irq_restore(flags);
+ arch_s2ram_enable_irqs(&flags);
+ BUG_ON(irqs_disabled());
return error;
}
Andrew Morton
2007-04-20 19:44:48 UTC
Permalink
On Fri, 20 Apr 2007 21:37:18 +0200
Post by Johannes Berg
int suspend_enter(suspend_state_t state)
{
int error = 0;
unsigned long flags;
- local_irq_save(flags);
+ arch_s2ram_disable_irqs(&flags);
+ BUG_ON(!irqs_disabled());
if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "Some devices failed to power down\n");
@@ -143,7 +165,8 @@ int suspend_enter(suspend_state_t state)
error = pm_ops->enter(state);
device_power_up();
- local_irq_restore(flags);
+ arch_s2ram_enable_irqs(&flags);
+ BUG_ON(irqs_disabled());
return error;
}
The second BUG_ON() looks a bit fishy to me. If someone calls
suspend_enter() with local interrupts disabled, it will trigger.
Johannes Berg
2007-04-20 19:51:53 UTC
Permalink
Post by Andrew Morton
Post by Johannes Berg
- local_irq_restore(flags);
+ arch_s2ram_enable_irqs(&flags);
+ BUG_ON(irqs_disabled());
return error;
}
The second BUG_ON() looks a bit fishy to me. If someone calls
suspend_enter() with local interrupts disabled, it will trigger.
Huh, hm, I don't think it's valid to do that. I'm not sure why the code
does irq_save/restore instead of just disable/enable. The only users of
this function are in the power management core code (main.c and user.c)
and they certainly always call it with interrupts enabled, the entry
point for other users is pm_suspend() and that surely cannot be called
with interrupts disabled. Rafael?

johannes
Rafael J. Wysocki
2007-04-20 20:36:28 UTC
Permalink
Post by Johannes Berg
Post by Andrew Morton
Post by Johannes Berg
- local_irq_restore(flags);
+ arch_s2ram_enable_irqs(&flags);
+ BUG_ON(irqs_disabled());
return error;
}
The second BUG_ON() looks a bit fishy to me. If someone calls
suspend_enter() with local interrupts disabled, it will trigger.
Huh, hm, I don't think it's valid to do that. I'm not sure why the code
does irq_save/restore instead of just disable/enable.
Well, I can't say why exactly suspend_enter() usues irq_save/restore, but it
looks like that's not needed. For the suspend to disk we call
local_irq_disable/enable() in the corresponding places and, for example,
acpi_pm_enter apparently uses irq_save/restore() itself too.
Post by Johannes Berg
The only users of this function are in the power management core code (main.c
and user.c) and they certainly always call it with interrupts enabled, the
entry point for other users is pm_suspend() and that surely cannot be called
with interrupts disabled. Rafael?
That's correct, plus pm_suspend() also uses enter_state().

In fact suspend_enter() has been made extern so that we can call it from
kernel/power/user.c and no one else is supposed to use it.

Loading...