Discussion:
[PATCH] specific do_timer_cpu value for nohz off mode
Dimitri Sivanich
2011-11-08 19:11:49 UTC
Permalink
Resending this.


Allow manual override of the tick_do_timer_cpu.

While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.

Signed-off-by: Dimitri Sivanich <***@sgi.com>
---
kernel/time/tick-sched.c | 98 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 98 insertions(+)

Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -834,6 +834,104 @@ void tick_cancel_sched_timer(int cpu)
}
#endif

+
+#ifdef CONFIG_SYSFS
+/**
+ * sysfs_show_do_timer_cpu - sysfs interface for tick_do_timer_cpu
+ * @dev: unused
+ * @buf: char buffer where value of tick_do_timer_cpu is copied
+ *
+ * Provides sysfs interface for showing the current tick_do_timer_cpu.
+ */
+static ssize_t
+sysfs_show_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ ssize_t count = 0;
+
+ count = snprintf(buf, PAGE_SIZE, "%d\n", tick_do_timer_cpu);
+
+ return count;
+}
+
+/**
+ * sysfs_override_do_timer_cpu - manually override tick_do_timer_cpu
+ * @dev: unused
+ * @buf: cpu number of desired tick_do_timer_cpu
+ * @count: length of buffer
+ *
+ * Takes input from sysfs interface for manually overriding the selected
+ * tick_do_timer_cpu. Only applicable when not running in nohz mode.
+ */
+static ssize_t
+sysfs_override_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t count)
+{
+ char b[16];
+ size_t ret = count;
+ int c;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+ /* strings from sysfs write are not 0 terminated! */
+ if (count >= sizeof(b))
+ return -EINVAL;
+
+ /* strip off \n: */
+ if (buf[count-1] == '\n')
+ count--;
+ if (count < 1)
+ return -EINVAL;
+
+ memcpy(b, buf, count);
+ b[count] = 0;
+
+ if (sscanf(b, "%d", &c) != 1)
+ return -EINVAL;
+
+ if (!cpu_online(c))
+ return -EINVAL;
+
+ tick_do_timer_cpu = c;
+
+ return ret;
+}
+
+/*
+ * Sysfs setup bits:
+ */
+static SYSDEV_ATTR(jiffies_cpu, 0644, sysfs_show_do_timer_cpu,
+ sysfs_override_do_timer_cpu);
+
+static struct sysdev_class timekeeping_sysclass = {
+ .name = "timekeeping",
+};
+
+static struct sys_device device_timekeeping = {
+ .id = 0,
+ .cls = &timekeeping_sysclass,
+};
+
+static int __init init_timekeeping_sysfs(void)
+{
+ int error = sysdev_class_register(&timekeeping_sysclass);
+
+ if (!error)
+ error = sysdev_register(&device_timekeeping);
+ if (!error)
+ error = sysdev_create_file(
+ &device_timekeeping,
+ &attr_jiffies_cpu);
+ return error;
+}
+
+device_initcall(init_timekeeping_sysfs);
+#endif /* SYSFS */
+
/**
* Async notification about clocksource changes
*/
Andrew Morton
2011-11-23 00:08:02 UTC
Permalink
On Tue, 8 Nov 2011 13:11:49 -0600
Post by Dimitri Sivanich
Resending this.
Not enough times apparently :(
Post by Dimitri Sivanich
Allow manual override of the tick_do_timer_cpu.
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.
This doesn't really explain what the patch does. "override" with what?
What effects can the user expect to see from doing
the-action-which-you-didn't-describe?
Post by Dimitri Sivanich
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -834,6 +834,104 @@ void tick_cancel_sched_timer(int cpu)
}
#endif
+
+#ifdef CONFIG_SYSFS
+/**
+ * sysfs_show_do_timer_cpu - sysfs interface for tick_do_timer_cpu
+ *
+ * Provides sysfs interface for showing the current tick_do_timer_cpu.
+ */
+static ssize_t
+sysfs_show_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ ssize_t count = 0;
+
+ count = snprintf(buf, PAGE_SIZE, "%d\n", tick_do_timer_cpu);
+
+ return count;
Save some trees:

return snprintf(buf, PAGE_SIZE, "%d\n", tick_do_timer_cpu);
Post by Dimitri Sivanich
+}
+
+/**
+ * sysfs_override_do_timer_cpu - manually override tick_do_timer_cpu
+ *
+ * Takes input from sysfs interface for manually overriding the selected
+ * tick_do_timer_cpu. Only applicable when not running in nohz mode.
+ */
+static ssize_t
+sysfs_override_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t count)
+{
+ char b[16];
+ size_t ret = count;
+ int c;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+ /* strings from sysfs write are not 0 terminated! */
hm. Aren't they?
Post by Dimitri Sivanich
+ if (count >= sizeof(b))
+ return -EINVAL;
+
+ /* strip off \n: */
+ if (buf[count-1] == '\n')
+ count--;
+ if (count < 1)
+ return -EINVAL;
+
+ memcpy(b, buf, count);
+ b[count] = 0;
+
+ if (sscanf(b, "%d", &c) != 1)
+ return -EINVAL;
You should be able to use strim() in here, and eliminate b[]. Not that
the \n needed to be removed anyway! I think it's OK to modify the
incoming memory for sysfs write handlers.

Also, kstrto*() should be used to detect and reject input of the form
"42foo".

But surely we have a helper function somewhere to read an integer out
of a sysfs-written buffer. If not, we should!
Post by Dimitri Sivanich
+ if (!cpu_online(c))
+ return -EINVAL;
+
+ tick_do_timer_cpu = c;
+
+ return ret;
+}
+
+/*
+ */
+static SYSDEV_ATTR(jiffies_cpu, 0644, sysfs_show_do_timer_cpu,
+ sysfs_override_do_timer_cpu);
+
+static struct sysdev_class timekeeping_sysclass = {
+ .name = "timekeeping",
+};
The patch shouod add some user-facing documentation for the user-facing
feature.
Post by Dimitri Sivanich
+static struct sys_device device_timekeeping = {
+ .id = 0,
+ .cls = &timekeeping_sysclass,
+};
+
+static int __init init_timekeeping_sysfs(void)
+{
+ int error = sysdev_class_register(&timekeeping_sysclass);
+
+ if (!error)
+ error = sysdev_register(&device_timekeeping);
+ if (!error)
+ error = sysdev_create_file(
+ &device_timekeeping,
+ &attr_jiffies_cpu);
+ return error;
+}
+
+device_initcall(init_timekeeping_sysfs);
+#endif /* SYSFS */
Dimitri Sivanich
2011-11-30 15:29:59 UTC
Permalink
Post by Andrew Morton
On Tue, 8 Nov 2011 13:11:49 -0600
Post by Dimitri Sivanich
Allow manual override of the tick_do_timer_cpu.
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.
This doesn't really explain what the patch does. "override" with what?
What effects can the user expect to see from doing
the-action-which-you-didn't-describe?
Hopefully my new description and documentation are acceptable.
Post by Andrew Morton
Post by Dimitri Sivanich
+sysfs_show_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ ssize_t count = 0;
+
+ count = snprintf(buf, PAGE_SIZE, "%d\n", tick_do_timer_cpu);
+
+ return count;
return snprintf(buf, PAGE_SIZE, "%d\n", tick_do_timer_cpu);
The new patch below should address this by using the already existing
'sysdev_show_int' function.
Post by Andrew Morton
Post by Dimitri Sivanich
+
+ if (sscanf(b, "%d", &c) != 1)
+ return -EINVAL;
You should be able to use strim() in here, and eliminate b[]. Not that
the \n needed to be removed anyway! I think it's OK to modify the
incoming memory for sysfs write handlers.
Also, kstrto*() should be used to detect and reject input of the form
"42foo".
Using kstrtouint, see patch below.
Post by Andrew Morton
But surely we have a helper function somewhere to read an integer out
of a sysfs-written buffer. If not, we should!
Here I need more strict input checking for online cpu values.
Post by Andrew Morton
Post by Dimitri Sivanich
+ if (!cpu_online(c))
+ return -EINVAL;
+
+ tick_do_timer_cpu = c;
+
+ return ret;
+}
+
+/*
+ */
+static SYSDEV_ATTR(jiffies_cpu, 0644, sysfs_show_do_timer_cpu,
+ sysfs_override_do_timer_cpu);
+
+static struct sysdev_class timekeeping_sysclass = {
+ .name = "timekeeping",
+};
The patch shouod add some user-facing documentation for the user-facing
feature.
Hopefully documenting this in Documentation/ABI/testing is acceptable at this
point(?). See patch below.




Show and modify the tick_do_timer_cpu via sysfs. This determines the cpu
on which global time (jiffies) updates occur. Modification can only be
done on systems with nohz mode turned off.

While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.

Signed-off-by: Dimitri Sivanich <***@sgi.com>
---
Documentation/ABI/testing/sysfs-devices-system-timekeeping | 16 ++
drivers/base/sys.c | 10 -
include/linux/sysdev.h | 2
kernel/time/tick-sched.c | 59 ++++++++++
4 files changed, 81 insertions(+), 6 deletions(-)

Index: linux/include/linux/sysdev.h
===================================================================
--- linux.orig/include/linux/sysdev.h
+++ linux/include/linux/sysdev.h
@@ -132,6 +132,8 @@ struct sysdev_ext_attribute {
void *var;
};

+#define SYSDEV_TO_EXT_ATTR(x) container_of(x, struct sysdev_ext_attribute, attr)
+
/*
* Support for simple variable sysdev attributes.
* The pointer to the variable is stored in a sysdev_ext_attribute
Index: linux/drivers/base/sys.c
===================================================================
--- linux.orig/drivers/base/sys.c
+++ linux/drivers/base/sys.c
@@ -339,13 +339,11 @@ int __init system_bus_init(void)
return 0;
}

-#define to_ext_attr(x) container_of(x, struct sysdev_ext_attribute, attr)
-
ssize_t sysdev_store_ulong(struct sys_device *sysdev,
struct sysdev_attribute *attr,
const char *buf, size_t size)
{
- struct sysdev_ext_attribute *ea = to_ext_attr(attr);
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
char *end;
unsigned long new = simple_strtoul(buf, &end, 0);
if (end == buf)
@@ -360,7 +358,7 @@ ssize_t sysdev_show_ulong(struct sys_dev
struct sysdev_attribute *attr,
char *buf)
{
- struct sysdev_ext_attribute *ea = to_ext_attr(attr);
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
return snprintf(buf, PAGE_SIZE, "%lx\n", *(unsigned long *)(ea->var));
}
EXPORT_SYMBOL_GPL(sysdev_show_ulong);
@@ -369,7 +367,7 @@ ssize_t sysdev_store_int(struct sys_devi
struct sysdev_attribute *attr,
const char *buf, size_t size)
{
- struct sysdev_ext_attribute *ea = to_ext_attr(attr);
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
char *end;
long new = simple_strtol(buf, &end, 0);
if (end == buf || new > INT_MAX || new < INT_MIN)
@@ -384,7 +382,7 @@ ssize_t sysdev_show_int(struct sys_devic
struct sysdev_attribute *attr,
char *buf)
{
- struct sysdev_ext_attribute *ea = to_ext_attr(attr);
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
return snprintf(buf, PAGE_SIZE, "%d\n", *(int *)(ea->var));
}
EXPORT_SYMBOL_GPL(sysdev_show_int);
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -834,6 +834,65 @@ void tick_cancel_sched_timer(int cpu)
}
#endif

+#ifdef CONFIG_SYSFS
+/*
+ * Allow modification of tick_do_timer_cpu when nohz mode is off.
+ */
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ if (new >= NR_CPUS || !cpu_online(new))
+ return -ERANGE;
+
+ *(unsigned int *)(ea->var) = new;
+ return size;
+}
+
+static struct sysdev_ext_attribute attr_jiffies_cpu = {
+ _SYSDEV_ATTR(jiffies_cpu, 0644, sysdev_show_int,
+ sysfs_store_do_timer_cpu),
+ &tick_do_timer_cpu };
+
+static struct sysdev_class timekeeping_sysclass = {
+ .name = "timekeeping",
+};
+
+static struct sys_device device_timekeeping = {
+ .id = 0,
+ .cls = &timekeeping_sysclass,
+};
+
+static int __init init_timekeeping_sysfs(void)
+{
+ int error = sysdev_class_register(&timekeeping_sysclass);
+
+ if (!error)
+ error = sysdev_register(&device_timekeeping);
+ if (!error)
+ error = sysdev_create_file(
+ &device_timekeeping,
+ &attr_jiffies_cpu.attr);
+ return error;
+}
+
+device_initcall(init_timekeeping_sysfs);
+#endif /* SYSFS */
+
/**
* Async notification about clocksource changes
*/
Index: linux/Documentation/ABI/testing/sysfs-devices-system-timekeeping
===================================================================
--- /dev/null
+++ linux/Documentation/ABI/testing/sysfs-devices-system-timekeeping
@@ -0,0 +1,16 @@
+What: /sys/devices/system/timekeeping/
+Date: November 2011
+Contact: Linux kernel mailing list <linux-***@vger.kernel.org>
+Description: Timekeeping attributes
+
+
+What: /sys/devices/system/timekeeping/timekeeping0/jiffies_cpu
+Date: November 2011
+Contact: Linux kernel mailing list <linux-***@vger.kernel.org>
+Description: Show and modify the kernel's tick_do_timer_cpu. This
+ determines the cpu on which global time (jiffies) updates
+ occur. This can only be modified on systems running with
+ the nohz mode turned off (nohz=off).
+
+ Possible values are:
+ 0 - <num online cpus>
Andrew Morton
2011-12-01 00:11:31 UTC
Permalink
On Wed, 30 Nov 2011 09:29:59 -0600
Post by Dimitri Sivanich
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ if (new >= NR_CPUS || !cpu_online(new))
+ return -ERANGE;
+
+ *(unsigned int *)(ea->var) = new;
+ return size;
+}
checkpatch tells us:

WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc

I think the check can just be removed? Surely cpu_online(1000000000)
will return false?
Andrew Morton
2011-12-01 00:16:10 UTC
Permalink
On Wed, 30 Nov 2011 16:11:31 -0800
Post by Andrew Morton
On Wed, 30 Nov 2011 09:29:59 -0600
Post by Dimitri Sivanich
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ if (new >= NR_CPUS || !cpu_online(new))
+ return -ERANGE;
+
+ *(unsigned int *)(ea->var) = new;
+ return size;
+}
WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc
I think the check can just be removed? Surely cpu_online(1000000000)
will return false?
And the whole thing is racy, isn't it? The "new" CPU can go offline a
nanosecond after we performed that test, so why perform it at all?
Dimitri Sivanich
2011-12-01 02:07:27 UTC
Permalink
Post by Andrew Morton
On Wed, 30 Nov 2011 16:11:31 -0800
Post by Andrew Morton
On Wed, 30 Nov 2011 09:29:59 -0600
Post by Dimitri Sivanich
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ if (new >= NR_CPUS || !cpu_online(new))
+ return -ERANGE;
+
+ *(unsigned int *)(ea->var) = new;
+ return size;
+}
WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc
I think the check can just be removed? Surely cpu_online(1000000000)
will return false?
And the whole thing is racy, isn't it? The "new" CPU can go offline a
nanosecond after we performed that test, so why perform it at all?
See my email concerning the panic in cpu_online().
Andrew Morton
2011-12-01 02:13:18 UTC
Permalink
Post by Dimitri Sivanich
Post by Andrew Morton
And the whole thing is racy, isn't it? The "new" CPU can go offline a
nanosecond after we performed that test, so why perform it at all?
See my email concerning the panic in cpu_online().
That doesn't address my question.

What's the point in checking cpu_online() when we have no locks to
prevent the online map from changing?

What happens if this cpu goes offline immediately after that check has
passed?
Dimitri Sivanich
2011-12-01 16:37:40 UTC
Permalink
Post by Andrew Morton
Post by Dimitri Sivanich
Post by Andrew Morton
And the whole thing is racy, isn't it? The "new" CPU can go offline a
nanosecond after we performed that test, so why perform it at all?
See my email concerning the panic in cpu_online().
That doesn't address my question.
What's the point in checking cpu_online() when we have no locks to
prevent the online map from changing?
What happens if this cpu goes offline immediately after that check has
passed?
Yes, there is some raciness here, and I see your point.

I've changed the patch to protect the operation. The version below should
resolve this with get/put_online_cpus. It also checks input against
nr_cpu_ids before checking cpu_online(), which is what
for_each_present_cpu()->for_each_cpu() uses, and makes checkpatch.pl happy.




Show and modify the tick_do_timer_cpu via sysfs. This determines the cpu
on which global time (jiffies) updates occur. Modification can only be
done on systems with nohz mode turned off.

While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.

Signed-off-by: Dimitri Sivanich <***@sgi.com>
---
Documentation/ABI/testing/sysfs-devices-system-timekeeping | 16 ++
drivers/base/sys.c | 10 -
include/linux/sysdev.h | 2
kernel/time/tick-sched.c | 67 ++++++++++
4 files changed, 89 insertions(+), 6 deletions(-)

Index: linux/include/linux/sysdev.h
===================================================================
--- linux.orig/include/linux/sysdev.h
+++ linux/include/linux/sysdev.h
@@ -132,6 +132,8 @@ struct sysdev_ext_attribute {
void *var;
};

+#define SYSDEV_TO_EXT_ATTR(x) container_of(x, struct sysdev_ext_attribute, attr)
+
/*
* Support for simple variable sysdev attributes.
* The pointer to the variable is stored in a sysdev_ext_attribute
Index: linux/drivers/base/sys.c
===================================================================
--- linux.orig/drivers/base/sys.c
+++ linux/drivers/base/sys.c
@@ -339,13 +339,11 @@ int __init system_bus_init(void)
return 0;
}

-#define to_ext_attr(x) container_of(x, struct sysdev_ext_attribute, attr)
-
ssize_t sysdev_store_ulong(struct sys_device *sysdev,
struct sysdev_attribute *attr,
const char *buf, size_t size)
{
- struct sysdev_ext_attribute *ea = to_ext_attr(attr);
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
char *end;
unsigned long new = simple_strtoul(buf, &end, 0);
if (end == buf)
@@ -360,7 +358,7 @@ ssize_t sysdev_show_ulong(struct sys_dev
struct sysdev_attribute *attr,
char *buf)
{
- struct sysdev_ext_attribute *ea = to_ext_attr(attr);
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
return snprintf(buf, PAGE_SIZE, "%lx\n", *(unsigned long *)(ea->var));
}
EXPORT_SYMBOL_GPL(sysdev_show_ulong);
@@ -369,7 +367,7 @@ ssize_t sysdev_store_int(struct sys_devi
struct sysdev_attribute *attr,
const char *buf, size_t size)
{
- struct sysdev_ext_attribute *ea = to_ext_attr(attr);
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
char *end;
long new = simple_strtol(buf, &end, 0);
if (end == buf || new > INT_MAX || new < INT_MIN)
@@ -384,7 +382,7 @@ ssize_t sysdev_show_int(struct sys_devic
struct sysdev_attribute *attr,
char *buf)
{
- struct sysdev_ext_attribute *ea = to_ext_attr(attr);
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
return snprintf(buf, PAGE_SIZE, "%d\n", *(int *)(ea->var));
}
EXPORT_SYMBOL_GPL(sysdev_show_int);
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -834,6 +834,73 @@ void tick_cancel_sched_timer(int cpu)
}
#endif

+#ifdef CONFIG_SYSFS
+/*
+ * Allow modification of tick_do_timer_cpu when nohz mode is off.
+ */
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ /* Protect against cpu-hotplug */
+ get_online_cpus();
+
+ if (new >= nr_cpu_ids || !cpu_online(new)) {
+ put_online_cpus();
+ return -ERANGE;
+ }
+
+ *(unsigned int *)(ea->var) = new;
+
+ put_online_cpus();
+
+ return size;
+}
+
+static struct sysdev_ext_attribute attr_jiffies_cpu = {
+ _SYSDEV_ATTR(jiffies_cpu, 0644, sysdev_show_int,
+ sysfs_store_do_timer_cpu),
+ &tick_do_timer_cpu };
+
+static struct sysdev_class timekeeping_sysclass = {
+ .name = "timekeeping",
+};
+
+static struct sys_device device_timekeeping = {
+ .id = 0,
+ .cls = &timekeeping_sysclass,
+};
+
+static int __init init_timekeeping_sysfs(void)
+{
+ int error = sysdev_class_register(&timekeeping_sysclass);
+
+ if (!error)
+ error = sysdev_register(&device_timekeeping);
+ if (!error)
+ error = sysdev_create_file(
+ &device_timekeeping,
+ &attr_jiffies_cpu.attr);
+ return error;
+}
+
+device_initcall(init_timekeeping_sysfs);
+#endif /* SYSFS */
+
/**
* Async notification about clocksource changes
*/
Index: linux/Documentation/ABI/testing/sysfs-devices-system-timekeeping
===================================================================
--- /dev/null
+++ linux/Documentation/ABI/testing/sysfs-devices-system-timekeeping
@@ -0,0 +1,16 @@
+What: /sys/devices/system/timekeeping/
+Date: November 2011
+Contact: Linux kernel mailing list <linux-***@vger.kernel.org>
+Description: Timekeeping attributes
+
+
+What: /sys/devices/system/timekeeping/timekeeping0/jiffies_cpu
+Date: November 2011
+Contact: Linux kernel mailing list <linux-***@vger.kernel.org>
+Description: Show and modify the kernel's tick_do_timer_cpu. This
+ determines the cpu on which global time (jiffies) updates
+ occur. This can only be modified on systems running with
+ the nohz mode turned off (nohz=off).
+
+ Possible values are:
+ 0 - <num online cpus>
Andrew Morton
2011-12-01 22:56:23 UTC
Permalink
On Thu, 1 Dec 2011 10:37:40 -0600
Post by Dimitri Sivanich
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ /* Protect against cpu-hotplug */
+ get_online_cpus();
+
+ if (new >= nr_cpu_ids || !cpu_online(new)) {
+ put_online_cpus();
+ return -ERANGE;
+ }
+
+ *(unsigned int *)(ea->var) = new;
+
+ put_online_cpus();
+
+ return size;
+}
OK, I think this fixes one race. We modify tick_do_timer_cpu inside
get_online_cpus(). If that cpu goes offline then
tick_handover_do_timer() will correctly hand the timer functions over
to a new CPU, and tick_handover_do_timer() runs in the CPU hotplug
handler which I assume is locked by get_online_cpus(). Please check
all this.

Now, the above code can alter tick_do_timer_cpu while a timer interrupt
is actually executing on another CPU. Will this disrupt aything? I
think it might cause problems. If we take an interrupt on CPU 5 and
that CPU enters tick_periodic() and another CPU alters
tick_do_timer_cpu from 5 to 4 at exactly the correct time, tick_periodic()
might fail to run do_timer(). Or it might run do_timer() on both CPUs 4 and
5 concurrently?
Dimitri Sivanich
2011-12-02 20:14:52 UTC
Permalink
Post by Andrew Morton
On Thu, 1 Dec 2011 10:37:40 -0600
Post by Dimitri Sivanich
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ /* Protect against cpu-hotplug */
+ get_online_cpus();
+
+ if (new >= nr_cpu_ids || !cpu_online(new)) {
+ put_online_cpus();
+ return -ERANGE;
+ }
+
+ *(unsigned int *)(ea->var) = new;
+
+ put_online_cpus();
+
+ return size;
+}
OK, I think this fixes one race. We modify tick_do_timer_cpu inside
get_online_cpus(). If that cpu goes offline then
tick_handover_do_timer() will correctly hand the timer functions over
to a new CPU, and tick_handover_do_timer() runs in the CPU hotplug
handler which I assume is locked by get_online_cpus(). Please check
all this.
Yes, _cpu_down() runs cpu_hotplug_begin(), which locks and holds the mutex
that get_online_cpus() needs in order to update the refcount
(cpu_hotplug_begin doesn't exit until refcount==0).

The notification that calls tick_handover_do_timer() is done in both the
CPU_DYING and CPU_DYING_FROZEN (CPU_TASKS_FROZEN), but I believe this always
comes from _cpu_down() in either case.
Post by Andrew Morton
Now, the above code can alter tick_do_timer_cpu while a timer interrupt
is actually executing on another CPU. Will this disrupt aything? I
think it might cause problems. If we take an interrupt on CPU 5 and
that CPU enters tick_periodic() and another CPU alters
tick_do_timer_cpu from 5 to 4 at exactly the correct time, tick_periodic()
might fail to run do_timer(). Or it might run do_timer() on both CPUs 4 and
5 concurrently?
Well, we do have to take the write_seqlock() in tick_periodic, so there's
no danger of do_timer running exactly concurrently.

But yes, we may end up with 2 jiffies ticks occurring close together
(when 5 runs do_timer while 4 waits for the seqlock), or we might end up
missing a jiffies update for almost a full tick (when it changes from 5
to 4 immediately after 4 has done the 'tick_do_timer_cpu == cpu' check).

So at that time, we could be off +- almost a tick. The question is, how
critical is that? When you down a cpu, the same sort of thing could
happen via tick_handover_do_timer(), which itself does nothing more than
change tick_do_timer_cpu.
Dimitri Sivanich
2011-12-02 20:22:54 UTC
Permalink
Post by Dimitri Sivanich
So at that time, we could be off +- almost a tick. The question is, how
critical is that? When you down a cpu, the same sort of thing could
happen via tick_handover_do_timer(), which itself does nothing more than
change tick_do_timer_cpu.
I should clarify that by 'same sort of thing' I mean that the target cpu
might've just gone through tick_periodic(), so we could miss a tick before
the next update.
Thomas Gleixner
2011-12-02 22:42:39 UTC
Permalink
Post by Dimitri Sivanich
Well, we do have to take the write_seqlock() in tick_periodic, so there's
no danger of do_timer running exactly concurrently.
But yes, we may end up with 2 jiffies ticks occurring close together
(when 5 runs do_timer while 4 waits for the seqlock), or we might end up
missing a jiffies update for almost a full tick (when it changes from 5
to 4 immediately after 4 has done the 'tick_do_timer_cpu == cpu' check).
So at that time, we could be off +- almost a tick. The question is, how
critical is that? When you down a cpu, the same sort of thing could
happen via tick_handover_do_timer(), which itself does nothing more than
change tick_do_timer_cpu.
It's uncritical as long as you are not using clocksource=jiffies. With
all other clocksources you just miss a jiffies update, which does not
affect timekeeping at all. It just might expire your network timeout a
jiffie earlier or later. So there is no damage to expect.

Thanks,

tglx
Dimitri Sivanich
2011-12-01 02:06:31 UTC
Permalink
Post by Andrew Morton
On Wed, 30 Nov 2011 09:29:59 -0600
Post by Dimitri Sivanich
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ if (new >= NR_CPUS || !cpu_online(new))
+ return -ERANGE;
+
+ *(unsigned int *)(ea->var) = new;
+ return size;
+}
WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc
I think a check against num_possible_cpus() should be OK.
Post by Andrew Morton
I think the check can just be removed? Surely cpu_online(1000000000)
will return false?
A value > NR_CPUS and < MAX_INT caused a panic in sysfs_store_do_timer_cpu,
presumably from the cpu_online() check. The check against NR_CPUS avoided
the panic.
Andrew Morton
2011-12-01 02:12:05 UTC
Permalink
Post by Dimitri Sivanich
Post by Andrew Morton
On Wed, 30 Nov 2011 09:29:59 -0600
Post by Dimitri Sivanich
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ if (new >= NR_CPUS || !cpu_online(new))
+ return -ERANGE;
+
+ *(unsigned int *)(ea->var) = new;
+ return size;
+}
WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc
I think a check against num_possible_cpus() should be OK.
You want cpu_possible(). Or maybe cpu_present().
Post by Dimitri Sivanich
Post by Andrew Morton
I think the check can just be removed? Surely cpu_online(1000000000)
will return false?
A value > NR_CPUS and < MAX_INT caused a panic in sysfs_store_do_timer_cpu,
presumably from the cpu_online() check. The check against NR_CPUS avoided
the panic.
OK. Well, it's not a panic:

static inline unsigned int cpumask_check(unsigned int cpu)
{
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
WARN_ON_ONCE(cpu >= nr_cpumask_bits);
#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
return cpu;
}

so we can't do cpu_online(insane number)
Dimitri Sivanich
2011-12-01 02:34:23 UTC
Permalink
Post by Andrew Morton
Post by Dimitri Sivanich
Post by Andrew Morton
On Wed, 30 Nov 2011 09:29:59 -0600
Post by Dimitri Sivanich
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ if (new >= NR_CPUS || !cpu_online(new))
+ return -ERANGE;
+
+ *(unsigned int *)(ea->var) = new;
+ return size;
+}
WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc
I think a check against num_possible_cpus() should be OK.
You want cpu_possible(). Or maybe cpu_present().
May be splitting hairs, but I think I like num_present_cpus even better.
Post by Andrew Morton
Post by Dimitri Sivanich
Post by Andrew Morton
I think the check can just be removed? Surely cpu_online(1000000000)
will return false?
A value > NR_CPUS and < MAX_INT caused a panic in sysfs_store_do_timer_cpu,
presumably from the cpu_online() check. The check against NR_CPUS avoided
the panic.
Actually, I didn't have CONFIG_DEBUG_PER_CPU_MAPS turned on.
Post by Andrew Morton
static inline unsigned int cpumask_check(unsigned int cpu)
{
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
WARN_ON_ONCE(cpu >= nr_cpumask_bits);
#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
return cpu;
}
so we can't do cpu_online(insane number)
Andrew Morton
2011-12-01 02:38:58 UTC
Permalink
Post by Dimitri Sivanich
Post by Andrew Morton
Post by Dimitri Sivanich
Post by Andrew Morton
On Wed, 30 Nov 2011 09:29:59 -0600
Post by Dimitri Sivanich
+static ssize_t sysfs_store_do_timer_cpu(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t size)
+{
+ struct sysdev_ext_attribute *ea = SYSDEV_TO_EXT_ATTR(attr);
+ unsigned int new;
+ int rv;
+
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
+
+ rv = kstrtouint(buf, 0, &new);
+ if (rv)
+ return rv;
+
+ if (new >= NR_CPUS || !cpu_online(new))
+ return -ERANGE;
+
+ *(unsigned int *)(ea->var) = new;
+ return size;
+}
WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc
I think a check against num_possible_cpus() should be OK.
You want cpu_possible(). Or maybe cpu_present().
May be splitting hairs, but I think I like num_present_cpus even better.
Does that work correctly if the cpu-present map has holes?
Post by Dimitri Sivanich
Post by Andrew Morton
Post by Dimitri Sivanich
Post by Andrew Morton
I think the check can just be removed? Surely cpu_online(1000000000)
will return false?
A value > NR_CPUS and < MAX_INT caused a panic in sysfs_store_do_timer_cpu,
presumably from the cpu_online() check. The check against NR_CPUS avoided
the panic.
Actually, I didn't have CONFIG_DEBUG_PER_CPU_MAPS turned on.
So why did it crash? test_bit() far out of range?
Mike Galbraith
2012-01-15 13:46:08 UTC
Permalink
Post by Dimitri Sivanich
Resending this.
Allow manual override of the tick_do_timer_cpu.
Bigger button below.
Post by Dimitri Sivanich
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
Uhuh, we have something in common, your HPC folks don't like NO_HZ
because it makes loads of jitter, my RT jitter test proggy hates it to
pieces for the same reason. I can't just config it out like you though.

Aside: how come your HPC folks aren't griping about (SGI monster) boxen
ticking all at the same time? That makes my 64 core box jitter plenty.

Anyway, if you boot one of your monster boxen nohz=off, or in this case,
just don't build with NO_HZ, the electric meter spins a lot faster than
with NO_HZ, yes? So I wonder if something like the below would help.
Seems to me cpusets is the right place to do this kind of tweaking,
though this particular patch may be eligible for an "award" or two :)

Turning NO_HZ off is probably not as good as just building without
NO_HZ, but for me, running an all pinned RT load in an isolated set and
distro config, I can switch an isolated cpuset to "sched_hpc", and only
the part of the box that NEEDS to kill NO_HZ does so, RT test proggy
becomes happy without booting nohz=off, and the partition can resume
green machine mode when done doing whatever required turning NO_HZ off,
and in my case rt push/pull as well to further improve jitter, since
it's not only a jitter source, but a waste of cycles for a 100% pinned
and isolated load.

Poke sched_hpc, CPU0 starts ticking, becomes jiffies knave (you need),
isolated partition also starts ticking (we both need). Or, poke
sched_hpc_rt, it'll do that and turn rt push/pull off as well, which I
don't desperately need, but it does cut remaining jitter in half.

There's space for other "isolate me harder" buttons, though one would
hope the space would never be used.. these two are ugly enough. I
couldn't figure out how to make it any prettier.

---
include/linux/sched.h | 29 ++++++
init/Kconfig | 11 ++
kernel/cpuset.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 95 +++++++++++++++++++++
kernel/sched/fair.c | 4
kernel/sched/rt.c | 18 +++-
kernel/sched/sched.h | 17 +++
kernel/time/tick-sched.c | 2
8 files changed, 372 insertions(+), 7 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -271,6 +271,35 @@ extern void init_idle_bootup_task(struct

extern int runqueue_is_locked(int cpu);

+/* Cpuset runqueue behavior modifier bits */
+enum
+{
+ RQ_TICK=0,
+ RQ_HPC,
+ RQ_HPCRT,
+ RQ_CLEAR=~0,
+};
+
+#ifdef CONFIG_HPC_CPUSETS
+extern int runqueue_is_flagged(int cpu, int nr);
+extern int runqueue_is_isolated(int cpu);
+extern void cpuset_flags_set(int cpu, unsigned bits);
+extern void cpuset_flags_clr(int cpu, unsigned bits);
+
+#ifdef CONFIG_NO_HZ
+static inline int sched_needs_cpu(int cpu)
+{
+ return runqueue_is_flagged(cpu, RQ_TICK);
+}
+#endif
+#else /* !CONFIG_HPC_CPUSETS */
+static inline int runqueue_is_flagged(int cpu, int nr) { return 0; }
+static inline int runqueue_is_isolated(int cpu) { return 0; }
+static inline int sched_needs_cpu(int cpu) { return 0; }
+static inline void cpuset_flag_set(int cpu, unsigned bits) { }
+static inline void cpuset_flag_clr(int cpu, unsigned bits) { }
+#endif /* CONFIG_HPC_CPUSETS */
+
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
extern void select_nohz_load_balancer(int stop_tick);
extern void set_cpu_sd_state_idle(void);
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -624,6 +624,17 @@ config PROC_PID_CPUSET
depends on CPUSETS
default y

+config HPC_CPUSETS
+ bool "HPC cpusets"
+ depends on CPUSETS && SMP
+ default n
+ help
+ This option provides per CPUSET scheduler behavior control switches.
+ This is primarily useful on large SMP systems where some partitions
+ may be dedicated to sensitive HPC applications, while others are not.
+
+ Say N if unsure.
+
config CGROUP_CPUACCT
bool "Simple CPU accounting cgroup subsystem"
help
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -145,6 +145,8 @@ typedef enum {
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+ CS_SCHED_HPC,
+ CS_SCHED_HPCRT,
} cpuset_flagbits_t;

/* convenient tests for these bits */
@@ -183,6 +185,16 @@ static inline int is_spread_slab(const s
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}

+static inline int is_sched_hpc(const struct cpuset *cs)
+{
+ return test_bit(CS_SCHED_HPC, &cs->flags);
+}
+
+static inline int is_sched_hpc_rt(const struct cpuset *cs)
+{
+ return test_bit(CS_SCHED_HPCRT, &cs->flags);
+}
+
static struct cpuset top_cpuset = {
.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
};
@@ -382,6 +394,147 @@ static void free_trial_cpuset(struct cpu
kfree(trial);
}

+#ifdef CONFIG_HPC_CPUSETS
+static int
+validate_sched_change(const struct cpuset *cur, const struct cpuset *trial)
+{
+ int cpu;
+
+ if (!is_sched_hpc(trial))
+ return 0;
+
+ cpu = cpumask_first(trial->cpus_allowed);
+
+ if (cur == &top_cpuset || !is_cpu_exclusive(cur))
+ return -EINVAL;
+ /*
+ * HPC cpusets may not contain the boot CPU,
+ * and must be completely isolated or empty.
+ */
+ if (!cpu || is_sched_load_balance(cur))
+ return -EINVAL;
+ if (cpu < nr_cpu_ids && !runqueue_is_isolated(cpu))
+ return -EINVAL;
+
+ /* Handle CPUs entering or leaving the set */
+ if (!cpumask_equal(cur->cpus_allowed, trial->cpus_allowed)) {
+ cpumask_var_t delta;
+ int entering, cpu;
+ unsigned bits;
+
+ if (!zalloc_cpumask_var(&delta, GFP_KERNEL))
+ return -ENOMEM;
+
+ cpumask_xor(delta, cur->cpus_allowed, trial->cpus_allowed);
+ entering = cpumask_weight(cur->cpus_allowed) <
+ cpumask_weight(trial->cpus_allowed);
+
+ bits = (1 << RQ_TICK) | (1 << RQ_HPC);
+ if (is_sched_hpc_rt(trial))
+ bits |= 1 << RQ_HPCRT;
+
+ if (entering) {
+ for_each_cpu(cpu, delta) {
+ if (runqueue_is_isolated(cpu))
+ continue;
+ free_cpumask_var(delta);
+ return -EINVAL;
+ }
+ }
+
+ for_each_cpu(cpu, delta) {
+ if (entering)
+ cpuset_flags_set(cpu, bits);
+ else
+ cpuset_flags_clr(cpu, bits);
+ }
+ free_cpumask_var(delta);
+ }
+
+ return 0;
+}
+
+/*
+ * update_sched_flags - update scheduler modifier flags in cpusets.
+ * @bit: the bit changing state.
+ * @cs: the cpuset in which flags need to be updated:
+ * @turning_on: whether we're turning the bit on or off.
+ *
+ * Called with cgroup_mutex held. Turn scheduler modifiers on/off,
+ * updating runqueue flags for associated CPUs. Set/clear of a flag
+ * which invalidates modifiers recursively clears invalidated flags
+ * for child cpusets and their associated CPUs.
+ *
+ * No return value.
+ */
+static void
+update_sched_flags(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on)
+{
+ struct cgroup *cont;
+ struct cpuset *child;
+ unsigned cpu, bits = 0, recursive = 0;
+
+ switch (bit) {
+ case CS_CPU_EXCLUSIVE:
+ if (turning_on)
+ return;
+ bits = RQ_CLEAR;
+ recursive = 1;
+ break;
+ case CS_SCHED_LOAD_BALANCE:
+ if (!turning_on)
+ return;
+ if (is_sched_hpc(cs)) {
+ bits |= (1 << RQ_TICK) | (1 << RQ_HPC);
+ clear_bit(CS_SCHED_HPC, &cs->flags);
+ }
+ if (is_sched_hpc_rt(cs)) {
+ bits |= (1 << RQ_HPCRT);
+ clear_bit(CS_SCHED_HPCRT, &cs->flags);
+ }
+ recursive = 1;
+ break;
+ case CS_SCHED_HPC:
+ bits = (1 << RQ_TICK) | (1 << RQ_HPC);
+ break;
+ case CS_SCHED_HPCRT:
+ bits = (1 << RQ_HPCRT);
+ break;
+ default:
+ return;
+ }
+
+ if (recursive) {
+ list_for_each_entry(cont, &cs->css.cgroup->children, sibling) {
+ child = cgroup_cs(cont);
+ update_sched_flags(bit, child, turning_on);
+ }
+ turning_on = 0;
+ }
+
+ if (!bits)
+ return;
+
+ for_each_cpu(cpu, cs->cpus_allowed) {
+ if (turning_on)
+ cpuset_flags_set(cpu, bits);
+ else
+ cpuset_flags_clr(cpu, bits);
+ }
+}
+
+#else /* !CONFIG_HPC_CPUSETS */
+
+static int
+validate_sched_change(const struct cpuset *cur, const struct cpuset *trial)
+{
+ return 0;
+}
+static void
+update_sched_flags(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on) { }
+
+#endif /* CONFIG_HPC_CPUSETS */
+
/*
* validate_change() - Used to validate that any proposed cpuset change
* follows the structural rules for cpusets.
@@ -406,6 +559,7 @@ static int validate_change(const struct
{
struct cgroup *cont;
struct cpuset *c, *par;
+ int ret;

/* Each of our child cpusets must be a subset of us */
list_for_each_entry(cont, &cur->css.cgroup->children, sibling) {
@@ -413,6 +567,10 @@ static int validate_change(const struct
return -EBUSY;
}

+ ret = validate_sched_change(cur, trial);
+ if (ret)
+ return ret;
+
/* Remaining checks don't apply to root cpuset */
if (cur == &top_cpuset)
return 0;
@@ -1250,6 +1408,7 @@ static int update_flag(cpuset_flagbits_t
struct cpuset *trialcs;
int balance_flag_changed;
int spread_flag_changed;
+ int sched_flag_changed;
struct ptr_heap heap;
int err;

@@ -1273,6 +1432,11 @@ static int update_flag(cpuset_flagbits_t
balance_flag_changed = (is_sched_load_balance(cs) !=
is_sched_load_balance(trialcs));

+ sched_flag_changed = balance_flag_changed;
+ sched_flag_changed |= (is_cpu_exclusive(cs) != is_cpu_exclusive(trialcs));
+ sched_flag_changed |= (is_sched_hpc(cs) != is_sched_hpc(trialcs));
+ sched_flag_changed |= (is_sched_hpc_rt(cs) != is_sched_hpc_rt(trialcs));
+
spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
|| (is_spread_page(cs) != is_spread_page(trialcs)));

@@ -1283,6 +1447,9 @@ static int update_flag(cpuset_flagbits_t
if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
async_rebuild_sched_domains();

+ if (sched_flag_changed)
+ update_sched_flags(bit, cs, turning_on);
+
if (spread_flag_changed)
update_tasks_flags(cs, &heap);
heap_free(&heap);
@@ -1488,6 +1655,8 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_SCHED_HPC,
+ FILE_SCHED_HPCRT,
} cpuset_filetype_t;

static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1527,6 +1696,18 @@ static int cpuset_write_u64(struct cgrou
case FILE_SPREAD_SLAB:
retval = update_flag(CS_SPREAD_SLAB, cs, val);
break;
+ case FILE_SCHED_HPC:
+ if (!val && is_sched_hpc_rt(cs))
+ retval = update_flag(CS_SCHED_HPCRT, cs, val);
+ if (!retval)
+ retval = update_flag(CS_SCHED_HPC, cs, val);
+ break;
+ case FILE_SCHED_HPCRT:
+ if (val && !is_sched_hpc(cs))
+ retval = update_flag(CS_SCHED_HPC, cs, val);
+ if (!retval)
+ retval = update_flag(CS_SCHED_HPCRT, cs, val);
+ break;
default:
retval = -EINVAL;
break;
@@ -1676,6 +1857,10 @@ static u64 cpuset_read_u64(struct cgroup
return is_mem_hardwall(cs);
case FILE_SCHED_LOAD_BALANCE:
return is_sched_load_balance(cs);
+ case FILE_SCHED_HPC:
+ return is_sched_hpc(cs);
+ case FILE_SCHED_HPCRT:
+ return is_sched_hpc_rt(cs);
case FILE_MEMORY_MIGRATE:
return is_memory_migrate(cs);
case FILE_MEMORY_PRESSURE_ENABLED:
@@ -1765,8 +1950,22 @@ static struct cftype files[] = {
.write_s64 = cpuset_write_s64,
.private = FILE_SCHED_RELAX_DOMAIN_LEVEL,
},
+#ifdef CONFIG_HPC_CPUSETS
+ {
+ .name = "sched_hpc",
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_SCHED_HPC,
+ },

{
+ .name = "sched_hpc_rt",
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_SCHED_HPCRT,
+ },
+#endif
+ {
.name = "memory_migrate",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
@@ -1906,6 +2105,10 @@ static void cpuset_destroy(struct cgroup
{
struct cpuset *cs = cgroup_cs(cont);

+ if (is_sched_hpc_rt(cs))
+ update_flag(CS_SCHED_HPCRT, cs, 0);
+ if (is_sched_hpc(cs))
+ update_flag(CS_SCHED_HPC, cs, 0);
if (is_sched_load_balance(cs))
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1957,14 +1957,14 @@ static void finish_task_switch(struct rq
/* assumes rq->lock is held */
static inline void pre_schedule(struct rq *rq, struct task_struct *prev)
{
- if (prev->sched_class->pre_schedule)
+ if (prev->sched_class->pre_schedule && !rq_flag(rq, RQ_HPCRT))
prev->sched_class->pre_schedule(rq, prev);
}

/* rq->lock is NOT held, but preemption is disabled */
static inline void post_schedule(struct rq *rq)
{
- if (rq->post_schedule) {
+ if (rq->post_schedule && !rq_flag(rq, RQ_HPCRT)) {
unsigned long flags;

raw_spin_lock_irqsave(&rq->lock, flags);
@@ -2986,6 +2986,92 @@ void thread_group_times(struct task_stru
}
#endif

+#ifdef CONFIG_HPC_CPUSETS
+extern int tick_do_timer_cpu __read_mostly;
+static int nr_hpc_cpus;
+
+#ifndef CONFIG_NO_HZ
+static inline void wake_up_idle_cpu(int cpu) { }
+#endif
+
+/* Called with cgroup_mutex held */
+void cpuset_flags_set(int cpu, unsigned bits)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+ int nr;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ /* Set blocker flags before taking any action */
+ rq->cpuset_flags |= bits;
+ for (nr = 0; bits; nr++) {
+ if (!(bits & (1 << nr)))
+ continue;
+ switch (nr) {
+ case RQ_TICK:
+ break;
+ case RQ_HPC:
+ /* Ensure that jiffies doesn't go stale */
+ if (!nr_hpc_cpus++) {
+ tick_do_timer_cpu = 0;
+ /* safe, CPU0 is modifier excluded */
+ cpuset_flags_set(0, 1 << RQ_TICK);
+ wake_up_idle_cpu(0);
+ }
+ break;
+ case RQ_HPCRT:
+ cpupri_set(&rq->rd->cpupri, cpu, CPUPRI_INVALID);
+ break;
+ }
+ bits &= ~(1 << nr);
+ }
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+/* Called with cgroup_mutex held */
+void cpuset_flags_clr(int cpu, unsigned bits)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+ unsigned nr, clear = bits, cleared = 0;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ bits &= rq->cpuset_flags;
+ rq->cpuset_flags &= ~bits;
+ for (nr = 0; bits; nr++) {
+ if (!(bits & (1 << nr)))
+ continue;
+ switch (nr) {
+ case RQ_TICK:
+ break;
+ case RQ_HPC:
+ /* Let CPU0 resume nohz mode */
+ if (nr_hpc_cpus && !--nr_hpc_cpus)
+ cpuset_flags_clr(0, 1 << RQ_TICK);
+ break;
+ case RQ_HPCRT:
+ cpupri_set(&rq->rd->cpupri, cpu, rq->rt.highest_prio.curr);
+ break;
+ }
+ bits &= ~(1 << nr);
+ cleared |= (1 << nr);
+ }
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+ WARN_ON_ONCE(clear != RQ_CLEAR && clear != cleared);
+}
+
+int runqueue_is_isolated(int cpu)
+{
+ return !cpu_rq(cpu)->sd;
+}
+
+int runqueue_is_flagged(int cpu, int nr)
+{
+ return rq_flag(cpu_rq(cpu), nr);
+}
+#endif /* CONFIG_HPC_CPUSETS */
+
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
@@ -3007,6 +3093,8 @@ void scheduler_tick(void)
perf_event_task_tick();

#ifdef CONFIG_SMP
+ if (rq_flag(rq, RQ_HPC))
+ return;
rq->idle_balance = idle_cpu(cpu);
trigger_load_balance(rq, cpu);
#endif
@@ -6940,6 +7028,9 @@ void __init sched_init(void)
#ifdef CONFIG_NO_HZ
rq->nohz_flags = 0;
#endif
+#ifdef CONFIG_HPC_CPUSETS
+ rq->cpuset_flags = 0;
+#endif
#endif
init_rq_hrtick(rq);
atomic_set(&rq->nr_iowait, 0);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4620,6 +4620,9 @@ void idle_balance(int this_cpu, struct r
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;

+ if (!this_rq->sd)
+ return;
+
this_rq->idle_stamp = this_rq->clock;

if (this_rq->avg_idle < sysctl_sched_migration_cost)
@@ -4914,6 +4917,7 @@ void select_nohz_load_balancer(int stop_
}
return;
}
+
#endif

static DEFINE_SPINLOCK(balancing);
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -917,8 +917,13 @@ inc_rt_prio_smp(struct rt_rq *rt_rq, int
{
struct rq *rq = rq_of_rt_rq(rt_rq);

- if (rq->online && prio < prev_prio)
- cpupri_set(&rq->rd->cpupri, rq->cpu, prio);
+ if (!rq->online || prio >= prev_prio)
+ return;
+
+ if (rq_flag(rq, SCHED_HPCRT))
+ return;
+
+ cpupri_set(&rq->rd->cpupri, rq->cpu, prio);
}

static void
@@ -926,8 +931,13 @@ dec_rt_prio_smp(struct rt_rq *rt_rq, int
{
struct rq *rq = rq_of_rt_rq(rt_rq);

- if (rq->online && rt_rq->highest_prio.curr != prev_prio)
- cpupri_set(&rq->rd->cpupri, rq->cpu, rt_rq->highest_prio.curr);
+ if (!rq->online || rt_rq->highest_prio.curr == prev_prio)
+ return;
+
+ if (rq_flag(rq, SCHED_HPCRT))
+ return;
+
+ cpupri_set(&rq->rd->cpupri, rq->cpu, rt_rq->highest_prio.curr);
}

#else /* CONFIG_SMP */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -424,6 +424,11 @@ struct rq {
int cpu;
int online;

+#ifdef CONFIG_CPUSETS
+ /* cpuset scheduler modifiers */
+ unsigned int cpuset_flags;
+#endif
+
u64 rt_avg;
u64 age_stamp;
u64 idle_stamp;
@@ -539,6 +544,18 @@ DECLARE_PER_CPU(int, sd_llc_id);

#endif /* CONFIG_SMP */

+#ifdef CONFIG_HPC_CPUSETS
+static inline int rq_flag(struct rq *rq, int nr)
+{
+ return rq->cpuset_flags & (1 << nr);
+}
+#else
+static inline int rq_flag(struct rq *rq, int nr)
+{
+ return 0;
+}
+#endif
+
#include "stats.h"
#include "auto_group.h"

--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -327,7 +327,7 @@ static void tick_nohz_stop_sched_tick(st
} while (read_seqretry(&xtime_lock, seq));

if (rcu_needs_cpu(cpu) || printk_needs_cpu(cpu) ||
- arch_needs_cpu(cpu)) {
+ arch_needs_cpu(cpu) || sched_needs_cpu(cpu)) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
} else {
Mike Galbraith
2012-01-15 14:04:40 UTC
Permalink
Post by Mike Galbraith
Post by Dimitri Sivanich
Resending this.
Allow manual override of the tick_do_timer_cpu.
Bigger button below.
Post by Dimitri Sivanich
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
Uhuh, we have something in common, your HPC folks don't like NO_HZ
because it makes loads of jitter, my RT jitter test proggy hates it to
pieces for the same reason. I can't just config it out like you though.
Aside: how come your HPC folks aren't griping about (SGI monster) boxen
ticking all at the same time? That makes my 64 core box jitter plenty.
P.S.

Using SGI and RT in the same message reminded me. I have the below in
my 3.0-rt tree, and 32 core UV100 box works fine. Wish I had a UV2 to
try out, the old (allegedly single digit s/n) UV1 is kinda slow.
Anyway, this patch is against mainline + preempt_rt FWIW.

rt,UV: rt conversion

Signed-off-by: Mike Galbraith <***@gmx.de>

---
arch/x86/include/asm/uv/uv_bau.h | 12 ++++++------
arch/x86/kernel/apic/x2apic_uv_x.c | 6 +++---
arch/x86/platform/uv/tlb_uv.c | 18 +++++++++---------
arch/x86/platform/uv/uv_time.c | 21 +++++++++++++--------
4 files changed, 31 insertions(+), 26 deletions(-)

--- a/arch/x86/include/asm/uv/uv_bau.h
+++ b/arch/x86/include/asm/uv/uv_bau.h
@@ -518,8 +518,8 @@ struct bau_control {
unsigned short uvhub_quiesce;
short socket_acknowledge_count[DEST_Q_SIZE];
cycles_t send_message;
- spinlock_t uvhub_lock;
- spinlock_t queue_lock;
+ raw_spinlock_t uvhub_lock;
+ raw_spinlock_t queue_lock;
/* tunables */
int max_concurr;
int max_concurr_const;
@@ -670,15 +670,15 @@ static inline int atom_asr(short i, stru
* to be lowered below the current 'v'. atomic_add_unless can only stop
* on equal.
*/
-static inline int atomic_inc_unless_ge(spinlock_t *lock, atomic_t *v, int u)
+static inline int atomic_inc_unless_ge(raw_spinlock_t *lock, atomic_t *v, int u)
{
- spin_lock(lock);
+ raw_spin_lock(lock);
if (atomic_read(v) >= u) {
- spin_unlock(lock);
+ raw_spin_unlock(lock);
return 0;
}
atomic_inc(v);
- spin_unlock(lock);
+ raw_spin_unlock(lock);
return 1;
}

--- a/arch/x86/kernel/apic/x2apic_uv_x.c
+++ b/arch/x86/kernel/apic/x2apic_uv_x.c
@@ -56,7 +56,7 @@ int uv_min_hub_revision_id;
EXPORT_SYMBOL_GPL(uv_min_hub_revision_id);
unsigned int uv_apicid_hibits;
EXPORT_SYMBOL_GPL(uv_apicid_hibits);
-static DEFINE_SPINLOCK(uv_nmi_lock);
+static DEFINE_RAW_SPINLOCK(uv_nmi_lock);

static struct apic apic_x2apic_uv_x;

@@ -707,10 +707,10 @@ int uv_handle_nmi(unsigned int reason, s
* Use a lock so only one cpu prints at a time.
* This prevents intermixed output.
*/
- spin_lock(&uv_nmi_lock);
+ raw_spin_lock(&uv_nmi_lock);
pr_info("UV NMI stack dump cpu %u:\n", smp_processor_id());
dump_stack();
- spin_unlock(&uv_nmi_lock);
+ raw_spin_unlock(&uv_nmi_lock);

return NMI_HANDLED;
}
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -39,7 +39,7 @@ static int timeout_base_ns[] = {
static int timeout_us;
static int nobau;
static int baudisabled;
-static spinlock_t disable_lock;
+static raw_spinlock_t disable_lock;
static cycles_t congested_cycles;

/* tunables: */
@@ -608,9 +608,9 @@ static void destination_plugged(struct b

quiesce_local_uvhub(hmaster);

- spin_lock(&hmaster->queue_lock);
+ raw_spin_lock(&hmaster->queue_lock);
reset_with_ipi(&bau_desc->distribution, bcp);
- spin_unlock(&hmaster->queue_lock);
+ raw_spin_unlock(&hmaster->queue_lock);

end_uvhub_quiesce(hmaster);

@@ -630,9 +630,9 @@ static void destination_timeout(struct b

quiesce_local_uvhub(hmaster);

- spin_lock(&hmaster->queue_lock);
+ raw_spin_lock(&hmaster->queue_lock);
reset_with_ipi(&bau_desc->distribution, bcp);
- spin_unlock(&hmaster->queue_lock);
+ raw_spin_unlock(&hmaster->queue_lock);

end_uvhub_quiesce(hmaster);

@@ -649,7 +649,7 @@ static void disable_for_congestion(struc
struct ptc_stats *stat)
{
/* let only one cpu do this disabling */
- spin_lock(&disable_lock);
+ raw_spin_lock(&disable_lock);

if (!baudisabled && bcp->period_requests &&
((bcp->period_time / bcp->period_requests) > congested_cycles)) {
@@ -668,7 +668,7 @@ static void disable_for_congestion(struc
}
}

- spin_unlock(&disable_lock);
+ raw_spin_unlock(&disable_lock);
}

static void count_max_concurr(int stat, struct bau_control *bcp,
@@ -717,7 +717,7 @@ static void record_send_stats(cycles_t t
*/
static void uv1_throttle(struct bau_control *hmaster, struct ptc_stats *stat)
{
- spinlock_t *lock = &hmaster->uvhub_lock;
+ raw_spinlock_t *lock = &hmaster->uvhub_lock;
atomic_t *v;

v = &hmaster->active_descriptor_count;
@@ -1835,7 +1835,7 @@ static int __init uv_bau_init(void)
}

nuvhubs = uv_num_possible_blades();
- spin_lock_init(&disable_lock);
+ raw_spin_lock_init(&disable_lock);
congested_cycles = usec_2_cycles(congested_respns_us);

uv_base_pnode = 0x7fffffff;
--- a/arch/x86/platform/uv/uv_time.c
+++ b/arch/x86/platform/uv/uv_time.c
@@ -58,7 +58,7 @@ static DEFINE_PER_CPU(struct clock_event

/* There is one of these allocated per node */
struct uv_rtc_timer_head {
- spinlock_t lock;
+ raw_spinlock_t lock;
/* next cpu waiting for timer, local node relative: */
int next_cpu;
/* number of cpus on this node: */
@@ -178,7 +178,7 @@ static __init int uv_rtc_allocate_timers
uv_rtc_deallocate_timers();
return -ENOMEM;
}
- spin_lock_init(&head->lock);
+ raw_spin_lock_init(&head->lock);
head->ncpus = uv_blade_nr_possible_cpus(bid);
head->next_cpu = -1;
blade_info[bid] = head;
@@ -232,7 +232,7 @@ static int uv_rtc_set_timer(int cpu, u64
unsigned long flags;
int next_cpu;

- spin_lock_irqsave(&head->lock, flags);
+ raw_spin_lock_irqsave(&head->lock, flags);

next_cpu = head->next_cpu;
*t = expires;
@@ -244,12 +244,12 @@ static int uv_rtc_set_timer(int cpu, u64
if (uv_setup_intr(cpu, expires)) {
*t = ULLONG_MAX;
uv_rtc_find_next_timer(head, pnode);
- spin_unlock_irqrestore(&head->lock, flags);
+ raw_spin_unlock_irqrestore(&head->lock, flags);
return -ETIME;
}
}

- spin_unlock_irqrestore(&head->lock, flags);
+ raw_spin_unlock_irqrestore(&head->lock, flags);
return 0;
}

@@ -268,7 +268,7 @@ static int uv_rtc_unset_timer(int cpu, i
unsigned long flags;
int rc = 0;

- spin_lock_irqsave(&head->lock, flags);
+ raw_spin_lock_irqsave(&head->lock, flags);

if ((head->next_cpu == bcpu && uv_read_rtc(NULL) >= *t) || force)
rc = 1;
@@ -280,7 +280,7 @@ static int uv_rtc_unset_timer(int cpu, i
uv_rtc_find_next_timer(head, pnode);
}

- spin_unlock_irqrestore(&head->lock, flags);
+ raw_spin_unlock_irqrestore(&head->lock, flags);

return rc;
}
@@ -300,13 +300,18 @@ static int uv_rtc_unset_timer(int cpu, i
static cycle_t uv_read_rtc(struct clocksource *cs)
{
unsigned long offset;
+ cycle_t cycles;

+ migrate_disable();
if (uv_get_min_hub_revision_id() == 1)
offset = 0;
else
offset = (uv_blade_processor_id() * L1_CACHE_BYTES) % PAGE_SIZE;

- return (cycle_t)uv_read_local_mmr(UVH_RTC | offset);
+ cycles = (cycle_t)uv_read_local_mmr(UVH_RTC | offset);
+ migrate_enable();
+
+ return cycles;
}

/*
Mike Galbraith
2012-01-15 14:23:18 UTC
Permalink
Post by Mike Galbraith
Post by Dimitri Sivanich
Resending this.
Allow manual override of the tick_do_timer_cpu.
Bigger button below.
+/* Called with cgroup_mutex held */
+void cpuset_flags_set(int cpu, unsigned bits)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+ int nr;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ /* Set blocker flags before taking any action */
+ rq->cpuset_flags |= bits;
+ for (nr = 0; bits; nr++) {
+ if (!(bits & (1 << nr)))
+ continue;
+ switch (nr) {
+ break;
+ /* Ensure that jiffies doesn't go stale */
+ if (!nr_hpc_cpus++) {
+ tick_do_timer_cpu = 0;
+ /* safe, CPU0 is modifier excluded */
+ cpuset_flags_set(0, 1 << RQ_TICK);
+ wake_up_idle_cpu(0);
Just in case someone was going to mention it, I'd already moved that
wakeup_idle_cpu() to RQ_TICK case, but the darn thing crawled back.

-Mike
Mike Galbraith
2012-01-25 11:27:16 UTC
Permalink
Post by Mike Galbraith
Post by Dimitri Sivanich
Resending this.
Allow manual override of the tick_do_timer_cpu.
Bigger button below.
Post by Dimitri Sivanich
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.
+#ifdef CONFIG_NO_HZ
+ /* nohz mode not supported */
+ if (tick_nohz_enabled)
+ return -EINVAL;
+#endif
Uhuh, we have something in common, your HPC folks don't like NO_HZ
because it makes loads of jitter, my RT jitter test proggy hates it to
pieces for the same reason. I can't just config it out like you though....
Not expecting any enthusiasm, but this is _one_ way to let nohz=off go
away, and gives a little more control to users who have to provide a
home for jitter intolerant applications.

It's not very pretty, but is pretty convenient.

sched, cpusets: "HPC" cpusets extension

Give the user the ability to dynamically influence scheduler behavior
through "HPC" cpusets.

When enabled, the user can dynamically inform the scheduler that a
cpuset cannot tolerate jitter induced by NO_HZ, jiffies update, and
RT load balancing locic. A large generic machine can re-partition
to service transient jitter sensitive loads without requiring the
entire machine to run nohz=off continuously.

Should the user invalidate "HPC" prerequisites, modifiers are self
canceling for safety reasons. Prerequisites are: the set may not
contain CPU0, must be cpu exclusive (obviously), and must be fully
disconnected from scheduler domains.

Signed-off-by: Mike Galbraith <***@gmx.de>

---
include/linux/sched.h | 29 +++++
init/Kconfig | 11 ++
kernel/cpuset.c | 245 ++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/core.c | 94 +++++++++++++++++-
kernel/sched/rt.c | 18 ++-
kernel/sched/sched.h | 15 ++
kernel/time/tick-sched.c | 6 -
7 files changed, 407 insertions(+), 11 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -271,6 +271,35 @@ extern void init_idle_bootup_task(struct

extern int runqueue_is_locked(int cpu);

+/* Cpuset runqueue behavior modifier flags */
+enum
+{
+ RQ_TICK = (1 << 0),
+ RQ_HPC = (1 << 1),
+ RQ_HPCRT = (1 << 2),
+ RQ_CLEAR = ~0,
+};
+
+#ifdef CONFIG_HPC_CPUSETS
+extern int runqueue_is_flagged(int cpu, unsigned flag);
+extern int runqueue_is_isolated(int cpu);
+extern void cpuset_flags_set(int cpu, unsigned bits);
+extern void cpuset_flags_clr(int cpu, unsigned bits);
+
+#ifdef CONFIG_NO_HZ
+static inline int sched_needs_cpu(int cpu)
+{
+ return runqueue_is_flagged(cpu, RQ_TICK);
+}
+#endif
+#else /* !CONFIG_HPC_CPUSETS */
+static inline int runqueue_is_flagged(int cpu, int nr) { return 0; }
+static inline int runqueue_is_isolated(int cpu) { return 0; }
+static inline int sched_needs_cpu(int cpu) { return 0; }
+static inline void cpuset_flag_set(int cpu, unsigned bits) { }
+static inline void cpuset_flag_clr(int cpu, unsigned bits) { }
+#endif /* CONFIG_HPC_CPUSETS */
+
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
extern void select_nohz_load_balancer(int stop_tick);
extern void set_cpu_sd_state_idle(void);
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -638,6 +638,17 @@ config PROC_PID_CPUSET
depends on CPUSETS
default y

+config HPC_CPUSETS
+ bool "HPC cpusets"
+ depends on CPUSETS && SMP
+ default n
+ help
+ This option provides per CPUSET scheduler behavior control switches.
+ This is primarily useful on large SMP systems where some partitions
+ may be dedicated to sensitive HPC applications, while others are not.
+
+ Say N if unsure.
+
config CGROUP_CPUACCT
bool "Simple CPU accounting cgroup subsystem"
help
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -145,6 +145,8 @@ typedef enum {
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+ CS_SCHED_HPC,
+ CS_SCHED_HPCRT,
} cpuset_flagbits_t;

/* convenient tests for these bits */
@@ -183,6 +185,16 @@ static inline int is_spread_slab(const s
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}

+static inline int is_sched_hpc(const struct cpuset *cs)
+{
+ return test_bit(CS_SCHED_HPC, &cs->flags);
+}
+
+static inline int is_sched_hpc_rt(const struct cpuset *cs)
+{
+ return test_bit(CS_SCHED_HPCRT, &cs->flags);
+}
+
static struct cpuset top_cpuset = {
.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
};
@@ -382,6 +394,168 @@ static void free_trial_cpuset(struct cpu
kfree(trial);
}

+#ifdef CONFIG_HPC_CPUSETS
+/* Without boot parameter "hpc_cpusets", HPC functionality is hidden */
+static __read_mostly int hpc_hide_files = 2;
+
+/**
+ * validate_sched_change() - validate proposed scheduler modifier changes.
+ *
+ * If we replaced the flag and mask values of the current cpuset (cur) with
+ * those values in the trial cpuset (trial), would our various subset and
+ * exclusive rules still be valid? For cpusets with scheduler modifiers,
+ * ensure that CPUs entering/leaving set/clear runqueue flags accordingly,
+ * to ensure that cpuset and runqueue states remain in sync.
+ *
+ * @cur: address of an actual, in-use cpuset.
+ * @trial: address of copy of cur, with proposed changes.
+ *
+ * Presumes cgroup_mutex held.
+ * Return 0 if valid, -errno if not.
+ */
+static int
+validate_sched_change(const struct cpuset *cur, const struct cpuset *trial)
+{
+ int cpu;
+
+ if (hpc_hide_files || !is_sched_hpc(trial))
+ return 0;
+
+ cpu = cpumask_first(trial->cpus_allowed);
+
+ if (cur == &top_cpuset || !is_cpu_exclusive(cur))
+ return -EINVAL;
+ /*
+ * HPC cpusets may not contain the boot CPU,
+ * and must be completely isolated or empty.
+ */
+ if (!cpu || is_sched_load_balance(cur))
+ return -EINVAL;
+ if (cpu < nr_cpu_ids && !runqueue_is_isolated(cpu))
+ return -EINVAL;
+
+ /* Handle CPUs entering or leaving the set */
+ if (!cpumask_equal(cur->cpus_allowed, trial->cpus_allowed)) {
+ cpumask_var_t delta;
+ int entering, cpu;
+ unsigned bits;
+
+ if (!zalloc_cpumask_var(&delta, GFP_KERNEL))
+ return -ENOMEM;
+
+ cpumask_xor(delta, cur->cpus_allowed, trial->cpus_allowed);
+ entering = cpumask_weight(cur->cpus_allowed) <
+ cpumask_weight(trial->cpus_allowed);
+
+ bits = RQ_TICK | RQ_HPC;
+ if (is_sched_hpc_rt(trial))
+ bits |= RQ_HPCRT;
+
+ if (entering) {
+ for_each_cpu(cpu, delta) {
+ if (runqueue_is_isolated(cpu))
+ continue;
+ free_cpumask_var(delta);
+ return -EINVAL;
+ }
+ }
+
+ for_each_cpu(cpu, delta) {
+ if (entering)
+ cpuset_flags_set(cpu, bits);
+ else
+ cpuset_flags_clr(cpu, bits);
+ }
+ free_cpumask_var(delta);
+ }
+
+ return 0;
+}
+
+/*
+ * update_sched_flags - update scheduler modifier flags in cpusets.
+ * @bit: the bit changing state.
+ * @cs: the cpuset in which flags need to be updated:
+ * @turning_on: whether we're turning the bit on or off.
+ *
+ * Called with cgroup_mutex held. Turn scheduler modifiers on/off,
+ * updating runqueue flags for associated CPUs. Set/clear of a flag
+ * which invalidates modifiers recursively clears invalidated flags
+ * for child cpusets and their associated CPUs.
+ *
+ * No return value.
+ */
+static void
+update_sched_flags(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on)
+{
+ struct cgroup *cont;
+ struct cpuset *child;
+ unsigned cpu, bits = 0, recursive = 0;
+
+ switch (bit) {
+ case CS_CPU_EXCLUSIVE:
+ if (turning_on)
+ return;
+ bits = RQ_CLEAR;
+ recursive = 1;
+ break;
+ case CS_SCHED_LOAD_BALANCE:
+ if (!turning_on)
+ return;
+ if (is_sched_hpc(cs)) {
+ bits |= RQ_TICK | RQ_HPC;
+ clear_bit(CS_SCHED_HPC, &cs->flags);
+ }
+ if (is_sched_hpc_rt(cs)) {
+ bits |= RQ_HPCRT;
+ clear_bit(CS_SCHED_HPCRT, &cs->flags);
+ }
+ recursive = 1;
+ break;
+ case CS_SCHED_HPC:
+ bits = RQ_TICK | RQ_HPC;
+ break;
+ case CS_SCHED_HPCRT:
+ bits = RQ_HPCRT;
+ break;
+ default:
+ return;
+ }
+
+ if (recursive) {
+ list_for_each_entry(cont, &cs->css.cgroup->children, sibling) {
+ child = cgroup_cs(cont);
+ update_sched_flags(bit, child, turning_on);
+ }
+ turning_on = 0;
+ }
+
+ if (!bits)
+ return;
+
+ for_each_cpu(cpu, cs->cpus_allowed) {
+ if (turning_on)
+ cpuset_flags_set(cpu, bits);
+ else
+ cpuset_flags_clr(cpu, bits);
+ }
+}
+
+#else /* !CONFIG_HPC_CPUSETS */
+
+/* HPC files do not exist, nothing to hide. */
+static __read_mostly int hpc_hide_files;
+
+static int
+validate_sched_change(const struct cpuset *cur, const struct cpuset *trial)
+{
+ return 0;
+}
+static void
+update_sched_flags(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on) { }
+
+#endif /* CONFIG_HPC_CPUSETS */
+
/*
* validate_change() - Used to validate that any proposed cpuset change
* follows the structural rules for cpusets.
@@ -406,6 +580,7 @@ static int validate_change(const struct
{
struct cgroup *cont;
struct cpuset *c, *par;
+ int ret;

/* Each of our child cpusets must be a subset of us */
list_for_each_entry(cont, &cur->css.cgroup->children, sibling) {
@@ -413,6 +588,10 @@ static int validate_change(const struct
return -EBUSY;
}

+ ret = validate_sched_change(cur, trial);
+ if (ret)
+ return ret;
+
/* Remaining checks don't apply to root cpuset */
if (cur == &top_cpuset)
return 0;
@@ -1250,6 +1429,7 @@ static int update_flag(cpuset_flagbits_t
struct cpuset *trialcs;
int balance_flag_changed;
int spread_flag_changed;
+ int sched_flag_changed;
struct ptr_heap heap;
int err;

@@ -1273,6 +1453,11 @@ static int update_flag(cpuset_flagbits_t
balance_flag_changed = (is_sched_load_balance(cs) !=
is_sched_load_balance(trialcs));

+ sched_flag_changed = balance_flag_changed;
+ sched_flag_changed |= (is_cpu_exclusive(cs) != is_cpu_exclusive(trialcs));
+ sched_flag_changed |= (is_sched_hpc(cs) != is_sched_hpc(trialcs));
+ sched_flag_changed |= (is_sched_hpc_rt(cs) != is_sched_hpc_rt(trialcs));
+
spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
|| (is_spread_page(cs) != is_spread_page(trialcs)));

@@ -1283,6 +1468,9 @@ static int update_flag(cpuset_flagbits_t
if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
async_rebuild_sched_domains();

+ if (sched_flag_changed)
+ update_sched_flags(bit, cs, turning_on);
+
if (spread_flag_changed)
update_tasks_flags(cs, &heap);
heap_free(&heap);
@@ -1488,6 +1676,8 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_SCHED_HPC,
+ FILE_SCHED_HPCRT,
} cpuset_filetype_t;

static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1527,6 +1717,18 @@ static int cpuset_write_u64(struct cgrou
case FILE_SPREAD_SLAB:
retval = update_flag(CS_SPREAD_SLAB, cs, val);
break;
+ case FILE_SCHED_HPC:
+ if (!val && is_sched_hpc_rt(cs))
+ retval = update_flag(CS_SCHED_HPCRT, cs, val);
+ if (!retval)
+ retval = update_flag(CS_SCHED_HPC, cs, val);
+ break;
+ case FILE_SCHED_HPCRT:
+ if (val && !is_sched_hpc(cs))
+ retval = update_flag(CS_SCHED_HPC, cs, val);
+ if (!retval)
+ retval = update_flag(CS_SCHED_HPCRT, cs, val);
+ break;
default:
retval = -EINVAL;
break;
@@ -1676,6 +1878,10 @@ static u64 cpuset_read_u64(struct cgroup
return is_mem_hardwall(cs);
case FILE_SCHED_LOAD_BALANCE:
return is_sched_load_balance(cs);
+ case FILE_SCHED_HPC:
+ return is_sched_hpc(cs);
+ case FILE_SCHED_HPCRT:
+ return is_sched_hpc_rt(cs);
case FILE_MEMORY_MIGRATE:
return is_memory_migrate(cs);
case FILE_MEMORY_PRESSURE_ENABLED:
@@ -1794,6 +2000,26 @@ static struct cftype files[] = {
.write_u64 = cpuset_write_u64,
.private = FILE_SPREAD_SLAB,
},
+#ifdef CONFIG_HPC_CPUSETS
+ /*
+ * IMPORTANT: HPC related files must be LAST in the array,
+ * they are enabled via a boot parameter, without which
+ * we lie about the array size to hide them.
+ */
+ {
+ .name = "sched_hpc",
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_SCHED_HPC,
+ },
+
+ {
+ .name = "sched_hpc_rt",
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_SCHED_HPCRT,
+ },
+#endif
};

static struct cftype cft_memory_pressure_enabled = {
@@ -1805,9 +2031,9 @@ static struct cftype cft_memory_pressure

static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
{
- int err;
+ int err, file_count = ARRAY_SIZE(files) - hpc_hide_files;

- err = cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+ err = cgroup_add_files(cont, ss, files, file_count);
if (err)
return err;
/* memory_pressure_enabled is in root cpuset only */
@@ -1906,6 +2132,10 @@ static void cpuset_destroy(struct cgroup
{
struct cpuset *cs = cgroup_cs(cont);

+ if (is_sched_hpc_rt(cs))
+ update_flag(CS_SCHED_HPCRT, cs, 0);
+ if (is_sched_hpc(cs))
+ update_flag(CS_SCHED_HPC, cs, 0);
if (is_sched_load_balance(cs))
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);

@@ -2634,3 +2864,14 @@ void cpuset_task_status_allowed(struct s
seq_nodemask_list(m, &task->mems_allowed);
seq_printf(m, "\n");
}
+
+#ifdef CONFIG_HPC_CPUSETS
+static int __init hpc_cpusets(char *str)
+{
+ hpc_hide_files = 0;
+
+ return 0;
+}
+early_param("hpc_cpusets", hpc_cpusets);
+#endif
+
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1957,14 +1957,14 @@ static void finish_task_switch(struct rq
/* assumes rq->lock is held */
static inline void pre_schedule(struct rq *rq, struct task_struct *prev)
{
- if (prev->sched_class->pre_schedule)
+ if (prev->sched_class->pre_schedule && !rq_flag(rq, RQ_HPCRT))
prev->sched_class->pre_schedule(rq, prev);
}

/* rq->lock is NOT held, but preemption is disabled */
static inline void post_schedule(struct rq *rq)
{
- if (rq->post_schedule) {
+ if (rq->post_schedule && !rq_flag(rq, RQ_HPCRT)) {
unsigned long flags;

raw_spin_lock_irqsave(&rq->lock, flags);
@@ -2986,6 +2986,91 @@ void thread_group_times(struct task_stru
}
#endif

+#ifdef CONFIG_HPC_CPUSETS
+extern int tick_do_timer_cpu __read_mostly;
+static int nr_hpc_cpus;
+
+#ifndef CONFIG_NO_HZ
+static inline void wake_up_idle_cpu(int cpu) { }
+#endif
+
+/* Called with cgroup_mutex held */
+void cpuset_flags_set(int cpu, unsigned bits)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+ int nr, bit;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ /* Set blocker flags before taking any action */
+ rq->cpuset_flags |= bits;
+ for (nr = 0; bits; nr++) {
+ bit = 1 << nr;
+ if (!(bits & bit))
+ continue;
+ switch (nr) {
+ case RQ_TICK:
+ wake_up_idle_cpu(cpu);
+ break;
+ case RQ_HPC:
+ /* Ensure that jiffies doesn't go stale */
+ if (!nr_hpc_cpus++) {
+ tick_do_timer_cpu = 0;
+ /* safe, CPU0 is modifier excluded */
+ cpuset_flags_set(0, RQ_TICK);
+ }
+ break;
+ case RQ_HPCRT:
+ cpupri_set(&rq->rd->cpupri, cpu, CPUPRI_INVALID);
+ break;
+ }
+ bits &= ~bit;
+ }
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+/* Called with cgroup_mutex held */
+void cpuset_flags_clr(int cpu, unsigned bits)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+ unsigned nr, bit;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ bits &= rq->cpuset_flags;
+ rq->cpuset_flags &= ~bits;
+ for (nr = 0; bits; nr++) {
+ bit = 1 << nr;
+ if (!(bits & bit))
+ continue;
+ switch (nr) {
+ case RQ_TICK:
+ break;
+ case RQ_HPC:
+ /* Let CPU0 resume nohz mode */
+ if (nr_hpc_cpus && !--nr_hpc_cpus)
+ cpuset_flags_clr(0, RQ_TICK);
+ break;
+ case RQ_HPCRT:
+ cpupri_set(&rq->rd->cpupri, cpu, rq->rt.highest_prio.curr);
+ break;
+ }
+ bits &= ~bit;
+ }
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+int runqueue_is_isolated(int cpu)
+{
+ return !cpu_rq(cpu)->sd;
+}
+
+int runqueue_is_flagged(int cpu, unsigned flag)
+{
+ return rq_flag(cpu_rq(cpu), flag);
+}
+#endif /* CONFIG_HPC_CPUSETS */
+
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
@@ -3007,6 +3092,8 @@ void scheduler_tick(void)
perf_event_task_tick();

#ifdef CONFIG_SMP
+ if (rq_flag(rq, RQ_HPC))
+ return;
rq->idle_balance = idle_cpu(cpu);
trigger_load_balance(rq, cpu);
#endif
@@ -6940,6 +7027,9 @@ void __init sched_init(void)
#ifdef CONFIG_NO_HZ
rq->nohz_flags = 0;
#endif
+#ifdef CONFIG_HPC_CPUSETS
+ rq->cpuset_flags = 0;
+#endif
#endif
init_rq_hrtick(rq);
atomic_set(&rq->nr_iowait, 0);
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -917,8 +917,13 @@ inc_rt_prio_smp(struct rt_rq *rt_rq, int
{
struct rq *rq = rq_of_rt_rq(rt_rq);

- if (rq->online && prio < prev_prio)
- cpupri_set(&rq->rd->cpupri, rq->cpu, prio);
+ if (!rq->online || prio >= prev_prio)
+ return;
+
+ if (rq_flag(rq, RQ_HPCRT))
+ return;
+
+ cpupri_set(&rq->rd->cpupri, rq->cpu, prio);
}

static void
@@ -926,8 +931,13 @@ dec_rt_prio_smp(struct rt_rq *rt_rq, int
{
struct rq *rq = rq_of_rt_rq(rt_rq);

- if (rq->online && rt_rq->highest_prio.curr != prev_prio)
- cpupri_set(&rq->rd->cpupri, rq->cpu, rt_rq->highest_prio.curr);
+ if (!rq->online || rt_rq->highest_prio.curr == prev_prio)
+ return;
+
+ if (rq_flag(rq, RQ_HPCRT))
+ return;
+
+ cpupri_set(&rq->rd->cpupri, rq->cpu, rt_rq->highest_prio.curr);
}

#else /* CONFIG_SMP */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -419,6 +419,9 @@ struct rq {
int post_schedule;
int active_balance;
int push_cpu;
+#ifdef CONFIG_CPUSETS
+ unsigned int cpuset_flags;
+#endif
struct cpu_stop_work active_balance_work;
/* cpu of this runqueue: */
int cpu;
@@ -539,6 +542,18 @@ DECLARE_PER_CPU(int, sd_llc_id);

#endif /* CONFIG_SMP */

+#ifdef CONFIG_HPC_CPUSETS
+static inline int rq_flag(struct rq *rq, unsigned flag)
+{
+ return rq->cpuset_flags & flag;
+}
+#else
+static inline int rq_flag(struct rq *rq, unsigned flag)
+{
+ return 0;
+}
+#endif
+
#include "stats.h"
#include "auto_group.h"

--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -303,9 +303,6 @@ static void tick_nohz_stop_sched_tick(st
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
return;

- if (need_resched())
- return;
-
if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
static int ratelimit;

@@ -317,6 +314,9 @@ static void tick_nohz_stop_sched_tick(st
return;
}

+ if (need_resched() || sched_needs_cpu(cpu))
+ return;
+
ts->idle_calls++;
/* Read jiffies and the time when jiffies were updated last */
do {
Thomas Gleixner
2012-02-15 14:16:34 UTC
Permalink
Post by Dimitri Sivanich
Allow manual override of the tick_do_timer_cpu.
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.
I really wonder about this changelog. The only case where jiffies
updates bounces around is the NOHZ case. In all other modes (periodic
or highres) the boot cpu gets the do_timer() duty and it's never
assigned to any other cpu.

So what's the point of this exercise? Moving it away from CPU0 for
acedemic reasons or what?

Thanks,

tglx
Dimitri Sivanich
2012-02-15 14:37:10 UTC
Permalink
Post by Thomas Gleixner
Post by Dimitri Sivanich
Allow manual override of the tick_do_timer_cpu.
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.
I really wonder about this changelog. The only case where jiffies
updates bounces around is the NOHZ case. In all other modes (periodic
or highres) the boot cpu gets the do_timer() duty and it's never
assigned to any other cpu.
So what's the point of this exercise? Moving it away from CPU0 for
acedemic reasons or what?
I wasn't specifically trying to move it away from CPU0 (having jiffies updates
on CPU0 was and would be just fine for the nohz=off case). The issue was
that the tick_do_timer_cpu could be any cpu even in the nohz=off case (maybe
something has changed that since?). After the point of assignment it is
static, but you never know which cpu it is.
Thomas Gleixner
2012-02-15 14:52:06 UTC
Permalink
Post by Dimitri Sivanich
Post by Thomas Gleixner
Post by Dimitri Sivanich
Allow manual override of the tick_do_timer_cpu.
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.
I really wonder about this changelog. The only case where jiffies
updates bounces around is the NOHZ case. In all other modes (periodic
or highres) the boot cpu gets the do_timer() duty and it's never
assigned to any other cpu.
So what's the point of this exercise? Moving it away from CPU0 for
acedemic reasons or what?
I wasn't specifically trying to move it away from CPU0 (having jiffies updates
on CPU0 was and would be just fine for the nohz=off case). The issue was
that the tick_do_timer_cpu could be any cpu even in the nohz=off case (maybe
something has changed that since?). After the point of assignment it is
static, but you never know which cpu it is.
It's always the boot cpu and that has been there from day one of that code.

tick_setup_device()
{
/*
* First device setup ?
*/
if (!td->evtdev) {
/*
* If no cpu took the do_timer update, assign it to
* this cpu:
*/
if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
tick_do_timer_cpu = cpu;

So the first CPU which registers a clock event device takes it. That's
the boot CPU, no matter what.

Thanks,

tglx
Dimitri Sivanich
2012-02-15 15:34:38 UTC
Permalink
Post by Thomas Gleixner
Post by Dimitri Sivanich
Post by Thomas Gleixner
Post by Dimitri Sivanich
Allow manual override of the tick_do_timer_cpu.
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like reproducibility
of results, and having jiffies updates bouncing around introduces variability.
I really wonder about this changelog. The only case where jiffies
updates bounces around is the NOHZ case. In all other modes (periodic
or highres) the boot cpu gets the do_timer() duty and it's never
assigned to any other cpu.
So what's the point of this exercise? Moving it away from CPU0 for
acedemic reasons or what?
I wasn't specifically trying to move it away from CPU0 (having jiffies updates
on CPU0 was and would be just fine for the nohz=off case). The issue was
that the tick_do_timer_cpu could be any cpu even in the nohz=off case (maybe
something has changed that since?). After the point of assignment it is
static, but you never know which cpu it is.
It's always the boot cpu and that has been there from day one of that code.
tick_setup_device()
{
/*
* First device setup ?
*/
if (!td->evtdev) {
/*
* If no cpu took the do_timer update, assign it to
*/
if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
tick_do_timer_cpu = cpu;
So the first CPU which registers a clock event device takes it. That's
the boot CPU, no matter what.
Both kernel tracing and the original patch that I proposed for this
showed plainly (at the time) that the tick_do_timer_cpu was not always cpu 0
prior to modifying it for nohz=off. Maybe that is no longer the case?
Thomas Gleixner
2012-02-15 20:36:47 UTC
Permalink
Post by Dimitri Sivanich
Post by Thomas Gleixner
So the first CPU which registers a clock event device takes it. That's
the boot CPU, no matter what.
Both kernel tracing and the original patch that I proposed for this
showed plainly (at the time) that the tick_do_timer_cpu was not always cpu 0
prior to modifying it for nohz=off. Maybe that is no longer the case?
This logic has not been changed in years.

tick_do_timer_cpu is initialized to TICK_DO_TIMER_BOOT and the first
cpu which registers either a global or a per cpu clock event device
takes it over. This is at least on x86 always the boot cpu, i.e. cpu0.
After that point nothing touches that variable when nohz is disabled
(runtime or compile time).

So I really want to see proper proof why that would not be the
case. If it really happens then we fix the root cause instead of
adding random sysfs interfaces.

Thanks,

tglx
Dimitri Sivanich
2012-02-16 14:59:00 UTC
Permalink
Post by Thomas Gleixner
Post by Dimitri Sivanich
Post by Thomas Gleixner
So the first CPU which registers a clock event device takes it. That's
the boot CPU, no matter what.
Both kernel tracing and the original patch that I proposed for this
showed plainly (at the time) that the tick_do_timer_cpu was not always cpu 0
prior to modifying it for nohz=off. Maybe that is no longer the case?
This logic has not been changed in years.
I did some tracing of all points where tick_do_timer_cpu is set in the
3.3.0-rc3+ kernel.
Post by Thomas Gleixner
tick_do_timer_cpu is initialized to TICK_DO_TIMER_BOOT and the first
cpu which registers either a global or a per cpu clock event device
takes it over. This is at least on x86 always the boot cpu, i.e. cpu0.
After that point nothing touches that variable when nohz is disabled
(runtime or compile time).
At that point it is set to cpu 0. However, when we go into highres mode
it does change. Below are the two places it was set during boot with
nohz=off on one of our x86 based machines.

[ 0.000000] tick_setup_device: tick_do_timer_cpu 0
[ 1.924098] tick_broadcast_setup_oneshot: tick_do_timer_cpu 17

So in this example it's now cpu 17, and it stays that way from that point on.

A traceback at that point shows tick_init_highres is indeed initiating this:

[ 1.924863] [<ffffffff81087e01>] tick_broadcast_setup_oneshot+0x71/0x160
[ 1.924863] [<ffffffff81087f23>] tick_broadcast_switch_to_oneshot+0x33/0x50
[ 1.924863] [<ffffffff81088841>] tick_switch_to_oneshot+0x81/0xd0
[ 1.924863] [<ffffffff810888a0>] tick_init_highres+0x10/0x20
[ 1.924863] [<ffffffff81061e71>] hrtimer_run_pending+0x71/0xd0
Post by Thomas Gleixner
So I really want to see proper proof why that would not be the
case. If it really happens then we fix the root cause instead of
adding random sysfs interfaces.
If you're willing to consider it a bug that tick_do_timer_cpu is not cpu 0,
than I'm very much in agreement.
Post by Thomas Gleixner
Thanks,
tglx
Loading...