utime/stime decreasing on thread exit

Discussion:

utime/stime decreasing on thread exit

Spencer Candland

2009-11-04 00:23:27 UTC

I am seeing a problem with utime/stime decreasing on thread exit in a
multi-threaded process. I have been able to track this regression down
to the "process wide cpu clocks/timers" changes introduces in
2.6.29-rc5, specifically when I revert the following commits I know
longer see decreasing utime/stime values:

4da94d49b2ecb0a26e716a8811c3ecc542c2a65d
3fccfd67df79c6351a156eb25a7a514e5f39c4d9
7d8e23df69820e6be42bcc41d441f4860e8c76f7
4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
32bd671d6cbeda60dc73be77fa2b9037d9a9bfa0

I poked around a little, but I am afraid I have to admit that I am not
familiar enough with how this works to resolve this or suggest a fix.

I have verified this in happening in kernels 2.6.29-rc5 - 2.6.32-rc6, I
have been testing this on x86 vanilla kernels, but have also verified it
on several x86 2.6.29+ distro kernels (fedora and ubuntu).

I first noticed this on a production environment running Apache with the
worker MPM, however while tracking this down I put together a simple
program that has been reliable in showing me utime decreasing, hopefully
it will be helpful in demonstrating the issue:

#include <stdio.h>
#include <pthread.h>
#include <sys/times.h>

#define NUM_THREADS 500

struct tms start;
int oldutime;

void *pound (void *threadid)
{
struct tms end;
int utime;
int c;
for (c = 0; c < 10000000; c++);
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
pthread_exit(NULL);
}

int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}

Output:

# ./decrease_utime
utime decreased, was 1288, now 1287!
utime decreased, was 1294, now 1293!
utime decreased, was 1296, now 1295!
utime decreased, was 1297, now 1296!
utime decreased, was 1298, now 1297!
utime decreased, was 1299, now 1298!

Please let me know if you need any additional information.

Thank you,
Spencer Candland

Hidetoshi Seto

2009-11-04 06:49:25 UTC

Post by Spencer Candland
I am seeing a problem with utime/stime decreasing on thread exit in a
multi-threaded process. I have been able to track this regression down
to the "process wide cpu clocks/timers" changes introduces in
2.6.29-rc5, specifically when I revert the following commits I know
4da94d49b2ecb0a26e716a8811c3ecc542c2a65d
3fccfd67df79c6351a156eb25a7a514e5f39c4d9
7d8e23df69820e6be42bcc41d441f4860e8c76f7
4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
32bd671d6cbeda60dc73be77fa2b9037d9a9bfa0
I poked around a little, but I am afraid I have to admit that I am not
familiar enough with how this works to resolve this or suggest a fix.
I have verified this in happening in kernels 2.6.29-rc5 - 2.6.32-rc6, I
have been testing this on x86 vanilla kernels, but have also verified it
on several x86 2.6.29+ distro kernels (fedora and ubuntu).
I first noticed this on a production environment running Apache with the
worker MPM, however while tracking this down I put together a simple
program that has been reliable in showing me utime decreasing, hopefully
#include <stdio.h>
#include <pthread.h>
#include <sys/times.h>
#define NUM_THREADS 500
struct tms start;
int oldutime;
void *pound (void *threadid)
{
struct tms end;
int utime;
int c;
for (c = 0; c < 10000000; c++);
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}

This test program doesn't show the problem correctly, because the
oldutime can be modified by any of threads running simultaneously.

Post by Spencer Candland
times(&end);
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;

So I thought it is not a bug, but....

Post by Spencer Candland
# ./decrease_utime
utime decreased, was 1288, now 1287!
utime decreased, was 1294, now 1293!
utime decreased, was 1296, now 1295!
utime decreased, was 1297, now 1296!
utime decreased, was 1298, now 1297!
utime decreased, was 1299, now 1298!
Please let me know if you need any additional information.
Thank you,
Spencer Candland

I revised the test program but still I can see the problem.
So I agree that something wrong is actually there.

[***@localhost work]$ cat test.c
#include <stdio.h>
#include <pthread.h>
#include <sys/times.h>

#define NUM_THREADS 500

void *pound (void *threadid)
{
struct tms t1, t2;
int c;
for (c = 0; c < 10000000; c++);
times(&t1);
times(&t2);
if (t1.tms_utime > t2.tms_utime) {
printf("utime decreased, was %d, now %d! : (%d %d %d %d) (%d %d %d %d)\n",
t1.tms_utime, t2.tms_utime,
t1.tms_stime, t1.tms_utime, t1.tms_cstime, t1.tms_cutime,
t2.tms_stime, t2.tms_utime, t2.tms_cstime, t2.tms_cutime);
}
pthread_exit(NULL);
}

int main()
{
pthread_t th[NUM_THREADS];
long i;
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}

[***@localhost work]$ ./a.out
utime decreased, was 1066, now 1063! : (102 1066 0 0) (102 1063 0 0)
utime decreased, was 1101, now 1099! : (114 1101 0 0) (114 1099 0 0)
utime decreased, was 1095, now 1093! : (146 1095 0 0) (146 1093 0 0)
utime decreased, was 1089, now 1086! : (166 1089 0 0) (166 1086 0 0)
utime decreased, was 1071, now 1069! : (212 1071 0 0) (212 1069 0 0)
utime decreased, was 1057, now 1054! : (274 1057 0 0) (274 1054 0 0)
utime decreased, was 1054, now 1049! : (79 1054 0 0) (303 1049 0 0)
utime decreased, was 1050, now 1048! : (313 1050 0 0) (313 1048 0 0)
utime decreased, was 1049, now 1046! : (319 1049 0 0) (319 1046 0 0)
utime decreased, was 1058, now 1038! : (83 1058 0 0) (369 1038 0 0)
utime decreased, was 1038, now 1036! : (378 1038 0 0) (378 1036 0 0)
utime decreased, was 1037, now 1035! : (384 1037 0 0) (384 1035 0 0)
utime decreased, was 1035, now 1034! : (385 1035 0 0) (386 1034 0 0)
utime decreased, was 1037, now 1035! : (385 1037 0 0) (385 1035 0 0)
utime decreased, was 1035, now 1032! : (389 1035 0 0) (390 1032 0 0)
utime decreased, was 1030, now 1028! : (402 1030 0 0) (402 1028 0 0)
utime decreased, was 1029, now 1026! : (405 1029 0 0) (405 1026 0 0)
utime decreased, was 1025, now 1024! : (409 1025 0 0) (410 1024 0 0)
utime decreased, was 1023, now 1021! : (420 1023 0 0) (420 1021 0 0)
utime decreased, was 1022, now 1020! : (423 1022 0 0) (423 1020 0 0)
utime decreased, was 1037, now 1017! : (372 1037 0 0) (432 1017 0 0)
utime decreased, was 1019, now 1017! : (431 1019 0 0) (432 1017 0 0)
utime decreased, was 1053, now 1015! : (79 1053 0 0) (434 1015 0 0)
utime decreased, was 1013, now 1011! : (446 1013 0 0) (446 1011 0 0)
utime decreased, was 1013, now 1010! : (448 1013 0 0) (448 1010 0 0)
utime decreased, was 1053, now 1009! : (79 1053 0 0) (451 1009 0 0)
[***@localhost work]$ taskset 0x1 ./a.out
utime decreased, was 1025, now 1021! : (59 1025 0 0) (417 1021 0 0)
utime decreased, was 1023, now 1021! : (59 1023 0 0) (419 1021 0 0)
utime decreased, was 1027, now 1020! : (60 1027 0 0) (420 1020 0 0)
utime decreased, was 1068, now 1000! : (173 1068 0 0) (500 1000 0 0)

I'll check commits you pointed/reverted.

Thanks,
H.Seto

Hidetoshi Seto

2009-11-05 05:24:06 UTC

Post by Hidetoshi Seto

Post by Spencer Candland
I am seeing a problem with utime/stime decreasing on thread exit in a
multi-threaded process. I have been able to track this regression down
to the "process wide cpu clocks/timers" changes introduces in
2.6.29-rc5, specifically when I revert the following commits I know
4da94d49b2ecb0a26e716a8811c3ecc542c2a65d
3fccfd67df79c6351a156eb25a7a514e5f39c4d9
7d8e23df69820e6be42bcc41d441f4860e8c76f7
4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
32bd671d6cbeda60dc73be77fa2b9037d9a9bfa0

(snip)

Post by Hidetoshi Seto
I'll check commits you pointed/reverted.
commit 4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
Date: Thu Feb 5 12:24:16 2009 +0100
timers: split process wide cpu clocks/timers

This results in making sys_times() to use "clocks" instead of
"timers". Please refer the description of the above commit.

I found 2 problems through my review.

Problem [1]:
thread_group_cputime() vs exit

+void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+{
+ struct sighand_struct *sighand;
+ struct signal_struct *sig;
+ struct task_struct *t;
+
+ *times = INIT_CPUTIME;
+
+ rcu_read_lock();
+ sighand = rcu_dereference(tsk->sighand);
+ if (!sighand)
+ goto out;
+
+ sig = tsk->signal;
+
+ t = tsk;
+ do {
+ times->utime = cputime_add(times->utime, t->utime);
+ times->stime = cputime_add(times->stime, t->stime);
+ times->sum_exec_runtime += t->se.sum_exec_runtime;
+
+ t = next_thread(t);
+ } while (t != tsk);
+
+ times->utime = cputime_add(times->utime, sig->utime);
+ times->stime = cputime_add(times->stime, sig->stime);
+ times->sum_exec_runtime += sig->sum_sched_runtime;
+out:
+ rcu_read_unlock();
+}

If one of (thousands) threads do exit while a thread is doing do-while
above, the s/utime of exited thread can be accounted twice, at do-while
(before exit) and at cputime_add() at last (after exit).

I suppose this is hard to fix: Taking lock on signal would solve this
problem, but it could block all other threads long and cause serious
performance issue and so on...

Problem [2]:
use of task_s/utime()

I modified the test program more, to take times() 6 times and print them
if utime decreased between 3rd and 4th.
I noticed that I cannot explain that if the problem [1] was the root cause
then why results show decreased value continuously, instead of an increased
value at a point (like (v)(v)(V)(v)(v)(v)) which is expected.

:
times decreased : (104 984) (104 984) (104 984) (105 983) (105 983) (105 983)
times decreased : (115 981) (116 980) (116 978) (117 977) (117 977) (119 979)
times decreased : (116 980) (117 980) (117 980) (117 977) (118 979) (118 977)
:

And it seems that the more thread exits the more utime decreases.

Soon I found:

[kernel/exit.c]
+ sig->utime = cputime_add(sig->utime, task_utime(tsk));
+ sig->stime = cputime_add(sig->stime, task_stime(tsk));

While the thread_group_cputime() accumulates raw s/utime in do-while loop,
the signal struct accumulates adjusted s/utime of exited threads.

I'm not sure how this adjustment works but applying the following patch
makes the result little bit better:

:
times decreased : (436 741) (436 741) (437 744) (436 742) (436 742) (436 742)
times decreased : (454 792) (454 792) (455 794) (454 792) (454 792) (454 792)
times decreased : (503 941) (503 941) (504 943) (503 941) (503 941) (503 941)
:

But still decreasing(or increasing) continues, because there is a problem [1]
at least.

I think I couldn't handle this problem any more... Anybody can help?

Thanks,
H.Seto

===

Subject: [PATCH] thread_group_cputime() should use task_s/utime()

The signal struct accumulates adjusted cputime of exited threads,
so thread_group_cputime() should use task_s/utime() instead of raw
task->s/utime, to accumulate adjusted cputime of live threads.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
---
kernel/posix-cpu-timers.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..e065b8a 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)

t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;

t = next_thread(t);

--
1.6.5.2

Peter Zijlstra

2009-11-09 14:49:14 UTC

Post by Hidetoshi Seto
thread_group_cputime() vs exit
+void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+{
+ struct sighand_struct *sighand;
+ struct signal_struct *sig;
+ struct task_struct *t;
+
+ *times = INIT_CPUTIME;
+
+ rcu_read_lock();
+ sighand = rcu_dereference(tsk->sighand);
+ if (!sighand)
+ goto out;
+
+ sig = tsk->signal;
+
+ t = tsk;
+ do {
+ times->utime = cputime_add(times->utime, t->utime);
+ times->stime = cputime_add(times->stime, t->stime);
+ times->sum_exec_runtime += t->se.sum_exec_runtime;
+
+ t = next_thread(t);
+ } while (t != tsk);
+
+ times->utime = cputime_add(times->utime, sig->utime);
+ times->stime = cputime_add(times->stime, sig->stime);
+ times->sum_exec_runtime += sig->sum_sched_runtime;
+ rcu_read_unlock();
+}
If one of (thousands) threads do exit while a thread is doing do-while
above, the s/utime of exited thread can be accounted twice, at do-while
(before exit) and at cputime_add() at last (after exit).
I suppose this is hard to fix: Taking lock on signal would solve this
problem, but it could block all other threads long and cause serious
performance issue and so on...

I just checked .22 and there we seem to hold p->sighand->siglock over
the full task iteration. So we might as well revert back to that if
people really mind counting things twice :-)

FWIW getrusage() also takes siglock over the task iteration.

Alternatively, we could try reading the sig->[us]time before doing the
loop, but I guess that's still racy in that we can then miss someone
altogether.

Post by Hidetoshi Seto
use of task_s/utime()
I modified the test program more, to take times() 6 times and print them
if utime decreased between 3rd and 4th.
I noticed that I cannot explain that if the problem [1] was the root cause
then why results show decreased value continuously, instead of an increased
value at a point (like (v)(v)(V)(v)(v)(v)) which is expected.
times decreased : (104 984) (104 984) (104 984) (105 983) (105 983) (105 983)
times decreased : (115 981) (116 980) (116 978) (117 977) (117 977) (119 979)
times decreased : (116 980) (117 980) (117 980) (117 977) (118 979) (118 977)
And it seems that the more thread exits the more utime decreases.
[kernel/exit.c]
+ sig->utime = cputime_add(sig->utime, task_utime(tsk));
+ sig->stime = cputime_add(sig->stime, task_stime(tsk));
While the thread_group_cputime() accumulates raw s/utime in do-while loop,
the signal struct accumulates adjusted s/utime of exited threads.
I'm not sure how this adjustment works but applying the following patch
times decreased : (436 741) (436 741) (437 744) (436 742) (436 742) (436 742)
times decreased : (454 792) (454 792) (455 794) (454 792) (454 792) (454 792)
times decreased : (503 941) (503 941) (504 943) (503 941) (503 941) (503 941)
But still decreasing(or increasing) continues, because there is a problem [1]
at least.
I think I couldn't handle this problem any more... Anybody can help?

Stick in a few trace_printk()s and see what happens?

Post by Hidetoshi Seto
Subject: [PATCH] thread_group_cputime() should use task_s/utime()
The signal struct accumulates adjusted cputime of exited threads,
so thread_group_cputime() should use task_s/utime() instead of raw
task->s/utime, to accumulate adjusted cputime of live threads.
---
kernel/posix-cpu-timers.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..e065b8a 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;
t = next_thread(t);

So what you're trying to say is that because __exit_signal() uses
task_[usg]time() to accumulate sig->[usg]time, we should use it too in
the loop over the live threads?

I'm thinking its the task_[usg]time() usage in __exit_signal() that's
the issue.

I tried running the modified test.c on a current -tip kernel but could
not observe the problem (dual-core opteron).

Oleg Nesterov

2009-11-09 17:20:20 UTC

Post by Peter Zijlstra

Post by Hidetoshi Seto
thread_group_cputime() vs exit
+void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+{
+ struct sighand_struct *sighand;
+ struct signal_struct *sig;
+ struct task_struct *t;
+
+ *times = INIT_CPUTIME;
+
+ rcu_read_lock();
+ sighand = rcu_dereference(tsk->sighand);
+ if (!sighand)
+ goto out;
+
+ sig = tsk->signal;
+
+ t = tsk;
+ do {
+ times->utime = cputime_add(times->utime, t->utime);
+ times->stime = cputime_add(times->stime, t->stime);
+ times->sum_exec_runtime += t->se.sum_exec_runtime;
+
+ t = next_thread(t);
+ } while (t != tsk);
+
+ times->utime = cputime_add(times->utime, sig->utime);
+ times->stime = cputime_add(times->stime, sig->stime);
+ times->sum_exec_runtime += sig->sum_sched_runtime;
+ rcu_read_unlock();
+}
If one of (thousands) threads do exit while a thread is doing do-while
above, the s/utime of exited thread can be accounted twice, at do-while
(before exit) and at cputime_add() at last (after exit).
I suppose this is hard to fix: Taking lock on signal would solve this
problem, but it could block all other threads long and cause serious
performance issue and so on...

I just checked .22 and there we seem to hold p->sighand->siglock over
the full task iteration.

Yes. Then thread_group_cputime() was changed so that it didn't itearate
over sub-threads, then this callsite was move outside of ->siglock, now
it does while_each_thread() again.

Post by Peter Zijlstra
So we might as well revert back to that if
people really mind counting things twice :-)

Stanislaw has already sent the patch, but I don't know what happened
with this patch:

[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145

Oleg.

Oleg Nesterov

2009-11-09 17:27:37 UTC

Sorry, forgot to cc Stanislaw...

Post by Peter Zijlstra

Post by Hidetoshi Seto
thread_group_cputime() vs exit
+void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+{
+ struct sighand_struct *sighand;
+ struct signal_struct *sig;
+ struct task_struct *t;
+
+ *times = INIT_CPUTIME;
+
+ rcu_read_lock();
+ sighand = rcu_dereference(tsk->sighand);
+ if (!sighand)
+ goto out;
+
+ sig = tsk->signal;
+
+ t = tsk;
+ do {
+ times->utime = cputime_add(times->utime, t->utime);
+ times->stime = cputime_add(times->stime, t->stime);
+ times->sum_exec_runtime += t->se.sum_exec_runtime;
+
+ t = next_thread(t);
+ } while (t != tsk);
+
+ times->utime = cputime_add(times->utime, sig->utime);
+ times->stime = cputime_add(times->stime, sig->stime);
+ times->sum_exec_runtime += sig->sum_sched_runtime;
+ rcu_read_unlock();
+}
If one of (thousands) threads do exit while a thread is doing do-while
above, the s/utime of exited thread can be accounted twice, at do-while
(before exit) and at cputime_add() at last (after exit).
I suppose this is hard to fix: Taking lock on signal would solve this
problem, but it could block all other threads long and cause serious
performance issue and so on...

I just checked .22 and there we seem to hold p->sighand->siglock over
the full task iteration.

Yes. Then thread_group_cputime() was changed so that it didn't itearate
over sub-threads, then this callsite was move outside of ->siglock, now
it does while_each_thread() again.

Post by Peter Zijlstra
So we might as well revert back to that if
people really mind counting things twice :-)

Stanislaw has already sent the patch, but I don't know what happened
with this patch:

[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145

Oleg.

Peter Zijlstra

2009-11-09 17:31:43 UTC

Post by Oleg Nesterov

Post by Peter Zijlstra
I just checked .22 and there we seem to hold p->sighand->siglock over
the full task iteration.

Yes. Then thread_group_cputime() was changed so that it didn't itearate
over sub-threads, then this callsite was move outside of ->siglock, now
it does while_each_thread() again.

Post by Peter Zijlstra
So we might as well revert back to that if
people really mind counting things twice :-)

Stanislaw has already sent the patch, but I don't know what happened
[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145

That patch has the siglock in the function calling
thread_group_cputime(), the 22 code had it near the loop proper, which
to me seems a more sensible thing, since there could be more callers,
no?

Oleg Nesterov

2009-11-09 19:23:55 UTC

Post by Peter Zijlstra

Post by Oleg Nesterov
Stanislaw has already sent the patch, but I don't know what happened
[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145

That patch has the siglock in the function calling
thread_group_cputime(), the 22 code had it near the loop proper, which
to me seems a more sensible thing, since there could be more callers,
no?

Well, we can't take ->siglock in thread_group_cputime(), sometimes it
is called under ->siglock. do_task_stat(), get_cpu_itimer() at least.

IIRC, Stanislaw verified other callers have no problems with this helper.

Oleg.

Peter Zijlstra

2009-11-09 19:32:23 UTC

Post by Oleg Nesterov

Post by Peter Zijlstra

Post by Oleg Nesterov
Stanislaw has already sent the patch, but I don't know what happened
[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145

That patch has the siglock in the function calling
thread_group_cputime(), the 22 code had it near the loop proper, which
to me seems a more sensible thing, since there could be more callers,
no?

Well, we can't take ->siglock in thread_group_cputime(), sometimes it
is called under ->siglock. do_task_stat(), get_cpu_itimer() at least.
IIRC, Stanislaw verified other callers have no problems with this helper.

Would have made fine changelog material :-)

Stanislaw Gruszka

2009-11-10 10:44:56 UTC

Post by Oleg Nesterov

Post by Peter Zijlstra

Post by Oleg Nesterov
Stanislaw has already sent the patch, but I don't know what happened
[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145

I did not resubmit it enough times :) But I didn't thought it worth to,
since performance can be degraded.

Post by Oleg Nesterov

Post by Peter Zijlstra
That patch has the siglock in the function calling
thread_group_cputime(), the 22 code had it near the loop proper, which
to me seems a more sensible thing, since there could be more callers,
no?

Well, we can't take ->siglock in thread_group_cputime(), sometimes it
is called under ->siglock. do_task_stat(), get_cpu_itimer() at least.
IIRC, Stanislaw verified other callers have no problems with this helper.

Actually I didn't. Try to do now.

Most calls are done with sighand or tasklist lock taken, except

Cscope tag: thread_group_cputime
# line filename / context / line
1 1353 fs/binfmt_elf.c <<fill_prstatus>>
thread_group_cputime(p, &cputime);
2 1403 fs/binfmt_elf_fdpic.c <<fill_prstatus>>
thread_group_cputime(p, &cputime);
10 129 mm/oom_kill.c <<badness>>
thread_group_cputime(p, &task_time);

I do not think in case 1 and 2 lock is needed since it seems to be core dump
with all threads dead. Not sure about 10 - oom killer.

One other exception is:
fastpath_timer_check() -> thread_group_cputimer() -> thread_group_cputime()

We can solve this like that:

--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1375,13 +1375,8 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
}

sig = tsk->signal;
- if (!task_cputime_zero(&sig->cputime_expires)) {
- struct task_cputime group_sample;
-
- thread_group_cputimer(tsk, &group_sample);
- if (task_cputime_expired(&group_sample, &sig->cputime_expires))
- return 1;
- }
+ if (!task_cputime_zero(&sig->cputime_expires))
+ return 1;

return sig->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY;
}

Or stay with task_cputime_expired() but only if cputimer is currently running.

Cheers
Stanislaw

Oleg Nesterov

2009-11-10 17:40:08 UTC

Post by Stanislaw Gruszka
10 129 mm/oom_kill.c <<badness>>
thread_group_cputime(p, &task_time);
Not sure about 10 - oom killer.

Not sure too, but I don't think we should worry about badness().

Post by Stanislaw Gruszka
fastpath_timer_check() -> thread_group_cputimer() -> thread_group_cputime()
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1375,13 +1375,8 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
}
sig = tsk->signal;
- if (!task_cputime_zero(&sig->cputime_expires)) {
- struct task_cputime group_sample;
-
- thread_group_cputimer(tsk, &group_sample);
- if (task_cputime_expired(&group_sample, &sig->cputime_expires))
- return 1;
- }
+ if (!task_cputime_zero(&sig->cputime_expires))
+ return 1;
return sig->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY;
}
Or stay with task_cputime_expired() but only if cputimer is currently running.

Oh. I forgot this code completely, can't comment.

Can't we ensure that fastpath_timer_check() never do while_each_thread() ?
IOW, if sig->cputime_expires != 0 then ->running must be true.

At least, shouldn't stop_process_timers() clear sig->cputime_expires ?

Oleg.

Stanislaw Gruszka

2009-11-10 18:24:15 UTC

Post by Oleg Nesterov

Post by Stanislaw Gruszka
fastpath_timer_check() -> thread_group_cputimer() -> thread_group_cputime()
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1375,13 +1375,8 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
}
sig = tsk->signal;
- if (!task_cputime_zero(&sig->cputime_expires)) {
- struct task_cputime group_sample;
-
- thread_group_cputimer(tsk, &group_sample);
- if (task_cputime_expired(&group_sample, &sig->cputime_expires))
- return 1;
- }
+ if (!task_cputime_zero(&sig->cputime_expires))
+ return 1;
return sig->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY;
}
Or stay with task_cputime_expired() but only if cputimer is currently running.

Oh. I forgot this code completely, can't comment.
Can't we ensure that fastpath_timer_check() never do while_each_thread() ?

Removing possibility to call while_each_tread() from fastpath_timer_check()
was exactly my intension here, perhaps I was not clear.

Post by Oleg Nesterov
IOW, if sig->cputime_expires != 0 then ->running must be true.
At least, shouldn't stop_process_timers() clear sig->cputime_expires ?

I'm going to think about that. However as far seems, checking ->running
explicitly and goto slow patch when is not true is safer solution.

Stanislaw

Oleg Nesterov

2009-11-10 19:23:27 UTC

Post by Stanislaw Gruszka

Post by Oleg Nesterov

Post by Stanislaw Gruszka
Or stay with task_cputime_expired() but only if cputimer is currently running.

Oh. I forgot this code completely, can't comment.
Can't we ensure that fastpath_timer_check() never do while_each_thread() ?

Removing possibility to call while_each_tread() from fastpath_timer_check()
was exactly my intension here, perhaps I was not clear.

Yes, yes, I understand.

I meant, perhaps we can ensure this shouldn't happen "by design", instead
of checking ->running in fastpath_timer_check().

Post by Stanislaw Gruszka

Post by Oleg Nesterov
IOW, if sig->cputime_expires != 0 then ->running must be true.
At least, shouldn't stop_process_timers() clear sig->cputime_expires ?

I'm going to think about that. However as far seems, checking ->running
explicitly and goto slow patch when is not true is safer solution.

Yes, agreed.

Still. check_process_timers() updates sig->cputime_expires at the end,
but it never clears it. For example,

if (sched_expires != 0 &&
(sig->cputime_expires.sched_exp == 0 ||
sig->cputime_expires.sched_exp > sched_expires))
sig->cputime_expires.sched_exp = sched_expires;

Why?

Now suppose that (say) sig->cputime_expires.sched_exp != 0, there are
no cpu timers, ->running == F.

In this case fastpath_timer_check() always returns T and triggers the
slow path which does nothing, not good.

But yes, I agree, probably we should start with the simple/safe change
as you suggested.

Oleg.

Stanislaw Gruszka

2009-11-17 12:48:52 UTC

Post by Oleg Nesterov

Post by Stanislaw Gruszka

Post by Oleg Nesterov
Can't we ensure that fastpath_timer_check() never do while_each_thread() ?

Removing possibility to call while_each_tread() from fastpath_timer_check()
was exactly my intension here, perhaps I was not clear.

Yes, yes, I understand.
I meant, perhaps we can ensure this shouldn't happen "by design", instead
of checking ->running in fastpath_timer_check().

Rule "sig->cputimer_expire != zero implies sig->cputimer.running == true" is
_almost_ assured (after fix in next mail). IMHO there is only one problem with
that functions: posix_cpu_timer_set() and posix_cpu_timer_schedule().

These functions first call thread_group_cputimer() without
tsk->sighand->siglock (only tasklist_lock is taken) and then
arm_timer(), which setups list and cputime_expires cache.

When there is some timer expiring already we can have situation like below:

cpu_timer_sample_group()
check_process_timers()
stop_process_timers()
arm_timer()

At the end we end with cputimer_expire != zero and ->running == false.
Very unlikely situation indeed, but possible. To address this
we can do in arm_timer() something like that:

if (unlikely(!sig->cputimer.running)) {
cpu_timer_sample_group()
bump_cpu_timer();
}

Since we have this we can do optimization, you proposed here:

http://lkml.org/lkml/2009/3/23/381

Use cputimer->running in fastpath_timer_check(). I'm going to work on
it as well on some other optimizations in posix-cpu-timer.c

Post by Oleg Nesterov
Still. check_process_timers() updates sig->cputime_expires at the end,
but it never clears it. For example,
if (sched_expires != 0 &&
(sig->cputime_expires.sched_exp == 0 ||
sig->cputime_expires.sched_exp > sched_expires))
sig->cputime_expires.sched_exp = sched_expires;
Why?
Now suppose that (say) sig->cputime_expires.sched_exp != 0, there are
no cpu timers, ->running == F.
In this case fastpath_timer_check() always returns T and triggers the
slow path which does nothing, not good.

This is real bug. I will fix it in the next patch.

Stanislaw

Stanislaw Gruszka

2009-11-17 12:57:03 UTC

When process delete cpu timer or timer expires we do not clear
expiration cache sig->cputimer_expires. This create situation when in
fastpath_timer_check() we _on_ cputimer (together with loop on all
threads) just to turn in off in check_process_timers(). We not
necessary go to slowpath and not necessary loop on all threads in
fastpath - and this is repeated on every tick.

To fix we zero sig->cputimer_expires in stop_process_timers().

Signed-off-by: Stanislaw Gruszka <***@redhat.com>
---
kernel/posix-cpu-timers.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..ac54986 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1061,9 +1061,9 @@ static void check_thread_timers(struct task_struct *tsk,
}
}

-static void stop_process_timers(struct task_struct *tsk)
+static void stop_process_timers(struct signal_struct *sig)
{
- struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;
+ struct thread_group_cputimer *cputimer = &sig->cputimer;
unsigned long flags;

if (!cputimer->running)
@@ -1072,6 +1072,10 @@ static void stop_process_timers(struct task_struct *tsk)
spin_lock_irqsave(&cputimer->lock, flags);
cputimer->running = 0;
spin_unlock_irqrestore(&cputimer->lock, flags);
+
+ sig->cputime_expires.prof_exp = cputime_zero;
+ sig->cputime_expires.virt_exp = cputime_zero;
+ sig->cputime_expires.sched_exp = 0;
}

static u32 onecputick;
@@ -1132,7 +1136,7 @@ static void check_process_timers(struct task_struct *tsk,
list_empty(&timers[CPUCLOCK_VIRT]) &&
cputime_eq(sig->it[CPUCLOCK_VIRT].expires, cputime_zero) &&
list_empty(&timers[CPUCLOCK_SCHED])) {
- stop_process_timers(tsk);
+ stop_process_timers(sig);
return;
}

--
1.6.2.5

Hidetoshi Seto

2009-11-10 05:42:20 UTC

Post by Peter Zijlstra

Post by Hidetoshi Seto
thread_group_cputime() vs exit

(snip)

Post by Peter Zijlstra
I just checked .22 and there we seem to hold p->sighand->siglock over
the full task iteration. So we might as well revert back to that if
people really mind counting things twice :-)
FWIW getrusage() also takes siglock over the task iteration.
Alternatively, we could try reading the sig->[us]time before doing the
loop, but I guess that's still racy in that we can then miss someone
altogether.

Right. So "before or after" will not be any help.

As you pointed there are some other functions taking siglock over the
task iteration, so making do_sys_times() do same will be acceptable.
In other words using many threads have risks, which might be solved in
future.

I agree that the following patch is right fix for this.

[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145

Post by Peter Zijlstra

Post by Hidetoshi Seto
use of task_s/utime()

(snip)

Post by Peter Zijlstra

Post by Hidetoshi Seto
[kernel/exit.c]
+ sig->utime = cputime_add(sig->utime, task_utime(tsk));
+ sig->stime = cputime_add(sig->stime, task_stime(tsk));
While the thread_group_cputime() accumulates raw s/utime in do-while loop,
the signal struct accumulates adjusted s/utime of exited threads.
I'm not sure how this adjustment works but applying the following patch
times decreased : (436 741) (436 741) (437 744) (436 742) (436 742) (436 742)
times decreased : (454 792) (454 792) (455 794) (454 792) (454 792) (454 792)
times decreased : (503 941) (503 941) (504 943) (503 941) (503 941) (503 941)
But still decreasing(or increasing) continues, because there is a problem [1]
at least.
I think I couldn't handle this problem any more... Anybody can help?

Stick in a few trace_printk()s and see what happens?

Nice idea.
I tried it and show the result in later of this mail.

Post by Peter Zijlstra

Post by Hidetoshi Seto
Subject: [PATCH] thread_group_cputime() should use task_s/utime()
The signal struct accumulates adjusted cputime of exited threads,
so thread_group_cputime() should use task_s/utime() instead of raw
task->s/utime, to accumulate adjusted cputime of live threads.
---
kernel/posix-cpu-timers.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..e065b8a 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;
t = next_thread(t);

So what you're trying to say is that because __exit_signal() uses
task_[usg]time() to accumulate sig->[usg]time, we should use it too in
the loop over the live threads?

Right. Thank you for trying to understand.

Post by Peter Zijlstra
I'm thinking its the task_[usg]time() usage in __exit_signal() that's
the issue.

It likely means reverting:

commit 49048622eae698e5c4ae61f7e71200f265ccc529
Author: Balbir Singh <***@linux.vnet.ibm.com>
Date: Fri Sep 5 18:12:23 2008 +0200
sched: fix process time monotonicity

I'm not sure the reason why the task_[usg]time() have introduced, but
removing them would be one of solutions.

Post by Peter Zijlstra
I tried running the modified test.c on a current -tip kernel but could
not observe the problem (dual-core opteron).

It might not happen if your box is quite fast (otherwise slow).
Changing loop in test.c (i.e. cycles in user-land) might help the
reproductivity...

Here is the result of additional test:

I put a trace_printk() in the __exit_signal(), to print tsk->s/utime,
task_s/utime() and tsk->se.sum_exec_rumtime.
(And tsk->prev_s/utime before calling task_s/utime())

<...>-2857 [006] 112.731732: release_task: (37 22)to(40 20), sxr 57480477, (0 0)
<...>-5077 [009] 526.272338: release_task: (0 27)to(10 20), sxr 27997019, (0 0)
<...>-4999 [009] 526.272396: release_task: (1 27)to(10 20), sxr 27967513, (0 0)
<...>-4992 [006] 526.328591: release_task: (2 34)to(10 30), sxr 35823013, (0 0)
<...>-5022 [012] 526.329183: release_task: (0 27)to(10 20), sxr 27761854, (0 0)
<...>-4996 [010] 526.329203: release_task: (3 38)to(10 30), sxr 39200207, (0 0)

... It clearly probes that there is the 3rd problem.

Problem [3]:
granularity of task_s/utime()

According to the git log, originally task_s/utime() were designed to
return clock_t but later changed to return cputime_t by following
commit:

commit efe567fc8281661524ffa75477a7c4ca9b466c63
Author: Christian Borntraeger <***@de.ibm.com>
Date: Thu Aug 23 15:18:02 2007 +0200

It only changes the type of return value, but not the implementation.
As the result the granularity of task_s/utime() is still that of clock_t,
not that of cputime_t.

So using task_s/utime() in __exit_signal() makes values accumulated to the
signal struct to be rounded and coarse grained.

After applying a fix (will follow to this mail), return values are changed
to be fine grained:

<...>-5438 [006] 135.212289: release_task: (0 28)to(0 26), sxr 26648558, (0 0)
<...>-5402 [015] 135.213193: release_task: (0 27)to(0 26), sxr 26725886, (0 0)
<...>-5408 [011] 135.214172: release_task: (0 28)to(0 26), sxr 26607882, (0 0)
<...>-5419 [005] 135.214410: release_task: (1 27)to(1 25), sxr 26612615, (0 0)
<...>-5350 [009] 135.214937: release_task: (0 28)to(0 27), sxr 27028388, (0 0)
<...>-5443 [008] 135.216748: release_task: (0 28)to(0 26), sxr 26372691, (0 0)

But it cannot stop adjusted values become smaller than tsk->s/utime.
So some approach to solve problem [2] is required.

Thanks,
H.Seto

Hidetoshi Seto

2009-11-10 05:47:34 UTC

Remove casting to clock_t from task_u/stime(), and keep granularity of
cputime_t over the calculation.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
---
kernel/sched.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 28dd4f4..1632e82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p)
return p->stime;
}
#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / NSEC_PER_MSEC)
+#endif
+
cputime_t task_utime(struct task_struct *p)
{
- clock_t utime = cputime_to_clock_t(p->utime),
- total = utime + cputime_to_clock_t(p->stime);
+ cputime_t utime = p->utime, total = utime + p->stime;
u64 temp;

/*
* Use CFS's precise accounting:
*/
- temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
+ temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);

if (total) {
temp *= utime;
do_div(temp, total);
}
- utime = (clock_t)temp;
+ utime = (cputime_t)temp;

- p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ p->prev_utime = max(p->prev_utime, utime);
return p->prev_utime;
}

cputime_t task_stime(struct task_struct *p)
{
- clock_t stime;
+ cputime_t stime;

/*
* Use CFS's precise accounting. (we subtract utime from
* the total, to make sure the total observed by userspace
* grows monotonically - apps rely on that):
*/
- stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
- cputime_to_clock_t(task_utime(p));
+ stime = nsecs_to_cputime(p->se.sum_exec_runtime) - task_utime(p);

if (stime >= 0)
- p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+ p->prev_stime = max(p->prev_stime, stime);

return p->prev_stime;
}

--
1.6.5.2

Stanislaw Gruszka

2009-11-11 12:11:51 UTC

Hi Hidetoshi

Post by Hidetoshi Seto
Remove casting to clock_t from task_u/stime(), and keep granularity of
cputime_t over the calculation.
---
kernel/sched.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 28dd4f4..1632e82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p)
return p->stime;
}
#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / NSEC_PER_MSEC)
+#endif
+

I would like to test, but unfortunately with the patch kernel not build
on my system:

kernel/built-in.o: In function `task_utime':
/usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference to
`__udivdi3'
kernel/built-in.o: In function `task_stime':
/usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference to
`__udivdi3'
make: *** [.tmp_vmlinux1] Error 1

Cheers
Stanislaw

Hidetoshi Seto

2009-11-12 00:00:25 UTC

Post by Stanislaw Gruszka

Post by Hidetoshi Seto
Remove casting to clock_t from task_u/stime(), and keep granularity of
cputime_t over the calculation.
---
kernel/sched.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 28dd4f4..1632e82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p)
return p->stime;
}
#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / NSEC_PER_MSEC)
+#endif
+

I would like to test, but unfortunately with the patch kernel not build
/usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference to
`__udivdi3'
/usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference to
`__udivdi3'
make: *** [.tmp_vmlinux1] Error 1

Wow, what arch?
Could you provide your .config?

Thanks,
H.Seto

Hidetoshi Seto

2009-11-12 02:49:38 UTC

Post by Hidetoshi Seto

Post by Stanislaw Gruszka

Post by Hidetoshi Seto
Remove casting to clock_t from task_u/stime(), and keep granularity of
cputime_t over the calculation.
---
kernel/sched.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 28dd4f4..1632e82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p)
return p->stime;
}
#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / NSEC_PER_MSEC)
+#endif
+

I would like to test, but unfortunately with the patch kernel not build
/usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference to
`__udivdi3'
/usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference to
`__udivdi3'
make: *** [.tmp_vmlinux1] Error 1

Wow, what arch?
Could you provide your .config?

Oh, I found I can hit this error on x86_32 (but not on x86_64).
Maybe something on dividing u64...

I'll try to find fix/workaround for this, but would appreciate if someone
could tell me what is happening here.

Thanks,
H.Seto

Américo Wang

2009-11-12 02:55:38 UTC

On Thu, Nov 12, 2009 at 10:49 AM, Hidetoshi Seto

Post by Hidetoshi Seto

Post by Hidetoshi Seto

Remove casting to clock_t from task_u/stime(), and keep granularit=

y of

Post by Hidetoshi Seto

Post by Hidetoshi Seto

cputime_t over the calculation.
---
=C2=A0kernel/sched.c | =C2=A0 21 ++++++++++++---------
=C2=A01 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 28dd4f4..1632e82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p=

)

Post by Hidetoshi Seto

Post by Hidetoshi Seto

=C2=A0 =C2=A0 return p->stime;
=C2=A0}
=C2=A0#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / N=

SEC_PER_MSEC)

Post by Hidetoshi Seto

Post by Hidetoshi Seto

+#endif
+

I would like to test, but unfortunately with the patch kernel not b=

uild

Post by Hidetoshi Seto

Post by Hidetoshi Seto

/usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference t=

o

Post by Hidetoshi Seto

Post by Hidetoshi Seto

`__udivdi3'
/usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference t=

o

Post by Hidetoshi Seto

Post by Hidetoshi Seto

`__udivdi3'
make: *** [.tmp_vmlinux1] Error 1

Wow, what arch?
Could you provide your .config?

Oh, I found I can hit this error on x86_32 (but not on x86_64).
Maybe something on dividing u64...
I'll try to find fix/workaround for this, but would appreciate if som=

eone

Post by Hidetoshi Seto
could tell me what is happening here.

Yeah, this is because on 32-bit gcc tries to use libgcc to do 64-bit
integer computing. You can insert inline asm to force gcc not to
optimize this code. ;)

Hidetoshi Seto

2009-11-12 04:16:33 UTC

Post by AmÃ©rico Wang

Post by Hidetoshi Seto

/usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference =

to

Post by AmÃ©rico Wang

Post by Hidetoshi Seto

`__udivdi3'
/usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference =

to

Post by AmÃ©rico Wang

Post by Hidetoshi Seto

`__udivdi3'
make: *** [.tmp_vmlinux1] Error 1

Oh, I found I can hit this error on x86_32 (but not on x86_64).
Maybe something on dividing u64...
I'll try to find fix/workaround for this, but would appreciate if so=

meone

Post by AmÃ©rico Wang

Post by Hidetoshi Seto
could tell me what is happening here.

=20
Yeah, this is because on 32-bit gcc tries to use libgcc to do 64-bit
integer computing. You can insert inline asm to force gcc not to
optimize this code. ;)

Thank you for the information.

According to the following commit which I found that is against "__udiv=
di3"
error, use of div_u64() will fix this problem, right?

commit 956dba3caaf66b84fe5f6180e0e4dd03902c7980
Author: Andy Whitcroft <***@canonical.com>
Date: Wed Jul 1 15:20:59 2009 +0100

I'm now testing modified patch...

Thanks,
H.Seto

Hidetoshi Seto

2009-11-12 04:33:45 UTC

Originally task_s/utime() were designed to return clock_t but later
changed to return cputime_t by following commit:

commit efe567fc8281661524ffa75477a7c4ca9b466c63
Author: Christian Borntraeger <***@de.ibm.com>
Date: Thu Aug 23 15:18:02 2007 +0200

It only changed the type of return value, but not the implementation.
As the result the granularity of task_s/utime() is still that of
clock_t, not that of cputime_t.

So using task_s/utime() in __exit_signal() makes values accumulated
to the signal struct to be rounded and coarse grained.

This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.

v2:
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
---
kernel/sched.c | 22 +++++++++++++---------
1 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3c11ae0..1f8d028 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5172,41 +5172,45 @@ cputime_t task_stime(struct task_struct *p)
return p->stime;
}
#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) \
+ msecs_to_cputime(div_u64((__nsecs), NSEC_PER_MSEC))
+#endif
+
cputime_t task_utime(struct task_struct *p)
{
- clock_t utime = cputime_to_clock_t(p->utime),
- total = utime + cputime_to_clock_t(p->stime);
+ cputime_t utime = p->utime, total = utime + p->stime;
u64 temp;

/*
* Use CFS's precise accounting:
*/
- temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
+ temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);

if (total) {
temp *= utime;
do_div(temp, total);
}
- utime = (clock_t)temp;
+ utime = (cputime_t)temp;

- p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ p->prev_utime = max(p->prev_utime, utime);
return p->prev_utime;
}

cputime_t task_stime(struct task_struct *p)
{
- clock_t stime;
+ cputime_t stime;

/*
* Use CFS's precise accounting. (we subtract utime from
* the total, to make sure the total observed by userspace
* grows monotonically - apps rely on that):
*/
- stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
- cputime_to_clock_t(task_utime(p));
+ stime = nsecs_to_cputime(p->se.sum_exec_runtime) - task_utime(p);

if (stime >= 0)
- p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+ p->prev_stime = max(p->prev_stime, stime);

return p->prev_stime;
}

--
1.6.5.2

Peter Zijlstra

2009-11-12 14:15:58 UTC

Post by Hidetoshi Seto
Originally task_s/utime() were designed to return clock_t but later
commit efe567fc8281661524ffa75477a7c4ca9b466c63
Date: Thu Aug 23 15:18:02 2007 +0200
It only changed the type of return value, but not the implementation.
As the result the granularity of task_s/utime() is still that of
clock_t, not that of cputime_t.
So using task_s/utime() in __exit_signal() makes values accumulated
to the signal struct to be rounded and coarse grained.
This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.

/me hates all the cputime_t and clock_t mess.. but I guess the patch is
good.

Thanks.

Stanislaw Gruszka

2009-11-12 14:49:20 UTC

Hi

Post by Hidetoshi Seto
Originally task_s/utime() were designed to return clock_t but later
commit efe567fc8281661524ffa75477a7c4ca9b466c63
Date: Thu Aug 23 15:18:02 2007 +0200
It only changed the type of return value, but not the implementation.
As the result the granularity of task_s/utime() is still that of
clock_t, not that of cputime_t.
So using task_s/utime() in __exit_signal() makes values accumulated
to the signal struct to be rounded and coarse grained.
This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.
---
kernel/sched.c | 22 +++++++++++++---------
1 files changed, 13 insertions(+), 9 deletions(-)

Patch not fix the issue on my system. I test it alone, together with

posix-cpu-timers: avoid do_sys_times() races with __exit_signal(

and (further) together with

--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk,
struct task_cputime *times)

t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;

t = next_thread(t);

What only changed was probability to enter the issue. I can not reproduce
the bug with below patch:

diff --git a/kernel/exit.c b/kernel/exit.c
index f7864ac..b85e384 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -110,8 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, task_utime(tsk));
- sig->stime = cputime_add(sig->stime, task_stime(tsk));
+ sig->utime = cputime_add(sig->utime, tsk->utime); //task_utime(tsk));
+ sig->stime = cputime_add(sig->stime, tsk->stime); //task_stime(tsk));
sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);

Perhaps we can remove task_{u,s}time() in some places or maybe at whole ?

Stanislaw

Peter Zijlstra

2009-11-12 15:00:38 UTC

Post by Stanislaw Gruszka
I can not reproduce
diff --git a/kernel/exit.c b/kernel/exit.c
index f7864ac..b85e384 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -110,8 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, task_utime(tsk));
- sig->stime = cputime_add(sig->stime, task_stime(tsk));
+ sig->utime = cputime_add(sig->utime, tsk->utime); //task_utime(tsk));
+ sig->stime = cputime_add(sig->stime, tsk->stime); //task_stime(tsk));
sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;

Yes, this is the thing I suggested and makes sense.

Post by Stanislaw Gruszka
diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;
- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
Perhaps we can remove task_{u,s}time() in some places or maybe at whole ?

I think task_[us]time() still has value for getrusage() since it avoids
the task hiding from top by never running during the tick crap.

Stanislaw Gruszka

2009-11-12 15:40:50 UTC

Post by Peter Zijlstra

Post by Stanislaw Gruszka
I can not reproduce
diff --git a/kernel/exit.c b/kernel/exit.c
index f7864ac..b85e384 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -110,8 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, task_utime(tsk));
- sig->stime = cputime_add(sig->stime, task_stime(tsk));
+ sig->utime = cputime_add(sig->utime, tsk->utime); //task_utime(tsk));
+ sig->stime = cputime_add(sig->stime, tsk->stime); //task_stime(tsk));
sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;

Yes, this is the thing I suggested and makes sense.

Ok, I'm going to post this.

Spencer,

seems you have more test cases for utime decreasing issues,
could you send links to me ? Somehow I could not find them
by my own. Particularly test case used in development this commit
is interested:

commit 49048622eae698e5c4ae61f7e71200f265ccc529
Author: Balbir Singh <***@linux.vnet.ibm.com>
Date: Fri Sep 5 18:12:23 2008 +0200
sched: fix process time monotonicity

Stanislaw

Stanislaw Gruszka

2009-11-13 12:42:37 UTC

When we have lots of exiting thread, two consecutive calls to sys_times()
can show utime/stime values decrease. This can be showed by program
provided in this thread:

http://lkml.org/lkml/2009/11/3/522

We have two bugs related with this problem, both need to be fixed to make
issue gone.

Problem 1) Races between thread_group_cputime() and __exit_signal()

When process exit in the middle of thread_group_cputime() loop, {u,s}time
values will be accounted twice. One time - in all threads loop, second - in
__exit_signal(). This make sys_times() return values bigger then they
are in real. Next consecutive call to sys_times() return correct values,
so we have {u,s}time decrease.

To fix use sighand->siglock in do_sys_times().

Problem 2) Using adjusted stime/utime values in __exit_signal()

Adjusted task_{u,s}time() functions can return smaller values then
corresponding tsk->{s,u}time. So when thread exit, thread {u/s}times
values accumulated in signal->{s,u}time can be smaller then
tsk->{u,s}times previous accounted in thread_group_cputime() loop.
Hence two consecutive sys_times() calls can show decrease.

To fix we use pure tsk->{u,s}time values in __exit_signal(). This mean
reverting:

commit 49048622eae698e5c4ae61f7e71200f265ccc529
Author: Balbir Singh <***@linux.vnet.ibm.com>
Date: Fri Sep 5 18:12:23 2008 +0200

sched: fix process time monotonicity

which is also fix for some utime/stime decreasing issues. However
I _believe_ issues which want to be fixed in this commit, was caused
by Problem 1) and this patch not make them happen again.

Patch was heavily inspired by Hidetoshi and Peter.

Reported-by: Spencer Candland <***@bluehost.com>
Signed-off-by: Stanislaw Gruszka <***@redhat.com>
---
kernel/exit.c | 6 +++---
kernel/sys.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index f7864ac..b0a28a5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -110,9 +110,9 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, task_utime(tsk));
- sig->stime = cputime_add(sig->stime, task_stime(tsk));
- sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
+ sig->utime = cputime_add(sig->utime, tsk->utime);
+ sig->stime = cputime_add(sig->stime, tsk->stime);
+ sig->gtime = cputime_add(sig->gtime, tsk->gtime);
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
sig->nvcsw += tsk->nvcsw;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);

--
1.6.2.5

Peter Zijlstra

2009-11-13 13:16:59 UTC

Post by Stanislaw Gruszka
When we have lots of exiting thread, two consecutive calls to sys_times()
can show utime/stime values decrease. This can be showed by program
http://lkml.org/lkml/2009/11/3/522
We have two bugs related with this problem, both need to be fixed to make
issue gone.
Problem 1) Races between thread_group_cputime() and __exit_signal()
When process exit in the middle of thread_group_cputime() loop, {u,s}time
values will be accounted twice. One time - in all threads loop, second - in
__exit_signal(). This make sys_times() return values bigger then they
are in real. Next consecutive call to sys_times() return correct values,
so we have {u,s}time decrease.
To fix use sighand->siglock in do_sys_times().
Problem 2) Using adjusted stime/utime values in __exit_signal()
Adjusted task_{u,s}time() functions can return smaller values then
corresponding tsk->{s,u}time. So when thread exit, thread {u/s}times
values accumulated in signal->{s,u}time can be smaller then
tsk->{u,s}times previous accounted in thread_group_cputime() loop.
Hence two consecutive sys_times() calls can show decrease.
To fix we use pure tsk->{u,s}time values in __exit_signal(). This mean
commit 49048622eae698e5c4ae61f7e71200f265ccc529
Date: Fri Sep 5 18:12:23 2008 +0200
sched: fix process time monotonicity
which is also fix for some utime/stime decreasing issues. However
I _believe_ issues which want to be fixed in this commit, was caused
by Problem 1) and this patch not make them happen again.

It would be very good to verify that believe and make it a certainty.

Otherwise we need to do the opposite and propagate task_[usg]time() to
all other places... :/

/me quickly stares at fs/proc/array.c:do_task_stat(), which is what top
uses to get the times..

That simply uses thread_group_cputime() properly under siglock and would
thus indeed require the use of task_[usg]time() in order to avoid the
stupid hiding 'exploit'..

Oh bugger,..

I think we do indeed need something like the below, not sure if all
task_[usg]time() calls are now under siglock, if not they ought to be,
otherwise there's a race with them updating p->prev_[us]time.

---

---diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..9b1d715 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -170,11 +170,11 @@ static void bump_cpu_timer(struct k_itimer *timer,

static inline cputime_t prof_ticks(struct task_struct *p)
{
- return cputime_add(p->utime, p->stime);
+ return cputime_add(task_utime(p), task_stime(p));
}
static inline cputime_t virt_ticks(struct task_struct *p)
{
- return p->utime;
+ return task_utime(p);
}

int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec
*tp)
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk,
struct task_cputime *times)

t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;

t = next_thread(t);
@@ -517,7 +517,8 @@ static void cleanup_timers(struct list_head *head,
void posix_cpu_timers_exit(struct task_struct *tsk)
{
cleanup_timers(tsk->cpu_timers,
- tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);
+ task_utime(tsk), task_stime(tsk),
+ tsk->se.sum_exec_runtime);

}
void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -525,8 +526,8 @@ void posix_cpu_timers_exit_group(struct task_struct
*tsk)
struct signal_struct *const sig = tsk->signal;

cleanup_timers(tsk->signal->cpu_timers,
- cputime_add(tsk->utime, sig->utime),
- cputime_add(tsk->stime, sig->stime),
+ cputime_add(task_utime(tsk), sig->utime),
+ cputime_add(task_stime(tsk), sig->stime),
tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
}

@@ -1365,8 +1366,8 @@ static inline int fastpath_timer_check(struct
task_struct *tsk)

if (!task_cputime_zero(&tsk->cputime_expires)) {
struct task_cputime task_sample = {
- .utime = tsk->utime,
- .stime = tsk->stime,
+ .utime = task_utime(tsk),
+ .stime = tsak_stime(tsk),
.sum_exec_runtime = tsk->se.sum_exec_runtime
};

Balbir Singh

2009-11-13 14:12:07 UTC

Post by Peter Zijlstra

When we have lots of exiting thread, two consecutive calls to sys_ti=

mes()

Post by Peter Zijlstra

can show utime/stime values decrease. This can be showed by program
http://lkml.org/lkml/2009/11/3/522
We have two bugs related with this problem, both need to be fixed to=

make

Post by Peter Zijlstra

issue gone.
Problem 1) Races between thread_group_cputime() and __exit_signal()
When process exit in the middle of thread_group_cputime() loop, {u,s=

}time

Post by Peter Zijlstra

values will be accounted twice. One time - in all threads loop, seco=

nd - in

Post by Peter Zijlstra

__exit_signal(). This make sys_times() return values bigger then the=

y

Post by Peter Zijlstra

are in real. Next consecutive call to sys_times() return correct val=

ues,

Post by Peter Zijlstra

so we have {u,s}time decrease.
To fix use sighand->siglock in do_sys_times().
Problem 2) Using adjusted stime/utime values in __exit_signal()
Adjusted task_{u,s}time() functions can return smaller values then
corresponding tsk->{s,u}time. So when thread exit, thread {u/s}times
values accumulated in signal->{s,u}time can be smaller then
tsk->{u,s}times previous accounted in thread_group_cputime() loop.
Hence two consecutive sys_times() calls can show decrease.
To fix we use pure tsk->{u,s}time values in __exit_signal(). This me=

an

Post by Peter Zijlstra

commit 49048622eae698e5c4ae61f7e71200f265ccc529
Date: =A0 Fri Sep 5 18:12:23 2008 +0200
=A0 =A0 sched: fix process time monotonicity
which is also fix for some utime/stime decreasing issues. However
I _believe_ issues which want to be fixed in this commit, was caused
by Problem 1) and this patch not make them happen again.

It would be very good to verify that believe and make it a certainty.
Otherwise we need to do the opposite and propagate task_[usg]time() t=

o

Post by Peter Zijlstra
all other places... :/
/me quickly stares at fs/proc/array.c:do_task_stat(), which is what t=

op

Post by Peter Zijlstra
uses to get the times..
That simply uses thread_group_cputime() properly under siglock and wo=

uld

Post by Peter Zijlstra
thus indeed require the use of task_[usg]time() in order to avoid the
stupid hiding 'exploit'..
Oh bugger,..
I think we do indeed need something like the below, not sure if all
task_[usg]time() calls are now under siglock, if not they ought to be=

,

Post by Peter Zijlstra
otherwise there's a race with them updating p->prev_[us]time.
---
---diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..9b1d715 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -170,11 +170,11 @@ static void bump_cpu_timer(struct k_itimer *tim=

er,

Post by Peter Zijlstra
=A0static inline cputime_t prof_ticks(struct task_struct *p)
=A0{
- =A0 =A0 =A0 return cputime_add(p->utime, p->stime);
+ =A0 =A0 =A0 return cputime_add(task_utime(p), task_stime(p));
=A0}
=A0static inline cputime_t virt_ticks(struct task_struct *p)
=A0{
- =A0 =A0 =A0 return p->utime;
+ =A0 =A0 =A0 return task_utime(p);
=A0}
=A0int posix_cpu_clock_getres(const clockid_t which_clock, struct tim=

espec

Post by Peter Zijlstra
*tp)
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk=

,

Post by Peter Zijlstra
struct task_cputime *times)
=A0 =A0 =A0 =A0t =3D tsk;
=A0 =A0 =A0 =A0do {
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 times->utime =3D cputime_add(times->uti=

me, t->utime);

Post by Peter Zijlstra
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 times->stime =3D cputime_add(times->sti=

me, t->stime);

Post by Peter Zijlstra
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 times->utime =3D cputime_add(times->uti=

me, task_utime(t));

Post by Peter Zijlstra
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 times->stime =3D cputime_add(times->sti=

me, task_stime(t));

Post by Peter Zijlstra
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0times->sum_exec_runtime +=3D t->se.sum=

_exec_runtime;

Post by Peter Zijlstra
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0t =3D next_thread(t);
@@ -517,7 +517,8 @@ static void cleanup_timers(struct list_head *head=

,

Post by Peter Zijlstra
=A0void posix_cpu_timers_exit(struct task_struct *tsk)
=A0{
=A0 =A0 =A0 =A0cleanup_timers(tsk->cpu_timers,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0tsk->utime, tsk->stime, =

tsk->se.sum_exec_runtime);

Post by Peter Zijlstra
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0task_utime(tsk), task_st=

ime(tsk),

Post by Peter Zijlstra
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0tsk->se.sum_exec_runtime=

);

Post by Peter Zijlstra
=A0}
=A0void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -525,8 +526,8 @@ void posix_cpu_timers_exit_group(struct task_stru=

ct

Post by Peter Zijlstra
*tsk)
=A0 =A0 =A0 =A0struct signal_struct *const sig =3D tsk->signal;
=A0 =A0 =A0 =A0cleanup_timers(tsk->signal->cpu_timers,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cputime_add(tsk->utime, =

sig->utime),

Post by Peter Zijlstra
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cputime_add(tsk->stime, =

sig->stime),

Post by Peter Zijlstra
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cputime_add(task_utime(t=

sk), sig->utime),

Post by Peter Zijlstra
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cputime_add(task_stime(t=

sk), sig->stime),

Post by Peter Zijlstra
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tsk->se.sum_exec_runtime =

+ sig->sum_sched_runtime);

Post by Peter Zijlstra
=A0}
@@ -1365,8 +1366,8 @@ static inline int fastpath_timer_check(struct
task_struct *tsk)
=A0 =A0 =A0 =A0if (!task_cputime_zero(&tsk->cputime_expires)) {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct task_cputime task_sample =3D {
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .utime =3D tsk->utime,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .stime =3D tsk->stime,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .utime =3D task_utime(t=

sk),

Post by Peter Zijlstra
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .stime =3D tsak_stime(t=

sk),

Post by Peter Zijlstra
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.sum_exec_runtime =3D =

tsk->se.sum_exec_runtime

Post by Peter Zijlstra
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0};

The patch looks correct upon first notice. My fault for missing these
call sites, thanks for catching them Peter. I wonder if we should
change utime and stime to __utime and __stime and force everyone to
use the accessor functions.

Acked-by: Balbir Singh <***@linux.vnet.ibm.com>

Balbir

Stanislaw Gruszka

2009-11-13 15:36:54 UTC

Post by Peter Zijlstra

Post by Stanislaw Gruszka
To fix we use pure tsk->{u,s}time values in __exit_signal(). This mean
commit 49048622eae698e5c4ae61f7e71200f265ccc529
Date: Fri Sep 5 18:12:23 2008 +0200
sched: fix process time monotonicity
which is also fix for some utime/stime decreasing issues. However
I _believe_ issues which want to be fixed in this commit, was caused
by Problem 1) and this patch not make them happen again.

It would be very good to verify that believe and make it a certainty.

Balbir, are some chance to avoid task_[usg]time() usage here? Could
you be so kind and give me point to reproducer program/script you used
when worked on "sched: fix process time monotonicity" commit?

Post by Peter Zijlstra
Otherwise we need to do the opposite and propagate task_[usg]time() to
all other places... :/
/me quickly stares at fs/proc/array.c:do_task_stat(), which is what top
uses to get the times..
That simply uses thread_group_cputime() properly under siglock and would
thus indeed require the use of task_[usg]time() in order to avoid the
stupid hiding 'exploit'..
Oh bugger,..
I think we do indeed need something like the below, not sure if all
task_[usg]time() calls are now under siglock, if not they ought to be,
otherwise there's a race with them updating p->prev_[us]time.
---
---diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..9b1d715 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -170,11 +170,11 @@ static void bump_cpu_timer(struct k_itimer *timer,
static inline cputime_t prof_ticks(struct task_struct *p)
{
- return cputime_add(p->utime, p->stime);
+ return cputime_add(task_utime(p), task_stime(p));
}
static inline cputime_t virt_ticks(struct task_struct *p)
{
- return p->utime;
+ return task_utime(p);
}
int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec
*tp)

Something wrong with formatting.

Post by Peter Zijlstra
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk,
struct task_cputime *times)
t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;
t = next_thread(t);

[snip]

Confirmed patch fix problem using reproducer from this thread.

But I don't like it much. Sad we can not do transition to opposite
direction and remove task_{u,s}time.

A few month ago I was thinking about removing cputime_t and using
long long instead, now see much more reasons of doing this, but still
lack of skills/time for that - oh dear.

Stanislaw

Peter Zijlstra

2009-11-13 17:05:03 UTC

Post by Stanislaw Gruszka
But I don't like it much. Sad we can not do transition to opposite
direction and remove task_{u,s}time.

We need task_[us]time() for some thing like process monitors like top to
avoid tasks hiding from them.

Since the regular [us]time are only incremented on tick, if a task never
runs on the tick, it'll never accumulate any tick and hence it'll be
reported as not using cpu time.

task_[us]time() solve this by using the ns sum_exec_runtime and scale
that according to the [us]time ticks -- the only trouble is keeping it
monotonic.

CONFIG_VIRT_CPU_ACCOUNTING avoids this whole issue by tracking time on
each user/kernel transition, but at the obvious cost of having to track
time at every such transition.

Now since we clearly don't want to incur that overhead on any regular
setup (kernel entry/exit paths are extremely hot), we're (afais) stuck
using the current scheme of dividing the sum_exec_runtime in two
components system/user according to the tick ratio.

Now if we can reduce the task_[us]time muckery to the bare minimum that
would be good, but the way it looks now all this accounting seems too
interwoven to pull that off.

Stanislaw Gruszka

2009-11-17 13:08:52 UTC

Post by Stanislaw Gruszka
seems you have more test cases for utime decreasing issues,
could you send links to me ? Somehow I could not find them
by my own. Particularly test case used in development this commit
commit 49048622eae698e5c4ae61f7e71200f265ccc529
Date: Fri Sep 5 18:12:23 2008 +0200
sched: fix process time monotonicity

I had originally noticed that in a production web server, so my test
case was designed to mirror what I was seeing there, which was just
running apache with worker mpm, and running a simple apache bench while
watching the utime/stime of the apache children. Unfortunately that
method was not terribly reliable at reproducing the issue, which is why
I felt it necessary to try to come up with a better test case this time
around.

No wonder I could not find anything on google and in mailing list
archives :)

Seems issue reported then was exactly the same as reported now by
you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
make probability of bug smaller and you did not note it until now.

Could you please test this patch, if it solve all utime decrease
problems for you:

http://patchwork.kernel.org/patch/59795/

If you confirm it work, I think we should apply it. Otherwise
we need to go to propagate task_{u,s}time everywhere, which is not
(my) preferred solution.

Stanislaw

Peter Zijlstra

2009-11-17 13:24:48 UTC

Post by Stanislaw Gruszka

Post by Stanislaw Gruszka
seems you have more test cases for utime decreasing issues,
could you send links to me ? Somehow I could not find them
by my own. Particularly test case used in development this commit
commit 49048622eae698e5c4ae61f7e71200f265ccc529
Date: Fri Sep 5 18:12:23 2008 +0200
sched: fix process time monotonicity

I had originally noticed that in a production web server, so my test
case was designed to mirror what I was seeing there, which was just
running apache with worker mpm, and running a simple apache bench while
watching the utime/stime of the apache children. Unfortunately that
method was not terribly reliable at reproducing the issue, which is why
I felt it necessary to try to come up with a better test case this time
around.

No wonder I could not find anything on google and in mailing list
archives :)
Seems issue reported then was exactly the same as reported now by
you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
make probability of bug smaller and you did not note it until now.
Could you please test this patch, if it solve all utime decrease
http://patchwork.kernel.org/patch/59795/
If you confirm it work, I think we should apply it. Otherwise
we need to go to propagate task_{u,s}time everywhere, which is not
(my) preferred solution.

That patch will create another issue, it will allow a process to hide
from top by arranging to never run when the tick hits.

Stanislaw Gruszka

2009-11-19 18:17:45 UTC

Post by Peter Zijlstra

Post by Stanislaw Gruszka
Seems issue reported then was exactly the same as reported now by
you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
make probability of bug smaller and you did not note it until now.
Could you please test this patch, if it solve all utime decrease
http://patchwork.kernel.org/patch/59795/
If you confirm it work, I think we should apply it. Otherwise
we need to go to propagate task_{u,s}time everywhere, which is not
(my) preferred solution.

That patch will create another issue, it will allow a process to hide
from top by arranging to never run when the tick hits.

What about that?

diff --git a/kernel/sched.c b/kernel/sched.c
index 1f8d028..9db1cbc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5194,7 +5194,7 @@ cputime_t task_utime(struct task_struct *p)
}
utime = (cputime_t)temp;

- p->prev_utime = max(p->prev_utime, utime);
+ p->prev_utime = max(p->prev_utime, max(p->utime, utime));
return p->prev_utime;
}

diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);

It's on top of Hidetoshi patch and fix utime decrease problem
on my system.

Are we not doing something nasty here?

cputime_t utime = p->utime, total = utime + p->stime;
u64 temp;

/*
* Use CFS's precise accounting:
*/
temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);

if (total) {
temp *= utime;
do_div(temp, total);
}
utime = (cputime_t)temp;

Stanislaw

Hidetoshi Seto

2009-11-20 02:00:21 UTC

Post by Stanislaw Gruszka

Post by Peter Zijlstra

Post by Stanislaw Gruszka
Seems issue reported then was exactly the same as reported now by
you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
make probability of bug smaller and you did not note it until now.
Could you please test this patch, if it solve all utime decrease
http://patchwork.kernel.org/patch/59795/
If you confirm it work, I think we should apply it. Otherwise
we need to go to propagate task_{u,s}time everywhere, which is not
(my) preferred solution.

That patch will create another issue, it will allow a process to hide
from top by arranging to never run when the tick hits.

Yes, nowadays there are many threads on high speed hardware,
such process can exist all around, easier than before.

E.g. assume that there are 2 tasks:

Task A: interrupted by timer few times
(utime, stime, se.sum_sched_runtime) = (50, 50, 1000000000)
=> total of runtime is 1 sec, but utime + stime is 100 ms

Task B: interrupted by timer many times
(utime, stime, se.sum_sched_runtime) = (50, 50, 10000000)
=> total of runtime is 10 ms, but utime + stime is 100 ms

You can see task_[su]time() works well for these tasks.

Post by Stanislaw Gruszka
What about that?
diff --git a/kernel/sched.c b/kernel/sched.c
index 1f8d028..9db1cbc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5194,7 +5194,7 @@ cputime_t task_utime(struct task_struct *p)
}
utime = (cputime_t)temp;
- p->prev_utime = max(p->prev_utime, utime);
+ p->prev_utime = max(p->prev_utime, max(p->utime, utime));
return p->prev_utime;
}

I think this makes things worse.

without this patch:
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 5 ms (= accurate)
with this patch:
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 50 ms (= not accurate)

Note that task_stime() calculates prev_stime using this prev_utime:

without this patch:
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 5 ms (= not accurate)
with this patch:
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 0 ms (= not accurate)

Post by Stanislaw Gruszka
diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;
- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
It's on top of Hidetoshi patch and fix utime decrease problem
on my system.

How about the stime decrease problem which can be caused by same
logic?

According to my labeling, there are 2 unresolved problem [1]
"thread_group_cputime() vs exit" and [2] "use of task_s/utime()".

Still I believe the real fix for this problem is combination of
above fix for do_sys_times() (for problem[1]) and (I know it is

Post by Stanislaw Gruszka

Post by Peter Zijlstra
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c

Post by Stanislaw Gruszka

Post by Hidetoshi Seto
index 5c9dc22..e065b8a 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;
t = next_thread(t);

Think about this diff, assuming task C is in same group of task A and B:

sys_times() on C while A and B are living returns:
(utime, stime)
= task_[su]time(C) + ([su]time(A)+[su]time(B)+...) + in_signal(exited)
= task_[su]time(C) + ( (50,50) + (50,50) +...) + in_signal(exited)
If A exited, it increases:
(utime, stime)
= task_[su]time(C) + ([su]time(B)+...) + in_signal(exited)+task_[su]time(A)
= task_[su]time(C) + ( (50,50) +...) + in_signal(exited)+(500,500)
Otherwise if B exited, it decreases:
(utime, stime)
= task_[su]time(C) + ([su]time(A)+...) + in_signal(exited)+task_[su]time(B)
= task_[su]time(C) + ( (50,50) +...) + in_signal(exited)+(5,5)

With this fix, sys_times() returns:
(utime, stime)
= task_[su]time(C) + (task_[su]time(A)+task_[su]time(B)+...) + in_signal(exited)
= task_[su]time(C) + ( (500,500) + (5,5) +...) + in_signal(exited)

Post by Stanislaw Gruszka
Are we not doing something nasty here?
cputime_t utime = p->utime, total = utime + p->stime;
u64 temp;
/*
*/
temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);
if (total) {
temp *= utime;
do_div(temp, total);
}
utime = (cputime_t)temp;

Not here, but doing do_div() for each thread could be said nasty.
I mean
__task_[su]time(sum(A, B, ...))
would be better than:
sum(task_[su]time(A)+task_[su]time(B)+...)

However it would bring another issue, because
__task_[su]time(sum(A, B, ...))
might not equal to
__task_[su]time(sum(B, ...)) + task_[su]time(A)

Thanks,
H.Seto

Stanislaw Gruszka

2009-11-23 10:09:26 UTC

Post by Hidetoshi Seto

Post by Stanislaw Gruszka

Post by Peter Zijlstra

Post by Stanislaw Gruszka
Could you please test this patch, if it solve all utime decrease
http://patchwork.kernel.org/patch/59795/
If you confirm it work, I think we should apply it. Otherwise
we need to go to propagate task_{u,s}time everywhere, which is not
(my) preferred solution.

That patch will create another issue, it will allow a process to hide
from top by arranging to never run when the tick hits.

Yes, nowadays there are many threads on high speed hardware,
such process can exist all around, easier than before.
Task A: interrupted by timer few times
(utime, stime, se.sum_sched_runtime) = (50, 50, 1000000000)
=> total of runtime is 1 sec, but utime + stime is 100 ms
Task B: interrupted by timer many times
(utime, stime, se.sum_sched_runtime) = (50, 50, 10000000)
=> total of runtime is 10 ms, but utime + stime is 100 ms

How tis is probable, that task is running very long, but not getting
the ticks ? I know this is possible, otherwise we will not see utime
decreasing after do_sys_times() siglock fix, but how probable?

Post by Hidetoshi Seto
You can see task_[su]time() works well for these tasks.

Post by Stanislaw Gruszka
What about that?
diff --git a/kernel/sched.c b/kernel/sched.c
index 1f8d028..9db1cbc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5194,7 +5194,7 @@ cputime_t task_utime(struct task_struct *p)
}
utime = (cputime_t)temp;
- p->prev_utime = max(p->prev_utime, utime);
+ p->prev_utime = max(p->prev_utime, max(p->utime, utime));
return p->prev_utime;
}

I think this makes things worse.
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 5 ms (= accurate)
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 50 ms (= not accurate)
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 5 ms (= not accurate)
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 0 ms (= not accurate)

Post by Stanislaw Gruszka
diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;
- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
It's on top of Hidetoshi patch and fix utime decrease problem
on my system.

How about the stime decrease problem which can be caused by same
logic?

Yes, above patch screw up stime. Below should be a bit better, but
not solve objections you have:

diff --git a/kernel/exit.c b/kernel/exit.c
index f7864ac..17491ad 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -91,6 +91,8 @@ static void __exit_signal(struct task_struct *tsk)
if (atomic_dec_and_test(&sig->count))
posix_cpu_timers_exit_group(tsk);
else {
+ cputime_t utime, stime;
+
/*
* If there is any task waiting for the group exit
* then notify it:
@@ -110,8 +112,16 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, task_utime(tsk));
- sig->stime = cputime_add(sig->stime, task_stime(tsk));
+
+ utime = task_utime(tsk);
+ stime = task_stime(tsk);
+ if (tsk->utime > utime || tsk->stime > stime) {
+ utime = tsk->utime;
+ stime = tsk->stime;
+ }
+
+ sig->utime = cputime_add(sig->utime, utime);
+ sig->stime = cputime_add(sig->stime, stime);
sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;

Post by Hidetoshi Seto
According to my labeling, there are 2 unresolved problem [1]
"thread_group_cputime() vs exit" and [2] "use of task_s/utime()".
Still I believe the real fix for this problem is combination of
above fix for do_sys_times() (for problem[1]) and (I know it is

Post by Stanislaw Gruszka

Post by Peter Zijlstra
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c

Post by Stanislaw Gruszka

Post by Hidetoshi Seto
index 5c9dc22..e065b8a 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;
t = next_thread(t);

That works for me and I agree that this is right fix. Peter had concerns
about p->prev_utime races and additional need for further propagation of
task_{s,u}time() to posix-cpu-timers code. However I do not understand
these problems.

Stanislaw

Stanislaw Gruszka

2009-11-23 10:16:12 UTC

When process exit in the middle of thread_group_cputime() loop, {u,s}time
values will be accounted twice. One time - in all threads loop, second - in
__exit_signal(). This make sys_times() return values bigger then they
are in real. Next consecutive call to sys_times() can return correct values,
so we have {u,s}time decrease.

To fix use sighand->siglock in do_sys_times().

This is partial fix for problem of utime/stime values decreasing described
in this thread:

http://lkml.org/lkml/2009/11/3/522

To fully fix the problem, we need second fix for mishmash between
task_{u,s}time() and tsk->{u,s}time. It's not clear now how this mishmash
should be fixed.

Signed-off-by: Stanislaw Gruszka <***@redhat.com>
---
kernel/sys.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);

--
1.6.2.5

Hidetoshi Seto

2009-11-30 09:20:05 UTC

They are used in task_times(), only when !VIRT_CPU_ACCOUNTING.
Add ifndef-endif for them.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
---
include/linux/sched.h | 2 ++
kernel/fork.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0395b0f..dff85e5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,7 +1331,9 @@ struct task_struct {

cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
diff --git a/kernel/fork.c b/kernel/fork.c
index 166b8c4..ad7cb6d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1066,8 +1066,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->gtime = cputime_zero;
p->utimescaled = cputime_zero;
p->stimescaled = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
p->prev_utime = cputime_zero;
p->prev_stime = cputime_zero;
+#endif

p->default_timer_slack_ns = current->timer_slack_ns;

--
1.6.5.3

Hidetoshi Seto

2009-11-30 09:21:36 UTC

This is a real fix for problem of utime/stime values decreasing
described in the thread:
http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while {u,s}time
and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.

- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(), because
it is mixed sum of "raw" value and "adjusted" value:

group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater than
adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only
runs 45ms) and if it exits, cputime will decrease (e.g. -5ms).

To fix this, we could do:

group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for every
thread should be avoided.

This patch fixes the above problem in the following way:

- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.

- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().

- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.

- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
---
fs/proc/array.c | 5 +---
include/linux/sched.h | 3 +-
kernel/exit.c | 23 +++++++++++----------
kernel/fork.c | 2 +
kernel/sched.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sys.c | 18 +++++++---------
6 files changed, 75 insertions(+), 28 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ca61a88..2571da4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -506,7 +506,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

/* add up live thread stats at the group level */
if (whole) {
- struct task_cputime cputime;
struct task_struct *t = task;
do {
min_flt += t->min_flt;
@@ -517,9 +516,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime(task, &cputime);
- utime = cputime.utime;
- stime = cputime.stime;
+ thread_group_times(task, &utime, &stime);
gtime = cputime_add(gtime, sig->gtime);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dff85e5..c5c68ed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,7 +1332,7 @@ struct task_struct {
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
- cputime_t prev_utime, prev_stime;
+ cputime_t prev_utime, prev_stime, prev_tgutime, prev_tgstime;
#endif
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
@@ -1723,6 +1723,7 @@ static inline void put_task_struct(struct task_struct *t)
}

extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 2eaf68b..b221ad6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -91,8 +91,6 @@ static void __exit_signal(struct task_struct *tsk)
if (atomic_dec_and_test(&sig->count))
posix_cpu_timers_exit_group(tsk);
else {
- cputime_t utime, stime;
-
/*
* If there is any task waiting for the group exit
* then notify it:
@@ -112,9 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- task_times(tsk, &utime, &stime);
- sig->utime = cputime_add(sig->utime, utime);
- sig->stime = cputime_add(sig->stime, stime);
+ sig->utime = cputime_add(sig->utime, tsk->utime);
+ sig->stime = cputime_add(sig->stime, tsk->stime);
sig->gtime = cputime_add(sig->gtime, tsk->gtime);
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
@@ -1208,6 +1205,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
struct signal_struct *psig;
struct signal_struct *sig;
unsigned long maxrss;
+ cputime_t tgutime, tgstime;

/*
* The resource counters for the group leader are in its
@@ -1223,20 +1221,23 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* need to protect the access to parent->signal fields,
* as other threads in the parent group can be right
* here reaping other children at the same time.
+ *
+ * We use thread_group_times() to get times for the thread
+ * group, which consolidates times for all threads in the
+ * group including the group leader.
*/
+ thread_group_times(p, &tgutime, &tgstime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
psig->cutime =
cputime_add(psig->cutime,
- cputime_add(p->utime,
- cputime_add(sig->utime,
- sig->cutime)));
+ cputime_add(tgutime,
+ sig->cutime));
psig->cstime =
cputime_add(psig->cstime,
- cputime_add(p->stime,
- cputime_add(sig->stime,
- sig->cstime)));
+ cputime_add(tgstime,
+ sig->cstime));
psig->cgtime =
cputime_add(psig->cgtime,
cputime_add(p->gtime,
diff --git a/kernel/fork.c b/kernel/fork.c
index ad7cb6d..1753cac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1069,6 +1069,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
p->prev_utime = cputime_zero;
p->prev_stime = cputime_zero;
+ p->prev_tgutime = cputime_zero;
+ p->prev_tgstime = cputime_zero;
#endif

p->default_timer_slack_ns = current->timer_slack_ns;
diff --git a/kernel/sched.c b/kernel/sched.c
index b3d4e2b..dabdeae 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5189,6 +5189,12 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
if (st)
*st = p->stime;
}
+
+static inline void __thread_group_times(struct task_struct *p,
+ struct task_cputime *cputime)
+{
+ thread_group_cputime(p, cputime);
+}
#else

#ifndef nsecs_to_cputime
@@ -5197,7 +5203,7 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)

void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- cputime_t rtime, utime = p->utime, total = utime + p->stime;
+ cputime_t rtime, utime = p->utime, total = cputime_add(utime, p->stime);

/*
* Use CFS's precise accounting:
@@ -5217,16 +5223,58 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
* Compare with previous values, to keep monotonicity:
*/
p->prev_utime = max(p->prev_utime, utime);
- p->prev_stime = max(p->prev_stime, rtime - p->prev_utime);
+ p->prev_stime = max(p->prev_stime, cputime_sub(rtime, p->prev_utime));

if (ut)
*ut = p->prev_utime;
if (st)
*st = p->prev_stime;
}
+
+static void __thread_group_times(struct task_struct *p,
+ struct task_cputime *cputime)
+{
+ cputime_t rtime, utime, total;
+
+ thread_group_cputime(p, cputime);
+
+ total = cputime_add(cputime->utime, cputime->stime);
+ rtime = nsecs_to_cputime(cputime->sum_exec_runtime);
+
+ if (total) {
+ u64 temp;
+
+ temp = (u64)(rtime * cputime->utime);
+ do_div(temp, total);
+ utime = (cputime_t)temp;
+ } else
+ utime = rtime;
+
+ p->prev_tgutime = max(p->prev_tgutime, utime);
+ p->prev_tgstime = max(p->prev_tgstime,
+ cputime_sub(rtime, p->prev_tgutime));
+
+ cputime->utime = p->prev_tgutime;
+ cputime->stime = p->prev_tgstime;
+}
#endif

/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct task_cputime cputime;
+
+ __thread_group_times(p, &cputime);
+
+ if (ut)
+ *ut = cputime.utime;
+ if (st)
+ *st = cputime.stime;
+}
+
+/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
diff --git a/kernel/sys.c b/kernel/sys.c
index d988abe..9968c5f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -911,16 +911,15 @@ change_okay:

void do_sys_times(struct tms *tms)
{
- struct task_cputime cputime;
- cputime_t cutime, cstime;
+ cputime_t tgutime, tgstime, cutime, cstime;

spin_lock_irq(&current->sighand->siglock);
- thread_group_cputime(current, &cputime);
+ thread_group_times(current, &tgutime, &tgstime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
- tms->tms_utime = cputime_to_clock_t(cputime.utime);
- tms->tms_stime = cputime_to_clock_t(cputime.stime);
+ tms->tms_utime = cputime_to_clock_t(tgutime);
+ tms->tms_stime = cputime_to_clock_t(tgstime);
tms->tms_cutime = cputime_to_clock_t(cutime);
tms->tms_cstime = cputime_to_clock_t(cstime);
}
@@ -1338,8 +1337,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
{
struct task_struct *t;
unsigned long flags;
- cputime_t utime, stime;
- struct task_cputime cputime;
+ cputime_t tgutime, tgstime, utime, stime;
unsigned long maxrss = 0;

memset((char *) r, 0, sizeof *r);
@@ -1372,9 +1370,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime(p, &cputime);
- utime = cputime_add(utime, cputime.utime);
- stime = cputime_add(stime, cputime.stime);
+ thread_group_times(p, &tgutime, &tgstime);
+ utime = cputime_add(utime, tgutime);
+ stime = cputime_add(stime, tgstime);
r->ru_nvcsw += p->signal->nvcsw;
r->ru_nivcsw += p->signal->nivcsw;
r->ru_minflt += p->signal->min_flt;

--
1.6.5.3

Stanislaw Gruszka

2009-11-30 14:54:43 UTC

Post by Hidetoshi Seto
- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.
- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().
- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.
- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.

Idea is very good IMHO.

Post by Hidetoshi Seto
---
fs/proc/array.c | 5 +---
include/linux/sched.h | 3 +-
kernel/exit.c | 23 +++++++++++----------
kernel/fork.c | 2 +
kernel/sched.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sys.c | 18 +++++++---------
6 files changed, 75 insertions(+), 28 deletions(-)

[snip]

Post by Hidetoshi Seto
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
- cputime_t prev_utime, prev_stime;
+ cputime_t prev_utime, prev_stime, prev_tgutime, prev_tgstime;
#endif

I think the new values should be part of struct_signal (see below)

Post by Hidetoshi Seto
+static void __thread_group_times(struct task_struct *p,
+ struct task_cputime *cputime)
+{
+ cputime_t rtime, utime, total;
+
+ thread_group_cputime(p, cputime);
+
+ total = cputime_add(cputime->utime, cputime->stime);
+ rtime = nsecs_to_cputime(cputime->sum_exec_runtime);
+
+ if (total) {
+ u64 temp;
+
+ temp = (u64)(rtime * cputime->utime);
+ do_div(temp, total);
+ utime = (cputime_t)temp;
+ } else
+ utime = rtime;
+
+ p->prev_tgutime = max(p->prev_tgutime, utime);
+ p->prev_tgstime = max(p->prev_tgstime,
+ cputime_sub(rtime, p->prev_tgutime));

p->sig->prev_utime = max(p->sig->prev_utime, utime);
p->sig->prev_stime = max(p->sig->prev_stime,
cputime_sub(rtime, p->sig->prev_utime));

Post by Hidetoshi Seto
/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct task_cputime cputime;
+
+ __thread_group_times(p, &cputime);
+
+ if (ut)
+ *ut = cputime.utime;
+ if (st)
+ *st = cputime.stime;

No thread_group_times() nor task_times() is called with NULL arguments, we
can get rid of "if ({u,s}t)" checks. Perhaps thread_group_times() should
have "struct task_cputime" argument as it is wrapper for
thread_group_cputime();

Thanks
Stanislaw

Hidetoshi Seto

2009-12-01 01:02:37 UTC

Post by Stanislaw Gruszka
Idea is very good IMHO.

Thank you very much!

Post by Stanislaw Gruszka

Post by Hidetoshi Seto
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
- cputime_t prev_utime, prev_stime;
+ cputime_t prev_utime, prev_stime, prev_tgutime, prev_tgstime;
#endif

I think the new values should be part of struct_signal (see below)

Good point. I'll update patch to do so.

Post by Stanislaw Gruszka

Post by Hidetoshi Seto
/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct task_cputime cputime;
+
+ __thread_group_times(p, &cputime);
+
+ if (ut)
+ *ut = cputime.utime;
+ if (st)
+ *st = cputime.stime;

No thread_group_times() nor task_times() is called with NULL arguments, we
can get rid of "if ({u,s}t)" checks. Perhaps thread_group_times() should
have "struct task_cputime" argument as it is wrapper for
thread_group_cputime();

Removing "if ({u,s}t)" is OK with me.
I expect all thread_group_times() user should have no interest in members
of struct task_cputime other than {u,s}time, so I'd like to keep the
argument as is.

Thanks,
H.Seto

Hidetoshi Seto

2009-12-02 08:26:47 UTC

- Remove if({u,s}t)s because no one call it with NULL now.
- Use cputime_{add,sub}().
- Add ifndef-endif for prev_{u,s}time since they are used
only when !VIRT_CPU_ACCOUNTING.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
---
include/linux/sched.h | 2 ++
kernel/fork.c | 2 ++
kernel/sched.c | 16 ++++++----------
3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d47f394..12dbb5c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,7 +1331,9 @@ struct task_struct {

cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
diff --git a/kernel/fork.c b/kernel/fork.c
index 166b8c4..ad7cb6d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1066,8 +1066,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->gtime = cputime_zero;
p->utimescaled = cputime_zero;
p->stimescaled = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
p->prev_utime = cputime_zero;
p->prev_stime = cputime_zero;
+#endif

p->default_timer_slack_ns = current->timer_slack_ns;

diff --git a/kernel/sched.c b/kernel/sched.c
index db1f6b2..3674f77 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5164,10 +5164,8 @@ void account_idle_ticks(unsigned long ticks)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING
void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- if (ut)
- *ut = p->utime;
- if (st)
- *st = p->stime;
+ *ut = p->utime;
+ *st = p->stime;
}
#else

@@ -5177,7 +5175,7 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)

void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- cputime_t rtime, utime = p->utime, total = utime + p->stime;
+ cputime_t rtime, utime = p->utime, total = cputime_add(utime, p->stime);

/*
* Use CFS's precise accounting:
@@ -5197,12 +5195,10 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
* Compare with previous values, to keep monotonicity:
*/
p->prev_utime = max(p->prev_utime, utime);
- p->prev_stime = max(p->prev_stime, rtime - p->prev_utime);
+ p->prev_stime = max(p->prev_stime, cputime_sub(rtime, p->prev_utime));

- if (ut)
- *ut = p->prev_utime;
- if (st)
- *st = p->prev_stime;
+ *ut = p->prev_utime;
+ *st = p->prev_stime;
}
#endif

--
1.6.5.3

Peter Zijlstra

2009-12-02 15:17:35 UTC

Could you, in the future, please start a new thread for new patches?

Balbir Singh

2009-12-02 15:29:30 UTC

Post by Peter Zijlstra
Could you, in the future, please start a new thread for new patches?

Yes, I almost missed it, cause I marked the thread as read. Let me
look

--
Balbir

Hidetoshi Seto

2009-12-03 00:21:25 UTC

Post by Balbir Singh

Post by Peter Zijlstra
Could you, in the future, please start a new thread for new patches?

Yes, I almost missed it, cause I marked the thread as read. Let me
look

OK, I will.

Maybe here I could post a link to new thread instead, to make tracking
problem from original thread to patch easy.

Thanks,
H.Seto

Peter Zijlstra

2009-12-02 15:57:29 UTC

Post by Hidetoshi Seto
- Remove if({u,s}t)s because no one call it with NULL now.
- Use cputime_{add,sub}().
- Add ifndef-endif for prev_{u,s}time since they are used
only when !VIRT_CPU_ACCOUNTING.

Acked-by: Peter Zijlstra <***@chello.nl>

tip-bot for Hidetoshi Seto

2009-12-02 17:33:34 UTC

Commit-ID: d99ca3b977fc5a93141304f571475c2af9e6c1c5
Gitweb: http://git.kernel.org/tip/d99ca3b977fc5a93141304f571475c2af9e6c1c5
Author: Hidetoshi Seto <***@jp.fujitsu.com>
AuthorDate: Wed, 2 Dec 2009 17:26:47 +0900
Committer: Ingo Molnar <***@elte.hu>
CommitDate: Wed, 2 Dec 2009 17:32:39 +0100

sched, cputime: Cleanups related to task_times()

- Remove if({u,s}t)s because no one call it with NULL now.
- Use cputime_{add,sub}().
- Add ifndef-endif for prev_{u,s}time since they are used
only when !VIRT_CPU_ACCOUNTING.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
Cc: Peter Zijlstra <***@infradead.org>
Cc: Spencer Candland <***@bluehost.com>
Cc: Americo Wang <***@gmail.com>
Cc: Oleg Nesterov <***@redhat.com>
Cc: Balbir Singh <***@in.ibm.com>
Cc: Stanislaw Gruszka <***@redhat.com>
LKML-Reference: <***@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <***@elte.hu>
---
include/linux/sched.h | 2 ++
kernel/fork.c | 2 ++
kernel/sched.c | 16 ++++++----------
3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0395b0f..dff85e5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,7 +1331,9 @@ struct task_struct {

cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
diff --git a/kernel/fork.c b/kernel/fork.c
index 166b8c4..ad7cb6d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1066,8 +1066,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->gtime = cputime_zero;
p->utimescaled = cputime_zero;
p->stimescaled = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
p->prev_utime = cputime_zero;
p->prev_stime = cputime_zero;
+#endif

p->default_timer_slack_ns = current->timer_slack_ns;

diff --git a/kernel/sched.c b/kernel/sched.c
index 4883fee..17e2c1d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5184,10 +5184,8 @@ void account_idle_ticks(unsigned long ticks)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING
void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- if (ut)
- *ut = p->utime;
- if (st)
- *st = p->stime;
+ *ut = p->utime;
+ *st = p->stime;
}
#else

@@ -5197,7 +5195,7 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)

void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- cputime_t rtime, utime = p->utime, total = utime + p->stime;
+ cputime_t rtime, utime = p->utime, total = cputime_add(utime, p->stime);

/*
* Use CFS's precise accounting:
@@ -5217,12 +5215,10 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
* Compare with previous values, to keep monotonicity:
*/
p->prev_utime = max(p->prev_utime, utime);
- p->prev_stime = max(p->prev_stime, rtime - p->prev_utime);
+ p->prev_stime = max(p->prev_stime, cputime_sub(rtime, p->prev_utime));

- if (ut)
- *ut = p->prev_utime;
- if (st)
- *st = p->prev_stime;
+ *ut = p->prev_utime;
+ *st = p->prev_stime;
}
#endif

Hidetoshi Seto

2009-12-02 08:28:07 UTC

This is a real fix for problem of utime/stime values decreasing
described in the thread:
http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while {u,s}time
and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.

- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(), because
it is mixed sum of "raw" value and "adjusted" value:

group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater than
adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only
runs 45ms) and if it exits, cputime will decrease (e.g. -5ms).

To fix this, we could do:

group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for every
thread should be avoided.

This patch fixes the above problem in the following way:

- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.

- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().

- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.

- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.

v2:
- remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
---
fs/proc/array.c | 5 +----
include/linux/sched.h | 4 ++++
kernel/exit.c | 23 ++++++++++++-----------
kernel/fork.c | 3 +++
kernel/sched.c | 41 +++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 18 ++++++++----------
6 files changed, 69 insertions(+), 25 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index b37121a..9e5d173 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -495,7 +495,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

/* add up live thread stats at the group level */
if (whole) {
- struct task_cputime cputime;
struct task_struct *t = task;
do {
min_flt += t->min_flt;
@@ -506,9 +505,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime(task, &cputime);
- utime = cputime.utime;
- stime = cputime.stime;
+ thread_group_times(task, &utime, &stime);
gtime = cputime_add(gtime, sig->gtime);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12dbb5c..7ae1274 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -628,6 +628,9 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+ cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1723,6 +1726,7 @@ static inline void put_task_struct(struct task_struct *t)
}

extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 2eaf68b..b221ad6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -91,8 +91,6 @@ static void __exit_signal(struct task_struct *tsk)
if (atomic_dec_and_test(&sig->count))
posix_cpu_timers_exit_group(tsk);
else {
- cputime_t utime, stime;
-
/*
* If there is any task waiting for the group exit
* then notify it:
@@ -112,9 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- task_times(tsk, &utime, &stime);
- sig->utime = cputime_add(sig->utime, utime);
- sig->stime = cputime_add(sig->stime, stime);
+ sig->utime = cputime_add(sig->utime, tsk->utime);
+ sig->stime = cputime_add(sig->stime, tsk->stime);
sig->gtime = cputime_add(sig->gtime, tsk->gtime);
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
@@ -1208,6 +1205,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
struct signal_struct *psig;
struct signal_struct *sig;
unsigned long maxrss;
+ cputime_t tgutime, tgstime;

/*
* The resource counters for the group leader are in its
@@ -1223,20 +1221,23 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* need to protect the access to parent->signal fields,
* as other threads in the parent group can be right
* here reaping other children at the same time.
+ *
+ * We use thread_group_times() to get times for the thread
+ * group, which consolidates times for all threads in the
+ * group including the group leader.
*/
+ thread_group_times(p, &tgutime, &tgstime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
psig->cutime =
cputime_add(psig->cutime,
- cputime_add(p->utime,
- cputime_add(sig->utime,
- sig->cutime)));
+ cputime_add(tgutime,
+ sig->cutime));
psig->cstime =
cputime_add(psig->cstime,
- cputime_add(p->stime,
- cputime_add(sig->stime,
- sig->cstime)));
+ cputime_add(tgstime,
+ sig->cstime));
psig->cgtime =
cputime_add(psig->cgtime,
cputime_add(p->gtime,
diff --git a/kernel/fork.c b/kernel/fork.c
index ad7cb6d..3d6f121 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -884,6 +884,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
sig->gtime = cputime_zero;
sig->cgtime = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+ sig->prev_utime = sig->prev_stime = cputime_zero;
+#endif
sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0;
diff --git a/kernel/sched.c b/kernel/sched.c
index 3674f77..aa733d5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5167,6 +5167,16 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
*ut = p->utime;
*st = p->stime;
}
+
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct task_cputime cputime;
+
+ thread_group_cputime(p, &cputime);
+
+ *ut = cputime.utime;
+ *st = cputime.stime;
+}
#else

#ifndef nsecs_to_cputime
@@ -5200,6 +5210,37 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
*ut = p->prev_utime;
*st = p->prev_stime;
}
+
+/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct signal_struct *sig = p->signal;
+ struct task_cputime cputime;
+ cputime_t rtime, utime, total;
+
+ thread_group_cputime(p, &cputime);
+
+ total = cputime_add(cputime.utime, cputime.stime);
+ rtime = nsecs_to_cputime(cputime.sum_exec_runtime);
+
+ if (total) {
+ u64 temp;
+
+ temp = (u64)(rtime * cputime.utime);
+ do_div(temp, total);
+ utime = (cputime_t)temp;
+ } else
+ utime = rtime;
+
+ sig->prev_utime = max(sig->prev_utime, utime);
+ sig->prev_stime = max(sig->prev_stime,
+ cputime_sub(rtime, sig->prev_utime));
+
+ *ut = sig->prev_utime;
+ *st = sig->prev_stime;
+}
#endif

/*
diff --git a/kernel/sys.c b/kernel/sys.c
index d988abe..9968c5f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -911,16 +911,15 @@ change_okay:

void do_sys_times(struct tms *tms)
{
- struct task_cputime cputime;
- cputime_t cutime, cstime;
+ cputime_t tgutime, tgstime, cutime, cstime;

spin_lock_irq(&current->sighand->siglock);
- thread_group_cputime(current, &cputime);
+ thread_group_times(current, &tgutime, &tgstime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
- tms->tms_utime = cputime_to_clock_t(cputime.utime);
- tms->tms_stime = cputime_to_clock_t(cputime.stime);
+ tms->tms_utime = cputime_to_clock_t(tgutime);
+ tms->tms_stime = cputime_to_clock_t(tgstime);
tms->tms_cutime = cputime_to_clock_t(cutime);
tms->tms_cstime = cputime_to_clock_t(cstime);
}
@@ -1338,8 +1337,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
{
struct task_struct *t;
unsigned long flags;
- cputime_t utime, stime;
- struct task_cputime cputime;
+ cputime_t tgutime, tgstime, utime, stime;
unsigned long maxrss = 0;

memset((char *) r, 0, sizeof *r);
@@ -1372,9 +1370,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime(p, &cputime);
- utime = cputime_add(utime, cputime.utime);
- stime = cputime_add(stime, cputime.stime);
+ thread_group_times(p, &tgutime, &tgstime);
+ utime = cputime_add(utime, tgutime);
+ stime = cputime_add(stime, tgstime);
r->ru_nvcsw += p->signal->nvcsw;
r->ru_nivcsw += p->signal->nivcsw;
r->ru_minflt += p->signal->min_flt;

--
1.6.5.3

Peter Zijlstra

2009-12-02 15:58:08 UTC

Post by Hidetoshi Seto
This is a real fix for problem of utime/stime values decreasing
http://lkml.org/lkml/2009/11/3/522
- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).
- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().
- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.
So {u,s}time in task struct are "raw" tick count, while {u,s}time
and c{u,s}time in signal struct are "adjusted" values.
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.
The problem is the return value of thread_group_cputime(), because
group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater than
adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only
runs 45ms) and if it exits, cputime will decrease (e.g. -5ms).
group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
But task_times() contains hard divisions, so applying it for every
thread should be avoided.
- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.
- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().
- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.
- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.
- remove if()s, put new variables into signal_struct.

Acked-by: Peter Zijlstra <***@chello.nl>

Thanks for taking care of this!

tip-bot for Hidetoshi Seto

2009-12-02 17:33:49 UTC

Commit-ID: 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
Gitweb: http://git.kernel.org/tip/0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
Author: Hidetoshi Seto <***@jp.fujitsu.com>
AuthorDate: Wed, 2 Dec 2009 17:28:07 +0900
Committer: Ingo Molnar <***@elte.hu>
CommitDate: Wed, 2 Dec 2009 17:32:40 +0100

sched, cputime: Introduce thread_group_times()

This is a real fix for problem of utime/stime values decreasing
described in the thread:

http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.

- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:

group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).

To fix this, we could do:

group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for
every thread should be avoided.

This patch fixes the above problem in the following way:

- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.

- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().

- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.

- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.

v2:
- remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
Acked-by: Peter Zijlstra <***@infradead.org>
Cc: Spencer Candland <***@bluehost.com>
Cc: Americo Wang <***@gmail.com>
Cc: Oleg Nesterov <***@redhat.com>
Cc: Balbir Singh <***@in.ibm.com>
Cc: Stanislaw Gruszka <***@redhat.com>
LKML-Reference: <***@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <***@elte.hu>
---
fs/proc/array.c | 5 +----
include/linux/sched.h | 4 ++++
kernel/exit.c | 23 ++++++++++++-----------
kernel/fork.c | 3 +++
kernel/sched.c | 41 +++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 18 ++++++++----------
6 files changed, 69 insertions(+), 25 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ca61a88..2571da4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -506,7 +506,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

/* add up live thread stats at the group level */
if (whole) {
- struct task_cputime cputime;
struct task_struct *t = task;
do {
min_flt += t->min_flt;
@@ -517,9 +516,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime(task, &cputime);
- utime = cputime.utime;
- stime = cputime.stime;
+ thread_group_times(task, &utime, &stime);
gtime = cputime_add(gtime, sig->gtime);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dff85e5..34238bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,6 +624,9 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+ cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1723,6 +1726,7 @@ static inline void put_task_struct(struct task_struct *t)
}

extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 2eaf68b..b221ad6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -91,8 +91,6 @@ static void __exit_signal(struct task_struct *tsk)
if (atomic_dec_and_test(&sig->count))
posix_cpu_timers_exit_group(tsk);
else {
- cputime_t utime, stime;
-
/*
* If there is any task waiting for the group exit
* then notify it:
@@ -112,9 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- task_times(tsk, &utime, &stime);
- sig->utime = cputime_add(sig->utime, utime);
- sig->stime = cputime_add(sig->stime, stime);
+ sig->utime = cputime_add(sig->utime, tsk->utime);
+ sig->stime = cputime_add(sig->stime, tsk->stime);
sig->gtime = cputime_add(sig->gtime, tsk->gtime);
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
@@ -1208,6 +1205,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
struct signal_struct *psig;
struct signal_struct *sig;
unsigned long maxrss;
+ cputime_t tgutime, tgstime;

/*
* The resource counters for the group leader are in its
@@ -1223,20 +1221,23 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* need to protect the access to parent->signal fields,
* as other threads in the parent group can be right
* here reaping other children at the same time.
+ *
+ * We use thread_group_times() to get times for the thread
+ * group, which consolidates times for all threads in the
+ * group including the group leader.
*/
+ thread_group_times(p, &tgutime, &tgstime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
psig->cutime =
cputime_add(psig->cutime,
- cputime_add(p->utime,
- cputime_add(sig->utime,
- sig->cutime)));
+ cputime_add(tgutime,
+ sig->cutime));
psig->cstime =
cputime_add(psig->cstime,
- cputime_add(p->stime,
- cputime_add(sig->stime,
- sig->cstime)));
+ cputime_add(tgstime,
+ sig->cstime));
psig->cgtime =
cputime_add(psig->cgtime,
cputime_add(p->gtime,
diff --git a/kernel/fork.c b/kernel/fork.c
index ad7cb6d..3d6f121 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -884,6 +884,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
sig->gtime = cputime_zero;
sig->cgtime = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+ sig->prev_utime = sig->prev_stime = cputime_zero;
+#endif
sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0;
diff --git a/kernel/sched.c b/kernel/sched.c
index 17e2c1d..e6ba726 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5187,6 +5187,16 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
*ut = p->utime;
*st = p->stime;
}
+
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct task_cputime cputime;
+
+ thread_group_cputime(p, &cputime);
+
+ *ut = cputime.utime;
+ *st = cputime.stime;
+}
#else

#ifndef nsecs_to_cputime
@@ -5220,6 +5230,37 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
*ut = p->prev_utime;
*st = p->prev_stime;
}
+
+/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct signal_struct *sig = p->signal;
+ struct task_cputime cputime;
+ cputime_t rtime, utime, total;
+
+ thread_group_cputime(p, &cputime);
+
+ total = cputime_add(cputime.utime, cputime.stime);
+ rtime = nsecs_to_cputime(cputime.sum_exec_runtime);
+
+ if (total) {
+ u64 temp;
+
+ temp = (u64)(rtime * cputime.utime);
+ do_div(temp, total);
+ utime = (cputime_t)temp;
+ } else
+ utime = rtime;
+
+ sig->prev_utime = max(sig->prev_utime, utime);
+ sig->prev_stime = max(sig->prev_stime,
+ cputime_sub(rtime, sig->prev_utime));
+
+ *ut = sig->prev_utime;
+ *st = sig->prev_stime;
+}
#endif

/*
diff --git a/kernel/sys.c b/kernel/sys.c
index bbdfce0..9968c5f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -911,16 +911,15 @@ change_okay:

void do_sys_times(struct tms *tms)
{
- struct task_cputime cputime;
- cputime_t cutime, cstime;
+ cputime_t tgutime, tgstime, cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_times(current, &tgutime, &tgstime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
- tms->tms_utime = cputime_to_clock_t(cputime.utime);
- tms->tms_stime = cputime_to_clock_t(cputime.stime);
+ tms->tms_utime = cputime_to_clock_t(tgutime);
+ tms->tms_stime = cputime_to_clock_t(tgstime);
tms->tms_cutime = cputime_to_clock_t(cutime);
tms->tms_cstime = cputime_to_clock_t(cstime);
}
@@ -1338,8 +1337,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
{
struct task_struct *t;
unsigned long flags;
- cputime_t utime, stime;
- struct task_cputime cputime;
+ cputime_t tgutime, tgstime, utime, stime;
unsigned long maxrss = 0;

memset((char *) r, 0, sizeof *r);
@@ -1372,9 +1370,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime(p, &cputime);
- utime = cputime_add(utime, cputime.utime);
- stime = cputime_add(stime, cputime.stime);
+ thread_group_times(p, &tgutime, &tgstime);
+ utime = cputime_add(utime, tgutime);
+ stime = cputime_add(stime, tgstime);
r->ru_nvcsw += p->signal->nvcsw;
r->ru_nivcsw += p->signal->nivcsw;
r->ru_minflt += p->signal->min_flt;

Hidetoshi Seto

2009-12-02 08:29:05 UTC

Following is a sample program that can reproduce the problem
in several ways.

I tested my patch using this reproducer, and confirmed that
applying both of Stanislaw's do_sys_times() patch and my
thread_group_times() patch is required to fix the problem.

Thanks,
H.Seto

===
/*
* Sample program to demonstrate time decreasing on thread exit
*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/times.h>

#define DEFAULT_THREADS 500

unsigned long lpt;
int looptype;
int samplesleep;
int threads;

void *child (void *thread_id)
{
struct tms t[12];
int i, j;
unsigned long ret = 0, id = (unsigned long)thread_id;

if (looptype) {
/*
* discont:
*
* Loop tricky, to make a significant gap between
* task->{u,s}time and task_times().
*
* runtime of a thread = 0.5 tick * 1000 = 500 ms
* task->{u,s}time = (??? , 0)
* task_times() = ideally (500, 0)
*/
for (j = 0; j < 1000; j++) {
for (i = 0; i < lpt/2; i++)
;
usleep(0);
}
} else {
/*
* cont:
*
* Loop without tricks.
*
* runtime of a thread = 500 ms
* task->{u,s}time = (500, 0)
* task_times() = (500, 0)
*/
for (i = 0; i < lpt * 500; i++)
;
}

if (!(id % 4))
/* bother other threads */
pthread_exit((void *)ret);

for (i = 0; i < 12; i++) {
times(&t[i]);
if (samplesleep)
usleep(0);
}

for (i = 0; i + 5 < 12; i++) {
/*
* +----+----+----+----+----+
* i+0 i+1 i+2 i+3 i+4 i+5
* ^^^^^^
* check here
*/
if (t[i+2].tms_utime > t[i+3].tms_utime
|| t[i+2].tms_stime > t[i+3].tms_stime) {

printf("[%4ld] %s decreased %3d: "
"(%d %d) (%d %d) [%d %d]<->[%d %d] (%d %d) (%d %d)\n",
id,
t[i+2].tms_utime > t[i+3].tms_utime ?
"utime" : "stime",
t[i+2].tms_utime > t[i+3].tms_utime ?
t[i+3].tms_utime - t[i+2].tms_utime :
t[i+3].tms_stime - t[i+2].tms_stime,
t[i+0].tms_utime, t[i+0].tms_stime,
t[i+1].tms_utime, t[i+1].tms_stime,
t[i+2].tms_utime, t[i+2].tms_stime,
t[i+3].tms_utime, t[i+3].tms_stime,
t[i+4].tms_utime, t[i+4].tms_stime,
t[i+5].tms_utime, t[i+5].tms_stime);
ret = 1;
}
}

pthread_exit((void *)ret);
}

void get_loops_per_tick(void)
{
struct tms t1, t2;
unsigned long i, mloop = 1000 * 1000 * 1000;

times(&t1);
for (i = 0; i < mloop; i++)
;
times(&t2);

lpt = mloop / ((t2.tms_utime - t1.tms_utime) * 10);
}

void do_test(int c)
{
struct tms t1, t2;
clock_t j1, j2;
pthread_t *th;
unsigned long i, ret = 0;

th = calloc(threads, sizeof(pthread_t));
if (!th)
return;

looptype = !!(c & 0x1) ? 1 : 0;
samplesleep = !!(c & 0x2) ? 1 : 0;
printf("looptype : %s\n", looptype ? "discont" : "cont");
printf("samplesleep : %s\n", samplesleep ? "yes" : "no");

printf(" ## start ##\n");
j1 = times(&t1);
for (i = 0; i < threads; i++)
pthread_create (&th[i], NULL, child, (void *)i);
for (i = 0; i < threads; i++) {
int r;
pthread_join(th[i], (void *)&r);
ret += (int)r;
}
j2 = times(&t2);
printf(" ## done. ##\n");
printf(" loop total:\n");
printf(" user : %7d ms\n", (t2.tms_utime - t1.tms_utime) * 10);
printf(" system : %7d ms\n", (t2.tms_stime - t1.tms_stime) * 10);
printf(" elapse : %7d ms\n", (j2 - j1) * 10);
printf(" error : %d\n\n", ret);

printf("result: %s\n\n", ret ? "BAD" : "GOOD");
}

int main(int argc, char **argv)
{
int i;

threads = argc > 1 ? atoi(argv[1]) : DEFAULT_THREADS;

printf("### Prep:\n");
get_loops_per_tick();
printf("loops_per_tick: %ld\n", lpt);
printf("threads : %d\n\n", threads);

printf("### Test:\n");
for (i = 0; i < 4; i++)
do_test(i);
}

Hidetoshi Seto

2009-12-02 08:32:42 UTC

This program can reproduce another problem that originally
task_{u,s}time() intended to solve, I think.

Adjustment by task_{u,s}time() was only applied on getrusage()
and /proc/<pid>/stat. So even in .32-rc8, we can reproduce this
problem on sys_times() which has no adjustment.

I confirmed that my patches fix it, by thread_group_times().

Thanks,
H.Seto

===
/*
* Sample program to demonstrate invisible utime
*/

#include <stdio.h>
#include <unistd.h>
#include <sys/times.h>

/* 300 million, arbitrary */
#define LOOP (300 * 1000 * 1000)

unsigned long lpt, l, c;

int do_loop(void)
{
struct tms t1, t2;
clock_t j1, j2;
unsigned long i;

printf("Loop %d times, sleeping %d times every %d.\n", l, l/c, c);
printf(" start ...\n");
j1 = times(&t1);
for (i = 1; i <= l; i++)
if (!(i % c))
usleep(0);
j2 = times(&t2);
printf(" ... done.\n");

/* tms_{u,s}time is clock_t */
printf(" user : %5d ms\n", (t2.tms_utime - t1.tms_utime) * 10);
printf(" system : %5d ms\n", (t2.tms_stime - t1.tms_stime) * 10);
printf(" elapse : %5d ms\n\n", (j2 - j1) * 10);

return (t2.tms_utime - t1.tms_utime) * 10;
}

int main(int argc, char **argv)
{
int u0, u1, u2, u3;
int ticks;

l = argc > 1 ? atoi(argv[1]) : LOOP;

printf("### Prep:\n");
c = l;
ticks = do_loop();
lpt = l / ticks;
printf("loops/tick: %d\n", lpt);
l = lpt * 1000;
printf("change loop to %d to short test.\n\n", l);

printf("### Test:\n");
c = l;
u0 = do_loop();
c = lpt;
u1 = do_loop();
c = lpt / 2;
u2 = do_loop();
c = lpt / 8;
u3 = do_loop();
printf("result: %s\n\n",
(u0 <= u1) && (u1 <= u2) && (u2 <= u3) ? "GOOD" : "BAD");
}

Balbir Singh

2009-11-23 10:25:28 UTC

* Stanislaw Gruszka <***@redhat.com> [2009-11-23 11:09:26]:
[snip]

Post by Stanislaw Gruszka
+
+ utime = task_utime(tsk);
+ stime = task_stime(tsk);
+ if (tsk->utime > utime || tsk->stime > stime) {
+ utime = tsk->utime;
+ stime = tsk->stime;
+ }

Won't this condition be problematic, since it can reset stime
when tsk->utime > utime for example?

--
Balbir

Stanislaw Gruszka

2009-11-23 10:46:40 UTC

On Mon, 23 Nov 2009 15:55:28 +0530

Post by Stanislaw Gruszka
[snip]

Post by Stanislaw Gruszka
+
+ utime = task_utime(tsk);
+ stime = task_stime(tsk);
+ if (tsk->utime > utime || tsk->stime > stime) {
+ utime = tsk->utime;
+ stime = tsk->stime;
+ }

Won't this condition be problematic, since it can reset stime
when tsk->utime > utime for example?

We use both {u,s}time adjusted values or both not adjusted values.
This seems to be right to me. Adjusting not help anyway, if we have
stime/utime ticks disproportion, e.g. task only gets stime ticks but
not gets utime ticks, but it runs in user and kernel space in
the same amount of time.

If this solution whold be accepted, patch will contain additional
comment.

Stanislaw

Hidetoshi Seto

2009-11-24 05:33:42 UTC

Post by Stanislaw Gruszka

Post by Hidetoshi Seto
Task A: interrupted by timer few times
(utime, stime, se.sum_sched_runtime) = (50, 50, 1000000000)
=> total of runtime is 1 sec, but utime + stime is 100 ms
Task B: interrupted by timer many times
(utime, stime, se.sum_sched_runtime) = (50, 50, 10000000)
=> total of runtime is 10 ms, but utime + stime is 100 ms

How tis is probable, that task is running very long, but not getting
the ticks ? I know this is possible, otherwise we will not see utime
decreasing after do_sys_times() siglock fix, but how probable?

For example, assume a task like watchdog that calls sleep soon after
its work. Such task will be woken up by a timer interrupt on other
task and queued to run queue. Once it get a cpu it can finish its
work before next tick. So it can run long without getting any ticks
on it. I suppose you can find such tasks in monitoring tool which
contains sampling threads that behaves like watchdog.

As the side effect, since such tasks tend to use cpu between tick
period, they make other tasks to more likely be interrupted by ticks.

Post by Stanislaw Gruszka

Post by Hidetoshi Seto

Post by Hidetoshi Seto
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c

Post by Hidetoshi Seto
index 5c9dc22..e065b8a 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;
t = next_thread(t);

That works for me and I agree that this is right fix. Peter had concerns
about p->prev_utime races and additional need for further propagation of
task_{s,u}time() to posix-cpu-timers code. However I do not understand
these problems.

I think that one of our concerns is the cost of task_{s,u}time(), which
might bring other problem if they are propagated. But I found we can reduce
the cost (about the half, or more), that is why I posted task_times() patch
in other thread in LKML.

Thanks,
H.Seto

Spencer Candland

2009-11-18 22:38:18 UTC

Post by Stanislaw Gruszka
Could you please test this patch, if it solve all utime decrease
http://patchwork.kernel.org/patch/59795/

Yes, this seems to solve the problem.

- Spencer

Stanislaw Gruszka

2009-11-23 09:52:46 UTC

Post by Stanislaw Gruszka

Post by Hidetoshi Seto
Originally task_s/utime() were designed to return clock_t but later
commit efe567fc8281661524ffa75477a7c4ca9b466c63
Date: Thu Aug 23 15:18:02 2007 +0200
It only changed the type of return value, but not the implementation.
As the result the granularity of task_s/utime() is still that of
clock_t, not that of cputime_t.
So using task_s/utime() in __exit_signal() makes values accumulated
to the signal struct to be rounded and coarse grained.
This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.
---
kernel/sched.c | 22 +++++++++++++---------
1 files changed, 13 insertions(+), 9 deletions(-)

Patch not fix the issue on my system. I test it alone, together with
posix-cpu-timers: avoid do_sys_times() races with __exit_signal(
and (further) together with
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk,
struct task_cputime *times)
t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;
t = next_thread(t);
What only changed was probability to enter the issue.

I was wrong here, that combination fix the problem on my system. I don't
know how I was testing it before, perhaps I booted wrong kernel.

Stanislaw

tip-bot for Hidetoshi Seto

2009-11-12 18:12:41 UTC

Commit-ID: 761b1d26df542fd5eb348837351e4d2f3bc7bffe
Gitweb: http://git.kernel.org/tip/761b1d26df542fd5eb348837351e4d2f3bc7bffe
Author: Hidetoshi Seto <***@jp.fujitsu.com>
AuthorDate: Thu, 12 Nov 2009 13:33:45 +0900
Committer: Ingo Molnar <***@elte.hu>
CommitDate: Thu, 12 Nov 2009 15:23:47 +0100

sched: Fix granularity of task_u/stime()

Originally task_s/utime() were designed to return clock_t but
later changed to return cputime_t by following commit:

commit efe567fc8281661524ffa75477a7c4ca9b466c63
Author: Christian Borntraeger <***@de.ibm.com>
Date: Thu Aug 23 15:18:02 2007 +0200

It only changed the type of return value, but not the
implementation. As the result the granularity of task_s/utime()
is still that of clock_t, not that of cputime_t.

So using task_s/utime() in __exit_signal() makes values
accumulated to the signal struct to be rounded and coarse
grained.

This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.

v2:
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
Acked-by: Peter Zijlstra <***@infradead.org>
Cc: ***@gmail.com
Cc: Spencer Candland <***@bluehost.com>
Cc: Oleg Nesterov <***@redhat.com>
Cc: Stanislaw Gruszka <***@redhat.com>
LKML-Reference: <***@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <***@elte.hu>
---
kernel/sched.c | 22 +++++++++++++---------
1 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 43e61fa..ab9a034 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5156,41 +5156,45 @@ cputime_t task_stime(struct task_struct *p)
return p->stime;
}
#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) \
+ msecs_to_cputime(div_u64((__nsecs), NSEC_PER_MSEC))
+#endif
+
cputime_t task_utime(struct task_struct *p)
{
- clock_t utime = cputime_to_clock_t(p->utime),
- total = utime + cputime_to_clock_t(p->stime);
+ cputime_t utime = p->utime, total = utime + p->stime;
u64 temp;

/*
* Use CFS's precise accounting:
*/
- temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
+ temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);

if (total) {
temp *= utime;
do_div(temp, total);
}
- utime = (clock_t)temp;
+ utime = (cputime_t)temp;

- p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ p->prev_utime = max(p->prev_utime, utime);
return p->prev_utime;
}

cputime_t task_stime(struct task_struct *p)
{
- clock_t stime;
+ cputime_t stime;

/*
* Use CFS's precise accounting. (we subtract utime from
* the total, to make sure the total observed by userspace
* grows monotonically - apps rely on that):
*/
- stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
- cputime_to_clock_t(task_utime(p));
+ stime = nsecs_to_cputime(p->se.sum_exec_runtime) - task_utime(p);

if (stime >= 0)
- p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+ p->prev_stime = max(p->prev_stime, stime);

return p->prev_stime;
}

Stanislaw Gruszka

2009-11-13 09:40:29 UTC

Post by tip-bot for Hidetoshi Seto
sched: Fix granularity of task_u/stime()
Originally task_s/utime() were designed to return clock_t but
commit efe567fc8281661524ffa75477a7c4ca9b466c63
Date: Thu Aug 23 15:18:02 2007 +0200
It only changed the type of return value, but not the
implementation. As the result the granularity of task_s/utime()
is still that of clock_t, not that of cputime_t.
So using task_s/utime() in __exit_signal() makes values
accumulated to the signal struct to be rounded and coarse
grained.
This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.

Patch make only difference on 64 arch and do not fix the ordinal utime
decreasing problem on 32 bits. Anyway except below nit looks good
for me (but I to be honest I don't understand this {u/s}time adjusting
functions at all).

Post by tip-bot for Hidetoshi Seto
- p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ p->prev_utime = max(p->prev_utime, utime);
return p->prev_utime;
}
cputime_t task_stime(struct task_struct *p)
{
- clock_t stime;
+ cputime_t stime;
/*
* Use CFS's precise accounting. (we subtract utime from
* the total, to make sure the total observed by userspace
*/
- stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
- cputime_to_clock_t(task_utime(p));
+ stime = nsecs_to_cputime(p->se.sum_exec_runtime) - task_utime(p);
if (stime >= 0)

Since cputime_t is not signed type, we should just remove this or
write something like that:

runtime = nsecs_to_cputime(p->se.sum_exec_runtime);
utime = task_utime(p);
if (cputime_ge(runtime, utime))
p->prev_stime = max(p->prev_stime, runtime - utime);

Stanislaw

Ingo Molnar

2009-11-13 23:09:10 UTC

Post by tip-bot for Hidetoshi Seto
Commit-ID: 761b1d26df542fd5eb348837351e4d2f3bc7bffe
Gitweb: http://git.kernel.org/tip/761b1d26df542fd5eb348837351e4d2f3bc7bffe
AuthorDate: Thu, 12 Nov 2009 13:33:45 +0900
CommitDate: Thu, 12 Nov 2009 15:23:47 +0100
sched: Fix granularity of task_u/stime()
Originally task_s/utime() were designed to return clock_t but

i think this one causes this new warning in -tip:

include/trace/events/timer.h:279: warning: format '%lu'
expects type 'long unsigned int', but argument 4 has type 'cputime_t'

Ingo

Hidetoshi Seto

2009-11-16 02:44:17 UTC

Post by Ingo Molnar

Post by tip-bot for Hidetoshi Seto
Commit-ID: 761b1d26df542fd5eb348837351e4d2f3bc7bffe
Gitweb: http://git.kernel.org/tip/761b1d26df542fd5eb348837351e4d2f3bc7bffe
AuthorDate: Thu, 12 Nov 2009 13:33:45 +0900
CommitDate: Thu, 12 Nov 2009 15:23:47 +0100
sched: Fix granularity of task_u/stime()
Originally task_s/utime() were designed to return clock_t but

include/trace/events/timer.h:279: warning: format '%lu'
expects type 'long unsigned int', but argument 4 has type 'cputime_t'

No. I believe the warning was already there.
Following patch, based on tip:sched/core, will remove the warning.

Thanks,
H.Seto

===

Subject: [PATCH] trace: cputime_t is not always unsigned long

Type of cputime_t can be u64, unsigned long long or so on,
e.g. if kernel is configured with VIRT_CPU_ACCOUNTING=y.
So it can cause following warning:

include/trace/events/timer.h:279: warning: format '%lu'
expects type 'long unsigned int', but argument 4 has type 'cputime_t'

For printk(), it is better to use cputime_to_cputime64() and %llu,
since cputime64_t is always u64.

Signed-off-by: Hidetoshi Seto <***@jp.fujitsu.com>
---
include/trace/events/timer.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 1844c48..8f4ec13 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -302,8 +302,8 @@ TRACE_EVENT(itimer_state,
__entry->interval_usec = value->it_interval.tv_usec;
),

- TP_printk("which %d, expires %lu, it_value %lu.%lu, it_interval %lu.%lu",
- __entry->which, __entry->expires,
+ TP_printk("which %d, expires %llu, it_value %lu.%lu, it_interval %lu.%lu",
+ __entry->which, cputime_to_cputime64(__entry->expires),
__entry->value_sec, __entry->value_usec,
__entry->interval_sec, __entry->interval_usec)
);
@@ -332,8 +332,8 @@ TRACE_EVENT(itimer_expire,
__entry->pid = pid_nr(pid);
),

- TP_printk("which %d, pid %d, now %lu", __entry->which,
- (int) __entry->pid, __entry->now)
+ TP_printk("which %d, pid %d, now %llu", __entry->which,
+ (int) __entry->pid, cputime_to_cputime64(__entry->now))
);

#endif /* _TRACE_TIMER_H */

--
1.6.5.2

61 Replies
3 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Spencer Candland 2009-11-04 00:23:27 UTC

Hidetoshi Seto 2009-11-04 06:49:25 UTC

Hidetoshi Seto 2009-11-05 05:24:06 UTC

Peter Zijlstra 2009-11-09 14:49:14 UTC

Oleg Nesterov 2009-11-09 17:20:20 UTC

Oleg Nesterov 2009-11-09 17:27:37 UTC

Peter Zijlstra 2009-11-09 17:31:43 UTC

Oleg Nesterov 2009-11-09 19:23:55 UTC

Peter Zijlstra 2009-11-09 19:32:23 UTC

Stanislaw Gruszka 2009-11-10 10:44:56 UTC

Oleg Nesterov 2009-11-10 17:40:08 UTC

Stanislaw Gruszka 2009-11-10 18:24:15 UTC

Oleg Nesterov 2009-11-10 19:23:27 UTC

Stanislaw Gruszka 2009-11-17 12:48:52 UTC

Stanislaw Gruszka 2009-11-17 12:57:03 UTC

Hidetoshi Seto 2009-11-10 05:42:20 UTC

Hidetoshi Seto 2009-11-10 05:47:34 UTC

Stanislaw Gruszka 2009-11-11 12:11:51 UTC

Hidetoshi Seto 2009-11-12 00:00:25 UTC

Hidetoshi Seto 2009-11-12 02:49:38 UTC

Américo Wang 2009-11-12 02:55:38 UTC

Hidetoshi Seto 2009-11-12 04:16:33 UTC

Hidetoshi Seto 2009-11-12 04:33:45 UTC

Peter Zijlstra 2009-11-12 14:15:58 UTC

Stanislaw Gruszka 2009-11-12 14:49:20 UTC

Peter Zijlstra 2009-11-12 15:00:38 UTC

Stanislaw Gruszka 2009-11-12 15:40:50 UTC

Stanislaw Gruszka 2009-11-13 12:42:37 UTC

Peter Zijlstra 2009-11-13 13:16:59 UTC

Balbir Singh 2009-11-13 14:12:07 UTC

Stanislaw Gruszka 2009-11-13 15:36:54 UTC

Peter Zijlstra 2009-11-13 17:05:03 UTC

Stanislaw Gruszka 2009-11-17 13:08:52 UTC

Peter Zijlstra 2009-11-17 13:24:48 UTC

Stanislaw Gruszka 2009-11-19 18:17:45 UTC

Hidetoshi Seto 2009-11-20 02:00:21 UTC

Stanislaw Gruszka 2009-11-23 10:09:26 UTC

Stanislaw Gruszka 2009-11-23 10:16:12 UTC

Hidetoshi Seto 2009-11-30 09:20:05 UTC

Hidetoshi Seto 2009-11-30 09:21:36 UTC

Stanislaw Gruszka 2009-11-30 14:54:43 UTC

Hidetoshi Seto 2009-12-01 01:02:37 UTC

Hidetoshi Seto 2009-12-02 08:26:47 UTC

Peter Zijlstra 2009-12-02 15:17:35 UTC

Balbir Singh 2009-12-02 15:29:30 UTC

Hidetoshi Seto 2009-12-03 00:21:25 UTC

Peter Zijlstra 2009-12-02 15:57:29 UTC

tip-bot for Hidetoshi Seto 2009-12-02 17:33:34 UTC

Hidetoshi Seto 2009-12-02 08:28:07 UTC

Peter Zijlstra 2009-12-02 15:58:08 UTC

tip-bot for Hidetoshi Seto 2009-12-02 17:33:49 UTC

Hidetoshi Seto 2009-12-02 08:29:05 UTC

Hidetoshi Seto 2009-12-02 08:32:42 UTC

Balbir Singh 2009-11-23 10:25:28 UTC

Stanislaw Gruszka 2009-11-23 10:46:40 UTC

Hidetoshi Seto 2009-11-24 05:33:42 UTC

Spencer Candland 2009-11-18 22:38:18 UTC

Stanislaw Gruszka 2009-11-23 09:52:46 UTC

tip-bot for Hidetoshi Seto 2009-11-12 18:12:41 UTC

Stanislaw Gruszka 2009-11-13 09:40:29 UTC

Ingo Molnar 2009-11-13 23:09:10 UTC

Hidetoshi Seto 2009-11-16 02:44:17 UTC

about - legalese

Loading...