Discussion:
RFR (S) 8212207: runtime/InternalApi/ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0
David Holmes
2018-11-20 00:01:09 UTC
Permalink
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/

There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live target
thread so that we can use its pthread_t identity to obtain its CPU clock
via pthread_getcpuclockid().

There is an iteration mechanism for NJTs in which the NJT is registered
during its constructor and de-registered during its destructor. A thread
that has only been constructed has not yet executed and so is not a
valid target for this management API. This seems to be the cause of
failures reported in this bug (and JDK-8213434). Registering a NJT only
when it starts executing is an appealing fix for this, but that impacts
all current users of the NJT list and straight-away causes a problem
with the BarrierSet initialization logic. So I don't attempt that.

Instead the first part of the fix is for ThreadTimesClosure::do_thread
to skip threads that have not yet executed - which we can recognize by
seeing an uninitialized (i.e. zero) stackbase.

A second part of the fix, which can be deferred to a separate RFE for
NJT lifecycle management if desired, tackles the problem of encountering
a terminated thread during iteration - which can also lead to SEGVs.
This can arise because NJT's are not actually "destructed", even if they
terminate, and so they never get removed from the NJT list. Calling
destructors is problematic because the code using these NJTs assume they
are always valid. So the fix in this case is to move the de-registering
from the NJT list out of the destructor and into the Thread::call_run()
method so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where the
remaining steps touch on a lot of areas and so need to be handled
separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
threads.

Testing: tiers 1 -3

I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test this
in the conditions reported in JDK-8213434.

Thanks,
David
David Holmes
2018-11-20 08:50:46 UTC
Permalink
After discussions with Kim I've decided to split out the NJT list update
into a separate RFE:

https://bugs.openjdk.java.net/browse/JDK-8214097

So only the change in management.cpp needs reviewing and testing.

Updated webrev:

http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/

Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live target
thread so that we can use its pthread_t identity to obtain its CPU clock
via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered
during its constructor and de-registered during its destructor. A thread
that has only been constructed has not yet executed and so is not a
valid target for this management API. This seems to be the cause of
failures reported in this bug (and JDK-8213434). Registering a NJT only
when it starts executing is an appealing fix for this, but that impacts
all current users of the NJT list and straight-away causes a problem
with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread
to skip threads that have not yet executed - which we can recognize by
seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for
NJT lifecycle management if desired, tackles the problem of encountering
a terminated thread during iteration - which can also lead to SEGVs.
This can arise because NJT's are not actually "destructed", even if they
terminate, and so they never get removed from the NJT list. Calling
destructors is problematic because the code using these NJTs assume they
are always valid. So the fix in this case is to move the de-registering
from the NJT list out of the destructor and into the Thread::call_run()
method so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where the
remaining steps touch on a lot of areas and so need to be handled
separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test this
in the conditions reported in JDK-8213434.
Thanks,
David
Thomas Stüfe
2018-11-20 10:20:39 UTC
Permalink
Hi David,

out of interest, would pthread_getcpuclockid also crash when fed an
pthread_t of an existing thread which has been suspended and not yet
started (on Solaris, AIX)? Not that it would have much to report :)

--

1643 // exclude externally visible JavaThreads
1644 if (thread->is_Java_thread() &&
!thread->is_hidden_from_external_view()) {
1645 return;
1646 }
1647
1648 // NonJavaThread instances may not be fully initialized yet, so
we need to
1649 // skip any that aren't - check for zero stack_size()
1650 if (!thread->is_Java_thread() && thread->stack_size() == 0) {
1651 return;
1652 }

So do we really want to skip the check for externally hidden java
threads? If not, would not this b better:

1648 // Thread instances may not be fully initialized yet, so we need to
1649 // skip any that aren't - check for zero stack_size()
1650 if (thread->stack_size() == 0) {
1651 return;
1652 }

--

I agree an "is_initialized" flag would be better. We do have the
ThreadState in the associated OsThread, could that be used?

Thanks, Thomas
Post by David Holmes
After discussions with Kim I've decided to split out the NJT list update
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live target
thread so that we can use its pthread_t identity to obtain its CPU clock
via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered
during its constructor and de-registered during its destructor. A thread
that has only been constructed has not yet executed and so is not a
valid target for this management API. This seems to be the cause of
failures reported in this bug (and JDK-8213434). Registering a NJT only
when it starts executing is an appealing fix for this, but that impacts
all current users of the NJT list and straight-away causes a problem
with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread
to skip threads that have not yet executed - which we can recognize by
seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for
NJT lifecycle management if desired, tackles the problem of encountering
a terminated thread during iteration - which can also lead to SEGVs.
This can arise because NJT's are not actually "destructed", even if they
terminate, and so they never get removed from the NJT list. Calling
destructors is problematic because the code using these NJTs assume they
are always valid. So the fix in this case is to move the de-registering
from the NJT list out of the destructor and into the Thread::call_run()
method so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where the
remaining steps touch on a lot of areas and so need to be handled
separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test this
in the conditions reported in JDK-8213434.
Thanks,
David
David Holmes
2018-11-20 10:38:06 UTC
Permalink
Hi Thomas,

Thanks for taking a look at this.
Post by Thomas Stüfe
Hi David,
out of interest, would pthread_getcpuclockid also crash when fed an
pthread_t of an existing thread which has been suspended and not yet
started (on Solaris, AIX)? Not that it would have much to report :)
I don't know for certain but I would certainly hope not! Once the
pthread_t has been returned to the caller it should be valid. Note
however on Solaris we don't normally use pthreads.
Post by Thomas Stüfe
--
1643 // exclude externally visible JavaThreads
1644 if (thread->is_Java_thread() &&
!thread->is_hidden_from_external_view()) {
1645 return;
1646 }
1647
1648 // NonJavaThread instances may not be fully initialized yet, so
we need to
1649 // skip any that aren't - check for zero stack_size()
1650 if (!thread->is_Java_thread() && thread->stack_size() == 0) {
1651 return;
1652 }
So do we really want to skip the check for externally hidden java
1648 // Thread instances may not be fully initialized yet, so we need to
1649 // skip any that aren't - check for zero stack_size()
1650 if (thread->stack_size() == 0) {
1651 return;
1652 }
JavaThreads can't be "not fully initialized" as they don't appear in the
ThreadsList until they are fully initialized. So I prefer to be clear
this only affects NJTs.
Post by Thomas Stüfe
--
I agree an "is_initialized" flag would be better. We do have the
ThreadState in the associated OsThread, could that be used?
I suppose it could use:

thread->osThread()->get_state() >= RUNNABLE

but OSThread ThreadState is a bit of an anachronism - as it states in
the header:

// Note: the ThreadState is legacy code and is not correctly implemented.
// Uses of ThreadState need to be replaced by the state in the JavaThread.

so I'd rather not use that. Checking the stack_size has been set ensures
the thread has reached as "reasonable" point in its execution. It's
later than strictly necessary but we're racing regardless.

Thanks,
David
-----
Post by Thomas Stüfe
Thanks, Thomas
Post by David Holmes
After discussions with Kim I've decided to split out the NJT list update
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live target
thread so that we can use its pthread_t identity to obtain its CPU clock
via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered
during its constructor and de-registered during its destructor. A thread
that has only been constructed has not yet executed and so is not a
valid target for this management API. This seems to be the cause of
failures reported in this bug (and JDK-8213434). Registering a NJT only
when it starts executing is an appealing fix for this, but that impacts
all current users of the NJT list and straight-away causes a problem
with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread
to skip threads that have not yet executed - which we can recognize by
seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for
NJT lifecycle management if desired, tackles the problem of encountering
a terminated thread during iteration - which can also lead to SEGVs.
This can arise because NJT's are not actually "destructed", even if they
terminate, and so they never get removed from the NJT list. Calling
destructors is problematic because the code using these NJTs assume they
are always valid. So the fix in this case is to move the de-registering
from the NJT list out of the destructor and into the Thread::call_run()
method so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where the
remaining steps touch on a lot of areas and so need to be handled
separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test this
in the conditions reported in JDK-8213434.
Thanks,
David
Thomas Stüfe
2018-11-20 10:50:30 UTC
Permalink
Post by David Holmes
Hi Thomas,
Thanks for taking a look at this.
Post by Thomas Stüfe
Hi David,
out of interest, would pthread_getcpuclockid also crash when fed an
pthread_t of an existing thread which has been suspended and not yet
started (on Solaris, AIX)? Not that it would have much to report :)
I don't know for certain but I would certainly hope not! Once the
pthread_t has been returned to the caller it should be valid. Note
however on Solaris we don't normally use pthreads.
Post by Thomas Stüfe
--
1643 // exclude externally visible JavaThreads
1644 if (thread->is_Java_thread() &&
!thread->is_hidden_from_external_view()) {
1645 return;
1646 }
1647
1648 // NonJavaThread instances may not be fully initialized yet, so
we need to
1649 // skip any that aren't - check for zero stack_size()
1650 if (!thread->is_Java_thread() && thread->stack_size() == 0) {
1651 return;
1652 }
So do we really want to skip the check for externally hidden java
1648 // Thread instances may not be fully initialized yet, so we need to
1649 // skip any that aren't - check for zero stack_size()
1650 if (thread->stack_size() == 0) {
1651 return;
1652 }
JavaThreads can't be "not fully initialized" as they don't appear in the
ThreadsList until they are fully initialized.
Are you sure?

See e.g. CompileBroker::make_thread() :

Threads::add(thread);
Thread::start(thread);

since stack_size is still 0 before Thread::start() on platforms where
the threads are created suspended, you could encounter threads with
stack_size==0. I do not know whether this translates to an invalid
pthread_t handle though (hence my question about suspended threads).
Post by David Holmes
So I prefer to be clear
this only affects NJTs.
Post by Thomas Stüfe
--
I agree an "is_initialized" flag would be better. We do have the
ThreadState in the associated OsThread, could that be used?
thread->osThread()->get_state() >= RUNNABLE
but OSThread ThreadState is a bit of an anachronism - as it states in
// Note: the ThreadState is legacy code and is not correctly implemented.
// Uses of ThreadState need to be replaced by the state in the JavaThread.
so I'd rather not use that. Checking the stack_size has been set ensures
the thread has reached as "reasonable" point in its execution. It's
later than strictly necessary but we're racing regardless.
Querying for thread_stack==0 is certainly fine at this point.

Thanks, Thomas
Post by David Holmes
Thanks,
David
-----
Post by Thomas Stüfe
Thanks, Thomas
Post by David Holmes
After discussions with Kim I've decided to split out the NJT list update
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live target
thread so that we can use its pthread_t identity to obtain its CPU clock
via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered
during its constructor and de-registered during its destructor. A thread
that has only been constructed has not yet executed and so is not a
valid target for this management API. This seems to be the cause of
failures reported in this bug (and JDK-8213434). Registering a NJT only
when it starts executing is an appealing fix for this, but that impacts
all current users of the NJT list and straight-away causes a problem
with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread
to skip threads that have not yet executed - which we can recognize by
seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for
NJT lifecycle management if desired, tackles the problem of encountering
a terminated thread during iteration - which can also lead to SEGVs.
This can arise because NJT's are not actually "destructed", even if they
terminate, and so they never get removed from the NJT list. Calling
destructors is problematic because the code using these NJTs assume they
are always valid. So the fix in this case is to move the de-registering
from the NJT list out of the destructor and into the Thread::call_run()
method so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where the
remaining steps touch on a lot of areas and so need to be handled
separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test this
in the conditions reported in JDK-8213434.
Thanks,
David
David Holmes
2018-11-20 12:49:59 UTC
Permalink
Hi Thomas,
Post by Thomas Stüfe
Post by David Holmes
Hi Thomas,
Thanks for taking a look at this.
Post by Thomas Stüfe
Hi David,
out of interest, would pthread_getcpuclockid also crash when fed an
pthread_t of an existing thread which has been suspended and not yet
started (on Solaris, AIX)? Not that it would have much to report :)
I don't know for certain but I would certainly hope not! Once the
pthread_t has been returned to the caller it should be valid. Note
however on Solaris we don't normally use pthreads.
Post by Thomas Stüfe
--
1643 // exclude externally visible JavaThreads
1644 if (thread->is_Java_thread() &&
!thread->is_hidden_from_external_view()) {
1645 return;
1646 }
1647
1648 // NonJavaThread instances may not be fully initialized yet, so
we need to
1649 // skip any that aren't - check for zero stack_size()
1650 if (!thread->is_Java_thread() && thread->stack_size() == 0) {
1651 return;
1652 }
So do we really want to skip the check for externally hidden java
1648 // Thread instances may not be fully initialized yet, so we need to
1649 // skip any that aren't - check for zero stack_size()
1650 if (thread->stack_size() == 0) {
1651 return;
1652 }
JavaThreads can't be "not fully initialized" as they don't appear in the
ThreadsList until they are fully initialized.
Are you sure?
Threads::add(thread);
Thread::start(thread);
since stack_size is still 0 before Thread::start() on platforms where
the threads are created suspended, you could encounter threads with
stack_size==0. I do not know whether this translates to an invalid
pthread_t handle though (hence my question about suspended threads).
Ah yes you are right - sorry. I didn't mean to imply that JavaThreads
were guaranteed to have their stack set - that's certainly not true. For
non-suspended-at-start platforms they are guaranteed to have executed
some code. But as you note if started suspended then they may not be
actually started before being seen in the ThreadsList.

For this particular case Solaris always uses the LWP id to read the
lwpusage from /proc. I'm not aware of any reports that this has ever
failed and I would not expect it to fail, but ...

AIX similarly uses a kernel-thread ID and a kernel-thread specific API.

Platform idiosyncrasies aside you may well be right that its better to
ensure all the examined threads have at least executed to the point
where they have set their own stack information.

Opinions from others?
Post by Thomas Stüfe
Post by David Holmes
So I prefer to be clear
this only affects NJTs.
Post by Thomas Stüfe
--
I agree an "is_initialized" flag would be better. We do have the
ThreadState in the associated OsThread, could that be used?
thread->osThread()->get_state() >= RUNNABLE
but OSThread ThreadState is a bit of an anachronism - as it states in
// Note: the ThreadState is legacy code and is not correctly implemented.
// Uses of ThreadState need to be replaced by the state in the JavaThread.
so I'd rather not use that. Checking the stack_size has been set ensures
the thread has reached as "reasonable" point in its execution. It's
later than strictly necessary but we're racing regardless.
Querying for thread_stack==0 is certainly fine at this point.
Okay - thanks.

David
-----
Post by Thomas Stüfe
Thanks, Thomas
Post by David Holmes
Thanks,
David
-----
Post by Thomas Stüfe
Thanks, Thomas
Post by David Holmes
After discussions with Kim I've decided to split out the NJT list update
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live target
thread so that we can use its pthread_t identity to obtain its CPU clock
via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered
during its constructor and de-registered during its destructor. A thread
that has only been constructed has not yet executed and so is not a
valid target for this management API. This seems to be the cause of
failures reported in this bug (and JDK-8213434). Registering a NJT only
when it starts executing is an appealing fix for this, but that impacts
all current users of the NJT list and straight-away causes a problem
with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread
to skip threads that have not yet executed - which we can recognize by
seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for
NJT lifecycle management if desired, tackles the problem of encountering
a terminated thread during iteration - which can also lead to SEGVs.
This can arise because NJT's are not actually "destructed", even if they
terminate, and so they never get removed from the NJT list. Calling
destructors is problematic because the code using these NJTs assume they
are always valid. So the fix in this case is to move the de-registering
from the NJT list out of the destructor and into the Thread::call_run()
method so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where the
remaining steps touch on a lot of areas and so need to be handled
separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test this
in the conditions reported in JDK-8213434.
Thanks,
David
Kim Barrett
2018-11-21 05:01:34 UTC
Permalink
Post by David Holmes
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for NonJavaThreads (NJTs). That functionality requires a valid/live target thread so that we can use its pthread_t identity to obtain its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered during its constructor and de-registered during its destructor. A thread that has only been constructed has not yet executed and so is not a valid target for this management API. This seems to be the cause of failures reported in this bug (and JDK-8213434). Registering a NJT only when it starts executing is an appealing fix for this, but that impacts all current users of the NJT list and straight-away causes a problem with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread to skip threads that have not yet executed - which we can recognize by seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for NJT lifecycle management if desired, tackles the problem of encountering a terminated thread during iteration - which can also lead to SEGVs. This can arise because NJT's are not actually "destructed", even if they terminate, and so they never get removed from the NJT list. Calling destructors is problematic because the code using these NJTs assume they are always valid. So the fix in this case is to move the de-registering from the NJT list out of the destructor and into the Thread::call_run() method so it is done before a thread actually terminates. This can be considered a first step in cleaning up the NJT lifecycle, where the remaining steps touch on a lot of areas and so need to be handled separately e.g. see JDK-8087340 for shutting down WorkGang GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure locally, even after thousands of runs. I'm hoping Zhengyu can test this in the conditions reported in JDK-8213434.
Thanks,
David
David Holmes
2018-11-28 02:58:43 UTC
Permalink
Sorry for the delayed response.
Post by Kim Barrett
Post by David Holmes
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Thanks Kim.

I've decided to stick with this simple fix for NJTs only.

David
Post by Kim Barrett
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for NonJavaThreads (NJTs). That functionality requires a valid/live target thread so that we can use its pthread_t identity to obtain its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered during its constructor and de-registered during its destructor. A thread that has only been constructed has not yet executed and so is not a valid target for this management API. This seems to be the cause of failures reported in this bug (and JDK-8213434). Registering a NJT only when it starts executing is an appealing fix for this, but that impacts all current users of the NJT list and straight-away causes a problem with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread to skip threads that have not yet executed - which we can recognize by seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for NJT lifecycle management if desired, tackles the problem of encountering a terminated thread during iteration - which can also lead to SEGVs. This can arise because NJT's are not actually "destructed", even if they terminate, and so they never get removed from the NJT list. Calling destructors is problematic because the code using these NJTs assume they are always valid. So the fix in this case is to move the de-registering from the NJT list out of the destructor and into the Thread::call_run() method so it is done before a thread actually terminates. This can be considered a first step in cleaning up the NJT lifecycle, where the remaining steps touch on a lot of areas and so need to be handled separately e.g. see JDK-8087340 for shutting down WorkGang GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure locally, even after thousands of runs. I'm hoping Zhengyu can test this in the conditions reported in JDK-8213434.
Thanks,
David
Thomas Stüfe
2018-11-28 06:30:19 UTC
Permalink
Hi David,

I admit I still do not really get it..

E.g. for GC threads:

We have

static ConcurrentMarkSweepThread*
ConcurrentMarkSweepThread::start(CMSCollector* collector);

which creates a ConcurrentMarkSweepThread and then calls
os::create_thread() to hook it up with an OSThread and start it.

ConcurrentMarkSweepThread derives from NonJavaThread, which in its
constructor adds the thread object to the list.

So there is a time gap where the NJT is part of the list, but
Thread::_osthread is still NULL.

In ThreadTimesClosure::do_thread(), we call
os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),
&clockid);

Should we then not crash when dereferencing the NULL
osthread()->pthread_id()? Why do we crash inside
pthread_getcpuclockid?

If I look further into os::create_thread(), I see that there is
another, smaller time gap where we create OSThread and anchor it into
the Thread object:

thread->set_osthread(osthread);

and then later, after pthread_create is thru, set its thread id:

// Store pthread info into the OSThread
osthread->set_pthread_id(tid);

When OSThread is created, we call OSThread::pd_initialize() and set
its _threadid to 0. We do this in the constructor, before anchoring
the OSThread in its Thread.

So for my understanding, we should have two situations:

(1)- NJT is in list, but its _osthread member is NULL. In that case I
would expect a different crash.
(2)- NJT is in list, _osthread is set, but its _thread_id is NULL.

(Modulo any concurrency issues, e.g. could the
"thread->set_osthread(osthread);" be visible before OSThread is
initialized?

Depending on what is happening, a fix for (1) would probably be a
querying for osthread==NULL, a fix for (2) would be to guard os::---
functions - well, at least os::thread_cpu_time() - to disregard
Threads with pthread_t == 0. Both fixes seem better to me than
querying the stacksize() when walking the list - that seems a bit
awkward.

--
P.s.

ConcurrentGCThread::ConcurrentGCThread() :
_should_terminate(false), _has_terminated(false) {
};

I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
--

Cheers, Thomas
Post by David Holmes
Sorry for the delayed response.
Post by Kim Barrett
Post by David Holmes
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Thanks Kim.
I've decided to stick with this simple fix for NJTs only.
David
Post by Kim Barrett
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for NonJavaThreads (NJTs). That functionality requires a valid/live target thread so that we can use its pthread_t identity to obtain its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered during its constructor and de-registered during its destructor. A thread that has only been constructed has not yet executed and so is not a valid target for this management API. This seems to be the cause of failures reported in this bug (and JDK-8213434). Registering a NJT only when it starts executing is an appealing fix for this, but that impacts all current users of the NJT list and straight-away causes a problem with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread to skip threads that have not yet executed - which we can recognize by seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for NJT lifecycle management if desired, tackles the problem of encountering a terminated thread during iteration - which can also lead to SEGVs. This can arise because NJT's are not actually "destructed", even if they terminate, and so they never get removed from the NJT list. Calling destructors is problematic because the code using these NJTs assume they are always valid. So the fix in this case is to move the de-registering from the NJT list out of the destructor and into the Thread::call_run() method so it is done before a thread actually terminates. This can be considered a first step in cleaning up the NJT lifecycle, where the remaining steps touch on a lot of areas and so need to be handled separately e.g. see JDK-8087340 for shutting down WorkGang GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure locally, even after thousands of runs. I'm hoping Zhengyu can test this in the conditions reported in JDK-8213434.
Thanks,
David
David Holmes
2018-11-28 07:26:39 UTC
Permalink
Hi Thomas,
Post by Thomas Stüfe
Hi David,
I admit I still do not really get it..
We have
static ConcurrentMarkSweepThread*
ConcurrentMarkSweepThread::start(CMSCollector* collector);
which creates a ConcurrentMarkSweepThread and then calls
os::create_thread() to hook it up with an OSThread and start it.
ConcurrentMarkSweepThread derives from NonJavaThread, which in its
constructor adds the thread object to the list.
So there is a time gap where the NJT is part of the list, but
Thread::_osthread is still NULL.
In ThreadTimesClosure::do_thread(), we call
os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),
&clockid);
Should we then not crash when dereferencing the NULL
osthread()->pthread_id()? Why do we crash inside
pthread_getcpuclockid?
If I look further into os::create_thread(), I see that there is
another, smaller time gap where we create OSThread and anchor it into
thread->set_osthread(osthread);
// Store pthread info into the OSThread
osthread->set_pthread_id(tid);
When OSThread is created, we call OSThread::pd_initialize() and set
its _threadid to 0. We do this in the constructor, before anchoring
the OSThread in its Thread.
(1)- NJT is in list, but its _osthread member is NULL. In that case I
would expect a different crash.
(2)- NJT is in list, _osthread is set, but its _thread_id is NULL.
Yes both situations may be possible. In the related bug report only the
0 thread_id was observed.
Post by Thomas Stüfe
(Modulo any concurrency issues, e.g. could the
"thread->set_osthread(osthread);" be visible before OSThread is
initialized?
Depending on what is happening, a fix for (1) would probably be a
querying for osthread==NULL, a fix for (2) would be to guard os::---
functions - well, at least os::thread_cpu_time() - to disregard
Threads with pthread_t == 0. Both fixes seem better to me than
querying the stacksize() when walking the list - that seems a bit
awkward.
I find it awkward that the list is populated with partially constructed
threads in the first place. This particular code is only interested in
live threads that can safely be queried. The stack_size() check is a
surrogate for "this is a live thread that can be queried".
Post by Thomas Stüfe
--
P.s.
_should_terminate(false), _has_terminated(false) {
};
I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
I assume it is implicit. But yes it should be explicit.

Cheers,
David
Post by Thomas Stüfe
--
Cheers, Thomas
Post by David Holmes
Sorry for the delayed response.
Post by Kim Barrett
Post by David Holmes
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Thanks Kim.
I've decided to stick with this simple fix for NJTs only.
David
Post by Kim Barrett
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for NonJavaThreads (NJTs). That functionality requires a valid/live target thread so that we can use its pthread_t identity to obtain its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered during its constructor and de-registered during its destructor. A thread that has only been constructed has not yet executed and so is not a valid target for this management API. This seems to be the cause of failures reported in this bug (and JDK-8213434). Registering a NJT only when it starts executing is an appealing fix for this, but that impacts all current users of the NJT list and straight-away causes a problem with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread to skip threads that have not yet executed - which we can recognize by seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for NJT lifecycle management if desired, tackles the problem of encountering a terminated thread during iteration - which can also lead to SEGVs. This can arise because NJT's are not actually "destructed", even if they terminate, and so they never get removed from the NJT list. Calling destructors is problematic because the code using these NJTs assume they are always valid. So the fix in this case is to move the de-registering from the NJT list out of the destructor and into the Thread::call_run() method so it is done before a thread actually terminates. This can be considered a first step in cleaning up the NJT lifecycle, where the remaining steps touch on a lot of areas and so need to be handled separately e.g. see JDK-8087340 for shutting down WorkGang GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure locally, even after thousands of runs. I'm hoping Zhengyu can test this in the conditions reported in JDK-8213434.
Thanks,
David
David Holmes
2018-11-28 07:35:17 UTC
Permalink
#$%$#@! I missed the fact that the thread id is set _after_ we do
record_stack_base_and-size(). I thought that was in call_run() not
native_thread_entry().

<sigh> New bug being filed.

Thanks,
David
Post by David Holmes
Hi Thomas,
Post by Thomas Stüfe
Hi David,
I admit I still do not really get it..
We have
static ConcurrentMarkSweepThread*
ConcurrentMarkSweepThread::start(CMSCollector* collector);
which creates a ConcurrentMarkSweepThread and then calls
os::create_thread() to hook it up with an OSThread and start it.
ConcurrentMarkSweepThread derives from NonJavaThread, which in its
constructor adds the thread object to the list.
So there is a time gap where the NJT is part of the list, but
Thread::_osthread is still NULL.
In ThreadTimesClosure::do_thread(), we call
os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),
&clockid);
Should we then not crash when dereferencing the NULL
osthread()->pthread_id()? Why do we crash inside
pthread_getcpuclockid?
If I look further into os::create_thread(), I see that there is
another, smaller time gap where we create OSThread and anchor it into
   thread->set_osthread(osthread);
     // Store pthread info into the OSThread
     osthread->set_pthread_id(tid);
When OSThread is created, we call OSThread::pd_initialize() and set
its _threadid to 0. We do this in the constructor, before anchoring
the OSThread in its Thread.
(1)- NJT is in list, but its _osthread member is NULL. In that case I
would expect a different crash.
(2)- NJT is in list, _osthread is set, but its _thread_id is NULL.
Yes both situations may be possible. In the related bug report only the
0 thread_id was observed.
Post by Thomas Stüfe
(Modulo any concurrency issues, e.g. could the
"thread->set_osthread(osthread);" be visible before OSThread is
initialized?
Depending on what is happening, a fix for (1) would probably be a
querying for osthread==NULL, a fix for (2) would be to guard os::---
functions - well, at least os::thread_cpu_time() - to disregard
Threads with pthread_t == 0. Both fixes seem better to me than
querying the stacksize() when walking the list - that seems a bit
awkward.
I find it awkward that the list is populated with partially constructed
threads in the first place. This particular code is only interested in
live threads that can safely be queried. The stack_size() check is a
surrogate for "this is a live thread that can be queried".
Post by Thomas Stüfe
--
P.s.
   _should_terminate(false), _has_terminated(false) {
};
I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
I assume it is implicit. But yes it should be explicit.
Cheers,
David
Post by Thomas Stüfe
--
Cheers, Thomas
Post by David Holmes
Sorry for the delayed response.
Post by Kim Barrett
Post by David Holmes
After discussions with Kim I've decided to split out the NJT list
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Thanks Kim.
I've decided to stick with this simple fix for NJTs only.
David
Post by Kim Barrett
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live
target thread so that we can use its pthread_t identity to obtain
its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is
registered during its constructor and de-registered during its
destructor. A thread that has only been constructed has not yet
executed and so is not a valid target for this management API.
This seems to be the cause of failures reported in this bug (and
JDK-8213434). Registering a NJT only when it starts executing is
an appealing fix for this, but that impacts all current users of
the NJT list and straight-away causes a problem with the
BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for
ThreadTimesClosure::do_thread to skip threads that have not yet
executed - which we can recognize by seeing an uninitialized (i.e.
zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE
for NJT lifecycle management if desired, tackles the problem of
encountering a terminated thread during iteration - which can also
lead to SEGVs. This can arise because NJT's are not actually
"destructed", even if they terminate, and so they never get
removed from the NJT list. Calling destructors is problematic
because the code using these NJTs assume they are always valid. So
the fix in this case is to move the de-registering from the NJT
list out of the destructor and into the Thread::call_run() method
so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where
the remaining steps touch on a lot of areas and so need to be
handled separately e.g. see JDK-8087340 for shutting down WorkGang
GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test
this in the conditions reported in JDK-8213434.
Thanks,
David
Thomas Stüfe
2018-11-28 07:46:11 UTC
Permalink
Are you sure?

On Linux, pthread id is set in the parent thread, after pthread_create
returns. Only the kernel thread id is set by the child (I find this
duality confusing). thread_cpu_time uses pthread_id. So, on Linux, it
may or may not be set before the thread stack boundaries are
initialized, depending on whether the native entry function ran before
the parent thread continued.

..Thomas
Post by David Holmes
record_stack_base_and-size(). I thought that was in call_run() not
native_thread_entry().
<sigh> New bug being filed.
Thanks,
David
Post by David Holmes
Hi Thomas,
Post by Thomas Stüfe
Hi David,
I admit I still do not really get it..
We have
static ConcurrentMarkSweepThread*
ConcurrentMarkSweepThread::start(CMSCollector* collector);
which creates a ConcurrentMarkSweepThread and then calls
os::create_thread() to hook it up with an OSThread and start it.
ConcurrentMarkSweepThread derives from NonJavaThread, which in its
constructor adds the thread object to the list.
So there is a time gap where the NJT is part of the list, but
Thread::_osthread is still NULL.
In ThreadTimesClosure::do_thread(), we call
os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),
&clockid);
Should we then not crash when dereferencing the NULL
osthread()->pthread_id()? Why do we crash inside
pthread_getcpuclockid?
If I look further into os::create_thread(), I see that there is
another, smaller time gap where we create OSThread and anchor it into
thread->set_osthread(osthread);
// Store pthread info into the OSThread
osthread->set_pthread_id(tid);
When OSThread is created, we call OSThread::pd_initialize() and set
its _threadid to 0. We do this in the constructor, before anchoring
the OSThread in its Thread.
(1)- NJT is in list, but its _osthread member is NULL. In that case I
would expect a different crash.
(2)- NJT is in list, _osthread is set, but its _thread_id is NULL.
Yes both situations may be possible. In the related bug report only the
0 thread_id was observed.
Post by Thomas Stüfe
(Modulo any concurrency issues, e.g. could the
"thread->set_osthread(osthread);" be visible before OSThread is
initialized?
Depending on what is happening, a fix for (1) would probably be a
querying for osthread==NULL, a fix for (2) would be to guard os::---
functions - well, at least os::thread_cpu_time() - to disregard
Threads with pthread_t == 0. Both fixes seem better to me than
querying the stacksize() when walking the list - that seems a bit
awkward.
I find it awkward that the list is populated with partially constructed
threads in the first place. This particular code is only interested in
live threads that can safely be queried. The stack_size() check is a
surrogate for "this is a live thread that can be queried".
Post by Thomas Stüfe
--
P.s.
_should_terminate(false), _has_terminated(false) {
};
I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
I assume it is implicit. But yes it should be explicit.
Cheers,
David
Post by Thomas Stüfe
--
Cheers, Thomas
Post by David Holmes
Sorry for the delayed response.
Post by Kim Barrett
Post by David Holmes
After discussions with Kim I've decided to split out the NJT list
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Thanks Kim.
I've decided to stick with this simple fix for NJTs only.
David
Post by Kim Barrett
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live
target thread so that we can use its pthread_t identity to obtain
its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is
registered during its constructor and de-registered during its
destructor. A thread that has only been constructed has not yet
executed and so is not a valid target for this management API.
This seems to be the cause of failures reported in this bug (and
JDK-8213434). Registering a NJT only when it starts executing is
an appealing fix for this, but that impacts all current users of
the NJT list and straight-away causes a problem with the
BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for
ThreadTimesClosure::do_thread to skip threads that have not yet
executed - which we can recognize by seeing an uninitialized (i.e.
zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE
for NJT lifecycle management if desired, tackles the problem of
encountering a terminated thread during iteration - which can also
lead to SEGVs. This can arise because NJT's are not actually
"destructed", even if they terminate, and so they never get
removed from the NJT list. Calling destructors is problematic
because the code using these NJTs assume they are always valid. So
the fix in this case is to move the de-registering from the NJT
list out of the destructor and into the Thread::call_run() method
so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where
the remaining steps touch on a lot of areas and so need to be
handled separately e.g. see JDK-8087340 for shutting down WorkGang
GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test
this in the conditions reported in JDK-8213434.
Thanks,
David
David Holmes
2018-11-28 07:59:33 UTC
Permalink
Post by Thomas Stüfe
Are you sure?
No. I'm completely confusing myself - too many different "thread id's" :(
Post by Thomas Stüfe
On Linux, pthread id is set in the parent thread, after pthread_create
returns. Only the kernel thread id is set by the child (I find this
duality confusing). thread_cpu_time uses pthread_id. So, on Linux, it
may or may not be set before the thread stack boundaries are
initialized, depending on whether the native entry function ran before
the parent thread continued.
Yes you are right. The code is still broken but for a different reason.

I may just fix this all in JDK-8214097 - Rework thread initialization
and teardown logic - by only adding to the list after the thread has run
its own initialization logic. Then the guard I just added to the
management code can be deleted again.

Thanks,
David
Post by Thomas Stüfe
..Thomas
Post by David Holmes
record_stack_base_and-size(). I thought that was in call_run() not
native_thread_entry().
<sigh> New bug being filed.
Thanks,
David
Post by David Holmes
Hi Thomas,
Post by Thomas Stüfe
Hi David,
I admit I still do not really get it..
We have
static ConcurrentMarkSweepThread*
ConcurrentMarkSweepThread::start(CMSCollector* collector);
which creates a ConcurrentMarkSweepThread and then calls
os::create_thread() to hook it up with an OSThread and start it.
ConcurrentMarkSweepThread derives from NonJavaThread, which in its
constructor adds the thread object to the list.
So there is a time gap where the NJT is part of the list, but
Thread::_osthread is still NULL.
In ThreadTimesClosure::do_thread(), we call
os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),
&clockid);
Should we then not crash when dereferencing the NULL
osthread()->pthread_id()? Why do we crash inside
pthread_getcpuclockid?
If I look further into os::create_thread(), I see that there is
another, smaller time gap where we create OSThread and anchor it into
thread->set_osthread(osthread);
// Store pthread info into the OSThread
osthread->set_pthread_id(tid);
When OSThread is created, we call OSThread::pd_initialize() and set
its _threadid to 0. We do this in the constructor, before anchoring
the OSThread in its Thread.
(1)- NJT is in list, but its _osthread member is NULL. In that case I
would expect a different crash.
(2)- NJT is in list, _osthread is set, but its _thread_id is NULL.
Yes both situations may be possible. In the related bug report only the
0 thread_id was observed.
Post by Thomas Stüfe
(Modulo any concurrency issues, e.g. could the
"thread->set_osthread(osthread);" be visible before OSThread is
initialized?
Depending on what is happening, a fix for (1) would probably be a
querying for osthread==NULL, a fix for (2) would be to guard os::---
functions - well, at least os::thread_cpu_time() - to disregard
Threads with pthread_t == 0. Both fixes seem better to me than
querying the stacksize() when walking the list - that seems a bit
awkward.
I find it awkward that the list is populated with partially constructed
threads in the first place. This particular code is only interested in
live threads that can safely be queried. The stack_size() check is a
surrogate for "this is a live thread that can be queried".
Post by Thomas Stüfe
--
P.s.
_should_terminate(false), _has_terminated(false) {
};
I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
I assume it is implicit. But yes it should be explicit.
Cheers,
David
Post by Thomas Stüfe
--
Cheers, Thomas
Post by David Holmes
Sorry for the delayed response.
Post by Kim Barrett
Post by David Holmes
After discussions with Kim I've decided to split out the NJT list
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Thanks Kim.
I've decided to stick with this simple fix for NJTs only.
David
Post by Kim Barrett
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live
target thread so that we can use its pthread_t identity to obtain
its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is
registered during its constructor and de-registered during its
destructor. A thread that has only been constructed has not yet
executed and so is not a valid target for this management API.
This seems to be the cause of failures reported in this bug (and
JDK-8213434). Registering a NJT only when it starts executing is
an appealing fix for this, but that impacts all current users of
the NJT list and straight-away causes a problem with the
BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for
ThreadTimesClosure::do_thread to skip threads that have not yet
executed - which we can recognize by seeing an uninitialized (i.e.
zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE
for NJT lifecycle management if desired, tackles the problem of
encountering a terminated thread during iteration - which can also
lead to SEGVs. This can arise because NJT's are not actually
"destructed", even if they terminate, and so they never get
removed from the NJT list. Calling destructors is problematic
because the code using these NJTs assume they are always valid. So
the fix in this case is to move the de-registering from the NJT
list out of the destructor and into the Thread::call_run() method
so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where
the remaining steps touch on a lot of areas and so need to be
handled separately e.g. see JDK-8087340 for shutting down WorkGang
GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test
this in the conditions reported in JDK-8213434.
Thanks,
David
Thomas Stüfe
2018-11-28 14:07:06 UTC
Permalink
Post by David Holmes
Post by Thomas Stüfe
Are you sure?
No. I'm completely confusing myself - too many different "thread id's" :(
A cheerful multitude :)

What throws me off usually is not that we have two of them, but that,
on Linux, the generalized thread id (OSThread::thread_id()) is the
kernel tid. That just feels weird, since pthread_t is what we use 99%
of all cases, and the kernel tid only in those rare cases where we
access the proc file system or similar. Well I guess this is all
historical, probably pre-NPTL?

In the AIX port we did it the other way around: made pthread_t the
default "thread_id" and use special accessors for the kernel tid
(OSThread::kernel_thread_id()) for those rare cases where we really
need it.
Post by David Holmes
Post by Thomas Stüfe
On Linux, pthread id is set in the parent thread, after pthread_create
returns. Only the kernel thread id is set by the child (I find this
duality confusing). thread_cpu_time uses pthread_id. So, on Linux, it
may or may not be set before the thread stack boundaries are
initialized, depending on whether the native entry function ran before
the parent thread continued.
Yes you are right. The code is still broken but for a different reason.
I may just fix this all in JDK-8214097 - Rework thread initialization
and teardown logic - by only adding to the list after the thread has run
its own initialization logic. Then the guard I just added to the
management code can be deleted again.
I agree, the way to go long term is to make sure that no uninitialized
or cleaned up threads are in the threadlist.

I also think your small is fine for now. I was just worried that we
start using Thread::stack_size() > 0 as a synonym for something like
"Thread::fully_initialized_and_still_alive()" and that once we change
it, we need to hunt down all cases of "misused Thread::stack_size()".
Therefore I would have prepared something like:

Thread::is_alive() { return _osthread != NULL && _osthread->thread_id
!= 0 && _osthread->pthread_id != 0 && _stack_base != NULL; }

and given that this is platform-specific, probably something like:

Thread::is_alive() { return _osthread != NULL &&
_osthread->is->alive() && _stack_base != NULL; }
OSThread::is_alive() (unix) { return _pthread_id != 0 && _thread_id != 0; }

and at the end of call_run(), resetting Thread::_osthread to NULL or similar.

But I am fine with your change, if you want to check it in as it is.

..Thomas
Post by David Holmes
Thanks,
David
Post by Thomas Stüfe
..Thomas
Post by David Holmes
record_stack_base_and-size(). I thought that was in call_run() not
native_thread_entry().
<sigh> New bug being filed.
Thanks,
David
Post by David Holmes
Hi Thomas,
Post by Thomas Stüfe
Hi David,
I admit I still do not really get it..
We have
static ConcurrentMarkSweepThread*
ConcurrentMarkSweepThread::start(CMSCollector* collector);
which creates a ConcurrentMarkSweepThread and then calls
os::create_thread() to hook it up with an OSThread and start it.
ConcurrentMarkSweepThread derives from NonJavaThread, which in its
constructor adds the thread object to the list.
So there is a time gap where the NJT is part of the list, but
Thread::_osthread is still NULL.
In ThreadTimesClosure::do_thread(), we call
os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),
&clockid);
Should we then not crash when dereferencing the NULL
osthread()->pthread_id()? Why do we crash inside
pthread_getcpuclockid?
If I look further into os::create_thread(), I see that there is
another, smaller time gap where we create OSThread and anchor it into
thread->set_osthread(osthread);
// Store pthread info into the OSThread
osthread->set_pthread_id(tid);
When OSThread is created, we call OSThread::pd_initialize() and set
its _threadid to 0. We do this in the constructor, before anchoring
the OSThread in its Thread.
(1)- NJT is in list, but its _osthread member is NULL. In that case I
would expect a different crash.
(2)- NJT is in list, _osthread is set, but its _thread_id is NULL.
Yes both situations may be possible. In the related bug report only the
0 thread_id was observed.
Post by Thomas Stüfe
(Modulo any concurrency issues, e.g. could the
"thread->set_osthread(osthread);" be visible before OSThread is
initialized?
Depending on what is happening, a fix for (1) would probably be a
querying for osthread==NULL, a fix for (2) would be to guard os::---
functions - well, at least os::thread_cpu_time() - to disregard
Threads with pthread_t == 0. Both fixes seem better to me than
querying the stacksize() when walking the list - that seems a bit
awkward.
I find it awkward that the list is populated with partially constructed
threads in the first place. This particular code is only interested in
live threads that can safely be queried. The stack_size() check is a
surrogate for "this is a live thread that can be queried".
Post by Thomas Stüfe
--
P.s.
_should_terminate(false), _has_terminated(false) {
};
I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
I assume it is implicit. But yes it should be explicit.
Cheers,
David
Post by Thomas Stüfe
--
Cheers, Thomas
Post by David Holmes
Sorry for the delayed response.
Post by Kim Barrett
Post by David Holmes
After discussions with Kim I've decided to split out the NJT list
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Thanks Kim.
I've decided to stick with this simple fix for NJTs only.
David
Post by Kim Barrett
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live
target thread so that we can use its pthread_t identity to obtain
its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is
registered during its constructor and de-registered during its
destructor. A thread that has only been constructed has not yet
executed and so is not a valid target for this management API.
This seems to be the cause of failures reported in this bug (and
JDK-8213434). Registering a NJT only when it starts executing is
an appealing fix for this, but that impacts all current users of
the NJT list and straight-away causes a problem with the
BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for
ThreadTimesClosure::do_thread to skip threads that have not yet
executed - which we can recognize by seeing an uninitialized (i.e.
zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE
for NJT lifecycle management if desired, tackles the problem of
encountering a terminated thread during iteration - which can also
lead to SEGVs. This can arise because NJT's are not actually
"destructed", even if they terminate, and so they never get
removed from the NJT list. Calling destructors is problematic
because the code using these NJTs assume they are always valid. So
the fix in this case is to move the de-registering from the NJT
list out of the destructor and into the Thread::call_run() method
so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where
the remaining steps touch on a lot of areas and so need to be
handled separately e.g. see JDK-8087340 for shutting down WorkGang
GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test
this in the conditions reported in JDK-8213434.
Thanks,
David
Thomas Stüfe
2018-11-28 07:35:48 UTC
Permalink
Hi David,
Post by David Holmes
Hi Thomas,
Post by Thomas Stüfe
Hi David,
I admit I still do not really get it..
We have
static ConcurrentMarkSweepThread*
ConcurrentMarkSweepThread::start(CMSCollector* collector);
which creates a ConcurrentMarkSweepThread and then calls
os::create_thread() to hook it up with an OSThread and start it.
ConcurrentMarkSweepThread derives from NonJavaThread, which in its
constructor adds the thread object to the list.
So there is a time gap where the NJT is part of the list, but
Thread::_osthread is still NULL.
In ThreadTimesClosure::do_thread(), we call
os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),
&clockid);
Should we then not crash when dereferencing the NULL
osthread()->pthread_id()? Why do we crash inside
pthread_getcpuclockid?
If I look further into os::create_thread(), I see that there is
another, smaller time gap where we create OSThread and anchor it into
thread->set_osthread(osthread);
// Store pthread info into the OSThread
osthread->set_pthread_id(tid);
When OSThread is created, we call OSThread::pd_initialize() and set
its _threadid to 0. We do this in the constructor, before anchoring
the OSThread in its Thread.
(1)- NJT is in list, but its _osthread member is NULL. In that case I
would expect a different crash.
(2)- NJT is in list, _osthread is set, but its _thread_id is NULL.
Yes both situations may be possible. In the related bug report only the
0 thread_id was observed.
Oh, was it 0? That was not clear from the bug report. I wondered
whether it could have been a random value.
Post by David Holmes
Post by Thomas Stüfe
(Modulo any concurrency issues, e.g. could the
"thread->set_osthread(osthread);" be visible before OSThread is
initialized?
Depending on what is happening, a fix for (1) would probably be a
querying for osthread==NULL, a fix for (2) would be to guard os::---
functions - well, at least os::thread_cpu_time() - to disregard
Threads with pthread_t == 0. Both fixes seem better to me than
querying the stacksize() when walking the list - that seems a bit
awkward.
I find it awkward that the list is populated with partially constructed
threads in the first place.
Sure. That is awkward too.
Post by David Holmes
This particular code is only interested in
live threads that can safely be queried. The stack_size() check is a
surrogate for "this is a live thread that can be queried".
I understand. But would a check in os::thread_cpu_time() for
osthread==NULL || osthread->pthread_id == 0 not be safer and help in
more situations?

Hence my former questions. If pthread_id were not 0 but either some
random value or a pthread_id of a long-terminated thread, that would
not help.
Post by David Holmes
Post by Thomas Stüfe
--
P.s.
_should_terminate(false), _has_terminated(false) {
};
I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
I assume it is implicit. But yes it should be explicit.
Cheers,
David
..Thomas
Post by David Holmes
Post by Thomas Stüfe
--
Cheers, Thomas
Post by David Holmes
Sorry for the delayed response.
Post by Kim Barrett
Post by David Holmes
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Thanks Kim.
I've decided to stick with this simple fix for NJTs only.
David
Post by Kim Barrett
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for NonJavaThreads (NJTs). That functionality requires a valid/live target thread so that we can use its pthread_t identity to obtain its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered during its constructor and de-registered during its destructor. A thread that has only been constructed has not yet executed and so is not a valid target for this management API. This seems to be the cause of failures reported in this bug (and JDK-8213434). Registering a NJT only when it starts executing is an appealing fix for this, but that impacts all current users of the NJT list and straight-away causes a problem with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread to skip threads that have not yet executed - which we can recognize by seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for NJT lifecycle management if desired, tackles the problem of encountering a terminated thread during iteration - which can also lead to SEGVs. This can arise because NJT's are not actually "destructed", even if they terminate, and so they never get removed from the NJT list. Calling destructors is problematic because the code using these NJTs assume they are always valid. So the fix in this case is to move the de-registering from the NJT list out of the destructor and into the Thread::call_run() method so it is done before a thread actually terminates. This can be considered a first step in cleaning up the NJT lifecycle, where the remaining steps touch on a lot of areas and so need to be handled separately e.g. see JDK-8087340 for shutting down WorkGang GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure locally, even after thousands of runs. I'm hoping Zhengyu can test this in the conditions reported in JDK-8213434.
Thanks,
David
David Holmes
2018-11-28 07:49:11 UTC
Permalink
Hi Thomas,

Not withstanding the fact my fix was wrong ...

<trimming>
Post by Thomas Stüfe
I understand. But would a check in os::thread_cpu_time() for
osthread==NULL || osthread->pthread_id == 0 not be safer and help in
more situations?
My view is that such functions should only be called on threads that are
valid/live so that the function itself does not have to do such checks -
which can be racy anyway when it comes to thread termination. So
higher-level code is responsible for ensuring only valid threads are the
target of such calls.
Post by Thomas Stüfe
Hence my former questions. If pthread_id were not 0 but either some
random value or a pthread_id of a long-terminated thread, that would
not help.
Yes but the aim is to ensure it is either zero or else a valid value.

Cheers,
David
Post by Thomas Stüfe
Post by David Holmes
Post by Thomas Stüfe
--
P.s.
_should_terminate(false), _has_terminated(false) {
};
I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
I assume it is implicit. But yes it should be explicit.
Cheers,
David
..Thomas
Post by David Holmes
Post by Thomas Stüfe
--
Cheers, Thomas
Post by David Holmes
Sorry for the delayed response.
Post by Kim Barrett
Post by David Holmes
https://bugs.openjdk.java.net/browse/JDK-8214097
So only the change in management.cpp needs reviewing and testing.
http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
Looks good.
Thanks Kim.
I've decided to stick with this simple fix for NJTs only.
David
Post by Kim Barrett
Post by David Holmes
Thanks,
David
Post by David Holmes
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for NonJavaThreads (NJTs). That functionality requires a valid/live target thread so that we can use its pthread_t identity to obtain its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered during its constructor and de-registered during its destructor. A thread that has only been constructed has not yet executed and so is not a valid target for this management API. This seems to be the cause of failures reported in this bug (and JDK-8213434). Registering a NJT only when it starts executing is an appealing fix for this, but that impacts all current users of the NJT list and straight-away causes a problem with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread to skip threads that have not yet executed - which we can recognize by seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for NJT lifecycle management if desired, tackles the problem of encountering a terminated thread during iteration - which can also lead to SEGVs. This can arise because NJT's are not actually "destructed", even if they terminate, and so they never get removed from the NJT list. Calling destructors is problematic because the code using these NJTs assume they are always valid. So the fix in this case is to move the de-registering from the NJT list out of the destructor and into the Thread::call_run() method so it is done before a thread actually terminates. This can be considered a first step in cleaning up the NJT lifecycle, where the remaining steps touch on a lot of areas and so need to be handled separately e.g. see JDK-8087340 for shutting down WorkGang GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure locally, even after thousands of runs. I'm hoping Zhengyu can test this in the conditions reported in JDK-8213434.
Thanks,
David
Kim Barrett
2018-11-28 19:53:17 UTC
Permalink
Post by David Holmes
Post by Thomas Stüfe
P.s.
_should_terminate(false), _has_terminated(false) {
};
I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
I assume it is implicit. But yes it should be explicit.
It is implicit. In any constructor, if the initializer-list doesn’t contain a call to a
(non-virtual) base class's constructor, an implicit call to the default constructor
for that base class will be inserted at the appropriate place. (Virtual base
classes are more complicated, but since we avoid using them…)

It’s pretty common usage in our code to not explicitly mention default base class
constructor calls.
Thomas Stüfe
2018-11-28 20:09:38 UTC
Permalink
Post by Kim Barrett
Post by David Holmes
Post by Thomas Stüfe
P.s.
_should_terminate(false), _has_terminated(false) {
};
I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
I assume it is implicit. But yes it should be explicit.
It is implicit. In any constructor, if the initializer-list doesn’t contain a call to a
(non-virtual) base class's constructor, an implicit call to the default constructor
for that base class will be inserted at the appropriate place. (Virtual base
classes are more complicated, but since we avoid using them…)
It’s pretty common usage in our code to not explicitly mention default base class
constructor calls.
Thanks for explaining :)

I assumed as much. But I actually prefer the explicit notation since
it saves me some indirection when walking up the base constructors in
an IDE. Otherwise I have to check which class this class derived from.

..Thomas

Loading...