Jan-Erik Soderholm
2014-06-24 11:51:30 UTC
Hi.
Is there any traces left from when a process goes into
a MWAIT/MUTEX state? I have been searching ACCUNTING
and the OPERATOR.LOG so far, but I can not find anything
there. We guess that some process went into this state
sometimes before last midnight.
We have an issue (that I reported about earlier, last
year sometimes) that somerting happens in the telnet
part of TCPIP/Services if one two batch (or by hand)
jobs at the same time that does DELETE and CREATE on
the same TNA device number. One of the jobs/processes
goes into a MUTEX state and hangs there until we run
another job doing DEL/CRE on *another* TNA device...
The MUTEX "jumps" from one process to the next.
We have tried to walk around the issue by using batch
queues with /job_limit=1 and similar things, but we
might have missed some submit someware.
We also have a dummy batch job running that make sure that
we hang a dummy TNA device instead of a production device.
It's only a reboot that clears this error. Well, probably
also a shutdown and startup of TCPIP/Services, but then
we could just as well restart the rest also...
Anyway, we'd like to find out when this MUTEX *first*
apparead this time to trace down the root couse.
Jan-Erik.
Is there any traces left from when a process goes into
a MWAIT/MUTEX state? I have been searching ACCUNTING
and the OPERATOR.LOG so far, but I can not find anything
there. We guess that some process went into this state
sometimes before last midnight.
We have an issue (that I reported about earlier, last
year sometimes) that somerting happens in the telnet
part of TCPIP/Services if one two batch (or by hand)
jobs at the same time that does DELETE and CREATE on
the same TNA device number. One of the jobs/processes
goes into a MUTEX state and hangs there until we run
another job doing DEL/CRE on *another* TNA device...
The MUTEX "jumps" from one process to the next.
We have tried to walk around the issue by using batch
queues with /job_limit=1 and similar things, but we
might have missed some submit someware.
We also have a dummy batch job running that make sure that
we hang a dummy TNA device instead of a production device.
It's only a reboot that clears this error. Well, probably
also a shutdown and startup of TCPIP/Services, but then
we could just as well restart the rest also...
Anyway, we'd like to find out when this MUTEX *first*
apparead this time to trace down the root couse.
Jan-Erik.