Discussion:
[Info-vax] Is an MUTEX/MWAIT happening logged in any way?
Jan-Erik Soderholm
2014-06-24 11:51:30 UTC
Permalink
Hi.

Is there any traces left from when a process goes into
a MWAIT/MUTEX state? I have been searching ACCUNTING
and the OPERATOR.LOG so far, but I can not find anything
there. We guess that some process went into this state
sometimes before last midnight.

We have an issue (that I reported about earlier, last
year sometimes) that somerting happens in the telnet
part of TCPIP/Services if one two batch (or by hand)
jobs at the same time that does DELETE and CREATE on
the same TNA device number. One of the jobs/processes
goes into a MUTEX state and hangs there until we run
another job doing DEL/CRE on *another* TNA device...

The MUTEX "jumps" from one process to the next.

We have tried to walk around the issue by using batch
queues with /job_limit=1 and similar things, but we
might have missed some submit someware.

We also have a dummy batch job running that make sure that
we hang a dummy TNA device instead of a production device.

It's only a reboot that clears this error. Well, probably
also a shutdown and startup of TCPIP/Services, but then
we could just as well restart the rest also...

Anyway, we'd like to find out when this MUTEX *first*
apparead this time to trace down the root couse.

Jan-Erik.
abrsvc
2014-06-24 12:02:31 UTC
Permalink
IT might help to know what Mutex is being held here. I find it interesting that the mutex "moves" from device to device. This is a bit odd. You'd be better to determine what resource is being restrained here. Normally when a process gets into an MWAIT state, unless the resource becomes available, the proces will remain in that state. Since this state "moves" between processes, there is clearly a resource that is short. Lets look for that resource first.

Dan
Jan-Erik Soderholm
2014-06-24 12:22:43 UTC
Permalink
Post by abrsvc
IT might help to know what Mutex is being held here. I find it
interesting that the mutex "moves" from device to device. This is a bit
odd. You'd be better to determine what resource is being restrained
here. Normally when a process gets into an MWAIT state, unless the
resource becomes available, the proces will remain in that state. Since
this state "moves" between processes, there is clearly a resource that
is short. Lets look for that resource first.
Dan
Yes, I have been trying to trace down this before. I think I ended
up in some internal TCPIP stack stuff.

Right now we have :

$ pipe show sys | sea sys$pipe mutex
0002CA3A BATCH_394 MUTEX 3 589 0 00:00:00.15 ...

That is our dummy job that grabs the MUTEX letting the prod
jobs run to an end. The thing is that a job hanging with the
MUTEX runs just fine as soon as another job "grabs" it.

$ sho proc/id=0002CA3A/cont
%SYSTEM-F-SUSPENDED, process is suspended
$

I can post an "SDA> show proc" of it, it doesn't tell me
much anyway... :-)

Jan-Erik.
Volker Halle
2014-06-24 12:30:25 UTC
Permalink
Jan-Eric,

if you have a process in MUTEX wait, try this:

$ ANALYZE/SYS
SDA> SHOW PROC/IND=2CA3A
SDA> SHOW PROC
...
Waiting EF cluster 1 Event flag wait mask xxxxxxxx
SDA> EXA xxxxxxxx
SDA> Exit

Volker.
Jan-Erik Soderholm
2014-06-24 12:32:39 UTC
Permalink
Post by Volker Halle
Jan-Eric,
$ ANALYZE/SYS
SDA> SHOW PROC/IND=2CA3A
SDA> SHOW PROC
...
Waiting EF cluster 1 Event flag wait mask xxxxxxxx
SDA> EXA xxxxxxxx
SDA> Exit
Volker.
KTB address 82171F40 HWPCB address FFFFFFFF.83D22080
PKTA address 7FFEFF98 Callback vector address 00000000
Internal PID 00B2023A Callback error 00000000
Extended PID 0002CA3A Current CPU id 00000000
State MUTEX Flags 00000000
Base priority 1 Current priority 3
Waiting EF cluster 4 Event flag wait mask 81FBDB40

Press RETURN for more.
SDA> exa 81FBDB40
FFFFFFFF.81FBDB40: FFB3C330.FFB3C330 "0ó.0ó."
SDA>


Hm, doesn't tell *me* anything... :-)

Jan-Erik
Volker Halle
2014-06-24 12:46:34 UTC
Permalink
Jan-Eric,

this does not look the the Contents of a typical mutex. As an example, look at SDA> EXA QMAN$GL_MUTEX which would show 00000000.0000FFFF

Try to look for Memory corruption:

SDA> EXA 81FBDB40;100 ! do you see the same data pattern repeating itself ?

Volker.

PS: and look at your old entries regarding this Problem in comp.os.vms ...
V***@SendSpamHere.ORG
2014-06-24 12:52:19 UTC
Permalink
Post by Jan-Erik Soderholm
Post by Volker Halle
Jan-Eric,
$ ANALYZE/SYS
SDA> SHOW PROC/IND=2CA3A
SDA> SHOW PROC
...
Waiting EF cluster 1 Event flag wait mask xxxxxxxx
SDA> EXA xxxxxxxx
SDA> Exit
Volker.
KTB address 82171F40 HWPCB address FFFFFFFF.83D22080
PKTA address 7FFEFF98 Callback vector address 00000000
Internal PID 00B2023A Callback error 00000000
Extended PID 0002CA3A Current CPU id 00000000
State MUTEX Flags 00000000
Base priority 1 Current priority 3
Waiting EF cluster 4 Event flag wait mask 81FBDB40
Press RETURN for more.
SDA> exa 81FBDB40
FFFFFFFF.81FBDB40: FFB3C330.FFB3C330 "0ó.0ó."
SDA>
Hm, doesn't tell *me* anything... :-)
Actually, it tells volumes. That address does not appear to hold a MUTEX
but, instead, it looks like a self-relative queue's header with a single
queued element.

Try SDA> VALIDATE QUEUE/SELF_RELATIVE/LIST 81FBDB40
Post by Jan-Erik Soderholm
Jan-Erik
--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG

I speak to machines with the voice of humanity.
Jan-Erik Soderholm
2014-06-24 13:11:18 UTC
Permalink
Post by V***@SendSpamHere.ORG
Post by Jan-Erik Soderholm
Post by Volker Halle
Jan-Eric,
$ ANALYZE/SYS
SDA> SHOW PROC/IND=2CA3A
SDA> SHOW PROC
...
Waiting EF cluster 1 Event flag wait mask xxxxxxxx
SDA> EXA xxxxxxxx
SDA> Exit
Volker.
KTB address 82171F40 HWPCB address FFFFFFFF.83D22080
PKTA address 7FFEFF98 Callback vector address 00000000
Internal PID 00B2023A Callback error 00000000
Extended PID 0002CA3A Current CPU id 00000000
State MUTEX Flags 00000000
Base priority 1 Current priority 3
Waiting EF cluster 4 Event flag wait mask 81FBDB40
Press RETURN for more.
SDA> exa 81FBDB40
FFFFFFFF.81FBDB40: FFB3C330.FFB3C330 "0ó.0ó."
SDA>
Hm, doesn't tell *me* anything... :-)
Actually, it tells volumes. That address does not appear to hold a MUTEX
but, instead, it looks like a self-relative queue's header with a single
queued element.
Try SDA> VALIDATE QUEUE/SELF_RELATIVE/LIST 81FBDB40
Post by Jan-Erik Soderholm
Jan-Erik
SDA> VALIDATE QUEUE/SELF_RELATIVE/LIST 81FBDB40

Entry Address Flink Blink
----- ------- ----- -----
Header 81FBDB40 FFB3C330 FFB3C330

Error comparing backward link to previous structure address (FFFFFFFF.81FBDB40)
Error occurred in queue element at address FFFFFFFF.81AF9E70, after tracing
0 elements
SDA>

And from Volkers sudgestion:

SDA> EXA 81FBDB40;100
00000000 81B0BD40 FFB3C330 FFB3C330 0ó.0ó.@½°..... FFFFFFFF.81FBDB40
00000000 00000001 821115C0 81FBDA40 @Úû.À........... FFFFFFFF.81FBDB50
00000000 00000000 00000000 00000000 ................ FFFFFFFF.81FBDB60
00000000 00000000 00000000 00000000 ................ FFFFFFFF.81FBDB70
81A13550 08300000 81C3E038 81C3E038 8àÃ.8àÃ...0.P5¡. FFFFFFFF.81FBDB80
FFFFFFFF 81A13550 FFFFFFFF 81FBDB80 .Ûû.....P5¡..... FFFFFFFF.81FBDB90
00000000 00000000 00000000 00000000 ................ FFFFFFFF.81FBDBA0
00000000 00000000 00000000 00000000 ................ FFFFFFFF.81FBDBB0
00000000 004A0080 00000000 00000000 ..........J..... FFFFFFFF.81FBDBC0
81102FD8 00000020 00000000 00000000 ........ ...Ø/.. FFFFFFFF.81FBDBD0
00000000 81103098 00000120 00000000 .... ....0...... FFFFFFFF.81FBDBE0
00000000 004A0030 813BB850 00000200 ....P¸;.0.J..... FFFFFFFF.81FBDBF0

Zeros suppressed from FFFFFFFF.81FBDC00 through FFFFFFFF.81FBDC3F

SDA>
Volker Halle
2014-06-24 13:27:52 UTC
Permalink
Jan-Erik,

for finding the current owner of the MUTEX, use PWAIT$SDA V1.0 (or higher). Available from:

http://eisner.encompasserve.org/~miller/

It has code to scan the PCB vector and find processes, which hold MUTEXes.

Last time this has been discussed, you've used the MTX$SDA extension and found, that the MUTEX in question had been Accessed by TCPIP$TNDRIVER...

The SDA Output does not immediately look like Memory corruption, also try: SDA> EXA 81FBDB40-100;100 to see Memory preceeding that virtual address.

It may also be interesting to find out, which data structure this 'MUTEX' is in.

Volker.
Jan-Erik Soderholm
2014-06-24 14:04:43 UTC
Permalink
Post by Volker Halle
Jan-Erik,
http://eisner.encompasserve.org/~miller/
OK, I got this output:

SDA> pwait/id=0003193C
PWAIT V1.0 (c) 2004, Ian Miller (***@encompasserve.org) built on VMS V8.2
Process PID 0003193C name BATCH_524 No. Threads 1
Thread 0: state MWAIT AST pending KS active (none) blocked (none)
Process has been waiting for 00:00:00.56
Process thread resource wait is ENABLED
Analyzing process locks
Process owns no locks
waiting for mutex 81FBDB40
SDA>

If I run it repeately the only change is the
"Process has been waiting for" that shows different times.
Seems as it counts up to 1 sec and then startas over "waiting".

B.t.w, this is on:

$ tcpip sh ver

HP TCP/IP Services for OpenVMS Alpha Version V5.5 - ECO 1
on a COMPAQ AlphaServer DS20E 666 MHz running OpenVMS V8.2

$

We are also trying to reproduce this issue on a system running:

$ tcpip sho ver

HP TCP/IP Services for OpenVMS Alpha Version V5.7 - ECO 3
on a COMPAQ AlphaServer DS20E 666 MHz running OpenVMS V8.4

$

Jan-Erik
Post by Volker Halle
It has code to scan the PCB vector and find processes, which hold MUTEXes.
Last time this has been discussed, you've used the MTX$SDA extension and
found, that the MUTEX in question had been Accessed by
TCPIP$TNDRIVER...
The SDA Output does not immediately look like Memory corruption, also
try: SDA> EXA 81FBDB40-100;100 to see Memory preceeding that virtual
address.
It may also be interesting to find out, which data structure this 'MUTEX' is in.
Volker.
abrsvc
2014-06-24 14:10:47 UTC
Permalink
Something is using this mutex, we need to find out what.

Try a READ/EXEC followed by a MAP xxxxxxxx where the X's are the address listed. Perhaps that will give us an idea of what is using it?
Volker Halle
2014-06-24 14:16:24 UTC
Permalink
Jan-Erik,

there is code in PWAIT$SDA.C to find processes owning a MUTEX (pcb$l_mtxcnt > 0) - look at Routine analyze_mutex_wait. If this does not work, there is either no process owning this MUTEX (shouldn't happen) or the MUTEX ownership is changing so fast, that the tool never finds the owner in the running system ! If you can produce a forced Crash and make that available, I'll try to debug the code.

Consider to use the MTX§SDA Extension (SDA> MTX for help) to trace operations on this MUTEX.

Volker.
Jan-Erik Soderholm
2014-06-24 14:47:24 UTC
Permalink
Post by Volker Halle
Jan-Erik,
there is code in PWAIT$SDA.C to find processes owning a MUTEX
(pcb$l_mtxcnt > 0) - look at Routine analyze_mutex_wait. If this does
not work, there is either no process owning this MUTEX (shouldn't
happen) or the MUTEX ownership is changing so fast, that the tool never
finds the owner in the running system ! If you can produce a forced
Crash and make that available, I'll try to debug the code.
Consider to use the MTX§SDA Extension (SDA> MTX for help) to trace
operations on this MUTEX.
Volker.
I did:

SDA> mtx load
SDA> mtx start trace
[wait 4-5 sec...]
SDA> mtx stop trace
SDA> mtx show trace

That shows a lot of activity on several pages.
I could not find "81FBDB40" in the listing.
After 8-10 pages the screen hanged and I had to
stop/id my session from another terminal...

I could not find out how to use the /MUTEX=mutex
switch, I tried with /MUTEX=81FBDB40 :

SDA> mtx start trace /MUTEX=81FBDB40
%CLI-W-IVKEYW, unrecognized keyword -
check validity and spelling \81FBDB40\
SDA>

I'll try a few times more.

If we can *not* force this issue on the system with
later VMS and TCPIP, we will probably plan an
upgrade during the summer hollidays.

Hm, our dev box has the same version of VMS and TCPIP
as the prod box, we could try to force the issue there
also. The dev box is easy to reboot, if needed...


Jan-Erik.
David Froble
2014-06-25 22:50:20 UTC
Permalink
Post by Jan-Erik Soderholm
Base priority 1 Current priority 3
Without getting into the details, I'm going to ask why you're playing
around with base priorities? Doing so has the capability of self
inflicted wounds to the feet.

As one simple example, suppose some low priority process gains some
resource, and for some time thereafter there are enough higher priority
jobs runable such that the low priority job rarely gets to run and at
some point release the resource.

Not a problem on a rather quiet system. But suppose a system with a
rather heavy job load at certain times. Just one thing that could
happen sporadically, and not reproducable.

My experience has been to normally not play with base priorities. If
there is soemthing that absolutely must have a higher base priority,
then the designer must be very careful.

I don't know what your system load(s) are like, but, I can easily set up
a scenario where a low priority runable job rarely gets to run.


Another item, and very possibly a language issue, not a
misunderstanding. At least the way I understand the terminology, you
don't "hold a MUTEX", a MUTEX is a state you end up in when you cannot
obtain a resource.
Jan-Erik Soderholm
2014-06-25 22:52:37 UTC
Permalink
Post by David Froble
Post by Jan-Erik Soderholm
Base priority 1 Current priority 3
Without getting into the details, I'm going to ask why you're playing
around with base priorities?
I'm not.
Post by David Froble
Another item, and very possibly a language issue, not a misunderstanding.
At least the way I understand the terminology, you don't "hold a MUTEX", a
MUTEX is a state you end up in when you cannot obtain a resource.
Noted.

Jan-Erik.
JF Mezei
2014-06-26 00:21:53 UTC
Permalink
Post by David Froble
At least the way I understand the terminology, you
don't "hold a MUTEX",
Yes you do, a MUTEX is a cute little cuddly animal that everyone loves
and it loves to eat wheat carried on a spaceship. It reproduces like
rabbits. (another name for it is a "Tribble").
Post by David Froble
a MUTEX is a state you end up in when you cannot
obtain a resource.
Is MUTEX explicitely and entirely about waiting for a resource you
don't have ? Or does it include situations where you must wait for
something such as IO to complete before you release something you hold ?
(such as RWAST, RWMBX etc).
David Froble
2014-06-26 04:24:05 UTC
Permalink
Post by JF Mezei
Post by David Froble
At least the way I understand the terminology, you
don't "hold a MUTEX",
Yes you do, a MUTEX is a cute little cuddly animal that everyone loves
and it loves to eat wheat carried on a spaceship. It reproduces like
rabbits. (another name for it is a "Tribble").
Post by David Froble
a MUTEX is a state you end up in when you cannot
obtain a resource.
Is MUTEX explicitely and entirely about waiting for a resource you
don't have ? Or does it include situations where you must wait for
something such as IO to complete before you release something you hold ?
(such as RWAST, RWMBX etc).
I guess to be precise, MWAIT is the process state. I think the MUTEX
has to do with the resource, whatever it is. I'm not correct in some of
the above. Regardless, the process in the MWAIT (or other wait state)
wants the resource, it doesn't hold it.
Jan-Erik Soderholm
2014-06-26 09:45:13 UTC
Permalink
Post by David Froble
I guess to be precise, MWAIT is the process state.
$ sh sys/batch
OpenVMS V8.2 on node HV06 26-JUN-2014 11:19:06.43 Uptime 73 22:06:07
Pid Process Name State Pri I/O CPU Page flts
0003B0BF BATCH_760 MUTEX 3 586 0 00:00:00.10 465
0002BD1D BATCH_58 LEF 3 5296685 0 00:01:14.80 2257
00039D27 BATCH_958 LEF 3 607 0 00:00:00.27 385
00000997 queue_check LEF 3 8620394 0 01:11:46.63 16269951
0002F609 BATCH_852 LEF 3 672 0 00:00:00.19 581
$

Entry 760 is a simple dummy job whos only job is to
aquire the MUTEX and release any real telnet job.

$ sh que/ba/al tnaque
Batch queue TNAQUE, available, on HV06::

Entry Jobname Username Status
----- ------- -------- ------
760 TELNET_DUMMY MK Executing
$

$ type DISK1:[MK.COM]TELNET_DUMMY.COM
$!-------------------------------------------------
$! Dummy connect...
$!-------------------------------------------------
$!
$ set ver
$ set output_rate=00:00:01
$!
$!
$ @ zx_hv06:telnet_delete_port 9966
$ wait 00:00:01
$!
$ Telnet/create 10.32.137.39 3007 9966 /prot=none
$ WAIT 00:00:01
$!
$


This will hang (for hours) with the MUTEX until some other job
tries to delete/create a telnet TNA device. Then the dummy job
will end (successfully) and the other job will hang in MUTEX.

A supervising job:

$ sh que/ba/al sys$batch
Batch queue SYS$BATCH, available, on HV06::

Entry Jobname Username Status
----- ------- -------- ------
58 CHECK_MUTEX xxxxxx Executing

will notice that the TELNET_DUMMY is gone and resubmit it.
That will relase the real job and again hang TELNET_DUMMY in
MUTEX again, and we are back... :-)

This works fine as a work-around and hides the problem from
the users.

We are still trying to reproduce this on our 8.4 system.

Jan-Erik.
Post by David Froble
I think the MUTEX has
to do with the resource, whatever it is. I'm not correct in some of the
above. Regardless, the process in the MWAIT (or other wait state) wants
the resource, it doesn't hold it.
Stephen Hoffman
2014-06-26 00:46:23 UTC
Permalink
Post by David Froble
Another item, and very possibly a language issue, not a
misunderstanding. At least the way I understand the terminology, you
don't "hold a MUTEX", a MUTEX is a state you end up in when you cannot
obtain a resource.
In VMS vernacular, "mutex" can refer to a mutual-exclusion semaphore,
to the mutex data structure and data structure definitions, to a wait
state (MWAIT) that processes enter while waiting to acquire a mutex,
and to a programming construct available within the POSIX threading
support.

Somewhat confusingly, the processes that are waiting for the mutex can
be in a mutex wait state within the scheduling-related displays, while
the process that is holding the mutex (also temporarily running at
priority 16) would usually (hopefully) show as being computable.
--
Pure Personal Opinion | HoffmanLabs LLC
Volker Halle
2014-06-24 13:48:22 UTC
Permalink
Jan-Erik,

from one of your previous Posts from 28-NOV-2008:

SDA> EXA @(PCB+PCB$L_EFWM)
FFFFFFFF.81F7C6C0: FFB9FFB0.FFB9FFB0 "°.š.°.š."

I'd say it's still the SAME Problem !

Volker.
Jan-Erik Soderholm
2014-06-24 14:15:24 UTC
Permalink
Post by Volker Halle
Jan-Erik,
FFFFFFFF.81F7C6C0: FFB9FFB0.FFB9FFB0 "°.š.°.š."
I'd say it's still the SAME Problem !
Volker.
Absolutely... :-)

What I asked about this time, was if there whre any logging
of the time the MUTEX happend. Then it took the route in some
ways like the last time... :-)

What I did this time was to copy a new copy of PWAIT$SDA, I
think I never got it running the last time. It hang with a
"Reading symbol table" message. That is not shown at all
this time. Thanks for the link!

The problem last time was also that we did an urgent reboot
so the MUTEX was gone to soon. Now we understand better how
to run with thu MUTEX active, so we have more time to
investigare the running system now.

But yes, it is the same root problem. We are currently
trying to reproduce on later VMS and TCPIP versions...

Jan-Erik.
Volker Halle
2014-06-24 14:23:56 UTC
Permalink
Jan-Erik,

when you last provided MTX trace Output, this MUTEX was being manipulated by TCPIP$TNDRIVER.

Use SDA> SHOW EXEC TCPIP$TNDRIVER and check whether the current MUTEX address 81FBDB40 is in the nonpaged read/write section of TCPIP$TNDRIVER.

Otherwise try:

SDA> CLUE MEM/LAY

and see if the MUTEX address is in nonpaged pool.

If so, try SDA> SHOW POOL/HEAD 81FBDB40-2000;2200
and find which packet this address is in.

Volker.
Jan-Erik Soderholm
2014-06-24 14:56:02 UTC
Permalink
Post by Volker Halle
Jan-Erik,
when you last provided MTX trace Output, this MUTEX was being
manipulated by TCPIP$TNDRIVER.
Use SDA> SHOW EXEC TCPIP$TNDRIVER and check whether the current MUTEX
address 81FBDB40 is in the nonpaged read/write section of
TCPIP$TNDRIVER.
Nop.
Post by Volker Halle
SDA> CLUE MEM/LAY
and see if the MUTEX address is in nonpaged pool.
Seems so, right? :

Nonpaged Pool (initial size) FFFFFFFF.81C3E000 FFFFFFFF.8217A000
Post by Volker Halle
If so, try SDA> SHOW POOL/HEAD 81FBDB40-2000;2200 and find which packet
this address is in.
81FBDB40 seems to be in the first "Unknown" packet below:

Packet type/subtype Start Length
------------------------- -------- --------
Unknown 81FBDA00 000001C0
FKB 81FBDBC0 00000080
Unknown 81FBDC40 00000100


Fun!

Jan-Erik.
Post by Volker Halle
Volker.
Volker Halle
2014-06-24 16:26:18 UTC
Permalink
On Tuesday, June 24, 2014 4:56:02 PM UTC+2, Jan-Erik Soderholm wrote:

Try to widen the search range:

SDA> 81FBDB40-4000;4200

If the list of packets still shows that Unknown packet at 81FBDA00, length 1C0, then dump the Contents of that packet with:

SDA> EXA 81FBDA00;1C0

Volker.
Jan-Erik Soderholm
2014-06-24 17:04:42 UTC
Permalink
Post by Volker Halle
SDA> 81FBDB40-4000;4200
That only gave me more data before the packat with the
target address range. Am I missing something here? Isn't
it the "Start" and "Length" fields that I should look for
a match against the value "81FBDB40"?
Post by Volker Halle
SDA> EXA 81FBDA00;1C0
That gives (this might wrap...) :

SDA> exa 81fbda00;1c0
00000000 00000000 FFFFFFFF 806B6AB8 000001C0 CD8D0000 81FC2100 81FBE900
.éû..!ü....ÍÀ...¸jk............. FFFFFFFF.81FBDA00
00000000 00000000 000000CD 00000000 00000000 00000000 00AE3AA1 C1D6EF7E
~ïÖÁ¡:®.............Í........... FFFFFFFF.81FBDA20
00000000 00000000 00000000 00000000 00010000 00000180 00000000 00000000
................................ FFFFFFFF.81FBDA40
00000000 81B17C50 00000000 00000000 00000000 00000000 81B0BE10 81B0BD78
x½°..¾°.................P|±..... FFFFFFFF.81FBDA60
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................ FFFFFFFF.81FBDA80
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................ FFFFFFFF.81FBDAA0
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................ FFFFFFFF.81FBDAC0
0000007F 81B0BBD0 81B0B620 81B0BE90 81B0BEC0 81B0B6E0 81B0BA80 81B16AC8
Èj±..º°.ධ.À¾°..¾°. ¶°.л°..... FFFFFFFF.81FBDAE0
00000000 3A000000 00013516 00000000 81F5AF80 81F32C00 00000000 00000000
.........,ó..¯õ......5.....:.... FFFFFFFF.81FBDB00
00000000 00000000 00000000 00000000 00000000 00000000 FFFFFFFF 00000000
................................ FFFFFFFF.81FBDB20
00000000 00000001 823C7040 81FBDA40 00000000 81B0BD40 FFB3C330 FFB3C330
0ó.0ó.@½°.....@Úû***@p<......... FFFFFFFF.81FBDB40
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................ FFFFFFFF.81FBDB60
FFFFFFFF 81A13550 FFFFFFFF 81FBDB80 81A13550 08300000 81C3E038 81C3E038
8àÃ.8àÃ...0.P5¡..Ûû.....P5¡..... FFFFFFFF.81FBDB80
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................ FFFFFFFF.81FBDBA0
SDA>

Again, I have no idea what that is telling me... :-)

Jan-Erik.
Post by Volker Halle
Volker.
Volker Halle
2014-06-25 05:37:19 UTC
Permalink
Jan-Erik,

SDA> exa 81fbda00;1c0
00000000 00000000 FFFFFFFF 806B6AB8 000001C0 CD8D0000 81FC2100 81FBE900 ...
^^^^^^^^

I strongly believe, this is a TCPIP datastructure allocated from nonpaged pool. The length of the packet (000001C0) is in the 4th longword, a standard OpenVMS data structure would have the length in the low word of the 3rd longword !

If you would have a froced dumps available, you could use SDA> TCPIP TAG MEMORY to try to get symbolization for the addresses of the packets. You can try it on the running System as well, as this packet is not going to change in the running System.

Volker.
Jan-Erik Soderholm
2014-06-25 08:52:57 UTC
Permalink
Post by Volker Halle
Jan-Erik,
SDA> exa 81fbda00;1c0
00000000 00000000 FFFFFFFF 806B6AB8 000001C0 CD8D0000 81FC2100 81FBE900
... ^^^^^^^^
I strongly believe, this is a TCPIP datastructure allocated from
nonpaged pool. The length of the packet (000001C0) is in the 4th
longword, a standard OpenVMS data structure would have the length in the
low word of the 3rd longword !
Seems very likely, in particular since the timestamp VAXman found
corresponded with the timestamp of the startup of the telnet server!
Post by Volker Halle
If you would have a froced dumps available, you could use SDA> TCPIP TAG
MEMORY to try to get symbolization for the addresses of the packets. You
can try it on the running System as well, as this packet is not going to
change in the running System.
Volker.
I'll try the TAG command.

I'm also waiting for the result from the other guys in the group
of the test to reproduce this on our test box (VMS 8.4) and on
our dev box (also 8.2 just as prod).

*If* we can *not* reproduce this on 8.4, an upgrade seems to be
the natural path here.

Jan-Erik.
Volker Halle
2014-06-26 09:29:09 UTC
Permalink
Post by Jan-Erik Soderholm
I'll try the TAG command.
Try (even in the running System) SDA> TAG MEMORY/LIST

Then SDA> EXA 81FBDB40 ! or whatever the current 'MUTX' address from EFWM is

If you're lucky, that address will now be symbolized and the Symbol Name would tell something about the kind of TCPIP data structure.


Volker.
Jan-Erik Soderholm
2014-06-26 09:55:14 UTC
Permalink
Post by Volker Halle
Post by Jan-Erik Soderholm
I'll try the TAG command.
Try (even in the running System) SDA> TAG MEMORY/LIST
SDA> tag memory/list
%CLI-W-SYNTAX, error parsing 'TAG'
SDA>

SDA> tcpip tag memory/list
%TCPIP-I-LINKDATE, TCPIP$TNDRIVER linked on 27-JUL-2005 19:03:53.08
%TCPIP-I-LINKDATE, TCPIP$BGDRIVER linked on 27-JUL-2005 19:03:47.16
%TCPIP-I-LINKDATE, TCPIP$PROXY_SERVICES linked on 27-JUL-2005 19:04:36.62
%TCPIP-I-LINKDATE, TCPIP$INTERNET_SERVICES linked on 27-JUL-2005 19:03:48.44
%TCPIP-I-VERSION, TCP/IP Services for OpenVMS Version 5.5
%TCPIP-I-LASTSTART, Driver last started 13-APR-2014 13:07:10.84
-TCPIP-I-STARTCOUNT, Driver was started 1 time
-TCPIP-I-PREVSTART, Previously started at 13-APR-2014 13:07:10.84
%TCPIP-I-TRACEINFO, Trace information is not available
%TCPIP-E-NOPWIP, Pathworks-over-IP (PWIP) driver is not present
%TCPIP-E-NONFSSVCS, NFS Services execlet is not present
Defining the following symbols:

[Aprox 45 pages of symbols...]

%TCPIP-S-YOUREIT, Total of 935 items were tagged
-TCPIP-I-TAGGED, 935 memory blocks were tagged
Post by Volker Halle
Then SDA> EXA 81FBDB40 ! or whatever the current 'MUTX' address from EFWM is
SDA> exa 81FBDB40
_M_OVMS_VCIB_TCPIP_KERNEL$81FBD+00100: FFB3C330.FFB3C330 "0ó.0ó."
SDA>
Post by Volker Halle
If you're lucky, that address will now be symbolized and the Symbol Name
would tell something about the kind of TCPIP data structure.
Well, did it? :-)

Jan-Erik.
Post by Volker Halle
Volker.
Volker Halle
2014-06-26 14:36:13 UTC
Permalink
Post by Jan-Erik Soderholm
SDA> exa 81FBDB40
_M_OVMS_VCIB_TCPIP_KERNEL$81FBD+00100: FFB3C330.FFB3C330 "0ó.0ó."
So the data structure, in which this 'MUTEX' is located (at offset 0x100), is a VCIB (maybe a VMS Communication Interface Block ?).

Try SDA> TCPIP SHOW VCI/DEBUG/FULL and look at the data for the Kernel VCI port Client TCPIP$TNDRIVER ...

Volker.
Jan-Erik Soderholm
2014-06-26 15:15:08 UTC
Permalink
Post by Volker Halle
Post by Jan-Erik Soderholm
SDA> exa 81FBDB40
_M_OVMS_VCIB_TCPIP_KERNEL$81FBD+00100: FFB3C330.FFB3C330 "0ó.0ó."
So the data structure, in which this 'MUTEX' is located (at offset 0x100), is a VCIB (maybe a VMS Communication Interface Block ?).
Try SDA> TCPIP SHOW VCI/DEBUG/FULL and look at the data for the Kernel VCI port Client TCPIP$TNDRIVER ...
Volker.
SDA> tcpip show vci

There are currently 0 waifs, 382 have been adopted.

Kernel VCI port client VCI ID Details ...
------------------------------- ----------- -----------------
TCPIP$TNDRIVER 1.0 Upper ID 00.00, lower ID
00.00, 14
5 associations, last ID 1.16709
1.26765 Device TNA5604:, state open
1.26766 Device TNA5603:, state open
1.29807 Device TNA7605:, state open
...
... [7-8 pages of TNA devices...]


/FULL gets *alot* of data...

SDA> tcpip show vci/full

There are currently 0 waifs, 382 have been adopted.

Kernel VCI port "TCPIP$TNDRIVER" has 144 associations
Upper version 00.00, lower version 00.00
VCI ID is 1.0, last assigned ID is 1.16709
Scheduling specifics:

And then one and a half page of information for each
"Association". I've no idea what to look for.

Ah well, I think we'd better schedule an VMS upgrade... :-)

I'm very thankfull for all input here!

Jan-Erik.
Volker Halle
2014-06-26 15:23:10 UTC
Permalink
Jan-Erik,

you're now deep down in TCPIP kernel data structures, a place where only very few people in the world would feel comfortable.

Look for something like this, which includes your VCIB address:

INETKVCI at 8302BF40 for association ID 3.199:
CREATOR_DATA2: 82FF1340 (TNA201:)
FLAGS: 00000000 (DWA, Idle, ...)
SOCKET: 82FB3E80
VCIB: 82B9AC00 <<< maybe 81FBDA40 on your sys
...

Maybe one could learn something from the data or the names listed for this data structure.

Volker.
Jan-Erik Soderholm
2014-06-26 20:10:06 UTC
Permalink
Post by Volker Halle
Jan-Erik,
you're now deep down in TCPIP kernel data structures, a place where only
very few people in the world would feel comfortable.
82FF1340 (TNA201:) FLAGS: 00000000 (DWA, Idle,
82B9AC00 <<< maybe 81FBDA40 on your sys ...
Maybe one could learn something from the data or the names listed for this data structure.
Volker.
This have been fun anyway. :-)

One weird thing is that we have *not* been able to reproduce this
MUTEX on another system, also DS20e, 4 GB, VMS 8.2 with the same
UPDATE level and TCPIP vith the same version and patch level.

Now, that box (development system) does not have the same amount
of active TNS devices, aprox 150 daytime, which might be involved.

Now, *another* thing to investigate is if we realy need to
re-create the TNA devices everytime we restart the application.

At the beginning (early 80's) this system only had LTA (LAT)
devices to terminal servers. They was (still are, the few that
are remaining) only created once in LAT$SYSTARTUP and never
deleted after that.

When they (before me) begun using telnet against the term.servers,
they for some reason decided to add telnet delete/create in each
specific applications startup process.

I think we should try to change the logic for one or two stations
to only create the TNA once at boot, and only stop/start the
application process when the operators request a restart. Then
evaluate how that works...

Jan-Erik.
JF Mezei
2014-06-27 06:10:26 UTC
Permalink
Post by Jan-Erik Soderholm
One weird thing is that we have *not* been able to reproduce this
MUTEX on another system, also DS20e, 4 GB, VMS 8.2 with the same
UPDATE level and TCPIP vith the same version and patch level.
System speed ( or perhaps even link speed) can affect this if certain
events do not arrive in the same order.

As an old example: on a Microvax II, a Decnet service would receive
connect, data and disconnect requests always in order. On a faster
machine, it would sometimes receive the disconnect request before the
data packet which would cause mucho problems when the data packet
arrived with all structures already de-allocated.

The TCP/IP drivers on TCPIP services are more primitive than DECNET and
terminal drivers (no timeout for instance) so it is easier for something
to get "stuck" waiting for an io to complete that never completes.

Does it go into MUTEX when it tries to delete the TN device, or when it
creates it ?

You could try SPAWN/NOWAIT to delete the TN device, and main process
waits 2 seconds, and if the spawned subprocess still exists, it takes
proactive action.

-kill the spamwed subprocess (although not sure what happens when you
stop/id a process in MUTEX).

-open the TN device to write something to it, and close it. This might
cause it to be unjammed. (alternatively, try to read from it. Because
DCL READ/WRITE are synchronous, you want to also spawn/nowait these
things in case they also get stuck.
Jan-Erik Soderholm
2014-06-27 07:57:52 UTC
Permalink
Post by JF Mezei
Post by Jan-Erik Soderholm
One weird thing is that we have *not* been able to reproduce this
MUTEX on another system, also DS20e, 4 GB, VMS 8.2 with the same
UPDATE level and TCPIP vith the same version and patch level.
Does it go into MUTEX when it tries to delete the TN device, or when it
creates it ?
If we look at one single COM file they all have two
(telnet related) steps:

1.
Do a TELNET DELETE on the port *and wait* for it to
go away using a f$GETDVI with "EXIST" parameter.
It takes aprox 15 sec for the TNAnnnn device to go
away after the delete command.

2.
Do a TELNET CREATE to re-create the same TNAnnnn device.

It is no problem to have this running on multiple jobs in any
sequence if all jobs are using *different* TNAnnnn devices.
That never causes any trouble.

The problem is when (by mistake) the *same* COM file is run
multiple times at (more or less) the same time. And the critical
time is during the 15 sec when the TNAnnnn is in a special state
waiting to be removed.

If one job is in the wait phase and a second job does any of
the TELNET commands, the *second job* will end up in MUTEX.
And again, it has to be the same TNAnnnn device.

That second job will then hang until som other job tries the same
TELNET commands again. This time (after the MUTEX issue already
has appeared) it can be *any* job using *any* TNAnnnn device.

The new job will "grab" the MUTEX and the former hanging
job will complete *successfully*.

And there we are with this MUTEX "jumping ship" until a reboot.

The workaround is a dummy job that delete/create a TNAnnnn
device against a spare terminal server. We make sure that
*that* dummy job is having the MUTEX at all times.


Jan-Erik.
Post by JF Mezei
You could try SPAWN/NOWAIT to delete the TN device, and main process
waits 2 seconds, and if the spawned subprocess still exists, it takes
proactive action.
-kill the spamwed subprocess (although not sure what happens when you
stop/id a process in MUTEX).
-open the TN device to write something to it, and close it. This might
cause it to be unjammed. (alternatively, try to read from it. Because
DCL READ/WRITE are synchronous, you want to also spawn/nowait these
things in case they also get stuck.
Volker Halle
2014-06-27 08:45:23 UTC
Permalink
Jan-Erik,

wouldn't the following be a very simple an effective workaround to prevent this problem from happening at all:

assuming that all those Batch Jobs run under the same username or at least in the same group:

- issue a SET PROC/NAME="xxx_TNAnnn"
- if that succeeds, your process is the only one with that Name and can delete/create the TELNET device TNAnnn
- if that command Fails, there is already a Batch Job for this TNAnnn device, so the procudure should Exit and retry the SET PROC/NAME command from time to time

Volker.
Jan-Erik Soderholm
2014-06-27 08:53:50 UTC
Permalink
Post by Volker Halle
Jan-Erik,
wouldn't the following be a very simple an effective workaround to
assuming that all those Batch Jobs run under the same username or at
- issue a SET PROC/NAME="xxx_TNAnnn" - if that succeeds, your process is
the only one with that Name and can delete/create the TELNET device
TNAnnn - if that command Fails, there is already a Batch Job for this
TNAnnn device, so the procudure should Exit and retry the SET PROC/NAME
command from time to time
Volker.
Yes, we have thought about that. :-)

Today all (as we think) runs on a batch queue with /job_limit=1.

Apart from the nighly stop/start of the applications where that
would take to long time. I could have been someone in one of
the night shifts that was to "hot" and resuested a manual
restart at the same time as the nightly startup run.

We has been trying to find *which* job it was that run twice
at the same time, but we have not been able to trace that down.

I think that the next step will be to reconfigure a couple
of workstations to *not* delete/create the TNA device each time
the process is restarted. Just as we do with the LTA devices.

And then see how that works.

Jan-Erik.
V***@SendSpamHere.ORG
2014-06-27 11:12:01 UTC
Permalink
Post by Jan-Erik Soderholm
Post by JF Mezei
Post by Jan-Erik Soderholm
One weird thing is that we have *not* been able to reproduce this
MUTEX on another system, also DS20e, 4 GB, VMS 8.2 with the same
UPDATE level and TCPIP vith the same version and patch level.
Does it go into MUTEX when it tries to delete the TN device, or when it
creates it ?
If we look at one single COM file they all have two
1.
Do a TELNET DELETE on the port *and wait* for it to
go away using a f$GETDVI with "EXIST" parameter.
It takes aprox 15 sec for the TNAnnnn device to go
away after the delete command.
2.
Do a TELNET CREATE to re-create the same TNAnnnn device.
It is no problem to have this running on multiple jobs in any
sequence if all jobs are using *different* TNAnnnn devices.
That never causes any trouble.
The problem is when (by mistake) the *same* COM file is run
multiple times at (more or less) the same time. And the critical
time is during the 15 sec when the TNAnnnn is in a special state
waiting to be removed.
If one job is in the wait phase and a second job does any of
the TELNET commands, the *second job* will end up in MUTEX.
And again, it has to be the same TNAnnnn device.
That second job will then hang until som other job tries the same
TELNET commands again. This time (after the MUTEX issue already
has appeared) it can be *any* job using *any* TNAnnnn device.
The new job will "grab" the MUTEX and the former hanging
job will complete *successfully*.
And there we are with this MUTEX "jumping ship" until a reboot.
The workaround is a dummy job that delete/create a TNAnnnn
device against a spare terminal server. We make sure that
*that* dummy job is having the MUTEX at all times.
Why are you so hell-bent on reusing the unit number? Just allow the TELNET
/CREATE to use the next available unit.
--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG

I speak to machines with the voice of humanity.
Jan-Erik Soderholm
2014-06-27 12:57:42 UTC
Permalink
Post by V***@SendSpamHere.ORG
Post by Jan-Erik Soderholm
Post by JF Mezei
Post by Jan-Erik Soderholm
One weird thing is that we have *not* been able to reproduce this
MUTEX on another system, also DS20e, 4 GB, VMS 8.2 with the same
UPDATE level and TCPIP vith the same version and patch level.
Does it go into MUTEX when it tries to delete the TN device, or when it
creates it ?
If we look at one single COM file they all have two
1.
Do a TELNET DELETE on the port *and wait* for it to
go away using a f$GETDVI with "EXIST" parameter.
It takes aprox 15 sec for the TNAnnnn device to go
away after the delete command.
2.
Do a TELNET CREATE to re-create the same TNAnnnn device.
It is no problem to have this running on multiple jobs in any
sequence if all jobs are using *different* TNAnnnn devices.
That never causes any trouble.
The problem is when (by mistake) the *same* COM file is run
multiple times at (more or less) the same time. And the critical
time is during the 15 sec when the TNAnnnn is in a special state
waiting to be removed.
If one job is in the wait phase and a second job does any of
the TELNET commands, the *second job* will end up in MUTEX.
And again, it has to be the same TNAnnnn device.
That second job will then hang until som other job tries the same
TELNET commands again. This time (after the MUTEX issue already
has appeared) it can be *any* job using *any* TNAnnnn device.
The new job will "grab" the MUTEX and the former hanging
job will complete *successfully*.
And there we are with this MUTEX "jumping ship" until a reboot.
The workaround is a dummy job that delete/create a TNAnnnn
device against a spare terminal server. We make sure that
*that* dummy job is having the MUTEX at all times.
Why are you so hell-bent on reusing the unit number? Just allow
the TELNET /CREATE to use the next available unit.
Yes, that is another route I have thought of. It will need some
additional work since the TNA number is also registred and used
in different places in the database and the applications.

I have only been managing this the last 6-7 years, it was
built in the 90's... :-)

A more strict use of logical names would make the use of
dynamic TNA numbers easier. Now, the database fields are
resticted to 7 positions ("TNAnnnn") so without a rebuild
of the data storage (some in indexed RMS files), the
logical name names would be restricted to 7 characters.

If there had been an easy way around this, it would
probably have been used already. :-)

Jan-Erik.
V***@SendSpamHere.ORG
2014-06-28 00:59:49 UTC
Permalink
Post by Jan-Erik Soderholm
Post by V***@SendSpamHere.ORG
Post by Jan-Erik Soderholm
Post by JF Mezei
Post by Jan-Erik Soderholm
One weird thing is that we have *not* been able to reproduce this
MUTEX on another system, also DS20e, 4 GB, VMS 8.2 with the same
UPDATE level and TCPIP vith the same version and patch level.
Does it go into MUTEX when it tries to delete the TN device, or when it
creates it ?
If we look at one single COM file they all have two
1.
Do a TELNET DELETE on the port *and wait* for it to
go away using a f$GETDVI with "EXIST" parameter.
It takes aprox 15 sec for the TNAnnnn device to go
away after the delete command.
2.
Do a TELNET CREATE to re-create the same TNAnnnn device.
It is no problem to have this running on multiple jobs in any
sequence if all jobs are using *different* TNAnnnn devices.
That never causes any trouble.
The problem is when (by mistake) the *same* COM file is run
multiple times at (more or less) the same time. And the critical
time is during the 15 sec when the TNAnnnn is in a special state
waiting to be removed.
If one job is in the wait phase and a second job does any of
the TELNET commands, the *second job* will end up in MUTEX.
And again, it has to be the same TNAnnnn device.
That second job will then hang until som other job tries the same
TELNET commands again. This time (after the MUTEX issue already
has appeared) it can be *any* job using *any* TNAnnnn device.
The new job will "grab" the MUTEX and the former hanging
job will complete *successfully*.
And there we are with this MUTEX "jumping ship" until a reboot.
The workaround is a dummy job that delete/create a TNAnnnn
device against a spare terminal server. We make sure that
*that* dummy job is having the MUTEX at all times.
Why are you so hell-bent on reusing the unit number? Just allow
the TELNET /CREATE to use the next available unit.
Yes, that is another route I have thought of. It will need some
additional work since the TNA number is also registred and used
in different places in the database and the applications.
I have only been managing this the last 6-7 years, it was
built in the 90's... :-)
A more strict use of logical names would make the use of
dynamic TNA numbers easier. Now, the database fields are
resticted to 7 positions ("TNAnnnn") so without a rebuild
of the data storage (some in indexed RMS files), the
logical name names would be restricted to 7 characters.
That's a deficit in the TELNET CREATE command WRT the LAT CREATE PORT.
There should be a /LOGICAL=(NAME=xxx,TABLE=SYSTEM,MODE=EXECUTIVE) for
the TELNET CREATE IMHO.
--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG

I speak to machines with the voice of humanity.
Volker Halle
2014-06-26 15:56:49 UTC
Permalink
Post by Jan-Erik Soderholm
SDA> exa 81fbda00;1c0
00000000 00000000 FFFFFFFF 806B6AB8 000001C0 CD8D0000 81FC2100 81FBE900
.��..!�....��...�jk............. FFFFFFFF.81FBDA00
00000000 00000000 000000CD 00000000 00000000 00000000 00AE3AA1 C1D6EF7E
~����:�.............�........... FFFFFFFF.81FBDA20
00000000 00000000 00000000 00000000 00010000 00000180 00000000 00000000
................................ FFFFFFFF.81FBDA40
...

Jan-Erik,

the data structure starting at 81FBDA40 is a VCIB

SDA> FORMAT/TYPE=VCIB 81FBDA40

There is a KRP within this VCRB starting at Offset VCIB$R_TCPIP_KRP, which is where the EFWM address Points to and which is being handled as a 'MUTEX' - which it is not.

On one of the TNA device on my System, this KRP Looks like:

SDA> form 82B9AD00/type=krp
FFFFFFFF.82B9AD00 FF7DF5F0.FF7DF5F0 <<< similar to your 'MUTEX'
FFFFFFFF.82B9AD08 8238BD80 KVCI$$CONTROL_SYNCH_CALLBACK
FFFFFFFF.82B9AD0C 00000000
FFFFFFFF.82B9AD10 82FF17C0.82B9AC00
FFFFFFFF.82B9AD18 00000000.00000001
FFFFFFFF.82B9AD20 00000000.00000000
... ...
FFFFFFFF.82B9AD40 82832038.82832038
FFFFFFFF.82B9AD48 82293940.08300000
FFFFFFFF.82B9AD50 FFFFFFFF.82B9AD40
FFFFFFFF.82B9AD58 FFFFFFFF.82293940 INET$DISPATCH_TCPIP_KRP
FFFFFFFF.82B9AD60 00000000.00000000
... ...

You will certainly see some similarities, but that's as far as we can get...

Volker.
V***@SendSpamHere.ORG
2014-06-24 18:13:36 UTC
Permalink
Post by Jan-Erik Soderholm
Post by Volker Halle
SDA> 81FBDB40-4000;4200
That only gave me more data before the packat with the
target address range. Am I missing something here? Isn't
it the "Start" and "Length" fields that I should look for
a match against the value "81FBDB40"?
Post by Volker Halle
SDA> EXA 81FBDA00;1C0
SDA> exa 81fbda00;1c0
00000000 00000000 FFFFFFFF 806B6AB8 000001C0 CD8D0000 81FC2100 81FBE900
..éû..!ü....ÍÀ...žjk............. FFFFFFFF.81FBDA00
00000000 00000000 000000CD 00000000 00000000 00000000 00AE3AA1 C1D6EF7E
Strangely, this looks like a time quadword ------------^^^^^^^^^^^^^^^^^

SDA> EVALUATE/TIME 00AE3AA1.C1D6EF7E
13-APR-2014 13:07:16.38

Maybe your system is slow! :)
Post by Jan-Erik Soderholm
~ïÖÁ¡:®.............Í........... FFFFFFFF.81FBDA20
00000000 00000000 00000000 00000000 00010000 00000180 00000000 00000000
................................. FFFFFFFF.81FBDA40
00000000 81B17C50 00000000 00000000 00000000 00000000 81B0BE10 81B0BD78
xœ°..Ÿ°.................P|±..... FFFFFFFF.81FBDA60
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDA80
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDAA0
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDAC0
0000007F 81B0BBD0 81B0B620 81B0BE90 81B0BEC0 81B0B6E0 81B0BA80 81B16AC8
Èj±..º°.ධ.ÀŸ°..Ÿ°. ¶°.л°..... FFFFFFFF.81FBDAE0
00000000 3A000000 00013516 00000000 81F5AF80 81F32C00 00000000 00000000
..........,ó..¯õ......5.....:.... FFFFFFFF.81FBDB00
00000000 00000000 00000000 00000000 00000000 00000000 FFFFFFFF 00000000
................................. FFFFFFFF.81FBDB20
00000000 00000001 823C7040 81FBDA40 00000000 81B0BD40 FFB3C330 FFB3C330
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDB60
FFFFFFFF 81A13550 FFFFFFFF 81FBDB80 81A13550 08300000 81C3E038 81C3E038
8àÃ.8àÃ...0.P5¡..Ûû.....P5¡..... FFFFFFFF.81FBDB80
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDBA0
SDA>
Again, I have no idea what that is telling me... :-)
Telling very little here.
--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG

I speak to machines with the voice of humanity.
Jan-Erik Soderholm
2014-06-24 19:05:30 UTC
Permalink
Post by V***@SendSpamHere.ORG
Post by Jan-Erik Soderholm
Post by Volker Halle
SDA> 81FBDB40-4000;4200
That only gave me more data before the packat with the
target address range. Am I missing something here? Isn't
it the "Start" and "Length" fields that I should look for
a match against the value "81FBDB40"?
Post by Volker Halle
SDA> EXA 81FBDA00;1C0
SDA> exa 81fbda00;1c0
00000000 00000000 FFFFFFFF 806B6AB8 000001C0 CD8D0000 81FC2100 81FBE900
..éû..!ü....ÍÀ...¸jk............. FFFFFFFF.81FBDA00
00000000 00000000 000000CD 00000000 00000000 00000000 00AE3AA1 C1D6EF7E
Strangely, this looks like a time quadword ------------^^^^^^^^^^^^^^^^^
SDA> EVALUATE/TIME 00AE3AA1.C1D6EF7E
13-APR-2014 13:07:16.38
Maybe your system is slow! :)
He he, what about *this* :

$ sea sys$manager:operator.log "13-APR-2014 13:07:16.38"/win=(0,2)

%%%%%%%%%%% OPCOM 13-APR-2014 13:07:16.38 %%%%%%%%%%%
Message from user INTERnet on HV06
INTERnet ACP Activate TELNET Server
$

This is a telnet issue after all... :-)

Very well spotted, I must say!

$ write sys$output f$getsyi("boottime")
13-APR-2014 13:05:22.00
$

So it was during the TCPIP startup at last boot...

Jan-Erik.
Post by V***@SendSpamHere.ORG
Post by Jan-Erik Soderholm
~ïÖÁ¡:®.............Í........... FFFFFFFF.81FBDA20
00000000 00000000 00000000 00000000 00010000 00000180 00000000 00000000
................................. FFFFFFFF.81FBDA40
00000000 81B17C50 00000000 00000000 00000000 00000000 81B0BE10 81B0BD78
x½°..¾°.................P|±..... FFFFFFFF.81FBDA60
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDA80
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDAA0
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDAC0
0000007F 81B0BBD0 81B0B620 81B0BE90 81B0BEC0 81B0B6E0 81B0BA80 81B16AC8
Èj±..º°.ධ.À¾°..¾°. ¶°.л°..... FFFFFFFF.81FBDAE0
00000000 3A000000 00013516 00000000 81F5AF80 81F32C00 00000000 00000000
..........,ó..¯õ......5.....:.... FFFFFFFF.81FBDB00
00000000 00000000 00000000 00000000 00000000 00000000 FFFFFFFF 00000000
................................. FFFFFFFF.81FBDB20
00000000 00000001 823C7040 81FBDA40 00000000 81B0BD40 FFB3C330 FFB3C330
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDB60
FFFFFFFF 81A13550 FFFFFFFF 81FBDB80 81A13550 08300000 81C3E038 81C3E038
8àÃ.8àÃ...0.P5¡..Ûû.....P5¡..... FFFFFFFF.81FBDB80
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
................................. FFFFFFFF.81FBDBA0
SDA>
Again, I have no idea what that is telling me... :-)
Telling very little here.
Volker Halle
2014-06-26 06:11:08 UTC
Permalink
Post by Jan-Erik Soderholm
SDA> exa 81FBDB40
FFFFFFFF.81FBDB40: FFB3C330.FFB3C330 "0ó.0ó."
Jan-Erik,

as the contents of this data structure does NOT look like a valid MUTEX datastructure (see my example of QMAN$GL_MUTEX) and you also can't find a process in PRIO 16 with MTXCNT>0, I would conclude, that there is some bug in the TELNET code, which passes the wrong 'MUTEX address' to the MUTEX wait routine. This would also explain, why MTX TRACE does not show any operations on this MUTEX, because the trace does only show MUTEX lock and unlock operations, but no MUTEX read attempts. And noone is locking that 'MUTEX' as it is not a MUTEX !

Volker.
V***@SendSpamHere.ORG
2014-06-26 13:47:44 UTC
Permalink
Post by Volker Halle
Post by Jan-Erik Soderholm
SDA> exa 81FBDB40
FFFFFFFF.81FBDB40: FFB3C330.FFB3C330 "0=F3.0=F3."
Jan-Erik,
as the contents of this data structure does NOT look like a valid MUTEX dat=
astructure (see my example of QMAN$GL_MUTEX) and you also can't find a proc=
ess in PRIO 16 with MTXCNT>0, I would conclude, that there is some bug in t=
he TELNET code, which passes the wrong 'MUTEX address' to the MUTEX wait ro=
utine. This would also explain, why MTX TRACE does not show any operations =
on this MUTEX, because the trace does only show MUTEX lock and unlock opera=
tions, but no MUTEX read attempts. And noone is locking that 'MUTEX' as it =
is not a MUTEX !
I concur. The purported mutex address appears to be pointing to a self-
relative queue listhead or to an element that might have been queued in
a self-relative queue. As such, the value, seen by the mutex wait code,
will cause the process to wait until that value is zero indicating that
the resource is available. Of course, it will never be becaue it is NOT
a mutex or, if it was, something else stepped on it.

With no process at priority 16 nor a positive mutex count, this appears
to be a code bug. Whether it's in TCPIP (telnet) or something else on
your system, I'd suggest opening a support call with HP.
--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG

I speak to machines with the voice of humanity.
V***@SendSpamHere.ORG
2014-06-24 12:59:42 UTC
Permalink
Post by Jan-Erik Soderholm
Post by abrsvc
IT might help to know what Mutex is being held here. I find it
interesting that the mutex "moves" from device to device. This is a bit
odd. You'd be better to determine what resource is being restrained
here. Normally when a process gets into an MWAIT state, unless the
resource becomes available, the proces will remain in that state. Since
this state "moves" between processes, there is clearly a resource that
is short. Lets look for that resource first.
Dan
Yes, I have been trying to trace down this before. I think I ended
up in some internal TCPIP stack stuff.
$ pipe show sys | sea sys$pipe mutex
0002CA3A BATCH_394 MUTEX 3 589 0 00:00:00.15 ...
That is our dummy job that grabs the MUTEX letting the prod
jobs run to an end. The thing is that a job hanging with the
MUTEX runs just fine as soon as another job "grabs" it.
$ sho proc/id=0002CA3A/cont
%SYSTEM-F-SUSPENDED, process is suspended
$
I can post an "SDA> show proc" of it, it doesn't tell me
much anyway... :-)
Jan-Erik.
You're looking at the process WAITING for some MUTEX. That's what MUTEX
means in the state field. What you want to look for is the process that
has hold of the MUTEX that the waiting process wants. Look for a process
with priority 16 (priority is raised to 16 when a MUTEX is acquired). It
is most likely THAT process that is your issue, not the waiter.
--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG

I speak to machines with the voice of humanity.
Volker Halle
2014-06-24 12:27:17 UTC
Permalink
Jan-Erik,

if a process is in MWAIT/MUTEX state, the EFWM (Event Flag Wait Mask) contains the address of that MUTEX. Use SDA to find the EFWM (SDA> SHOW PROC/IND=xxx) and examine the address of the MUTEX.

See http://labs.hoffmanlabs.com/node/231

There is also Ian Millers PWAIT$SDA tool to help find the MUTEX.

Volker.
Loading...