Discussion:
Intermittent failures retrieving process exit codes
Tom Honermann
2012-12-07 19:54:52 UTC
Permalink
I've witnessed intermittent failures in multiple build systems while
working at multiple companies using Cygwin bash and make as part of the
build system but using non-Cygwin compilers and other tools. The
intermittent failures occur when a process appears to complete
successfully, but the process retrieving its exit code receives an
unexpected value. This has been seen on many different Cygwin versions
across several years.

Several reports of similar sounding issues can be found online:
-
http://cygwin.1069669.n5.nabble.com/Cygwin-1-7-x-on-Windows-7-Exit-statuses-of-Win32-executables-are-sometimes-wrong-td20186.html
-
http://stackoverflow.com/questions/9769256/intermittent-failures-under-cygwin-possibly-related-to-candle-and-or-make

I recently was able to produce a very small test case that reproduces
this issue reliably on some machines:

$ cat test.sh
#!/bin/sh

while [ 1 ]; do
echo "test..."
if cmd /c "false"; then
echo "exiting..."
exit 1
fi
done

An invocation of test.sh should run indefinitely, but fails very quickly
on one of my machines:

$ ./test.sh
test...
test...
exiting...

$ ./test.sh
test...
test...
test...
test...
exiting...

$ ./test.sh
test...
exiting...

There are several high-level possibilities for what is going wrong:

1) cmd.exe is failing to retrieve the correct exit code for the
invocation of false.exe (A Cygwin process).

2) cmd.exe is failing to return the (correct) exit code it received for
the invocation of false.exe.

3) bash.exe (A Cygwin process) is failing to retrieve the correct exit
code for the invocation of cmd.exe.

It is possible that other software installed on the machines I've
witnessed this on are contributing to the problem (ala
http://cygwin.com/faq/faq.using.html#faq.using.bloda). If so, such
software would be a contributing factor to one of the explanations
above, but does not necessarily mean that there is not a defect in
Cygwin (or CreateProcess, WaitForSingleObject, or GetExitCodeProcess).
I have not yet seen a similar case that does not involve Cygwin, so at
present I suspect a defect in Cygwin, but possibly one that produces no
negative symptoms in isolation.

I've reproduced this issue with both the 32-bit and 64-bit versions of
cmd.exe. I've also reproduced it by replacing cmd.exe with a C file
that calls CreateProcess for Cygwin's false.exe on its own. The issue
reproduces whether that C file is compiled with Cygwin gcc, MinGW gcc
(32-bit and 64-bit), and with MSVC (32-bit and 64-bit). So, substitute
what you like for 'cmd.exe' in the above.

Likewise, I've reproduced this issue by replacing false.exe in the test
above with a custom false.exe (A C program that just returns 1). The
issue reproduces whether myfalse.exe is compiled with Cygwin gcc, MinGW
gcc (32-bit and 64-bit), and with MSVC (32-bit and 64-bit). So,
substitute what you like for 'false.exe' in the above.

I am not able to reproduce the problem if I elide the invocation of
false.exe. (ie, if the cmd.exe invocation is 'cmd /c "exit /B 1"' or if
my replacement for cmd.exe just returns 1).

The problem feels like a race condition in retrieving process exit
codes. Further, it seems that it may only occur when two related
processes exit in quick succession.

I've been granted several weeks in the near future to work exclusively
on this issue. Before I start working on it though, I'd like to hear
from other community members who have experienced this and tried to
debug it. What is and is not known about the issue. What workarounds
have been tried (especially any that were found to be successful). Are
there specific parts of the Cygwin (or bash) code that you recommend
starting with?

The machine that I've been running the above script on is 64-bit Windows
7 Professional SP1 running under VMware Workstation 8 which is running
on Kubuntu 12.04.

Relevant parts of 'cygcheck-s' are:

Windows 7 Professional N Ver 6.1 Build 7601 Service Pack 1

Running under WOW64 on AMD64

Cygwin DLL version info:
DLL version: 1.7.16
DLL epoch: 19
DLL old termios: 5
DLL malloc env: 28
Cygwin conv: 181
API major: 0
API minor: 262
Shared data: 5
DLL identifier: cygwin1
Mount registry: 3
Cygwin registry name: Cygwin
Program options name: Program Options
Installations name: Installations
Cygdrive default prefix:
Build date:
Shared id: cygwin1S5


Potential app conflicts:

ByteMobile laptop optimization client.

No Cygwin services found.

Cygwin Package Information
Package Version Status
bash 4.1.10-4 OK
cygwin 1.7.16-1 OK


Tom.
Tom Honermann
2012-12-07 21:54:01 UTC
Permalink
Post by Tom Honermann
Likewise, I've reproduced this issue by replacing false.exe in the test
above with a custom false.exe (A C program that just returns 1). The
issue reproduces whether myfalse.exe is compiled with Cygwin gcc, MinGW
gcc (32-bit and 64-bit), and with MSVC (32-bit and 64-bit). So,
substitute what you like for 'false.exe' in the above.
The above is not correct, I erred in my testing.

I am able to reproduce the issue when replacing false.exe in the test
case with a custom false.exe compiled with Cygwin gcc.

I am *not* able to reproduce the issue when replacing it with one
compiled with MinGW gcc (32-bit or 64-bit) or with MSVC (32-bit or 64-bit).

Tom.
bartels
2012-12-07 23:07:24 UTC
Permalink
Your suspicion about a race condition may very well be correct: I can easily confirm the problem on both iron and virtual smp, but not on a
single core virtual.

I have two instances of your test case running for half hour on the same core, without any problem: 30k cycles without hickup.

Apart from the immediate effect exposed by your script, I have reason to believe that the root cause also affects other running (smp) processes.

bartels
Tom Honermann
2012-12-21 06:30:07 UTC
Permalink
I spent most of the week debugging this issue. This appears to be a
defect in Windows. I can reproduce the issue without Cygwin. I can't
rule out other third party kernel mode software possibly contributing to
the issue. A simple change to Cygwin works around the problem for me.

I don't know which Windows releases are affected by this. I've only
reproduced the problem (outside of Cygwin) with Wow64 processes running
on 64-bit Windows 7. I haven't yet tried elsewhere.

The problem appears to be a race condition involving concurrent calls to
TerminateProcess() and ExitThread(). The example code below minimally
mimics the threads created and exit process/thread calls that are
performed when running Cygwin's false.exe. The primary thread exits the
process via TerminateProcess() ala pinfo::exit() in
winsup/cygwin/pinfo.cc. The secondary thread exits itself via
ExitThread() ala Cygwin's signal processing thread function, wait_sig(),
in winsup/cygwin/sigproc.cc.

When the race condition results in the undesirable outcome, the exit
code for the process is set to the exit code for the secondary thread's
call to ExitThread(). I can only speculate at this point, but my guess
is that the TerminateProcess() code disassociates the calling thread
from the process before other threads are stopped such that
ExitThread(), concurrently running in another thread, may determine that
the calling thread is the last thread of the process and overwrite the
process exit code.

The issue also reproduces if ExitProcess() is called in place of
TerminateProcess(). The test case below only uses TerminateProcess()
because that is what Cygwin does.

Source code to reproduce the issue follows. Again, Cygwin is not
required to reproduce the problem. For my own testing, I compiled the
code using Microsoft's Visual Studio 2010 x86 compiler with the command
'cl /Fetest-exit-code.exe test-exit-code.cpp'

test-exit-code.cpp:

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>

DWORD WINAPI SecondaryThread(
LPVOID lpParameter)
{
Sleep(1);
ExitThread(2);
}

int main() {
HANDLE hSecondaryThread = CreateThread(
NULL, // lpThreadAttributes
0, // dwStackSize
SecondaryThread, // lpStartAddress
(LPVOID)0, // lpParameter
0, // dwCreationFlags
NULL); // lpThreadId
if (!hSecondaryThread) {
fprintf(stderr, "CreateThread failed. GLE=%lu\n",
(unsigned long)GetLastError());
exit(127);
}

Sleep(1);

if (!TerminateProcess(GetCurrentProcess(), 1)) {
fprintf(stderr, "TerminateProcess failed. GLE=%lu\n",
(unsigned long)GetLastError());
exit(127);
}

return 0;
}


To run the test, a simple .bat file is used:

test.bat:

@echo off
setlocal

:loop
echo test...
test-exit-code.exe
if %ERRORLEVEL% NEQ 1 (
echo test-exit-code.exe returned %ERRORLEVEL%
exit /B 1
)
goto loop


test.bat should run indefinitely. The amount of time it takes to fail
on my machine (64-bit Windows 7 running in a VMware Workstation 8 VM
under Kubuntu 12.04 on a Lenovo T420 Intel i7-2640M 2 processor laptop)
varies considerably. I had one run fail in less than 10 iterations, but
most of the time it has taken upwards of 5 minutes to get a failure.

The workaround I implemented within Cygwin was simple and sloppy. I
added a call to Sleep(1000) immediately before the call to ExitThread()
in wait_sig() in winsup/cygwin/sigproc.cc. Since this thread (probably)
doesn't exit until the process is exiting anyway, the call to Sleep()
does not adversely affect shutdown. The thread just gets terminated
while in the call to Sleep() instead of exiting before the process is
terminated or getting terminated while still in the call to
ExitThread(). A better solution might be to avoid the thread exiting at
all (so long as it can't get terminated while holding critical
resources), or to have the process exiting thread wait on it. Neither
of these is ideal. Orderly shutdown of multi-threaded processes is
really hard to do correctly on Windows.

Since the exit code for the signal processing thread is not used, having
the wait_sig() thread (and any other threads that could potentially
concurrently exit with another thread) exit with a special status value
such as STATUS_THREAD_IS_TERMINATING (0xC000004BL) would enable
diagnosis of this issue as any process exit code matching this would be
a likely indicator that this issue was encountered.

As is, when this race condition results in the undesirable outcome,
since the signal processing thread exits with a status of 0, the exit
status of the process is 0. This explains why false.exe works so well
to reproduce the issue. It would be impossible to produce a negative
test using true.exe.

Tom.
Corinna Vinschen
2012-12-21 10:32:41 UTC
Permalink
Post by Tom Honermann
I spent most of the week debugging this issue. This appears to be a
defect in Windows. I can reproduce the issue without Cygwin. I
can't rule out other third party kernel mode software possibly
contributing to the issue. A simple change to Cygwin works around
the problem for me.
I don't know which Windows releases are affected by this. I've only
reproduced the problem (outside of Cygwin) with Wow64 processes
running on 64-bit Windows 7. I haven't yet tried elsewhere.
The problem appears to be a race condition involving concurrent
calls to TerminateProcess() and ExitThread(). The example code
below minimally mimics the threads created and exit process/thread
calls that are performed when running Cygwin's false.exe. The
primary thread exits the process via TerminateProcess() ala
pinfo::exit() in winsup/cygwin/pinfo.cc. The secondary thread exits
itself via ExitThread() ala Cygwin's signal processing thread
function, wait_sig(), in winsup/cygwin/sigproc.cc.
When the race condition results in the undesirable outcome, the exit
code for the process is set to the exit code for the secondary
thread's call to ExitThread(). I can only speculate at this point,
but my guess is that the TerminateProcess() code disassociates the
calling thread from the process before other threads are stopped
such that ExitThread(), concurrently running in another thread, may
determine that the calling thread is the last thread of the process
and overwrite the process exit code.
The issue also reproduces if ExitProcess() is called in place of
TerminateProcess(). The test case below only uses
TerminateProcess() because that is what Cygwin does.
Source code to reproduce the issue follows. Again, Cygwin is not
required to reproduce the problem. For my own testing, I compiled
the code using Microsoft's Visual Studio 2010 x86 compiler with the
command 'cl /Fetest-exit-code.exe test-exit-code.cpp'
Wow. Thanks for this testcase. I tried to reproduce the issue and
I was not able to reprodsuce it on a single-CPU, single-core setup,
but I could reproduce it almost immediately on a dual-core system,
twice in a row in under 5 secs.
Post by Tom Honermann
The workaround I implemented within Cygwin was simple and sloppy. I
added a call to Sleep(1000) immediately before the call to
ExitThread() in wait_sig() in winsup/cygwin/sigproc.cc. Since this
thread (probably) doesn't exit until the process is exiting anyway,
the call to Sleep() does not adversely affect shutdown. The thread
just gets terminated while in the call to Sleep() instead of exiting
before the process is terminated or getting terminated while still
in the call to ExitThread(). A better solution might be to avoid
the thread exiting at all (so long as it can't get terminated while
holding critical resources), or to have the process exiting thread
wait on it. Neither of these is ideal. Orderly shutdown of
multi-threaded processes is really hard to do correctly on Windows.
Since the exit code for the signal processing thread is not used,
having the wait_sig() thread (and any other threads that could
potentially concurrently exit with another thread) exit with a
special status value such as STATUS_THREAD_IS_TERMINATING
(0xC000004BL) would enable diagnosis of this issue as any process
exit code matching this would be a likely indicator that this issue
was encountered.
Maybe the signal thread should really not exit by itself, but just
wait until the TerminateThread is called. Chris?


Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
Nick Lowe
2012-12-21 12:15:09 UTC
Permalink
Briefly casting my eye at the test case, as a general point, remember
that these termination APIs all complete asynchronously and I do not
believe it has ever been safe or correct to call another while one is
still pending - you are in undefined, edge case behaviour territory
here.

Win32's TerminateThread/ExitThread, that in turn calls the native
NtTerminateThread, only requests cancellation of a thread and returns
immediately.
One has to wait on a handle to the thread know that termination has
completed, for which the synchronise standard access right is
required.
The same is true of Win32's TerminateProcess/ExitProcess, in turn
NtTerminateProcess, where one waits instead on a handle to the
process.

Regards,

Nick
Tom Honermann
2012-12-21 19:44:44 UTC
Permalink
Post by Nick Lowe
Briefly casting my eye at the test case, as a general point, remember
that these termination APIs all complete asynchronously and I do not
believe it has ever been safe or correct to call another while one is
still pending - you are in undefined, edge case behaviour territory
here.
These comments do not match my understanding of these APIs. MSDN
documentation contradicts some of this as well.
Post by Nick Lowe
Win32's TerminateThread/ExitThread, that in turn calls the native
NtTerminateThread, only requests cancellation of a thread and returns
immediately.
One has to wait on a handle to the thread know that termination has
completed, for which the synchronise standard access right is
required.
The same is true of Win32's TerminateProcess/ExitProcess, in turn
NtTerminateProcess, where one waits instead on a handle to the
process.
TerminateProcess() is documented to perform error checking and then to
schedule asynchronous termination of the specified process. I would not
be surprised if the asynchronous termination applies even when
GetCurrentProcess() is used to specify the process to terminate, but I
would likewise not be surprised if TerminateProcess() has special
handling for this. I agree that calls to TerminateProcess() might
return before the calling thread/process is terminated. I have not
tried to verify this behavior though.

http://msdn.microsoft.com/en-us/library/windows/desktop/ms686714%28v=vs.85%29.aspx

The MSDN documentation for TerminateThread() does not state that the
termination is carried out asynchronously, but I would not be surprised
if that is the case.

http://msdn.microsoft.com/en-us/library/windows/desktop/ms686717%28v=vs.85%29.aspx

I would be *very* surprised if it is possible for ExitProcess() and
ExitThread() to return (unless the thread is being suspended and its
context manipulated by another process/thread). The MSDN docs for these
do not mention any possibility of return. In addition, the ExitThread()
documentation explicitly states that Windows manages serialization of
calls to ExitProcess() and ExitThread().

<quote>
The ExitProcess, ExitThread, CreateThread, CreateRemoteThread functions,
and a process that is starting (as the result of a CreateProcess call)
are serialized between each other within a process. Only one of these
events can happen in an address space at a time.
</quote>

http://msdn.microsoft.com/en-us/library/windows/desktop/ms682659%28v=vs.85%29.aspx

http://msdn.microsoft.com/en-us/library/windows/desktop/ms682658%28v=vs.85%29.aspx

I read that quote as supporting my assertion that the observed behavior
is a defect in Windows. It appears Windows is failing to serialize the
calls appropriately.

Tom.
Nick Lowe
2012-12-22 03:09:29 UTC
Permalink
The documentation in MSDN is incorrect/incomplete with regards to
TerminateThread/TerminateProcess, both are definitely asynchronous.

I am not clear/confident on the behaviour of ExitProcess and
ExitThread, but will investigate with IDA and a test case later. I
suspect any locking/serialisation will pertain to these functions
only.
Post by Tom Honermann
Post by Nick Lowe
Briefly casting my eye at the test case, as a general point, remember
that these termination APIs all complete asynchronously and I do not
believe it has ever been safe or correct to call another while one is
still pending - you are in undefined, edge case behaviour territory
here.
These comments do not match my understanding of these APIs. MSDN
documentation contradicts some of this as well.
Post by Nick Lowe
Win32's TerminateThread/ExitThread, that in turn calls the native
NtTerminateThread, only requests cancellation of a thread and returns
immediately.
One has to wait on a handle to the thread know that termination has
completed, for which the synchronise standard access right is
required.
The same is true of Win32's TerminateProcess/ExitProcess, in turn
NtTerminateProcess, where one waits instead on a handle to the
process.
TerminateProcess() is documented to perform error checking and then to
schedule asynchronous termination of the specified process. I would not be
surprised if the asynchronous termination applies even when
GetCurrentProcess() is used to specify the process to terminate, but I would
likewise not be surprised if TerminateProcess() has special handling for
this. I agree that calls to TerminateProcess() might return before the
calling thread/process is terminated. I have not tried to verify this
behavior though.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686714%28v=vs.85%29.aspx
The MSDN documentation for TerminateThread() does not state that the
termination is carried out asynchronously, but I would not be surprised if
that is the case.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686717%28v=vs.85%29.aspx
I would be *very* surprised if it is possible for ExitProcess() and
ExitThread() to return (unless the thread is being suspended and its context
manipulated by another process/thread). The MSDN docs for these do not
mention any possibility of return. In addition, the ExitThread()
documentation explicitly states that Windows manages serialization of calls
to ExitProcess() and ExitThread().
<quote>
The ExitProcess, ExitThread, CreateThread, CreateRemoteThread functions, and
a process that is starting (as the result of a CreateProcess call) are
serialized between each other within a process. Only one of these events can
happen in an address space at a time.
</quote>
http://msdn.microsoft.com/en-us/library/windows/desktop/ms682659%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/ms682658%28v=vs.85%29.aspx
I read that quote as supporting my assertion that the observed behavior is a
defect in Windows. It appears Windows is failing to serialize the calls
appropriately.
Tom.
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Christopher Faylor
2012-12-21 16:10:24 UTC
Permalink
Post by Corinna Vinschen
Maybe the signal thread should really not exit by itself, but just
wait until the TerminateThread is called. Chris?
If the analysis is correct, that just fixes one symptom doesn't it?
There are potentially many threads running in any Cygwin program
and it sounds like any one of them could trigger this.

cgf
Corinna Vinschen
2012-12-21 17:02:19 UTC
Permalink
Post by Christopher Faylor
Post by Corinna Vinschen
Maybe the signal thread should really not exit by itself, but just
wait until the TerminateThread is called. Chris?
If the analysis is correct, that just fixes one symptom doesn't it?
There are potentially many threads running in any Cygwin program
and it sounds like any one of them could trigger this.
Right. I guess the question is how to synchronize things so that the
thread calling TerminateProcess is actually the last one, making sure
its return value is used.

Maybe the NtQueryInformationThread(ThreadAmILastThread) call is of some
help. Or we have to keep all thread IDs of the self-started threads
available to terminate them explicitely at process exit.


Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
Christopher Faylor
2012-12-21 19:36:20 UTC
Permalink
Post by Corinna Vinschen
Post by Christopher Faylor
Post by Corinna Vinschen
Maybe the signal thread should really not exit by itself, but just
wait until the TerminateThread is called. Chris?
If the analysis is correct, that just fixes one symptom doesn't it?
There are potentially many threads running in any Cygwin program
and it sounds like any one of them could trigger this.
Right. I guess the question is how to synchronize things so that the
thread calling TerminateProcess is actually the last one, making sure
its return value is used.
Maybe the NtQueryInformationThread(ThreadAmILastThread) call is of some
help. Or we have to keep all thread IDs of the self-started threads
available to terminate them explicitely at process exit.
I checked in a complicated fix for this problem which only affected
Cygwin-created threads. But, then, I thought about another riskier but
simpler fix. That version is now in CVS and I'm generating a new
snapshot with it.

I tested this lightly on Windows 7 and 32-bit XP but it would be nice to
hear if multi-threaded things like X work on other platforms too.

If you test a snapshot, note that I'm still tracking down Ken Brown's
reporte emacs regression in recent snapshots so that will still be
broken.

cgf
Daniel Colascione
2012-12-21 20:37:13 UTC
Permalink
Post by Christopher Faylor
Post by Corinna Vinschen
Post by Christopher Faylor
Post by Corinna Vinschen
Maybe the signal thread should really not exit by itself, but just
wait until the TerminateThread is called. Chris?
If the analysis is correct, that just fixes one symptom doesn't it?
There are potentially many threads running in any Cygwin program
and it sounds like any one of them could trigger this.
Right. I guess the question is how to synchronize things so that the
thread calling TerminateProcess is actually the last one, making sure
its return value is used.
Maybe the NtQueryInformationThread(ThreadAmILastThread) call is of some
help. Or we have to keep all thread IDs of the self-started threads
available to terminate them explicitely at process exit.
I checked in a complicated fix for this problem which only affected
Cygwin-created threads. But, then, I thought about another riskier but
simpler fix.
Your second approach scares me. There's no global order imposed on the loader
lock and the Cygwin process lock, and Windows can take the loader lock at
virtually any time, since LoadLibrary can be used internally to implement any API.
marco atzeri
2012-12-21 22:23:00 UTC
Permalink
Post by Christopher Faylor
Post by Corinna Vinschen
Post by Christopher Faylor
Post by Corinna Vinschen
Maybe the signal thread should really not exit by itself, but just
wait until the TerminateThread is called. Chris?
If the analysis is correct, that just fixes one symptom doesn't it?
There are potentially many threads running in any Cygwin program
and it sounds like any one of them could trigger this.
Right. I guess the question is how to synchronize things so that the
thread calling TerminateProcess is actually the last one, making sure
its return value is used.
Maybe the NtQueryInformationThread(ThreadAmILastThread) call is of some
help. Or we have to keep all thread IDs of the self-started threads
available to terminate them explicitely at process exit.
I checked in a complicated fix for this problem which only affected
Cygwin-created threads. But, then, I thought about another riskier but
simpler fix. That version is now in CVS and I'm generating a new
snapshot with it.
I tested this lightly on Windows 7 and 32-bit XP but it would be nice to
hear if multi-threaded things like X work on other platforms too.
If you test a snapshot, note that I'm still tracking down Ken Brown's
reporte emacs regression in recent snapshots so that will still be
broken.
cgf
I think the Xserver doesn't like it.
on 20121221 it freezes on start on W7/64
no issue on 20121218

Regards
Marco
Tom Honermann
2012-12-21 23:08:46 UTC
Permalink
Post by marco atzeri
Post by Christopher Faylor
I tested this lightly on Windows 7 and 32-bit XP but it would be nice to
hear if multi-threaded things like X work on other platforms too.
If you test a snapshot, note that I'm still tracking down Ken Brown's
reporte emacs regression in recent snapshots so that will still be
broken.
cgf
I think the Xserver doesn't like it.
on 20121221 it freezes on start on W7/64
no issue on 20121218
I was worried about this possibility after looking at the code changes.
But, I haven't had to a chance to test adequately yet. I would expect
indefinite blocking in dll_entry() may prevent unloading DLLs. For
example, calls to dll_entry() for DLL_PROCESS_DETACH may get blocked.

It looks to me like the changes made are insufficient to prevent the
race. For example, this won't address the case where an exiting thread
releases the process lock acquired in dll_entry() before a thread
exiting the process acquires it in pinfo::exit(). Both threads could
still end up in an ExitThread() vs ExitProcess()/TerminateProcess()
race. However, this is only true for threads whose exits are not
predicated upon an action taken by the process exiting thread after it
has acquired the process lock in pinfo::exit(). And since the exiting
thread must be the last thread of the process in order to hit the issue,
this may not be a concern.

I'm not sure that a general workaround for this issue is feasible for
all possible threads. At least, not without hooking the Terminate* and
Exit* Win32 APIs. My gut tells me that a general solution requires
waiting for thread handles to be signaled, but I haven't thought it
completely through yet.

It looks like Chris reverted the change and checked in a new update. I
haven't looked at those changes yet.

Tom.
Christopher Faylor
2012-12-22 02:52:59 UTC
Permalink
Post by Tom Honermann
Post by marco atzeri
Post by Christopher Faylor
I tested this lightly on Windows 7 and 32-bit XP but it would be nice to
hear if multi-threaded things like X work on other platforms too.
If you test a snapshot, note that I'm still tracking down Ken Brown's
reporte emacs regression in recent snapshots so that will still be
broken.
cgf
I think the Xserver doesn't like it.
on 20121221 it freezes on start on W7/64
no issue on 20121218
I was worried about this possibility after looking at the code changes.
But, I haven't had to a chance to test adequately yet. I would expect
indefinite blocking in dll_entry() may prevent unloading DLLs. For
example, calls to dll_entry() for DLL_PROCESS_DETACH may get blocked.
You're looking at the wrong changes.

cgf
Tom Honermann
2012-12-22 02:57:15 UTC
Permalink
Post by Christopher Faylor
You're looking at the wrong changes.
I wasn't at the time that I wrote that :)

I noticed that you had reverted those changes. I haven't looked at the
new changes yet.

Tom.
Christopher Faylor
2012-12-22 02:49:43 UTC
Permalink
Post by marco atzeri
Post by Christopher Faylor
Post by Corinna Vinschen
Post by Christopher Faylor
Post by Corinna Vinschen
Maybe the signal thread should really not exit by itself, but just
wait until the TerminateThread is called. Chris?
If the analysis is correct, that just fixes one symptom doesn't it?
There are potentially many threads running in any Cygwin program
and it sounds like any one of them could trigger this.
Right. I guess the question is how to synchronize things so that the
thread calling TerminateProcess is actually the last one, making sure
its return value is used.
Maybe the NtQueryInformationThread(ThreadAmILastThread) call is of some
help. Or we have to keep all thread IDs of the self-started threads
available to terminate them explicitely at process exit.
I checked in a complicated fix for this problem which only affected
Cygwin-created threads. But, then, I thought about another riskier but
simpler fix. That version is now in CVS and I'm generating a new
snapshot with it.
I tested this lightly on Windows 7 and 32-bit XP but it would be nice to
hear if multi-threaded things like X work on other platforms too.
If you test a snapshot, note that I'm still tracking down Ken Brown's
reporte emacs regression in recent snapshots so that will still be
broken.
cgf
I think the Xserver doesn't like it.
on 20121221 it freezes on start on W7/64
no issue on 20121218
I acdtually tried Xserver before submitting my change so it certainly isn't
a consistent problem.

cgf
Christopher Faylor
2012-12-22 03:14:30 UTC
Permalink
I actually tried Xserver before submitting my change so it certainly isn't
a consistent problem.
Sorry, I take that back. I tried Xserver before backing out parts of the
other change and never retried it. Marco is right. It's definitely broken.
I've checked in a new change and am regenerating a snapshot.

cgf
marco atzeri
2012-12-22 09:06:32 UTC
Permalink
Post by Christopher Faylor
I actually tried Xserver before submitting my change so it certainly isn't
a consistent problem.
Sorry, I take that back. I tried Xserver before backing out parts of the
other change and never retried it. Marco is right. It's definitely broken.
I've checked in a new change and am regenerating a snapshot.
cgf
glad to be useful

20121222 : Xserver works fine and the false loop does not stop.

However lftp is still broken

$ lftp
lftp :~> open -u xxxxxxx matzeri.altervista.org
1 [main] lftp 1092 select_stuff::wait: WaitForMultipleObjects
failed, Win32 error 6


(I have the impression it worked after your last select changes, but I
am unable to replicate)

Regards
Marco
Christopher Faylor
2012-12-22 17:50:41 UTC
Permalink
Post by marco atzeri
Post by Christopher Faylor
I actually tried Xserver before submitting my change so it certainly isn't
a consistent problem.
Sorry, I take that back. I tried Xserver before backing out parts of the
other change and never retried it. Marco is right. It's definitely broken.
I've checked in a new change and am regenerating a snapshot.
cgf
glad to be useful
20121222 : Xserver works fine and the false loop does not stop.
However lftp is still broken
$ lftp
lftp :~> open -u xxxxxxx matzeri.altervista.org
1 [main] lftp 1092 select_stuff::wait: WaitForMultipleObjects
failed, Win32 error 6
(I have the impression it worked after your last select changes, but I
am unable to replicate)
The snapshot is intended to work around the race between ExitThread and
ExitProcess. Nothing else.

cgf
Christopher Faylor
2012-12-23 16:56:21 UTC
Permalink
Post by Christopher Faylor
Post by marco atzeri
Post by Christopher Faylor
I actually tried Xserver before submitting my change so it certainly isn't
a consistent problem.
Sorry, I take that back. I tried Xserver before backing out parts of the
other change and never retried it. Marco is right. It's definitely broken.
I've checked in a new change and am regenerating a snapshot.
glad to be useful
20121222 : Xserver works fine and the false loop does not stop.
However lftp is still broken
$ lftp
lftp :~> open -u xxxxxxx matzeri.altervista.org
1 [main] lftp 1092 select_stuff::wait: WaitForMultipleObjects
failed, Win32 error 6
(I have the impression it worked after your last select changes, but I
am unable to replicate)
The snapshot is intended to work around the race between ExitThread and
ExitProcess. Nothing else.
The latest snapshot seems to fix this problem.

FYI
cgf
marco atzeri
2012-12-23 18:53:59 UTC
Permalink
Post by Christopher Faylor
Post by Christopher Faylor
Post by marco atzeri
However lftp is still broken
$ lftp
lftp :~> open -u xxxxxxx matzeri.altervista.org
1 [main] lftp 1092 select_stuff::wait: WaitForMultipleObjects
failed, Win32 error 6
(I have the impression it worked after your last select changes, but I
am unable to replicate)
The snapshot is intended to work around the race between ExitThread and
ExitProcess. Nothing else.
The latest snapshot seems to fix this problem.
FYI
cgf
confirmed.
20121222 19:36:23 is fine

Thanks
Marco
Tom Honermann
2012-12-27 20:49:24 UTC
Permalink
I've been doing some testing with the latest source (pulled updates
about 30 minutes ago). I'm no longer able to reproduce any problems
with incorrect exit codes (Yay! Thanks for the quick turn around on
that!), but I am seeing some new errors when terminating the infinite
loop via ctrl-c using the test case below. This is a test case I was
using previously to help isolate the original problem - I had added
special_printf() calls in a few places and was using strace -m special
to trigger them. All of my changes have been reverted and I'm back to
using vanilla source code. This test is run with a newly built
strace.exe and cygwin1.dll (the false.exe is an old one)

c:\>type test-strace.bat
@echo off
setlocal

set PATH=%CD%;%PATH%

:loop
echo test...
strace -m special false
if not errorlevel 1 (
echo exiting...
exit /B 1
)
goto loop


When interrupting the test run, I'll often (but not always) get the
following error:

c:\>test-strace.bat
test...
test...
test...
test...
--- Process 8092, exception 40010005 at 75E26D67
Terminate batch job (Y/N)? y


Additionally, some of the Cygwin gcc built utilities that I've built for
testing now occasionally hang upon interruption by ctrl-c. Basic
diagnostics courtesy of gdb follow.

This utility was one used in place of strace in the test case above. It
does a fork() and execlp() of its first parameter and then calls
waitpid() on the child and asserts that the exit code received is 1.

If anyone knows of a way to get accurate stack traces when both gcc and
Microsoft compiled modules are present, I'll be happy to regenerate the
stack traces below.

$ gdb --pid=6908
GNU gdb (GDB) 7.5.50.20120815-cvs (cygwin-special)
...
Reading symbols from
/home/thonermann/cygwin/test-install/bin/expect-false-execve-cygwin32.exe...done.
...
(gdb) info shared
From To Syms Read Shared Object Library
0x77461000 0x775c5d1c Yes (*) /cygdrive/c/Windows/SysWOW64/ntdll.dll
0x75d71000 0x75e6bd58 Yes (*)
/cygdrive/c/Windows/syswow64/kernel32.dll
0x74ba1000 0x74be5a08 Yes (*)
/cygdrive/c/Windows/syswow64/KERNELBASE.dll
0x61001000 0x61490000 Yes
/home/thonermann/cygwin/test-install/bin/cygwin1.dll
0x76271000 0x76354198 Yes (*) /cygdrive/c/Windows/system32/user32.dll
0x74f11000 0x74f8292c Yes (*) /cygdrive/c/Windows/syswow64/GDI32.dll
0x76181000 0x761892f8 Yes (*) /cygdrive/c/Windows/syswow64/LPK.dll
0x74d71000 0x74e0c9fc Yes (*) /cygdrive/c/Windows/syswow64/USP10.dll
0x75bf1000 0x75c9b2c4 Yes (*) /cygdrive/c/Windows/syswow64/msvcrt.dll
0x75eb1000 0x75f4f04c Yes (*)
/cygdrive/c/Windows/syswow64/ADVAPI32.dll
0x74ed1000 0x74ee8ed8 Yes (*) /cygdrive/c/Windows/SysWOW64/sechost.dll
0x76371000 0x76445e04 Yes (*) /cygdrive/c/Windows/syswow64/RPCRT4.dll
0x74b41000 0x74b821f0 Yes (*) /cygdrive/c/Windows/syswow64/SspiCli.dll
0x74b31000 0x74b3b474 Yes (*)
/cygdrive/c/Windows/syswow64/CRYPTBASE.dll
0x76581000 0x765c1ce0 Yes (*) /cygdrive/c/Windows/system32/IMM32.DLL
0x75ca1000 0x75d6bebc Yes (*) /cygdrive/c/Windows/syswow64/MSCTF.dll
0x70e41000 0x70e8b464 Yes (*) /cygdrive/c/Windows/system32/apphelp.dll
(*): Shared library is missing debugging information.

(gdb) info thread
Id Target Id Frame
* 4 Thread 6908.0x1950 0x7747000d in ntdll!LdrFindResource_U ()
from /cygdrive/c/Windows/SysWOW64/ntdll.dll
3 Thread 6908.0x1d8c 0x7747f8e5 in ntdll!RtlUpdateClonedSRWLock ()
from /cygdrive/c/Windows/SysWOW64/ntdll.dll
2 Thread 6908.0x1d34 0x7747f8b1 in ntdll!RtlUpdateClonedSRWLock ()
from /cygdrive/c/Windows/SysWOW64/ntdll.dll
1 Thread 6908.0x1344 0x7748013d in
ntdll!RtlEnableEarlyCriticalSectionEventCreation ()
from /cygdrive/c/Windows/SysWOW64/ntdll.dll

(gdb) thread 1
[Switching to thread 1 (Thread 6908.0x1344)]
#0 0x7748013d in ntdll!RtlEnableEarlyCriticalSectionEventCreation ()
from /cygdrive/c/Windows/SysWOW64/ntdll.dll
(gdb) bt
#0 0x7748013d in ntdll!RtlEnableEarlyCriticalSectionEventCreation ()
from /cygdrive/c/Windows/SysWOW64/ntdll.dll
#1 0x7748013d in ntdll!RtlEnableEarlyCriticalSectionEventCreation ()
from /cygdrive/c/Windows/SysWOW64/ntdll.dll
#2 0x74bb0bdd in WaitForMultipleObjectsEx () from
/cygdrive/c/Windows/syswow64/KERNELBASE.dll
#3 0x00000002 in ?? ()
#4 0x00000001 in ?? ()
#5 0x00000000 in ?? ()

(gdb) thread 2
[Switching to thread 2 (Thread 6908.0x1d34)]
#0 0x7747f8b1 in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
(gdb) bt
#0 0x7747f8b1 in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#1 0x7747f8b1 in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#2 0x74bb0a91 in WaitForSingleObjectEx () from
/cygdrive/c/Windows/syswow64/KERNELBASE.dll
#3 0x00000034 in ?? ()
#4 0x00000000 in ?? ()

(gdb) thread 3
[Switching to thread 3 (Thread 6908.0x1d8c)]
#0 0x7747f8e5 in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
(gdb) bt
#0 0x7747f8e5 in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#1 0x7747f8e5 in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#2 0x74bad348 in ReadFile () from
/cygdrive/c/Windows/syswow64/KERNELBASE.dll
#3 0x00000118 in ?? ()
#4 0x00000000 in ?? ()

(gdb) thread 4
[Switching to thread 4 (Thread 6908.0x1950)]
#0 0x7747000d in ntdll!LdrFindResource_U () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
(gdb) bt
#0 0x7747000d in ntdll!LdrFindResource_U () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#1 0x774ff896 in ntdll!RtlQueryTimeZoneInformation ()
from /cygdrive/c/Windows/SysWOW64/ntdll.dll
#2 0x5dfded78 in ?? ()
#3 0x00000000 in ?? ()

Tom.
Christopher Faylor
2012-12-29 21:57:25 UTC
Permalink
Post by Tom Honermann
When interrupting the test run, I'll often (but not always) get the
c:\>test-strace.bat
test...
test...
test...
test...
--- Process 8092, exception 40010005 at 75E26D67
That is coming from strace and it's:

/usr/include/w32api/ntstatus.h:#define DBG_CONTROL_C ((NTSTATUS)0x40010005)

i.e., it's expected.
Post by Tom Honermann
Additionally, some of the Cygwin gcc built utilities that I've built for
testing now occasionally hang upon interruption by ctrl-c. Basic
diagnostics courtesy of gdb follow.
The hang should be fixed in the latest snapshot.

cgf
Tom Honermann
2013-01-01 01:44:56 UTC
Permalink
Post by Christopher Faylor
Post by Tom Honermann
When interrupting the test run, I'll often (but not always) get the
c:\>test-strace.bat
test...
test...
test...
test...
--- Process 8092, exception 40010005 at 75E26D67
/usr/include/w32api/ntstatus.h:#define DBG_CONTROL_C ((NTSTATUS)0x40010005)
i.e., it's expected.
Ah, sorry, I should have researched that further before reporting it.
Thanks for the explanation.
Post by Christopher Faylor
Post by Tom Honermann
Additionally, some of the Cygwin gcc built utilities that I've built for
testing now occasionally hang upon interruption by ctrl-c. Basic
diagnostics courtesy of gdb follow.
The hang should be fixed in the latest snapshot.
I'm still seeing hangs in the latest code from CVS. The stack traces
below are from WinDbg. I manually resolved the symbol references within
the cygwin1 module using the linker generated .map file. Since the .map
file does not include static functions, some of these may be incorrect -
I didn't try and verify or correct for this.

# ChildEBP RetAddr
00 00288bd0 758d0a91 ntdll!ZwWaitForSingleObject+0x15
01 00288c3c 76c11194 KERNELBASE!WaitForSingleObjectEx+0x98
02 00288c54 76c11148 kernel32!WaitForSingleObjectExImplementation+0x75
03 00288c68 610f1553 kernel32!WaitForSingleObject+0x12
04 00288cb8 6118e54d cygwin1!strtosigno+0x357
__ZN4muto7acquireEm
muto::acquire(unsigned long)
05 00288cc8 610f17b2 cygwin1!alloca+0xbbc9
__ZN6dtable4lockEv
dtable::lock()
06 00288d28 610eb717 cygwin1!strtosigno+0x5b6
***@4
close_all_files(bool)
07 00289a48 610eb92b cygwin1!sigfillset+0x7f3e
__ZN16child_info_spawn6workerEPKcPKS1_S3_iii
child_info_spawn::worker(char const*, char
const* const*, char const* const*, int, int, int)
08 00289a88 6103af97 cygwin1!sigfillset+0x8152
_spawnve
09 0028ac28 61007b38 cygwin1!getenv+0x5293
_execlp
0a 0028ac48 61007ad5 cygwin1!setprogname+0x597d
0b 00000000 00000000 cygwin1!setprogname+0x591a


# ChildEBP RetAddr
00 0071aafc 758cd348 ntdll!ZwReadFile+0x15
01 0071ab60 76c13ef7 KERNELBASE!ReadFile+0x118
02 0071aba8 610e7910 kernel32!ReadFileImplementation+0xf0
03 0071aca8 61003ec2 cygwin1!sigfillset+0x4137
__ZN15pending_signals4nextEv
pending_signals::next()
04 0071ace8 61004057 cygwin1!setprogname+0x1d07
__ZN9cygthread8callfuncEb
cygthread::callfunc(bool)
05 0071ad28 61004f61 cygwin1!setprogname+0x1e9c
***@4
cygthread::stub(void*)
06 0071cd98 61004dbc cygwin1!setprogname+0x2da6
__ZN7_cygtls5call2EPFmPvS0_ES0_S0_A
_cygtls::call2(unsigned long (*)(void*,
void*), void*, void*)
07 0071ff68 61087074 cygwin1!setprogname+0x2c01
__ZN7_cygtls4callEPFmPvS0_ES0_A
_cygtls::call(unsigned long (*)(void*,
void*), void*)
08 0071ff88 76c1339a cygwin1!setgrent+0x283c
09 0071ff94 779a9ef2 kernel32!BaseThreadInitThunk+0xe
0a 0071ffd4 779a9ec5 ntdll!__RtlUserThreadStart+0x70
0b 0071ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

Tom.
Christopher Faylor
2013-01-01 05:36:06 UTC
Permalink
Post by Tom Honermann
Post by Christopher Faylor
Post by Tom Honermann
When interrupting the test run, I'll often (but not always) get the
c:\>test-strace.bat
test...
test...
test...
test...
--- Process 8092, exception 40010005 at 75E26D67
/usr/include/w32api/ntstatus.h:#define DBG_CONTROL_C ((NTSTATUS)0x40010005)
i.e., it's expected.
Ah, sorry, I should have researched that further before reporting it.
Thanks for the explanation.
Post by Christopher Faylor
Post by Tom Honermann
Additionally, some of the Cygwin gcc built utilities that I've built for
testing now occasionally hang upon interruption by ctrl-c. Basic
diagnostics courtesy of gdb follow.
The hang should be fixed in the latest snapshot.
I'm still seeing hangs in the latest code from CVS. The stack traces
below are from WinDbg.
I'm not asking you to build this yourself. I have no way to know how
you are building this. Please just use the snapshots at

http://cygwin.com/snapshots/
Post by Tom Honermann
I manually resolved the symbol references within
the cygwin1 module using the linker generated .map file. Since the .map
file does not include static functions, some of these may be incorrect -
I didn't try and verify or correct for this.
Thanks for trying, but the output below is garbled and not really
useful. If you are not going to dive in and attempt to fix code
yourself then all we normally need is a simple test case. WinDbg
is not really appropriate for debugging Cygwin applications.

cgf
Post by Tom Honermann
# ChildEBP RetAddr
00 00288bd0 758d0a91 ntdll!ZwWaitForSingleObject+0x15
01 00288c3c 76c11194 KERNELBASE!WaitForSingleObjectEx+0x98
02 00288c54 76c11148 kernel32!WaitForSingleObjectExImplementation+0x75
03 00288c68 610f1553 kernel32!WaitForSingleObject+0x12
04 00288cb8 6118e54d cygwin1!strtosigno+0x357
__ZN4muto7acquireEm
muto::acquire(unsigned long)
[snip]
Tom Honermann
2013-01-02 19:15:15 UTC
Permalink
Post by Christopher Faylor
Post by Tom Honermann
I'm still seeing hangs in the latest code from CVS. The stack traces
below are from WinDbg.
I'm not asking you to build this yourself. I have no way to know how
you are building this. Please just use the snapshots at
http://cygwin.com/snapshots/
I was building it myself so that I could debug it without having to
specify debug source paths and such. I believe my builds are not
unconventional. I used options that disabled frame pointer omission so
that the resulting binaries could be debugged with non-gcc debuggers.

$ mkdir build
$ cd build
$ ../src/configure \
CFLAGS="-g" \
CXXFLAGS="-g" \
CFLAGS_FOR_TARGET="-g" \
CXXFLAGS_FOR_TARGET="-g" \
--enable-debugging \
--prefix=$HOME/src/cygwin-latest/install -v
$ make
$ make install
Post by Christopher Faylor
Post by Tom Honermann
I manually resolved the symbol references within
the cygwin1 module using the linker generated .map file. Since the .map
file does not include static functions, some of these may be incorrect -
I didn't try and verify or correct for this.
Thanks for trying, but the output below is garbled and not really
useful. If you are not going to dive in and attempt to fix code
yourself then all we normally need is a simple test case. WinDbg
is not really appropriate for debugging Cygwin applications.
The output below is not garbled, but I didn't explain it clearly enough.
Lines with frame numbers come directly from WinDbg. Since WinDbg is
unable to resolve symbols to gcc generated debug info, the symbol
references within the cygwin1 module are incorrect. In those cases, I
manually resolved the instruction pointer address using the RetAddr
value from the prior frame and searching the linker generated
cygwin1.map file. I then pasted the mangled name on a line following
the WinDbg line (with the incorrect symbol name) and, if the symbol is a
C++ one, the unmangled name on an additional line.

For the stack fragment below, address 610f1553 == strtosigno+0x357 ==
__ZN4muto7acquireEm == muto::acquire(unsigned long). I did not
translate offsets for the functions as I resolved them, nor did I try
and verify they are correct (ie, that the return address is not for a
static function that is not represented in the .map file)
Post by Christopher Faylor
Post by Tom Honermann
# ChildEBP RetAddr
00 00288bd0 758d0a91 ntdll!ZwWaitForSingleObject+0x15
01 00288c3c 76c11194 KERNELBASE!WaitForSingleObjectEx+0x98
02 00288c54 76c11148 kernel32!WaitForSingleObjectExImplementation+0x75
03 00288c68 610f1553 kernel32!WaitForSingleObject+0x12
04 00288cb8 6118e54d cygwin1!strtosigno+0x357
__ZN4muto7acquireEm
muto::acquire(unsigned long)
[snip]
The reason for using WinDbg is that, from what I understand, gdb is
unable to produce accurate stack traces when the call stack includes
frames for functions that omit the frame pointer and do not have debug
info that gdb can process. I believe many Microsoft provided functions
in ntdll, kernel32, kernelbase, etc... do omit the frame pointer and
only provide debug info in the PDB format - which gdb is unable to use.
Compiling Cygwin without frame pointer omission, and using WinDbg
therefore provides the most accurate stack trace. If I am incorrect
about any of this, I would very much appreciate a correction and/or
explanation.

I downloaded the latest snapshot (2012-12-31 18:44:57 UTC) and was able
to reproduce several issues which are described below.

All of these issues occur when using ctrl-c to interrupt the infinite
loop in the test case(s) I've been using to debug inconsistent exit
codes. When ctrl-c is pressed, I've observed the following:

1) Programs are (generally) terminated as expected. cmd.exe prompts to
"Terminate batch job" as expected.

2) An access violation occurs and a processor context is dumped to the
console. I do not yet have stack traces for these cases.

3) One of the processes hangs.

access violations occur in ~20% of test runs. Hangs occur in ~5% of
test runs.

I did not provide a test case previously because I don't have an
automated reproducer at present. All sources needed to reproduce the
issues are below. The test case uses a .bat file to avoid dependencies
on bash so as to minimally isolate the problem.

To reproduce the issues, copy test.bat, false-cygwin32.exe, and
expect-false-execve-cygwin32.exe to a Cygwin bin directory and run
test.bat from a cmd.exe console. Press ctrl-c to interrupt the test.
Repeat until problems are observed. I have not been able to reproduce
these symptoms when running the test via a MinTTY console.

I have been unable to get useful stack traces from hung processes using
gdb. gdb reports that the debug information in cygwin1-20130102.dbg.bz2
does not match (CRC mismatch) the cygwin1.dll module in
cygwin-inst-20130102.tar.bz2.


$ cat expect-false-execve.c
#include <errno.h>
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
pid_t child_pid, wait_pid;
int result, child_status;

if (argc != 2) {
fprintf(stderr, "expect-false: Missing or too many arguments\n");
return 127;
}

child_pid = fork();
if (child_pid == -1) {
fprintf(stderr, "expect-false: fork failed. errno=%d\n", errno);
return 127;
} else if (child_pid == 0) {
result = execlp(argv[1], argv[1], NULL);
if (result == -1) {
fprintf(stderr, "expect-false: execlp failed. errno=%d\n",
errno);
}
_exit(127);
}

do {
wait_pid = waitpid(child_pid, &child_status, 0);
} while(
(wait_pid == -1 && errno == EINTR) ||
(wait_pid == child_pid && !(WIFEXITED(child_status) ||
WIFSIGNALED(child_status)))
);
if (wait_pid == -1) {
fprintf(stderr, "expect-false: waitpid failed. errno=%d\n",
errno);
return 127;
}
if (!WIFEXITED(child_status)) {
fprintf(stderr, "expect-false: child process did not exit
normally\n");
return 127;
}
if (WEXITSTATUS(child_status) != 1) {
fprintf(stderr, "expect-false: unexpected exit code: %d\n",
child_status);
}

return WEXITSTATUS(child_status);
}


$ cat false.c
#include <stdio.h>

int main() {
printf("myfalse\n");
return 1;
}


$ cat test.bat
@echo off
setlocal

set PATH=%CD%;%PATH%

:loop
echo test...
expect-false-execve-cygwin32.exe false-cygwin32
if not errorlevel 1 (
echo exiting...
exit /B 1
)
goto loop


$ gcc -o expect-false-execve-cygwin32.exe expect-false-execve.c
$ gcc -o false-cygwin32.exe false.c

From a cmd.exe console: (press ctrl-c once the test is running)
C:\...\cygwin\bin>test
test...
myfalse
test...
myfalse
...


Tom.
Christopher Faylor
2013-01-02 20:48:21 UTC
Permalink
I managed to duplicate a hang by really stressing ctrl-c a loop. It
uncovers some rather amazing Windows behavior which I have to think
about. Apparently ExitThread can be called recursively within the
thread that Windows creates to handle CTRL-C.

cgf
Daniel Colascione
2013-01-02 20:53:11 UTC
Permalink
Post by Christopher Faylor
I managed to duplicate a hang by really stressing ctrl-c a loop. It
uncovers some rather amazing Windows behavior which I have to think
about. Apparently ExitThread can be called recursively within the
thread that Windows creates to handle CTRL-C.
What do you mean? ExitThread should never return, and I can't
imagine anything on the thread termination path calling ExitThread
again, especially not once the thread jumps to kernel mode.
Christopher Faylor
2013-01-02 21:41:16 UTC
Permalink
Post by Daniel Colascione
Post by Christopher Faylor
I managed to duplicate a hang by really stressing ctrl-c a loop. It
uncovers some rather amazing Windows behavior which I have to think
about. Apparently ExitThread can be called recursively within the
thread that Windows creates to handle CTRL-C.
What do you mean? ExitThread should never return, and I can't
imagine anything on the thread termination path calling ExitThread
again, especially not once the thread jumps to kernel mode.
Sorry, I was just speculating about what it looked like. I'm
still debugging the problem.

cgf
Tom Honermann
2013-01-02 21:24:39 UTC
Permalink
Post by Christopher Faylor
I managed to duplicate a hang by really stressing ctrl-c a loop. It
uncovers some rather amazing Windows behavior which I have to think
about. Apparently ExitThread can be called recursively within the
thread that Windows creates to handle CTRL-C.
I'm glad you could reproduce. Based on your description, this sounds
like a separate issue and not a regression introduced by the workarounds
you put in place for the ExitProcess / ExitThread race. Correct?

I wonder if this is the same issue I'm experiencing though. I'm only
pressing ctrl-c once and it sounds like you might be deliving a ctrl-c
to the same process multiple times. That may not be relevant to the
root cause however.

Tom.
Tom Honermann
2013-01-15 22:16:57 UTC
Permalink
Post by Tom Honermann
Post by Christopher Faylor
I managed to duplicate a hang by really stressing ctrl-c a loop. It
uncovers some rather amazing Windows behavior which I have to think
about. Apparently ExitThread can be called recursively within the
thread that Windows creates to handle CTRL-C.
I'm glad you could reproduce. Based on your description, this sounds
like a separate issue and not a regression introduced by the workarounds
you put in place for the ExitProcess / ExitThread race. Correct?
I wonder if this is the same issue I'm experiencing though. I'm only
pressing ctrl-c once and it sounds like you might be deliving a ctrl-c
to the same process multiple times. That may not be relevant to the
root cause however.
I noticed that some changes were checked in related to signal handling
and process termination recently, so I downloaded the most recent
snapshot (20130114) and tested again. I was still able to produce
hanging processes (including hangs of strace.exe) by hitting ctrl-c in a
mintty window while Cygwin processes ran in an infinite loop inside of a
.bat file. I was able to produce a hang ~1 out of 20 times.

If you are still working on this, then I apologize for the noise.
Otherwise, assuming you are still looking at this, if I can provide
something further that would be helpful, please let me know.

Tom.
Christopher Faylor
2013-01-16 02:04:20 UTC
Permalink
Post by Tom Honermann
I noticed that some changes were checked in related to signal handling
and process termination recently, so I downloaded the most recent
snapshot (20130114) and tested again. I was still able to produce
hanging processes (including hangs of strace.exe) by hitting ctrl-c in a
mintty window while Cygwin processes ran in an infinite loop inside of a
.bat file. I was able to produce a hang ~1 out of 20 times.
How does one run a .bat file inside mintty which handles CTRL-C? AFAIK,
a CTRL-C will just cause the .bat file to exit when run under bash.

cgf
Tom Honermann
2013-01-16 16:37:43 UTC
Permalink
Post by Christopher Faylor
Post by Tom Honermann
I noticed that some changes were checked in related to signal handling
and process termination recently, so I downloaded the most recent
snapshot (20130114) and tested again. I was still able to produce
hanging processes (including hangs of strace.exe) by hitting ctrl-c in a
mintty window while Cygwin processes ran in an infinite loop inside of a
.bat file. I was able to produce a hang ~1 out of 20 times.
How does one run a .bat file inside mintty which handles CTRL-C? AFAIK,
a CTRL-C will just cause the .bat file to exit when run under bash.
Here is the test case:

1) Install the latest snapshot

2) Copy bash.exe, false.exe, and their dependent DLLs from a Cygwin
install into the usr/bin directory of the snapshot. For me this
consisted of:
bash.exe
cygintl-8.dll
cygiconv-2.dll
cygreadline7.dll
cygncurses-10.dll
cygncursesw-10.dll
cyggcc_s-1.dll
false.exe

3) Create 'test.bat' in the usr/bin directory of the snapshot with the
following contents:

@echo off
setlocal

set PATH=%CD%;%PATH%

:loop
echo test...
bash -c false
if not errorlevel 1 (
echo exiting...
exit /B 1
)
goto loop

4) Launch mintty using an existing Cygwin installation. Naturally, this
will run a shell from the existing Cygwin install.

5) Change directories to the usr/bin directory of the snapshot.

6) Start task manager or some other process monitoring tool and keep it
running. Run ./test.bat from the Cygwin shell running within mintty and
interrupt it with ctrl-c. Repeat until you see a new bash.exe or
false.exe process persisting following the interrupt. You'll likely
have multiple bash processes running. If you are able to reproduce, you
should see one with a command line of 'bash -c false'. Alternatively,
if your process monitoring tool shows the path to the executable, you'll
be able to identify it as the one from the usr/bin directory of the
snapshot.

I rather doubt that the use of a .bat file is necessary to reproduce
this hang, but I haven't tried producing a test case that doesn't use a
.bat file. This is a test case I was using when debugging the
intermittent incorrect exit code issue.

Tom.
marco atzeri
2013-01-16 16:53:35 UTC
Permalink
Post by Tom Honermann
4) Launch mintty using an existing Cygwin installation. Naturally, this
will run a shell from the existing Cygwin install.
5) Change directories to the usr/bin directory of the snapshot.
This will cause a cygwin1.dll collision between the two versions
Nothing is guarantee to work fine
Post by Tom Honermann
Tom.
Marco
Tom Honermann
2013-01-16 17:42:45 UTC
Permalink
Post by marco atzeri
Post by Tom Honermann
4) Launch mintty using an existing Cygwin installation. Naturally, this
will run a shell from the existing Cygwin install.
5) Change directories to the usr/bin directory of the snapshot.
This will cause a cygwin1.dll collision between the two versions
Nothing is guarantee to work fine
Can you elaborate? Cygwin supports multiple installations just fine
these days. Use of a .bat file (an intervening cmd.exe process) should
isolate the environments for this test.

Regardless, I was also able to produce a hang in bash running the same
.bat file from a cmd.exe prompt using only the snapshot install and the
copied bash.exe, false.exe, and dependent binaries - no mintty. The
hung bash.exe process eventually timed out with an error message:

5 [unknown (0x176C)] bash 2000 sig_send: wait for sig_complete event
failed, signal 6, rc 258, Win32 error 0

Tom.
Earnie Boyd
2013-01-16 18:05:41 UTC
Permalink
Post by marco atzeri
Post by Tom Honermann
4) Launch mintty using an existing Cygwin installation. Naturally, this
will run a shell from the existing Cygwin install.
5) Change directories to the usr/bin directory of the snapshot.
This will cause a cygwin1.dll collision between the two versions
Nothing is guarantee to work fine
Can you elaborate? Cygwin supports multiple installations just fine these
days. Use of a .bat file (an intervening cmd.exe process) should isolate
the environments for this test.
While you can multiple installations you cannot mix the environments.
You did not copy mintty so you started it in one instance and then
went to another instance which will cause a clash of resources.
Regardless, I was also able to produce a hang in bash running the same .bat
file from a cmd.exe prompt using only the snapshot install and the copied
bash.exe, false.exe, and dependent binaries - no mintty. The hung bash.exe
5 [unknown (0x176C)] bash 2000 sig_send: wait for sig_complete event failed,
signal 6, rc 258, Win32 error 0
Looking at the list of DLL you copied you may still be seeing a
conflict with which DLL is in use. Do you see a hang if you remain in
usr/bin and not changing directories to your copied files?
--
Earnie
-- https://sites.google.com/site/earnieboyd
Tom Honermann
2013-01-16 18:51:11 UTC
Permalink
Post by Earnie Boyd
Post by marco atzeri
Post by Tom Honermann
4) Launch mintty using an existing Cygwin installation. Naturally, this
will run a shell from the existing Cygwin install.
5) Change directories to the usr/bin directory of the snapshot.
This will cause a cygwin1.dll collision between the two versions
Nothing is guarantee to work fine
Can you elaborate? Cygwin supports multiple installations just fine these
days. Use of a .bat file (an intervening cmd.exe process) should isolate
the environments for this test.
While you can multiple installations you cannot mix the environments.
You did not copy mintty so you started it in one instance and then
went to another instance which will cause a clash of resources.
Can you elaborate on what resources you are referring to? I fail to see
how the Cygwin binaries run via the .bat file could conflict with mintty
(or the top level bash process) since the intervening cmd.exe execution
would have blocked inheritance of Cygwin related resources, primarily
since fork() isn't used to create these child processes.

My understanding is that shared Cygwin resources are keyed off of the
location of the cygwin1.dll loaded into the Cygwin process. If two
Cygwin processes run with different cygwin1.dll instances, they should
not share resources. I can see a case for there being a problem if a
Cygwin process creates another Cygwin process via fork() and that child
process is run with a different cygwin1.dll instance, but that isn't the
case here. The only other case I can think of would require Cygwin
looking at the process tree (stepping up through non-Cygwin processes)
to get at resources. That would be quite expensive on Windows.
Post by Earnie Boyd
Regardless, I was also able to produce a hang in bash running the same .bat
file from a cmd.exe prompt using only the snapshot install and the copied
bash.exe, false.exe, and dependent binaries - no mintty. The hung bash.exe
5 [unknown (0x176C)] bash 2000 sig_send: wait for sig_complete event failed,
signal 6, rc 258, Win32 error 0
Looking at the list of DLL you copied you may still be seeing a
conflict with which DLL is in use.
I don't see how that would be the case. If it were, then it would not
be possible (in general) to have multiple Cygwin installations with
unrelated processes running concurrently from each installation.
Post by Earnie Boyd
Do you see a hang if you remain in
usr/bin and not changing directories to your copied files?
I believe that would be equivalent to testing in my (non-snapshot)
Cygwin installation. The goal is to test the snapshot.

Tom.
Christopher Faylor
2013-01-16 18:59:08 UTC
Permalink
Post by Tom Honermann
Can you elaborate on what resources you are referring to? I fail to
see how the Cygwin binaries run via the .bat file could conflict with
mintty (or the top level bash process) since the intervening cmd.exe
execution would have blocked inheritance of Cygwin related resources,
primarily since fork() isn't used to create these child processes.
Here is a very basic issue: If you are going to be submitting a bug
report you should be making things as simple and as clear as possible.
The fact that there are two cygwin DLLs in play here adds additional
confusion and complication. If we now have to enter into a theoretical
discussion about what should be allowed, we have needlessly strayed from
the initial problem.

Given the number of historical problems we have had with mixing two
versions of Cygwin and given that our consistent guidance is to
only have one on your computer, there is no reason to get into a
discussion about what is allowed. Just use one version. You
can easily switch back and forth using windows tools.

cgf
Tom Honermann
2013-01-16 20:18:47 UTC
Permalink
Post by Christopher Faylor
Post by Tom Honermann
Can you elaborate on what resources you are referring to? I fail to
see how the Cygwin binaries run via the .bat file could conflict with
mintty (or the top level bash process) since the intervening cmd.exe
execution would have blocked inheritance of Cygwin related resources,
primarily since fork() isn't used to create these child processes.
Here is a very basic issue: If you are going to be submitting a bug
report you should be making things as simple and as clear as possible.
I'm trying. What you are suggesting implies that all testing of
snapshots either be done with a cmd.exe prompt (and copying enough of
another Cygwin installation into the snapshot), or updating the host
Cygwin installation. My host installation is used for production
purposes and I don't have spare machines available for other testing.
I'm not messing with it.

I am aware of the snapshot guidance:
http://cygwin.com/faq-nochunks.html#faq.setup.snapshots
Post by Christopher Faylor
The fact that there are two cygwin DLLs in play here adds additional
confusion and complication. If we now have to enter into a theoretical
discussion about what should be allowed, we have needlessly strayed from
the initial problem.
Given the number of historical problems we have had with mixing two
versions of Cygwin and given that our consistent guidance is to
only have one on your computer, there is no reason to get into a
discussion about what is allowed. Just use one version. You
can easily switch back and forth using windows tools.
I previously mentioned that problems can be duplicated without mintty.
Here are detailed steps for how to reproduce without mintty.

1) Install the latest snapshot

2) Copy bash.exe, false.exe, and their dependent DLLs from a Cygwin
install into the usr/bin directory of the snapshot. For me this
consisted of:
bash.exe
cygintl-8.dll
cygiconv-2.dll
cygreadline7.dll
cygncurses-10.dll
cygncursesw-10.dll
cyggcc_s-1.dll
false.exe

3) Shutdown all other Cygwin processes.

4) Create 'test.bat' in the usr/bin directory of the snapshot with the
following contents:

@echo off
setlocal

set PATH=%CD%;%PATH%

:loop
echo test...
bash -c false
if not errorlevel 1 (
echo exiting...
exit /B 1
)
goto loop

5 Start a cmd.exe prompt.

6) Change directories to the usr/bin directory of the snapshot.

7) Start task manager or some other process monitoring tool and keep it
running. Run ./test.bat from the cmd.exe prompt and interrupt it with
ctrl-c. Repeat until you see a new bash.exe or false.exe process
persisting following the interrupt.

It took me 20 or so tries re-running test.bat and interrupting it before
I was able to produce a hanging/abandoned process.

I don't know how to make things any simpler or clearer than this.

Tom.
Christopher Faylor
2013-01-16 22:23:05 UTC
Permalink
Post by Tom Honermann
I previously mentioned that problems can be duplicated without mintty.
Here are detailed steps for how to reproduce without mintty.
I was responding to your latest bug report which mentioned mintty.

I managed to duplicate a hang by changing your .bat file to use "sleep
2" rather than false. I'm investigating now.

cgf
Tom Honermann
2013-01-18 20:11:03 UTC
Permalink
Post by Christopher Faylor
I managed to duplicate a hang by changing your .bat file to use "sleep
2" rather than false. I'm investigating now.
I noticed that you checked in some additional changes on the 16th that
look related to this, so I tested again with today's snapshot (20130118).

I was still able to produce hangs using the same test case. The
symptoms are slightly different than I had seen previously. bash hung 2
out of the ~60 times I interrupted the test. No error messages were
displayed this time. Upon pressing ctrl-c, bash hung for 60 seconds. I
was then greeted with the "Terminate batch job" prompt and responding
'Y' terminated the process tree as expected. Pressing ctrl-c while bash
was hung for that 60 seconds appeared to have no affect.

My apologies for this distraction if you don't yet expect this to be fixed.

Tom.
Christopher Faylor
2013-01-19 05:58:33 UTC
Permalink
Post by Tom Honermann
Post by Christopher Faylor
I managed to duplicate a hang by changing your .bat file to use "sleep
2" rather than false. I'm investigating now.
I noticed that you checked in some additional changes on the 16th that
look related to this, so I tested again with today's snapshot (20130118).
I thought I sent a "try a snapshot" but I must have been hallucinating
again.
Post by Tom Honermann
I was still able to produce hangs using the same test case. The
symptoms are slightly different than I had seen previously. bash hung 2
out of the ~60 times I interrupted the test. No error messages were
displayed this time. Upon pressing ctrl-c, bash hung for 60 seconds. I
was then greeted with the "Terminate batch job" prompt and responding
'Y' terminated the process tree as expected. Pressing ctrl-c while bash
was hung for that 60 seconds appeared to have no affect.
The hang should be fixed in the upcoming snapshot.

cgf
Tom Honermann
2013-01-20 22:08:50 UTC
Permalink
Post by Christopher Faylor
Post by Tom Honermann
Post by Christopher Faylor
I managed to duplicate a hang by changing your .bat file to use "sleep
2" rather than false. I'm investigating now.
I noticed that you checked in some additional changes on the 16th that
look related to this, so I tested again with today's snapshot (20130118).
I thought I sent a "try a snapshot" but I must have been hallucinating
again.
Post by Tom Honermann
I was still able to produce hangs using the same test case. The
symptoms are slightly different than I had seen previously. bash hung 2
out of the ~60 times I interrupted the test. No error messages were
displayed this time. Upon pressing ctrl-c, bash hung for 60 seconds. I
was then greeted with the "Terminate batch job" prompt and responding
'Y' terminated the process tree as expected. Pressing ctrl-c while bash
was hung for that 60 seconds appeared to have no affect.
The hang should be fixed in the upcoming snapshot.
Snapshot 20130119 appears to have addressed most of the cases I've
witnessed.

However, I was still able to reproduce another case. As before, one of
the processes is being left running when the rest are terminated. The
"abandoned" process appears to be in a live-lock state with two threads
(threads 1 and 2) running at 100%. Of particular interest is that each
time I press ctrl-c in the cmd.exe console this process was spawned
from, a new thread appears in the process even though this program is no
longer a foreground process and all other Cygwin processes have
terminated. The new threads never exit.

Same test case as before. However, since reproducing this may be
challenging, I dug in to try and get some details that might help with
reproducing it.

It looks like thread 1 was interrupted while in a call to free(). Both
thread 1 and 2 appear to be stuck looping on calls to yield(). Thread 3
appears to be stuck in a call to WriteFile. I suspect thread 3 was
created by the initial ctrl-c event, but I'm not able to get an accurate
stack trace for this thread to prove that. Threads 4 and up correspond
to new threads created for new ctrl-c events.

The following stack traces correspond to the above mentioned snapshot
with cygwin1.dbg (from cygwin1-20130119.dbg.bz2) in place.

(gdb) thread 1
[Switching to thread 1 (Thread 5344.0x1878)]
#0 0x7767fbfa in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
(gdb) bt
#0 0x7767fbfa in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#1 0x7767fbfa in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#2 0x76792ed6 in KERNELBASE!GetThreadUILanguage ()
from /cygdrive/c/Windows/syswow64/KERNELBASE.dll
#3 0x61087581 in yield ()
at
/netrel/src/cygwin-snapshot-20130119-1/winsup/cygwin/miscfuncs.cc:243
#4 0x610d6d9c in _sigfe () from
/home/thonermann/cygwin/snapshot/usr/bin/cygwin1.dll
#5 0x61083180 in free ()
at
/netrel/src/cygwin-snapshot-20130119-1/winsup/cygwin/malloc_wrapper.cc:43
#6 0x00000010 in ?? ()
#7 0x00000000 in ?? ()

(gdb) thread 2
[Switching to thread 2 (Thread 5344.0x1ac8)]
#0 0x7767f99e in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
(gdb) bt
#0 0x7767f99e in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#1 0x7767f99e in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#2 0x76793a5e in SetThreadPriority () from
/cygdrive/c/Windows/syswow64/KERNELBASE.dll
#3 0x6108759b in yield ()
at
/netrel/src/cygwin-snapshot-20130119-1/winsup/cygwin/miscfuncs.cc:244
#4 0x610d6eb4 in _cygtls::lock() () from
/home/thonermann/cygwin/snapshot/usr/bin/cygwin1.dll
#5 0x610302ee in sigpacket::setup_handler (this=0x95ac04,
handler=0x6102fdc0 <signal_exit(int, siginfo_t*)>, siga=...,
tls=0x28ce64)
at
/netrel/src/cygwin-snapshot-20130119-1/winsup/cygwin/exceptions.cc:796
#6 0x610319d8 in sigpacket::process (this=0x95ac04)
at
/netrel/src/cygwin-snapshot-20130119-1/winsup/cygwin/exceptions.cc:1266
#7 0x610dd2ac in wait_sig ()
at /netrel/src/cygwin-snapshot-20130119-1/winsup/cygwin/sigproc.cc:1389
#8 0x61003ea5 in cygthread::callfunc (this=0x6118b400,
issimplestub=<optimized out>)
at /netrel/src/cygwin-snapshot-20130119-1/winsup/cygwin/cygthread.cc:51
#9 0x6100442f in cygthread::stub (arg=0x6118b400)
at /netrel/src/cygwin-snapshot-20130119-1/winsup/cygwin/cygthread.cc:93
#10 0x6100538d in _cygtls::call2 (this=<optimized out>,
func=0x610043e0 <cygthread::stub(void*)>, arg=0x6118b400,
buf=0x6100551b <_cygtls::call(unsigned long (*)(void*, void*),
void*)+91>)
at /netrel/src/cygwin-snapshot-20130119-1/winsup/cygwin/cygtls.cc:99
#11 0x0095ff88 in ?? ()
#12 0x76a8339a in KERNEL32!BaseCleanupAppcompatCacheSupport ()
from /cygdrive/c/Windows/syswow64/kernel32.dll
#13 0x6118b400 in cygthread::exiting ()
from /home/thonermann/cygwin/snapshot/usr/bin/cygwin1.dll
#14 0x0095ffd4 in ?? ()
#15 0x77699ef2 in ntdll!RtlpNtSetValueKey () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#16 0x6118b400 in cygthread::exiting ()
from /home/thonermann/cygwin/snapshot/usr/bin/cygwin1.dll
#17 0x4449ca2d in ?? ()
#18 0x00000000 in ?? ()

(gdb) thread 3
[Switching to thread 3 (Thread 5344.0x1c2c)]
#0 0x7767f91d in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
(gdb) bt
#0 0x7767f91d in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#1 0x7767f91d in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#2 0x7678d4b5 in WriteFile () from
/cygdrive/c/Windows/syswow64/KERNELBASE.dll
#3 0x0000009c in ?? ()
#4 0x00000000 in ?? ()

(gdb) thread 4
[Switching to thread 4 (Thread 5344.0x718)]
#0 0x7767f8b1 in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
(gdb) bt
#0 0x7767f8b1 in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#1 0x7767f8b1 in ntdll!RtlUpdateClonedSRWLock () from
/cygdrive/c/Windows/SysWOW64/ntdll.dll
#2 0x76790a91 in WaitForSingleObjectEx () from
/cygdrive/c/Windows/syswow64/KERNELBASE.dll
#3 0x00000034 in ?? ()
#4 0x00000000 in ?? ()

Tom.
Tom Honermann
2013-01-23 03:20:20 UTC
Permalink
Post by Tom Honermann
However, I was still able to reproduce another case. As before, one of
the processes is being left running when the rest are terminated. The
"abandoned" process appears to be in a live-lock state with two threads
(threads 1 and 2) running at 100%. Of particular interest is that each
time I press ctrl-c in the cmd.exe console this process was spawned
from, a new thread appears in the process even though this program is no
longer a foreground process and all other Cygwin processes have
terminated. The new threads never exit.
I noticed that more changes were checked in that looked like they might
address this, so I tested again with the latest snapshot (20130123).

I wasn't able to reproduce any of the symptoms I previously reported. Yay!

However, just as I was about to give up testing, I hit one more new
issue. One of the ctrl-c events sent bash into what appeared to be an
infinite loop emitting error messages like these:

11408974 [unknown (0x144C)] bash 1752 exception::handle: Error while
dumping state (probably corrupted stack)
11411584 [unknown (0x144C)] bash 1752 exception::handle: Error while
dumping state (probably corrupted stack)

While this was going on, hitting ctrl-c had no discernible effect. I
resorted to killing the process via task manager.

This only occurred once, I wasn't able to get it to happen again.

Tom.
Christopher Faylor
2013-01-23 05:26:50 UTC
Permalink
Post by Tom Honermann
Post by Tom Honermann
However, I was still able to reproduce another case. As before, one of
the processes is being left running when the rest are terminated. The
"abandoned" process appears to be in a live-lock state with two threads
(threads 1 and 2) running at 100%. Of particular interest is that each
time I press ctrl-c in the cmd.exe console this process was spawned
from, a new thread appears in the process even though this program is no
longer a foreground process and all other Cygwin processes have
terminated. The new threads never exit.
I noticed that more changes were checked in that looked like they might
address this, so I tested again with the latest snapshot (20130123).
I wasn't able to reproduce any of the symptoms I previously reported. Yay!
However, just as I was about to give up testing, I hit one more new
issue. One of the ctrl-c events sent bash into what appeared to be an
11408974 [unknown (0x144C)] bash 1752 exception::handle: Error while
dumping state (probably corrupted stack)
11411584 [unknown (0x144C)] bash 1752 exception::handle: Error while
dumping state (probably corrupted stack)
While this was going on, hitting ctrl-c had no discernible effect. I
resorted to killing the process via task manager.
This only occurred once, I wasn't able to get it to happen again.
Was there a stackdump?

cgf
Tom Honermann
2013-01-23 18:17:45 UTC
Permalink
Post by Christopher Faylor
Post by Tom Honermann
However, just as I was about to give up testing, I hit one more new
issue. One of the ctrl-c events sent bash into what appeared to be an
11408974 [unknown (0x144C)] bash 1752 exception::handle: Error while
dumping state (probably corrupted stack)
11411584 [unknown (0x144C)] bash 1752 exception::handle: Error while
dumping state (probably corrupted stack)
While this was going on, hitting ctrl-c had no discernible effect. I
resorted to killing the process via task manager.
This only occurred once, I wasn't able to get it to happen again.
Was there a stackdump?
Unfortunately no. And I should have grabbed a stack trace, but I didn't.

I tried to reproduce again today using the same snapshot (20130123), but
didn't have any luck.

I see you checked in a change to detect the infinite recursion. I'd
call that good enough.

I didn't encounter any further anomalies that I can positively attribute
to Cygwin. I did encounter a few that I suspect are cmd.exe issues that
I'll report below. I'm only reporting these for the curious, I am not
requesting any action be taken with regard to these.

1) Some times a ctrl-C was ignored. I would see ^C echoed to the
console, but the test case would keep running without prompting to
"Terminate batch job".

2) Some times cmd.exe would issue an error message about a syntax error
in the .bat file following pressing ctrl-C and all processes would exit
without prompting to "Terminate batch job".

Thank you for your prompt attention to all of these issues Chris! I
find it very impressive how responsive the Cygwin maintainers are to
reports like these!

Tom.
Christopher Faylor
2013-01-23 18:35:39 UTC
Permalink
Post by Tom Honermann
Post by Christopher Faylor
Post by Tom Honermann
However, just as I was about to give up testing, I hit one more new
issue. One of the ctrl-c events sent bash into what appeared to be an
11408974 [unknown (0x144C)] bash 1752 exception::handle: Error while
dumping state (probably corrupted stack)
11411584 [unknown (0x144C)] bash 1752 exception::handle: Error while
dumping state (probably corrupted stack)
While this was going on, hitting ctrl-c had no discernible effect. I
resorted to killing the process via task manager.
This only occurred once, I wasn't able to get it to happen again.
Was there a stackdump?
Unfortunately no. And I should have grabbed a stack trace, but I didn't.
I tried to reproduce again today using the same snapshot (20130123), but
didn't have any luck.
I see you checked in a change to detect the infinite recursion. I'd
call that good enough.
That probably is relatively ok given that you're trying to terminate the
process anyway but it would be nice to know why the stackdump was
happening.
Post by Tom Honermann
Thank you for your prompt attention to all of these issues Chris! I
find it very impressive how responsive the Cygwin maintainers are to
reports like these!
You're very welcome. Thanks for hanging in there throughout this
process.

FYI, as it turns out, working around the thread exit problem uncovered a
whole host of issues with locking/signals/exit that have been lurking in
the code for a while. So, this exercise should have made a better
Cygwin in the long-run. It may even have made Cygwin a little faster.

cgf
Tom Honermann
2013-01-24 04:11:47 UTC
Permalink
Post by Christopher Faylor
Post by Tom Honermann
I see you checked in a change to detect the infinite recursion. I'd
call that good enough.
That probably is relatively ok given that you're trying to terminate the
process anyway but it would be nice to know why the stackdump was
happening.
Agreed. I'll investigate and report any future cases I encounter.

Tom.
Christopher Faylor
2013-01-16 19:14:33 UTC
Permalink
Post by Tom Honermann
Post by Christopher Faylor
Post by Tom Honermann
I noticed that some changes were checked in related to signal handling
and process termination recently, so I downloaded the most recent
snapshot (20130114) and tested again. I was still able to produce
hanging processes (including hangs of strace.exe) by hitting ctrl-c in a
mintty window while Cygwin processes ran in an infinite loop inside of a
.bat file. I was able to produce a hang ~1 out of 20 times.
How does one run a .bat file inside mintty which handles CTRL-C? AFAIK,
a CTRL-C will just cause the .bat file to exit when run under bash.
1) Install the latest snapshot
2) Copy bash.exe, false.exe, and their dependent DLLs from a Cygwin
install into the usr/bin directory of the snapshot. For me this
bash.exe
cygintl-8.dll
cygiconv-2.dll
cygreadline7.dll
cygncurses-10.dll
cygncursesw-10.dll
cyggcc_s-1.dll
false.exe
3) Create 'test.bat' in the usr/bin directory of the snapshot with the
@echo off
setlocal
set PATH=%CD%;%PATH%
:loop
echo test...
bash -c false
if not errorlevel 1 (
echo exiting...
exit /B 1
)
goto loop
4) Launch mintty using an existing Cygwin installation. Naturally, this
will run a shell from the existing Cygwin install.
5) Change directories to the usr/bin directory of the snapshot.
6) Start task manager or some other process monitoring tool and keep it
running. Run ./test.bat from the Cygwin shell running within mintty and
interrupt it with ctrl-c. Repeat until you see a new bash.exe or
false.exe process persisting following the interrupt. You'll likely
have multiple bash processes running. If you are able to reproduce, you
should see one with a command line of 'bash -c false'. Alternatively,
if your process monitoring tool shows the path to the executable, you'll
be able to identify it as the one from the usr/bin directory of the
snapshot.
Again, if I hit CTRL-C while running ./test.bat in mintty then test.bat
exits immediately, as expected. Hitting ctrl-c repeatedly after that
point gives me a new bash prompt.

Non-exiting behavior was a symptom of a previous snapshot which was
mentioned here:

http://cygwin.com/ml/cygwin/2013-01/msg00164.html
Post by Tom Honermann
I rather doubt that the use of a .bat file is necessary to reproduce
this hang, but I haven't tried producing a test case that doesn't use a
.bat file. This is a test case I was using when debugging the
intermittent incorrect exit code issue.
Btw, an incorrect exit code is still a possibility if you're running
from a cmd shell since it is possible to interrupt a cygwin process
before cygwin is entirely set up. That will cause a normal windows
CTRL-C exit.

cgf
Tom Honermann
2013-01-16 20:24:15 UTC
Permalink
Post by Christopher Faylor
Again, if I hit CTRL-C while running ./test.bat in mintty then test.bat
exits immediately, as expected. Hitting ctrl-c repeatedly after that
point gives me a new bash prompt.
Yes, that is what is expected to happen. What I am reporting is that
interrupting test.bat sometimes leaves hung processes still running
after control is returned to the shell.
Post by Christopher Faylor
Non-exiting behavior was a symptom of a previous snapshot which was
http://cygwin.com/ml/cygwin/2013-01/msg00164.html
I'm testing a newer snapshot than that one. I'm been testing with
20130114 which Thomas reported as no longer having that problem here:

http://cygwin.com/ml/cygwin/2013-01/msg00196.html
Post by Christopher Faylor
Post by Tom Honermann
I rather doubt that the use of a .bat file is necessary to reproduce
this hang, but I haven't tried producing a test case that doesn't use a
.bat file. This is a test case I was using when debugging the
intermittent incorrect exit code issue.
Btw, an incorrect exit code is still a possibility if you're running
from a cmd shell since it is possible to interrupt a cygwin process
before cygwin is entirely set up. That will cause a normal windows
CTRL-C exit.
Yup, that is understood and expected.

Tom.
Tom Honermann
2012-12-21 20:01:10 UTC
Permalink
Post by Tom Honermann
I don't know which Windows releases are affected by this. I've only
reproduced the problem (outside of Cygwin) with Wow64 processes running
on 64-bit Windows 7. I haven't yet tried elsewhere.
I was able to reproduce the issue with a 64-bit executable compiled with
the test case in the parent email using Microsoft's Visual Studio 2010
x64 compiler. This issue does not appear to be specific to support for
running 32-bit processes on 64-bit Windows via Wow64.

I have not yet tried to reproduce this on any release of Windows other
than 64-bit Windows 7 SP1. I am curious about what other Windows
releases are affected. Please reply if you try the test case and are
able to reproduce the problem on other Windows releases. So far, I'm
only aware of the issue being reproduced on multi-processor systems. I
suspect the problem can occur on single-processor systems as well, but
is much less likely to.

Tom.
Tom Honermann
2013-11-14 04:01:50 UTC
Permalink
Post by Tom Honermann
I spent most of the week debugging this issue. This appears to be a
defect in Windows. I can reproduce the issue without Cygwin. I can't
rule out other third party kernel mode software possibly contributing to
the issue. A simple change to Cygwin works around the problem for me.
I don't know which Windows releases are affected by this. I've only
reproduced the problem (outside of Cygwin) with Wow64 processes running
on 64-bit Windows 7. I haven't yet tried elsewhere.
The problem appears to be a race condition involving concurrent calls to
TerminateProcess() and ExitThread(). The example code below minimally
mimics the threads created and exit process/thread calls that are
performed when running Cygwin's false.exe. The primary thread exits the
process via TerminateProcess() ala pinfo::exit() in
winsup/cygwin/pinfo.cc. The secondary thread exits itself via
ExitThread() ala Cygwin's signal processing thread function, wait_sig(),
in winsup/cygwin/sigproc.cc.
When the race condition results in the undesirable outcome, the exit
code for the process is set to the exit code for the secondary thread's
call to ExitThread(). I can only speculate at this point, but my guess
is that the TerminateProcess() code disassociates the calling thread
from the process before other threads are stopped such that
ExitThread(), concurrently running in another thread, may determine that
the calling thread is the last thread of the process and overwrite the
process exit code.
The issue also reproduces if ExitProcess() is called in place of
TerminateProcess(). The test case below only uses TerminateProcess()
because that is what Cygwin does.
Source code to reproduce the issue follows. Again, Cygwin is not
required to reproduce the problem. For my own testing, I compiled the
code using Microsoft's Visual Studio 2010 x86 compiler with the command
'cl /Fetest-exit-code.exe test-exit-code.cpp'
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
DWORD WINAPI SecondaryThread(
LPVOID lpParameter)
{
Sleep(1);
ExitThread(2);
}
int main() {
HANDLE hSecondaryThread = CreateThread(
NULL, // lpThreadAttributes
0, // dwStackSize
SecondaryThread, // lpStartAddress
(LPVOID)0, // lpParameter
0, // dwCreationFlags
NULL); // lpThreadId
if (!hSecondaryThread) {
fprintf(stderr, "CreateThread failed. GLE=%lu\n",
(unsigned long)GetLastError());
exit(127);
}
Sleep(1);
if (!TerminateProcess(GetCurrentProcess(), 1)) {
fprintf(stderr, "TerminateProcess failed. GLE=%lu\n",
(unsigned long)GetLastError());
exit(127);
}
return 0;
}
@echo off
setlocal
:loop
echo test...
test-exit-code.exe
if %ERRORLEVEL% NEQ 1 (
echo test-exit-code.exe returned %ERRORLEVEL%
exit /B 1
)
goto loop
test.bat should run indefinitely. The amount of time it takes to fail
on my machine (64-bit Windows 7 running in a VMware Workstation 8 VM
under Kubuntu 12.04 on a Lenovo T420 Intel i7-2640M 2 processor laptop)
varies considerably. I had one run fail in less than 10 iterations, but
most of the time it has taken upwards of 5 minutes to get a failure.
The workaround I implemented within Cygwin was simple and sloppy. I
added a call to Sleep(1000) immediately before the call to ExitThread()
in wait_sig() in winsup/cygwin/sigproc.cc. Since this thread (probably)
doesn't exit until the process is exiting anyway, the call to Sleep()
does not adversely affect shutdown. The thread just gets terminated
while in the call to Sleep() instead of exiting before the process is
terminated or getting terminated while still in the call to
ExitThread(). A better solution might be to avoid the thread exiting at
all (so long as it can't get terminated while holding critical
resources), or to have the process exiting thread wait on it. Neither
of these is ideal. Orderly shutdown of multi-threaded processes is
really hard to do correctly on Windows.
Since the exit code for the signal processing thread is not used, having
the wait_sig() thread (and any other threads that could potentially
concurrently exit with another thread) exit with a special status value
such as STATUS_THREAD_IS_TERMINATING (0xC000004BL) would enable
diagnosis of this issue as any process exit code matching this would be
a likely indicator that this issue was encountered.
As is, when this race condition results in the undesirable outcome,
since the signal processing thread exits with a status of 0, the exit
status of the process is 0. This explains why false.exe works so well
to reproduce the issue. It would be impossible to produce a negative
test using true.exe.
Tom.
Time passes...

I worked with some former colleagues to report this issue to Microsoft.
Windows 8.1 and Windows Server 2012 R2 contain a fix that addresses
the test case above. A hotfix has been made available for Windows 7 SP1
and Windows Server 2008 R2. Should anyone desire a hotfix for other
versions of Windows, it will be necessary to open a case with Microsoft
to request it.

http://support.microsoft.com/kb/2875501

Tom.
Corinna Vinschen
2013-11-14 09:19:50 UTC
Permalink
Hi Tom,
Post by Tom Honermann
Post by Tom Honermann
[...]
When the race condition results in the undesirable outcome, the exit
code for the process is set to the exit code for the secondary thread's
call to ExitThread(). I can only speculate at this point, but my guess
is that the TerminateProcess() code disassociates the calling thread
from the process before other threads are stopped such that
ExitThread(), concurrently running in another thread, may determine that
the calling thread is the last thread of the process and overwrite the
process exit code.
[...]
Time passes...
I worked with some former colleagues to report this issue to
Microsoft. Windows 8.1 and Windows Server 2012 R2 contain a fix
that addresses the test case above. A hotfix has been made
available for Windows 7 SP1 and Windows Server 2008 R2. Should
anyone desire a hotfix for other versions of Windows, it will be
necessary to open a case with Microsoft to request it.
http://support.microsoft.com/kb/2875501
Tom.
thanks for letting us know!

I'm very glad to read that this is an OS bug and a fix is available.

At least partially. I'm a bit confused. As far as I understand it this
is the situation now:

Vista/2008 and earlier: no fix available.
W7/2008R2: only hotfix for manual installation
W8/2012: no fix available.
W8.1/2012R2: fixed.

Did I get that right? That sounds a bit weird...


Thanks again,
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Maintainer cygwin AT cygwin DOT com
Red Hat
Tom Honermann
2013-11-14 15:20:30 UTC
Permalink
Post by Corinna Vinschen
thanks for letting us know!
You're welcome :)
Post by Corinna Vinschen
I'm very glad to read that this is an OS bug and a fix is available.
At least partially. I'm a bit confused. As far as I understand it this
Vista/2008 and earlier: no fix available.
W7/2008R2: only hotfix for manual installation
W8/2012: no fix available.
W8.1/2012R2: fixed.
Did I get that right? That sounds a bit weird...
That is how I understand it. Microsoft requires a Premier Support
agreement in order to request hotfixes and I am not a party on any such
agreement. So, I worked with former colleagues at another company that
does have a Premier support agreement and that I knew were also
experiencing the issue. They only requested a hotfix for Windows 7 SP1
and Windows 2008 R2 as those are the only Windows releases they were
concerned about having a fix for. The result: it is fixed in currently
shipping versions and a hotfix is available for those specific releases,
but other releases remain vulnerable. Addressing those releases will
presumably require someone with access to a Premier Support agreement to
request additional hotfix releases.

Tom.
Denis Excoffier
2013-11-15 18:53:26 UTC
Permalink
Post by Tom Honermann
The workaround I implemented within Cygwin was simple and sloppy. I
added a call to Sleep(1000) immediately before the call to ExitThread()
in wait_sig() in winsup/cygwin/sigproc.cc. Since this thread (probably)
doesn't exit until the process is exiting anyway, the call to Sleep()
does not adversely affect shutdown. The thread just gets terminated
while in the call to Sleep() instead of exiting before the process is
terminated or getting terminated while still in the call to
ExitThread(). A better solution might be to avoid the thread exiting at
all (so long as it can't get terminated while holding critical
resources), or to have the process exiting thread wait on it. Neither
of these is ideal. Orderly shutdown of multi-threaded processes is
really hard to do correctly on Windows.
I experience on Windows 7 (not on XP) some problems that may be related.
I would like to test your workaround, but sigproc.cc has much changed since
then, there is now an exit_thead function with the comment "Exit the current
thread very carefully.". I tried to insert Sleep(1000) at the end of
exit_thread, immediately before "ExitThread (0)", but this yielded no
change at all.

Could someone be kind enough to update the workaround for modern sigproc.cc?

Very briefly, my problem is that when i "tar xf —use-compress-program=xz", i
get:
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
and the last file of the archive is truncated at some 512bytes block. This
occurs on Windows 7 (not on XP); with xz-5.1.3alpha (not with xz-5.1.2alpha or
xz-5.0.5); never on most tar.xz files; almost always on some (rare) tar.xz files
(one notable example is bc-1.06.95.tar.bz2 bunzip2’ed and then xz’ed); depends
on the .tar file itself, not on the option (like -9e, -0) used to create the
.tar.xz; never with "tar tf"; and with all tar’s i have tested. The return code
of all the involved xz -d commands is always zero though. Perhaps after all, this
is unrelated?

Thank you.

Regards,

Denis Excoffier.
Christopher Faylor
2013-11-15 19:21:38 UTC
Permalink
Post by Denis Excoffier
Post by Tom Honermann
The workaround I implemented within Cygwin was simple and sloppy. I
added a call to Sleep(1000) immediately before the call to ExitThread()
in wait_sig() in winsup/cygwin/sigproc.cc. Since this thread (probably)
doesn't exit until the process is exiting anyway, the call to Sleep()
does not adversely affect shutdown. The thread just gets terminated
while in the call to Sleep() instead of exiting before the process is
terminated or getting terminated while still in the call to
ExitThread(). A better solution might be to avoid the thread exiting at
all (so long as it can't get terminated while holding critical
resources), or to have the process exiting thread wait on it. Neither
of these is ideal. Orderly shutdown of multi-threaded processes is
really hard to do correctly on Windows.
I experience on Windows 7 (not on XP) some problems that may be related.
I would like to test your workaround, but sigproc.cc has much changed since
then, there is now an exit_thead function with the comment "Exit the current
thread very carefully.". I tried to insert Sleep(1000) at the end of
exit_thread, immediately before "ExitThread (0)", but this yielded no
change at all.
Could someone be kind enough to update the workaround for modern sigproc.cc?
You apparently are misunderstanding the whole point of the changes to
sigproc.cc. They were to work around this very problem.

cgf
Denis Excoffier
2013-11-17 13:29:45 UTC
Permalink
Post by Christopher Faylor
Post by Denis Excoffier
Post by Tom Honermann
The workaround I implemented within Cygwin was simple and sloppy. I
added a call to Sleep(1000) immediately before the call to ExitThread()
in wait_sig() in winsup/cygwin/sigproc.cc. Since this thread (probably)
doesn't exit until the process is exiting anyway, the call to Sleep()
does not adversely affect shutdown. The thread just gets terminated
while in the call to Sleep() instead of exiting before the process is
terminated or getting terminated while still in the call to
ExitThread(). A better solution might be to avoid the thread exiting at
all (so long as it can't get terminated while holding critical
resources), or to have the process exiting thread wait on it. Neither
of these is ideal. Orderly shutdown of multi-threaded processes is
really hard to do correctly on Windows.
I experience on Windows 7 (not on XP) some problems that may be related.
I would like to test your workaround, but sigproc.cc has much changed since
then, there is now an exit_thead function with the comment "Exit the current
thread very carefully.". I tried to insert Sleep(1000) at the end of
exit_thread, immediately before "ExitThread (0)", but this yielded no
change at all.
Could someone be kind enough to update the workaround for modern sigproc.cc?
You apparently are misunderstanding the whole point of the changes to
sigproc.cc. They were to work around this very problem.
Oh, i didn’t remember that. Then this must be the antivirus or something else
i have to cope with.

Regards,

Denis Excoffier.
Tom Honermann
2013-11-15 22:15:29 UTC
Permalink
Post by Denis Excoffier
Post by Tom Honermann
The workaround I implemented within Cygwin was simple and sloppy. I
added a call to Sleep(1000) immediately before the call to ExitThread()
in wait_sig() in winsup/cygwin/sigproc.cc. Since this thread (probably)
doesn't exit until the process is exiting anyway, the call to Sleep()
does not adversely affect shutdown. The thread just gets terminated
while in the call to Sleep() instead of exiting before the process is
terminated or getting terminated while still in the call to
ExitThread(). A better solution might be to avoid the thread exiting at
all (so long as it can't get terminated while holding critical
resources), or to have the process exiting thread wait on it. Neither
of these is ideal. Orderly shutdown of multi-threaded processes is
really hard to do correctly on Windows.
I experience on Windows 7 (not on XP) some problems that may be related.
I would like to test your workaround, but sigproc.cc has much changed since
then, there is now an exit_thead function with the comment "Exit the current
thread very carefully.". I tried to insert Sleep(1000) at the end of
exit_thread, immediately before "ExitThread (0)", but this yielded no
change at all.
Could someone be kind enough to update the workaround for modern sigproc.cc?
Hi Denis. Cygwin versions 1.7.18 and later contain a workaround for
this issue. If you are running something older than that, I highly
encourage you to upgrade. Many stability related fixes have been made
in more recent versions.
Post by Denis Excoffier
Very briefly, my problem is that when i "tar xf —use-compress-program=xz", i
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
and the last file of the archive is truncated at some 512bytes block. This
occurs on Windows 7 (not on XP); with xz-5.1.3alpha (not with xz-5.1.2alpha or
xz-5.0.5); never on most tar.xz files; almost always on some (rare) tar.xz files
(one notable example is bc-1.06.95.tar.bz2 bunzip2’ed and then xz’ed); depends
on the .tar file itself, not on the option (like -9e, -0) used to create the
.tar.xz; never with "tar tf"; and with all tar’s i have tested. The return code
of all the involved xz -d commands is always zero though. Perhaps after all, this
is unrelated?
This doesn't sound related to the intermittent incorrect exit code
defect to me. I'm afraid I don't have other explanations for what you
are experiencing though.

Tom.
Lasse Collin
2013-11-25 19:58:42 UTC
Permalink
Post by Denis Excoffier
Very briefly, my problem is that when i "tar xf
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
and the last file of the archive is truncated at some 512bytes block.
This occurs on Windows 7 (not on XP); with xz-5.1.3alpha (not with
xz-5.1.2alpha or xz-5.0.5); never on most tar.xz files; almost always
on some (rare) tar.xz files (one notable example is
bc-1.06.95.tar.bz2 bunzip2’ed and then xz’ed); depends on the .tar
file itself, not on the option (like -9e, -0) used to create
the .tar.xz; never with "tar tf"; and with all tar’s i have tested.
The return code of all the involved xz -d commands is always zero
though. Perhaps after all, this is unrelated?
xz 5.1.3alpha has some new file I/O code that uses non-blocking file
descriptors, the self-pipe trick, and poll(). It's there to fix a race
condition in signal handling. Since you say it works with 5.1.2alpha, I
wonder could there be a bug with the new I/O code in xz or if the code
in xz triggers a bug in Cygwin or Windows.

If you haven't already tried, please compile both 5.1.2alpha and
5.1.3alpha from source while keeping everything else unchanged, and see
if the bug really only occurs with 5.1.3alpha.
--
Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Denis Excoffier
2013-11-25 23:11:20 UTC
Permalink
Post by Lasse Collin
Post by Denis Excoffier
Very briefly, my problem is that when i "tar xf
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
and the last file of the archive is truncated at some 512bytes block.
This occurs on Windows 7 (not on XP); with xz-5.1.3alpha (not with
xz-5.1.2alpha or xz-5.0.5); never on most tar.xz files; almost always
on some (rare) tar.xz files (one notable example is
bc-1.06.95.tar.bz2 bunzip2’ed and then xz’ed); depends on the .tar
file itself, not on the option (like -9e, -0) used to create
the .tar.xz; never with "tar tf"; and with all tar’s i have tested.
The return code of all the involved xz -d commands is always zero
though. Perhaps after all, this is unrelated?
xz 5.1.3alpha has some new file I/O code that uses non-blocking file
descriptors, the self-pipe trick, and poll(). It's there to fix a race
condition in signal handling. Since you say it works with 5.1.2alpha, I
wonder could there be a bug with the new I/O code in xz or if the code
in xz triggers a bug in Cygwin or Windows.
If you haven't already tried, please compile both 5.1.2alpha and
5.1.3alpha from source while keeping everything else unchanged, and see
if the bug really only occurs with 5.1.3alpha.
Already done. I did some strace-ing, and since i’m not so fluent with the
result, i’ll send it there in a while (when i’m back on cygwin) if someone is
interested. But the bug (contrary to what i said before) also _sometimes_
occurs with 5.1.2alpha or 5.0.5 and this makes me think now that:

a) my antivirus-anti-intrusion-whatever-software (that i can’t remove of
course) creates some kind of "background noise" where a certain percentage
of such ‘tar xf —use-compress-program’ commands will always fail

b) nevertheless, xz-5.1.3alpha (with its new file I/O code etc.) triggers some
untypical configuration inside the antivirus that increases drastically the
percentage, making the failure almost certain for some files.

It is not extraordinary that i cannot observe the failure on XP since
i do not have this particular antivirus on XP.

You might however want some more detail. Test plan is: perform
'tar xf file.xz --use-compress-program=xz -bx', where x varies from 1 to 200.
There are two kinds of results:

1) usual situation is where you observe max 1 or 2 failures (on a maximum of 200).
If you launch the same plan, you still report max 1 or 2 failures, usually not
with the same numbers. Very often you have no failure at all. Very often the
-b20 (the default) does not fail.
-> this situation occurs with 5.1.2alpha or 5.0.5 with all input files, or with
5.1.3alpha with most input files.

2) pathological situation is where you observe, say, 30 failures (on a maximum of 200).
If you launch the same plan, you report nearly the same failures, ie mostly the same
ones, with some minor variability analogous to the variability observed in the usual
situation above
-> this situation occurs with 5.1.3alpha only, with some selected input files,
eg bc-1.06.95.tar.xz (see above how to create bc-1.06.95.tar.xz)

When it fails (usually or pathologically), the last file of the archive gets
truncated (see above), and _this_ is strange from an antivirus behaviour. After
all, perhaps some flush() or similar is missing inside 5.1.3alpha.

Denis Excoffier.
Denis Excoffier
2013-11-26 21:09:19 UTC
Permalink
Already done. I did some strace-ing, and since iÂ’m not so fluent with the
result, iÂ’ll send it there in a while (when iÂ’m back on cygwin) if someone is
interested. But the bug (contrary to what i said before) also _sometimes_
Here is the result of strace (with minor editing). I kept the whole strace (12000 lines),
because xz ends rather early (around line 10000).

2bc-1.06.95.tar.xz is a file built using bunzip2 | xz -c

Note the presence of Win32 error 109 (broken pipe).

Regards,

Denis Excoffier.

This is part1. part2 follows in a few minutes.
Christopher Faylor
2013-11-26 23:36:03 UTC
Permalink
Post by Denis Excoffier
Already done. I did some strace-ing, and since i?m not so fluent with the
result, i?ll send it there in a while (when i?m back on cygwin) if someone is
interested. But the bug (contrary to what i said before) also _sometimes_
Here is the result of strace (with minor editing). I kept the whole strace (12000 lines),
because xz ends rather early (around line 10000).
2bc-1.06.95.tar.xz is a file built using bunzip2 | xz -c
Note the presence of Win32 error 109 (broken pipe).
Please don't post unsolicited straces to this list. No one is going to
be looking at them and they just clog up the mailing list.

cgf
Denis Excoffier
2013-11-26 21:09:22 UTC
Permalink
Already done. I did some strace-ing, and since iÂ’m not so fluent with the
result, iÂ’ll send it there in a while (when iÂ’m back on cygwin) if someone is
interested. But the bug (contrary to what i said before) also _sometimes_
This is part2. Just cat typescript-part1 typescript-part2.
Lasse Collin
2013-12-01 13:24:15 UTC
Permalink
Post by Denis Excoffier
Post by Lasse Collin
If you haven't already tried, please compile both 5.1.2alpha and
5.1.3alpha from source while keeping everything else unchanged, and
see if the bug really only occurs with 5.1.3alpha.
Already done. I did some strace-ing, and since i’m not so fluent with
the result, i’ll send it there in a while (when i’m back on cygwin)
if someone is interested. But the bug (contrary to what i said
before) also _sometimes_ occurs with 5.1.2alpha or 5.0.5 and this
a) my antivirus-anti-intrusion-whatever-software (that i can’t remove
of course) creates some kind of "background noise" where a certain
percentage of such ‘tar xf —use-compress-program’ commands will
always fail
b) nevertheless, xz-5.1.3alpha (with its new file I/O code etc.)
triggers some untypical configuration inside the antivirus that
increases drastically the percentage, making the failure almost
certain for some files.
It is not extraordinary that i cannot observe the failure on XP since
i do not have this particular antivirus on XP.
OK, so the new I/O code in xz probably isn't the problem even if it may
affect how easily the actual problem gets triggered.

[...]
Post by Denis Excoffier
When it fails (usually or pathologically), the last file of the
archive gets truncated (see above), and _this_ is strange from an
antivirus behaviour. After all, perhaps some flush() or similar is
missing inside 5.1.3alpha.
xz uses write() which uses a file descriptor argument, so there is
nothing to flush separately. xz just has to write() everything.

When used with tar, xz writes to standard output (FILENO_STDOUT) which
with tar is a pipe. When xz finishes, it closes its end (the writer end)
of the pipe.

With xz 5.1.3alpha, O_NONBLOCK flag is set for FILENO_STDIN and
FILENO_STDOUT if the flag wasn't already set. If xz set the flag, it
will unset it before closing the file descriptor. The setting and
unsetting can be seen in the trace you sent and it seems to work
correctly. I don't have a guess if these fcntl() calls might cause the
difference between 5.1.3alpha and other versions, but it doesn't sound
too important since the bug occurs in some form with all versions.

From the trace file it seems that the last write() from xz gets lost.
xz first makes 173 writes of 8192 bytes and then one 6144-byte write,
totalling 1,423,360 bytes. tar gets 1,417,216 from xz, that is, 6144
bytes too little.

Since things go wrong with old xz versions that don't use non-blocking
I/O, I would expect you to see similar issues with other compressors
too. Maybe it would be worth testing with gzip and bzip2 in the same
way you did with xz 5.0.5.
--
Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Loading...