Intermittent failures retrieving process exit codes

Discussion:

Tom Honermann

2012-12-07 19:54:52 UTC

I've witnessed intermittent failures in multiple build systems while
working at multiple companies using Cygwin bash and make as part of the
build system but using non-Cygwin compilers and other tools. The
intermittent failures occur when a process appears to complete
successfully, but the process retrieving its exit code receives an
unexpected value. This has been seen on many different Cygwin versions
across several years.

Several reports of similar sounding issues can be found online:
-
http://cygwin.1069669.n5.nabble.com/Cygwin-1-7-x-on-Windows-7-Exit-statuses-of-Win32-executables-are-sometimes-wrong-td20186.html
-
http://stackoverflow.com/questions/9769256/intermittent-failures-under-cygwin-possibly-related-to-candle-and-or-make

I recently was able to produce a very small test case that reproduces
this issue reliably on some machines:

$ cat test.sh
#!/bin/sh

while [ 1 ]; do
echo "test..."
if cmd /c "false"; then
echo "exiting..."
exit 1
fi
done

An invocation of test.sh should run indefinitely, but fails very quickly
on one of my machines:

$ ./test.sh
test...
test...
exiting...

$ ./test.sh
test...
test...
test...
test...
exiting...

$ ./test.sh
test...
exiting...

There are several high-level possibilities for what is going wrong:

1) cmd.exe is failing to retrieve the correct exit code for the
invocation of false.exe (A Cygwin process).

2) cmd.exe is failing to return the (correct) exit code it received for
the invocation of false.exe.

3) bash.exe (A Cygwin process) is failing to retrieve the correct exit
code for the invocation of cmd.exe.

It is possible that other software installed on the machines I've
witnessed this on are contributing to the problem (ala
http://cygwin.com/faq/faq.using.html#faq.using.bloda). If so, such
software would be a contributing factor to one of the explanations
above, but does not necessarily mean that there is not a defect in
Cygwin (or CreateProcess, WaitForSingleObject, or GetExitCodeProcess).
I have not yet seen a similar case that does not involve Cygwin, so at
present I suspect a defect in Cygwin, but possibly one that produces no
negative symptoms in isolation.

I've reproduced this issue with both the 32-bit and 64-bit versions of
cmd.exe. I've also reproduced it by replacing cmd.exe with a C file
that calls CreateProcess for Cygwin's false.exe on its own. The issue
reproduces whether that C file is compiled with Cygwin gcc, MinGW gcc
(32-bit and 64-bit), and with MSVC (32-bit and 64-bit). So, substitute
what you like for 'cmd.exe' in the above.

Likewise, I've reproduced this issue by replacing false.exe in the test
above with a custom false.exe (A C program that just returns 1). The
issue reproduces whether myfalse.exe is compiled with Cygwin gcc, MinGW
gcc (32-bit and 64-bit), and with MSVC (32-bit and 64-bit). So,
substitute what you like for 'false.exe' in the above.

I am not able to reproduce the problem if I elide the invocation of
false.exe. (ie, if the cmd.exe invocation is 'cmd /c "exit /B 1"' or if
my replacement for cmd.exe just returns 1).

The problem feels like a race condition in retrieving process exit
codes. Further, it seems that it may only occur when two related
processes exit in quick succession.

I've been granted several weeks in the near future to work exclusively
on this issue. Before I start working on it though, I'd like to hear
from other community members who have experienced this and tried to
debug it. What is and is not known about the issue. What workarounds
have been tried (especially any that were found to be successful). Are
there specific parts of the Cygwin (or bash) code that you recommend
starting with?

The machine that I've been running the above script on is 64-bit Windows
7 Professional SP1 running under VMware Workstation 8 which is running
on Kubuntu 12.04.

Relevant parts of 'cygcheck-s' are:

Windows 7 Professional N Ver 6.1 Build 7601 Service Pack 1

Running under WOW64 on AMD64

Cygwin DLL version info:
DLL version: 1.7.16
DLL epoch: 19
DLL old termios: 5
DLL malloc env: 28
Cygwin conv: 181
API major: 0
API minor: 262
Shared data: 5
DLL identifier: cygwin1
Mount registry: 3
Cygwin registry name: Cygwin
Program options name: Program Options
Installations name: Installations
Cygdrive default prefix:
Build date:
Shared id: cygwin1S5

Potential app conflicts:

ByteMobile laptop optimization client.

No Cygwin services found.

Cygwin Package Information
Package Version Status
bash 4.1.10-4 OK
cygwin 1.7.16-1 OK

Tom.

Tom Honermann

2012-12-07 21:54:01 UTC