Discussion:
[Bug 1640518] [NEW] MongoDB Memory corruption
bugproxy
2016-11-09 15:49:52 UTC
Permalink
Public bug reported:

== Comment: #0 - Calvin L. Sze <***@us.ibm.com> - 2016-11-01 23:09:10 ==
Team has changed to the Bare-metal Ubuntu 16.4. The problem still exists, so it is not related to the virtualization.

Since the bug is complicated to reproduce, Could we use sets of tools
to collect the data when this happens?


---Problem Description---
MongoDB has memory corruption issues which only occurred on Ubuntu 16.04, it doesn't occur on Ubuntu 15.

Contact Information =Calvin Sze/Austin/IBM

---uname output---
Linux master 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:00:57 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = Model: 2.1 (pvr 004b 0201) Model name: POWER8E (raw), altivec supported

---System Hang---
the system is still alive

---Debugger---
A debugger is not configured

---Steps to Reproduce---
Unfortunately, not very easily. I had a test case that I was running on ubuntu1604-ppc-dev.pic.build.10gen.cc and xxxx-ppc-dev.pic.build.10gen.cc. I understand these to be two VMs running on the same physical host.

About 3.5% of the test runs on ubuntu1604-ppc-dev.pic.build.10gen.cc
would fail, but all of the runs on the other machine passed. Originally,
this failure manifested as the GCC stack protector (from -fstack-
protector-strong) claiming stack corruption.

Hoping to be able to see the data that was being written and corrupting
the stack, I manually injected a guard region into the stack of the
failing functions as follows:


+namespace {
+
+class Canary {
+public:
+
+ static constexpr size_t kSize = 1024;
+
+ explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
+ ::memset(const_cast<unsigned char*>(_t), kBits, kSize);
+ }
+
+ ~Canary() {
+ _verify();
+ }
+
+private:
+ static constexpr uint8_t kBits = 0xCD;
+ static constexpr size_t kChecksum = kSize * size_t(kBits);
+
+ void _verify() const noexcept {
+ invariant(std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
+ }
+
+ const volatile unsigned char* const _t;
+};
+
+} // namespace
+

Status bsonExtractField(const BSONObj& object, StringData fieldName, BSONElement* outElement) {
+
+ volatile unsigned char* const cookie = static_cast<unsigned char *>(alloca(Canary::kSize));
+ const Canary c(cookie);
+

When running with this, the invariant would sometimes fire. Examining
the stack cookie under the debugger would show two consecutive bytes,
always at an offset ending 0x...e, written as either 0 0, or 0 1,
somewhere at random within the middle of the cookie.

This indicated that it was not a conventional stack smash, where we were
writing past the end of a contiguous buffer. Instead it appeared that
either the currently running thread had reached up some arbitrary and
random amount on the stack and done either two one-byte writes, or an
unaligned 2-byte write. Another possibility was that a local variable
had been transferred to another thread, which had written to it.

However, while looking at the code to find such a thing, I realized that
there was another possibility, which was that the bytes had never been
written correctly in the first place. I changed the stack canary
constructor to be:


+ explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
+ ::memset(const_cast<unsigned char*>(_t), kBits, kSize);
+ _verify();
+ }

So that immediately after writing the byte pattern to the stack buffer,
we verified the contents we wrote. Amazingly, this *failed*, with the
same corruption as seen before. This means that either between the time
we called memset to write the bytes and when we read them back,
something either overwrote the stack cookie region, or that the bytes
were never written correctly by memset, or that memset wrote the bytes,
but the underlying physical memory never took the write.



Stack trace output:
no

Oops output:
no

Userspace tool common name: MongoDB

Userspace rpm: mongod

The userspace tool has the following bit modes: 64bit

System Dump Info:
The system is not configured to capture a system dump.

Userspace tool obtained from project website: na

*Additional Instructions for Lilian Romero/Austin/IBM:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.
-Attach ltrace and strace of userspace application.

== Comment: #1 - Luciano Chavez <***@us.ibm.com> - 2016-11-02 08:41:47 ==
Normally for userspace memory corruption type problems I would recommend Valgrind's memcheck tool though if this works on other versions of linux, one would want to compare the differences such as whether or not you are using the same version of mongodb, gcc, glibc and the kernel.

Has a standalone testcase been produced that shows the issue without
mongodb?

== Comment: #2 - Steven J. Munroe <***@us.ibm.com> - 2016-11-02 10:27:40 ==
We really need that standalone test case.

Need to look at WHAT c++ is doing with memset. I suspect the compiler is
short circuiting the function and inlining. That is what you would want
for optimization, but we need to know so we can steer this to the
correct team.

== Comment: #3 - Calvin L. Sze <***@us.ibm.com> - 2016-11-02 13:17:30 ==
Hi Luciano and Steve, Thanks for the advise,

They don't have a standalone test case without Mongodb, I could image
it take a while and probably not that easy to produce. I am seeking
your advise how to approach this. The failure takes at least 24 - 48
hours running to reproduce. Steve, do you have what you needed for C++
test, or there is something I need to ask Mongo development team?

Thanks

== Comment: #4 - William J. Schmidt <***@us.ibm.com> - 2016-11-02 16:29:26 ==
(In reply to comment #3)
Hi Luciano and Steve, Thanks for the advise,
They don't have a standalone test case without Mongodb, I could image it
take a while and probably not that easy to produce. I am seeking your
advise how to approach this. The failure takes at least 24 - 48 hours
running to reproduce. Steve, do you have what you needed for C++ test, or
there is something I need to ask Mongo development team?
Thanks
It's unclear to me yet that we have evidence of this being a problem in
the toolchain. Does the last experiment (revised Canary constructor)
ALWAYS fail, or does it also fail only ever 24 - 48 hours? If the
latter, then all we know is that stack corruption happens. There's no
indication of where the wild pointer is coming from (application
problem, compiler problem, etc.). If it does always fail, however, then
I question the assertion that they can't provide a standalone test case.

We need something more concrete to work with.

Bill

== Comment: #5 - Calvin L. Sze <***@us.ibm.com> - 2016-11-03 18:08:33 ==
Could this ticket be viewed by external customer/ISV?
I am thinking how to establish the direct communications between Mongodb development team and experts/owner of the ticket to pass the middle man, me :-)

Here are the MongoDB deelopment director, Andrew's answers to my 3
questions. And in addition he added comments.

Basically, there are 3 questions,
1. Is the mongoDB binary built with gcc came with Linux distributions
or with IBM Advance toolchain gcc?


We build our own GCC, but we have reproduced the issue with both our custom GCC, and the builtin linux distribution GCC. We have also reproduced with clang 3.9 built from source on the Ubuntu 16.04 POWER machine, so we do not think that this is a compiler issue (could still be a std library issue).
2. Does the last experiment (revised Canary constructor) ALWAYS fail, or does it also fail only ever 24 - 48 hours?
No, we have never been able to construct a deterministic repro. We are
only able to get it to fail after running the test a very large number
of times.
3. Is there any way we can have a standalone test case without
MongoDB?

We do not have such a repro at this time.

I do understand the position they are taking - it isn't a lot of
information to go on, and most of the time the correct response to a
mysterious software crash is to blame the software itself, not the
surrounding ecosystem. However, we have a lot of *indirect* evidence
that has made us skeptical that this is our bug. We would love to be
proved wrong!


- The stack corruption has not reproduced on any other systems. We are running these same tests on every commit across dozens of Linux variants, and across four cpu architectures (x86_64, POWER, zSeries, ARMv8).
- We don't see crashes on other POWER, but we do on Ubuntu POWER.
- We don't see crashes on Windows, Solaris, OS X
- We have run the under the clang address sanitizer, with no reports.
- We have enabled the clang address sanitizer use-after-return detector, and found no results.


If this were a wild pointer in the MongoDB server process that was writing to the stack of other threads, we would expect to see corruption show up elsewhere, but we simply do not.

However, lets assume that this is a bug in our code, that for whatever
reason only reveals itself on POWER, and only on Ubuntu. We would still
be interesting in learning from the kernel team if there are additional
power specific debugging techniques that we might be able to apply. In
particular, the ability to programmatically set/unset hardware
watchpoints over the stack canary. Another possibility would be to
mprotect the stack canary, but it is not clear to us whether it is valid
to mprotect part of the stack, either in general, or on POWER.

We would be happy to hear any suggestions on how to proceed.


Thanks,
Andrew

== Comment: #6 - Steven J. Munroe <***@us.ibm.com> - 2016-11-03 18:34:30 ==
you could tell what specific GCC version you are based on and configure options.

You could provide the disassemble of the canary code.

== Comment: #7 - William J. Schmidt <***@us.ibm.com> - 2016-11-03 23:01:55 ==
It would be useful to see what the Canary is compiled into, as Steve suggested. Let's make sure it's doing what we think it is.

Given we have multiple compilers producing the same results, we may want
to think more about the runtime environment -- are you using the same
glibc and libstdc++ in all cases? Clang at least would pick up the
distro versions, as it doesn't provide its own.

One reason you see this on Ubuntu 16.04 and not on another linux distro
is likely because of glibc level. The other linux's glibc is quite old
by comparison. glibc 2.23, which appears on Ubuntu 16.04, is the first
version to be compiled with -fstack-protector-strong by default. So
this doesn't necessarily mean that the bug doesn't exist elsewhere; it
just means that the stack protector code isn't enabled to spot the
problem. If the stack corruption is benign, then it wouldn't be noticed
otherwise.

I assume that glibc 2.23 was compiled with Ubuntu's version of gcc 5
that ships with the system, in case that becomes relevant.

I don't personally have a lot of experience with trying to debug
something of this nature, in case we don't see something obvious from
the disassembly of the canary. CCing Ulrich Weigand in case he has some
ideas of other approaches to try.

== Comment: #9 - Ulrich Weigand <***@de.ibm.com> - 2016-11-04 12:21:48 ==
I don't really have any other great ideas either. Just two comments:

- Even though the original reported mentioned they already tried clang's
address sanitizer, I'd definitely still also try reproducing the problem
under valgrind -- the two are different in what exactly they detect, and
using both tools in a complex problem can only help.

- The Canary code sample above has strictly speaking undefined behavior,
I think: it is calling memset on a const *. (The const_cast makes the
warning go away, but doesn't actually cure the undefined behavior.) I
don't *think* this will cause codegen changes in this example, but it
cannot hurt to try to fix this and see if anything changes.

== Comment: #12 - Calvin L. Sze <***@us.ibm.com> - 2016-11-06 10:32:25 ==
Hi Bill, Thanks

I have asked Andrew, waiting for his confirmation.

== Comment: #14 - Calvin L. Sze <***@us.ibm.com> - 2016-11-06 10:56:49 ==
Hi Calvin -


I can provide the assembly of the function that contains the canary (the canary itself gets inlined), but I think it might just be easier if I uploaded a binary and an associated corefile? That way your engineers could disassemble the crashing function themselves in the debugger and see exactly what the state was at the time of the crash.


What is the best way for me to get that information to you?


Thanks,
Andrew

== Comment: #15 - Calvin L. Sze <***@us.ibm.com> - 2016-11-06 10:58:54 ==
Provided the binary and core information.

Note from Mongo;

I've uploaded a sample core file and the associated binary to your ftp
server as detailed above. The binary is named `mongod.power` and the core is
named `mongod.power.core`.

You should expect to see a backtrace on the faulting thread which looks
like this (for the first few frames):

(gdb) bt
#0 0x00003fff997be5d0 in __libc_signal_restore_set (set=0x3fff5814c1f0)
at ../sysdeps/unix/sysv/linux/nptl-signals.h:79
#1 __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:55
#2 0x00003fff997c0c00 in __GI_abort () at abort.c:89
#3 0x00000000223c33e8 in mongo::invariantFailed (expr=<optimized out>,
file=0x24131b38 "src/mongo/bson/util/bson_extract.cpp",
line=<optimized out>) at src/mongo/util/assert_util.cpp:154
#4 0x00000000224bbc48 in mongo::(anonymous namespace)::Canary::_verify (
this=<optimized out>) at src/mongo/bson/util/bson_extract.cpp:58


The "Canary::_verify" frame (number 4) has a local variable "_t" which is an
on-the-stack array and filled with "0xcd" for a span of 1024 bytes. Near the
end of this block we see two bytes of poisoned memory which were altered:

0x3fff5814c858: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
0x3fff5814c860: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
0x3fff5814c868: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0x01 0x00
0x3fff5814c870: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
0x3fff5814c878: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd


Note the two bytes set to values "0x01" and "0x00".

At the time of core-dump all the other threads seemed to be paused on system
calls such as "recv" or "__pthread_cond_wait". The verify function is called
when setting up our software canary, and checks the memory immediately after
its setup. We do not run any other functions on this thread between the
memory poisoning and the verification of the poisoning. All other threads
appear to be paused at this time.

== Comment: #16 - Calvin L. Sze <***@us.ibm.com> - 2016-11-06 10:59:40 ==
A follow up message from Mongo

The function calling the canary code, which you'll want to possibly
disassemble is in frame 6:

#6 mongo::bsonExtractStringField (object=..., fieldName=...,
out=0x3fff5814caa8) at src/mongo/bson/util/bson_extract.cpp:138

The lower numbered frames deal with the canary code
itself.
From Andrew,
Given we have multiple compilers producing the same results, we may want to
think more about the runtime environment -- are you using the same glibc and
libstdc++ in all cases? Clang at least would pick up the distro versions, as
it doesn't provide its own.
We have repro'd with three compilers:

- The system GCC, using system libstdc++ and system glibc
- Our hand-rolled GCC, using its own libstdc++, and system glibc
- One off clang-3.9 build, using system libstdc++, and system glibc.


Coincidentally, both system and hand-rolled GCC are 5.4.0, so there may not be as much variation there as hoped. We could try building with clang and libc++ to at least rule out libstdc++ as a factor.
One reason you see this on Ubuntu 16.04 and not on the other linux distro is likely because of
glibc level. The other linux distro's glibc is quite old by comparison. glibc 2.23, which
appears on Ubuntu 16.04, is the first version to be compiled with
-fstack-protector-strong by default.
I'm not sure I follow. Our software has been built with -fstack-protector-strong on both platforms, whether or not glibc has been, and the invocation of the __stack_chk_fail function is always from our code, not from glibc, or libstdc++. So, I'd expect that if there were stack corruption taking place as a result of our code, we would see the stack protector trip on both platforms. Or are you saying that on platforms where glibc itself wasn't built with -fstack-protector-whatever that user code built with that same flag won't report errors?
So this doesn't necessarily mean that the
bug doesn't exist elsewhere; it just means that the stack protector code isn't
enabled to spot the problem. If the stack corruption is benign, then it
wouldn't be noticed otherwise.
Yeah, still confused. I can definitely make the other linux distro box
report a stack corruption:


[***@xxxx-ppc-dev.pic.build ~]$ cat > boom.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>


struct no_chars {
unsigned int len;
unsigned int data;
};


int main(int argc, char * argv[])
{
struct no_chars info = { };


if (argc < 3) {
fprintf(stderr, "Usage: %s LENGTH DATA...\n", argv[0]);
return 1;
}


info.len = atoi(argv[1]);
memcpy(&info.data, argv[2], info.len);


return 0;
}
[***@rhel71-ppc-dev.pic.build ~]$ gcc -Wall -O2 -U_FORTIFY_SOURCE -fstack-protector-strong boom.c -o boom


[***@rhel71-ppc-dev.pic.build ~]$ ./boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
*** stack smashing detected ***: ./boom terminated
Segmentation fault


I assume that glibc 2.23 was compiled with Ubuntu's version of gcc 5 that ships
with the system, in case that becomes relevant.


Correct, we have not made any changes to glibc - we are using the stock version that ships on the system.
From Andrew
Also, I want to re-iterate that while we have definitely observed cases
where the stack protector detects the stack corruption, we have also
observed stack corruption within our own hand-rolled stack buffer, per
the code posted earlier. The core dump that Adam provided is of this
latter sort So to some extent, this is independent of -fstack-protector-
strong.


One thing that I have not yet ruled out is whether -fstack-protect-strong could itself be at fault, somehow, though I find that unlikely given that we have reproduced with clang as well.


Still, it sounds like a worthwhile experiment, so I will see if I can still detect corruption in our hand-rolled stack canary when building without any form of -fstack-protector enabled.
From Andrew,
I've performed this experiment, replacing our use of -fstack-protector-strong with -fno-stack-protector when building MongoDB, and I can confirm that we still observe stack corruption in our hand-rolled canary, per the code posted earlier.


I have a core file and executable. Let me know if you would be interested in my providing those in addition to the files provided yesterday by Adam.

== Comment: #21 - William J. Schmidt <***@us.ibm.com> - 2016-11-07 11:10:54 ==
Andrew, thanks for all the details, and for the binary and core file! I'll start poking through them this morning. I've just been absorbing all the notes that Calvin dumped into our bug tracking system yesterday.

You can ignore what I was saying about -fstack-protector-strong. My
thought at the time was that *if* the flow of control entered glibc,
that whether or not the code *there* was compiled with -fstack-
protector-strong might prove to make a difference. Reading back through
today, I see that was off base, so sorry for the distraction.

While I'm looking at the binary, there are a couple of other things you might want to try:
- Replace ::memset with __builtin_memset with GCC to see whether that makes any difference;
- Try Ulrich Weigand's suggestions from comment #9;
- As you suggested, try clang + libc++ to try to rule libstdc++ in or out.

A couple of questions that may or may not prove relevant:
- You've mentioned you don't get the crashes on the other linux distro. Have you tried your modified canary on the other linux distro anyway? If we're certain the two systems behave differently with the canary that may help us in narrowing things down.
- Which version of the C++ standard are you compiling against? Is it just the default on all systems, or are you forcing a specific -std=...?

== Comment: #22 - William J. Schmidt <***@us.ibm.com> - 2016-11-07 12:18:41 ==
I'm having some difficulties with core file compatibility. I put your files on an Ubuntu 16.04.1 system, but I don't see quite the same results as you report under gdb, with libc and libgcc shared libs not at the correct address and a problem with the stack. There's a transcript below. I'm particularly concerned about the warning that the core file and executable may not match. Note also the report of stack corruption above frame #4, so I can't get to frame #6 to look at the register state. The library frames at #0-#3 are reporting the wrong information, which I assume to be because the libraries are at the wrong address.

For debug purposes it would probably be best to use the system compiler,
just in case that wasn't the case here.

$ ls -l
total 1950688
-rw-r--r-- 1 wschmidt wschmidt 700141992 Nov 7 14:37 mongod.power
-rw-r--r-- 1 wschmidt wschmidt 1297350656 Nov 7 14:39 mongod.power.core
$ gdb mongod.power mongod.power.core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mongod.power...done.

warning: core file may not match specified executable file.
[New LWP 101461]
[New LWP 100045]
[New LWP 100062]
[New LWP 100056]
[New LWP 99983]
[New LWP 100052]
[New LWP 100054]
[New LWP 99892]
[New LWP 100051]
[New LWP 100048]
[New LWP 100007]
[New LWP 99868]
[New LWP 100059]
[New LWP 101459]
[New LWP 100001]
[New LWP 99986]
[New LWP 101403]
[New LWP 99980]
[New LWP 99882]
[New LWP 99893]
[New LWP 99877]
[New LWP 99872]
[New LWP 101462]
[New LWP 99874]
[New LWP 100058]
[New LWP 100231]
[New LWP 99994]
[New LWP 99873]
[New LWP 100003]
[New LWP 99993]
[New LWP 99879]
[New LWP 101398]
[New LWP 99891]
[New LWP 99880]
[New LWP 99910]
[New LWP 99895]
[New LWP 99901]
[New LWP 100011]
[New LWP 99974]
[New LWP 100049]
[New LWP 99898]
[New LWP 99875]
[New LWP 101460]
[New LWP 99878]
[New LWP 99871]
[New LWP 99896]
[New LWP 101954]
[New LWP 101406]
[New LWP 100015]
[New LWP 100068]
[New LWP 99984]
[New LWP 101519]
[New LWP 100053]
[New LWP 99996]
[New LWP 100050]
[New LWP 100055]
[New LWP 100057]
[New LWP 101807]
[New LWP 99890]
[New LWP 100004]
[New LWP 99884]
[New LWP 101437]
[New LWP 101455]
[New LWP 100013]
[New LWP 99894]
[New LWP 101411]
[New LWP 101457]
[New LWP 101431]
[New LWP 101458]
[New LWP 100443]
[New LWP 101438]
[New LWP 101414]
[New LWP 101433]
[New LWP 101784]
[New LWP 99979]
[New LWP 101397]
[New LWP 101402]
[New LWP 101401]
[New LWP 101435]
[New LWP 101405]
[New LWP 101423]
[New LWP 101425]
[New LWP 99897]
[New LWP 101419]
[New LWP 99989]
[New LWP 101409]
[New LWP 100008]
[New LWP 101410]
[New LWP 99998]
[New LWP 101413]
[New LWP 101469]
[New LWP 101418]
[New LWP 101427]
[New LWP 101399]
[New LWP 101235]
[New LWP 101396]
[New LWP 101421]
[New LWP 99990]
[New LWP 101407]
[New LWP 101480]
[New LWP 100060]
[New LWP 101499]
[New LWP 101506]
[New LWP 101395]
[New LWP 101415]
[New LWP 101400]
[New LWP 101412]
[New LWP 101408]
[New LWP 101420]
[New LWP 101416]
[New LWP 101492]
[New LWP 101513]
[New LWP 101782]
[New LWP 101404]
[New LWP 101481]
[New LWP 101417]
[New LWP 100067]
[New LWP 101429]
[New LWP 99883]
[New LWP 101430]
[New LWP 101436]
[New LWP 101454]
[New LWP 101428]
[New LWP 101422]
[New LWP 100108]
[New LWP 101434]
[New LWP 100064]
[New LWP 101453]
[New LWP 100061]
[New LWP 101426]
[New LWP 100066]
[New LWP 101452]
[New LWP 101439]
[New LWP 101456]
[New LWP 101451]
[New LWP 101450]
[New LWP 101432]
[New LWP 101449]
[New LWP 101424]
[New LWP 100065]
[New LWP 100063]
[New LWP 101448]
[New LWP 101447]
[New LWP 101446]
[New LWP 101445]
[New LWP 101444]
[New LWP 101443]
[New LWP 101442]
[New LWP 101441]
[New LWP 101440]

warning: .dynamic section for "/lib/powerpc64le-linux-gnu/libgcc_s.so.1"
is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib/powerpc64le-linux-gnu/libc.so.6" is not at the expected address (wrong library or version mismatch?)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
Core was generated by `/home/pic1user/proj/mongo-repro/mongod --oplogSize 1024 --port 30012 --nopreall'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00003fff997be5d0 in __copysign (y=<optimized out>, x=<optimized out>)
at ../sysdeps/generic/math_private.h:233
233 ../sysdeps/generic/math_private.h: No such file or directory.
[Current thread is 1 (Thread 0x3fff5814ec20 (LWP 101461))]
(gdb) bt
#0 0x00003fff997be5d0 in __copysign (y=<optimized out>, x=<optimized out>)
at ../sysdeps/generic/math_private.h:233
#1 __modf_power5plus (x=-6.2774385622041925e+66, iptr=0x3fff5814c1f0)
at ../sysdeps/powerpc/power5+/fpu/s_modf.c:44
#2 0x00003fff997be4f0 in ?? () from /lib/powerpc64le-linux-gnu/libc.so.6
#3 0x00003fff997c0c00 in ?? () at ../signal/allocrtsig.c:45
from /lib/powerpc64le-linux-gnu/libc.so.6
#4 0x00000000223c33e8 in mongo::invariantFailed (expr=<optimized out>,
file=0x24131b38 "src/mongo/bson/util/bson_extract.cpp",
line=<optimized out>) at src/mongo/util/assert_util.cpp:154
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) quit
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc64le-linux-gnu/5/lto-wrapper
Target: powerpc64le-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-ppc64el/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-ppc64el --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-ppc64el --with-arch-directory=ppc64le --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-secureplt --with-cpu=power8 --enable-targets=powerpcle-linux --disable-multilib --enable-multiarch --disable-werror --with-long-double-128 --enable-checking=release --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2)
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
$


I'll disassemble the binary and see if I can spot anything without the state information.

Oh, still waiting on permission to mirror the bug.

== Comment: #23 - William J. Schmidt <***@us.ibm.com> - 2016-11-07 13:39:45 ==
A little more information:

I've been looking at bsonExtractStringField's disassembly. It appears
that this binary inlines the call to the Canary constructor as well as
the call to _verify. As evidence, I see the PLT call to glibc's memset:

8ebb3c: 71 c9 06 48 bl 9584ac
<00000d72.plt_call.memset@@GLIBC_2.17>

And later I see the call to invariantFailed:

8ebc44: e9 75 f0 4b bl 7f322c
<_ZN5mongo15invariantFailedEPKcS1_j+0x8>

So we've answered Steve's initial question about which memset we're
using. This isn't being inlined by the compiler, but does an out-of-
line dynamic call to the GLIBC_2.17 version.

I'm not sure whether GCC would inline a 1024-byte memset using
__builtin_memset, or just end up calling out the same way, but it might
be worth trying out that replacement, and disassembling
bsonExtractStringField again to see if the PLT call has gone away.

== Comment: #24 - William J. Schmidt <***@us.ibm.com> - 2016-11-07 13:50:04 ==
I forgot to mention that the ensuing code generation to accumulate the checksum and test it is completely straightforward and looks correct. So this looks like pretty strong evidence that the problem is in the GLIBC memset implementation.

8ebb3c: 71 c9 06 48 bl 9584ac <00000d72.plt_call.memset@@GLIBC_2.17>
8ebb40: 18 00 41 e8 ld r2,24(r1)
8ebb44: 00 04 40 39 li r10,1024
8ebb48: 00 00 20 39 li r9,0
8ebb4c: a6 03 49 7d mtctr r10
8ebb50: 00 00 43 89 lbz r10,0(r3)
8ebb54: 01 00 63 38 addi r3,r3,1
8ebb58: 14 52 29 7d add r9,r9,r10
8ebb5c: f4 ff 00 42 bdnz 8ebb50 <_ZN5mongo22bsonExtractStringFieldERKNS_7BSONObjENS_10StringDataEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x80>
8ebb60: 03 00 40 3d lis r10,3
8ebb64: 00 34 4a 61 ori r10,r10,13312
8ebb68: 00 50 a9 7f cmpd cr7,r9,r10
8ebb6c: c4 00 9e 40 bne cr7,8ebc30 <_ZN5mongo22bsonExtractStringFieldERKNS_7BSONObjENS_10StringDataEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x160>

...

8ebc30: 44 ff 82 3c addis r4,r2,-188
8ebc34: 44 ff 62 3c addis r3,r2,-188
8ebc38: 3a 00 a0 38 li r5,58
8ebc3c: 38 aa 84 38 addi r4,r4,-21960
8ebc40: 60 aa 63 38 addi r3,r3,-21920
8ebc44: e9 75 f0 4b bl 7f322c <_ZN5mongo15invariantFailedEPKcS1_j+0x8>

== Comment: #28 - William J. Schmidt <***@us.ibm.com> - 2016-11-08 11:02:18 ==
Recording some information from email discussions.

(1) The customer is planning to attempt to use valgrind memcheck.
(2) The const cast problem with the canary has been fixed without changing the results.
(3) Prior to that fix, the canary was used on the RHEL system with no corruption detected, so this does seem to be Ubuntu-specific.
(4) -std=c++11 is used everywhere.
(5) The core and binary compatibility issues appear to be that they were generated on 16.10, not 16.04. New ones coming.
(6) The canary code now looks like:

+namespace {
+
+class Canary {
+public:
+
+ static constexpr size_t kSize = 2048;
+
+ explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
+ __builtin_memset(const_cast<unsigned char*>(t), kBits, kSize);
+ _verify();
+ }
+
+ ~Canary() {
+ _verify();
+ }
+
+private:
+ static constexpr uint8_t kBits = 0xCD;
+ static constexpr size_t kChecksum = kSize * size_t(kBits);
+
+ void _verify() const noexcept {
+ invariant(std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
+ }
+
+ const volatile unsigned char* const _t;
+};
+
+} // namespace
+

And its application in bsonExtractTypedField looks like:

@@ -47,6 +82,10 @@ Status bsonExtractTypedField(const BSONObj& object,
StringData fieldName,
BSONType type,
BSONElement* outElement) {
+
+ volatile unsigned char* const cookie = static_cast<unsigned char *>(alloca(Canary::kSize));
+ const Canary c(cookie);
+
Status status = bsonExtractField(object, fieldName, outElement);

(7) Steve Munroe investigated memset and he and Andrew are in agreement
that we can rule it out:

I looked at the memset_power8 code (memset is just a IFUNC resolve
stub). and I don't see how this problem is caused by memset_power8.

First some observations:

The canary is allocated with alloca for a large power of 2 (1024 bytes).
Alloca returns quadword aligned memory as required to maintain quadword stack alignment.
For this case memset_power8 will quickly jump to the vector store loop (quadword x 8) all from the same register (a vector splat of the fill char).

With this code the failure modes could only be:
Overwrite by N*quadwords,
Underwrite by N*quadwords,
A repeated pattern every quadword.

But we are not see this. Also think we are back to a clobber by some
other code.
I did try the valgrind repro. However, I'm not able to make valgrind
work:

The first try resulted in lots of "mismatched free/delete" reports,
which is sort of odd, because they all seem to be from within the
valgrind --soname-synonyms=somalloc=NONE --track-origins=yes --leak-check=no ./mongos
==17387== Memcheck, a memory error detector
==17387== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==17387== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==17387== Command: ./mongos
==17387==
==17387== Mismatched free() / delete / delete []
==17387== at 0x4895888: free (in /usr/lib/valgrind/vgpreload_memcheck-ppc64le-linux.so)
==17387== by 0x59514F: deallocate (new_allocator.h:110)
==17387== by 0x59514F: deallocate (alloc_traits.h:517)
==17387== by 0x59514F: _M_deallocate_buckets (hashtable_policy.h:2010)
==17387== by 0x59514F: _M_deallocate_buckets (hashtable.h:356)
==17387== by 0x59514F: _M_deallocate_buckets (hashtable.h:361)
==17387== by 0x59514F: _M_rehash_aux (hashtable.h:1999)
==17387== by 0x59514F: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_rehash(unsigned long, unsigned long const&) (hashtable.h:1953)
==17387== by 0x595253: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, true>*) (hashtable.h:1600)
==17387== by 0x5954D3: std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (hashtable_policy.h:600)
==17387== by 0x593693: operator[] (unordered_map.h:668)
==17387== by 0x593693: mongo::InitializerDependencyGraph::addInitializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<mongo::Status (mongo::InitializerContext*)> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (initializer_dependency_graph.cpp:58)
==17387== by 0x591057: mongo::GlobalInitializerRegisterer::GlobalInitializerRegisterer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<mongo::Status (mongo::InitializerContext*)> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (global_initializer_registerer.cpp:44)
==17387== by 0x52D46F: __static_initialization_and_destruction_0(int, int) [clone .constprop.34] (mongos_options_init.cpp:39)
==17387== by 0x137FED3: __libc_csu_init (in /home/acm/opt/src/mongo/mongos)
==17387== by 0x4F830A7: generic_start_main.isra.0 (libc-start.c:247)
==17387== by 0x4F83337: (below main) (libc-start.c:116)
==17387== Address 0x5151fb0 is 0 bytes inside a block of size 16 alloc'd
==17387== at 0x48951D4: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-ppc64le-linux.so)
==17387== by 0x59328F: allocate (new_allocator.h:104)
==17387== by 0x59328F: allocate (alloc_traits.h:491)
==17387== by 0x59328F: std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, true> > >::_M_allocate_buckets(unsigned long) [clone .isra.108] (hashtable_policy.h:1996)
==17387== by 0x595093: _M_allocate_buckets (hashtable.h:347)
==17387== by 0x595093: _M_rehash_aux (hashtable.h:1974)
==17387== by 0x595093: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_rehash(unsigned long, unsigned long const&) (hashtable.h:1953)
==17387== by 0x595253: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, true>*) (hashtable.h:1600)
==17387== by 0x5954D3: std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (hashtable_policy.h:600)
==17387== by 0x59356B: operator[] (unordered_map.h:668)
==17387== by 0x59356B: mongo::InitializerDependencyGraph::addInitializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<mongo::Status (mongo::InitializerContext*)> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (initializer_dependency_graph.cpp:46)
==17387== by 0x591057: mongo::GlobalInitializerRegisterer::GlobalInitializerRegisterer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<mongo::Status (mongo::InitializerContext*)> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (global_initializer_registerer.cpp:44)
==17387== by 0x52D46F: __static_initialization_and_destruction_0(int, int) [clone .constprop.34] (mongos_options_init.cpp:39)
==17387== by 0x137FED3: __libc_csu_init (in /home/acm/opt/src/mongo/mongos)
==17387== by 0x4F830A7: generic_start_main.isra.0 (libc-start.c:247)
==17387== by 0x4F83337: (below main) (libc-start.c:116)


So, that is a puzzle. However, I can instruct valgrind to ignore that. But it still fails to start, now with something more odd:

$ valgrind --show-mismatched-frees=no --soname-synonyms=somalloc=NONE --track-origins=yes --leak-check=no ./mongos
==19834== Memcheck, a memory error detector
==19834== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==19834== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==19834== Command: ./mongos
==19834==
MC_(get_otrack_shadow_offset)(ppc64)(off=1688,sz=8)

Memcheck: mc_machine.c:329 (get_otrack_shadow_offset_wrk): the
'impossible' happened.

host stacktrace:
==19834== at 0x3808D9B8: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3808DB5F: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3808DCDB: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x38078CE3: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x38076FAB: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x380BAA2B: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x381B9BB7: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x380BE19F: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3810D04F: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3810FFEF: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3812BB97: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)

sched status:
running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 19834)
==19834== at 0x4F3AC14: __lll_lock_elision (elision-lock.c:60)
==19834== by 0x4F2BBC7: pthread_mutex_lock (pthread_mutex_lock.c:92)
==19834== by 0x602753: mongo::DBConnectionPool::DBConnectionPool() (connpool.cpp:196)
==19834== by 0x5319EB: __static_initialization_and_destruction_0 (global_conn_pool.cpp:35)
==19834== by 0x5319EB: _GLOBAL__sub_I__ZN5mongo14globalConnPoolE (global_conn_pool.cpp:39)
==19834== by 0x137FED3: __libc_csu_init (in /home/acm/opt/src/mongo/mongos)
==19834== by 0x4F830A7: generic_start_main.isra.0 (libc-start.c:247)
==19834== by 0x4F83337: (below main) (libc-start.c:116)


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using. Thanks.


I'm not really sure what to make of that, except that I did see some thing die in the same place, once or twice (__lll_lock_elision), when running with clang ASAN with the stack-use-after-return checking enabled. I wasn't really sure what to make of that, but it is interesting that this has turned up twice. I presume this is related to hardware lock elision?

Anyway, it doesn't seem like I can get this running with valgrind. Happy
to try again if anyone is aware of a workaround.

== Comment: #30 - William J. Schmidt <***@us.ibm.com> - 2016-11-08 11:06:00 ==
CCing Carl Love. Carl, have you seen this sort of interaction between valgrind and lock elision before? (Comment #29, you can ignore the rest of this bugzilla for now.)

** Affects: gcc-4.8 (Ubuntu)
Importance: Undecided
Assignee: Taco Screen team (taco-screen-team)
Status: New


** Tags: architecture-ppc64le bugnameltc-148069 severity-critical targetmilestone-inin16045

** Tags added: architecture-ppc64le bugnameltc-148069 severity-critical
targetmilestone-inin16045

** Changed in: ubuntu
Assignee: (unassigned) => Taco Screen team (taco-screen-team)

** Package changed: ubuntu => gcc-4.8 (Ubuntu)
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-4.8/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-09 16:01:28 UTC
Permalink
------- Comment From ***@us.ibm.com 2016-11-09 10:52 EDT-------
Hello Canonical,
Sending this bug to you for awareness and advice.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-4.8/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Matthias Klose
2016-11-09 19:59:51 UTC
Permalink
** Package changed: gcc-4.8 (Ubuntu) => gcc-5 (Ubuntu)
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-09 20:05:07 UTC
Permalink
For what it's worth, GCC is just a placeholder package for now. The
compiler and libraries aren't directly implicated, at least as of now.
The origin of the stack corruption remains unknown.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-10 00:16:15 UTC
Permalink
Andrew, according to the valgrind community, the error

Memcheck: mc_machine.c:329 (get_otrack_shadow_offset_wrk): the
'impossible' happened.

is probably due to the lock elision code accessing a hardware register
that valgrind doesn't know about, so there isn't a shadow register to
consult. Our valgrind guys will look into that aspect of things. In
the meantime, they tell us that you can circumvent this problem by not
using --track-origins. Could you please give this a try?
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-10 00:30:19 UTC
Permalink
I had another thought this evening. In case this is a threading
problem, have you tried building with Clang and using ThreadSanitizer?
Support for this was added to ppc64el in 2015.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-10 00:44:44 UTC
Permalink
The following are reproduction instructions for the behavior that we are
observing on Ubuntu 16.04 ppc64le. Note that we have run this same test
on RHEL 7.1 ppc64le, and we do not observe any stack corruption. Note
also that building and running this repro may depend on certain system
libraries (SSL, etc) or python libraries being available on the system.
Please install as needed. The particular commit here is fairly recent,
just one that I happen to know demonstrates the issue.

- git clone https://github.com/mongodb/mongo.git
- cd mongo
- git checkout 3220495083b0d678578a76591f54ee1d7a5ec5df
- git apply acm.nov9.patch
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j$(echo "$(grep -c processor /proc/cpuinfo)/2" | bc) ./mongo ./mongod ./mongos
- ulimit -c unlimited && python buildscripts/resmoke.py --suites=concurrency_sharded --storageEngine=wiredTiger --excludeWithAnyTags=requires_mmapv1 --dbpathPrefix=... --repeat=500 --continueOnFailure

Note that you should provide an actual argument for the --dbpathPrefix
argument in the last step, as this is where the running database
instances will store data.

You will need to leave this running for several hours, perhaps
overnight. In our runs, we find that about 1% of the repeated runs of
the test fail, dropping a core.

The core files are typically (but not always!) associated with crashes
of the mongos binary inside one of the several mongo::bsonExtractXXX
functions, where we find our hand-rolled stack canary to be corrupted. A
typical stack trace of a crashing thread looks like:

$ gdb ./mongos core.2016-11-09T23:11:56+00:00
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./mongos...done.
[New LWP 3821]
...
[New LWP 3736]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
Core was generated by `/home/acm/opt/src/mongo/mongos --configdb test-configRS/ubuntu1604-ppc-dev.pic.'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00003fff779ff21c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x3fff5d98f140 (LWP 3821))]
(gdb) bt
#0 0x00003fff779ff21c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00003fff77a01894 in __GI_abort () at abort.c:89
#2 0x000000005d899504 in mongo::fassertFailedWithLocation (msgid=<optimized out>, file=0x5e6f9570 "src/mongo/bson/util/bson_extract.cpp", line=<optimized out>) at src/mongo/util/assert_util.cpp:172
#3 0x000000005d95cb64 in mongo::fassertWithLocation (line=46, file=0x5e6f9570 "src/mongo/bson/util/bson_extract.cpp", testOK=<optimized out>, msgid=100001) at src/mongo/util/assert_util.h:273
#4 mongo::(anonymous namespace)::Canary::Canary (t=0x3fff5d98be90 '\315' <repeats 199 times>, <incomplete sequence \315>..., this=<synthetic pointer>) at src/mongo/bson/util/bson_extract.cpp:46
#5 mongo::bsonExtractTypedField (object=owned BSONObj 34 bytes @ 0x10008d85a37, fieldName=..., type=mongo::Bool, outElement=0x3fff5d98c730) at src/mongo/bson/util/bson_extract.cpp:83
#6 0x000000005d95cc5c in mongo::bsonExtractBooleanField (object=..., fieldName=..., out=0x3fff5d98c798) at src/mongo/bson/util/bson_extract.cpp:101
#7 0x000000005df5e7ec in mongo::AutoSplitSettingsType::fromBSON (obj=...) at src/mongo/s/balancer_configuration.cpp:400

The hand-rolled canary is implemented as follows:

class Canary {
public:

static constexpr size_t kSize = 2048;

__attribute__((always_inline)) explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
__builtin_memset(const_cast<unsigned char*>(t), kBits, kSize);
fassert(100001, std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
}

__attribute__((always_inline)) ~Canary() {
fassert(100002, std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
}

private:
static constexpr uint8_t kBits = 0xCD;
static constexpr size_t kChecksum = kSize * size_t(kBits);

const volatile unsigned char* const _t;
};

} // namespace

The setup of the Canary in mongo::bsonExtractField is here:

Status bsonExtractField(const BSONObj& object, StringData fieldName,
BSONElement* outElement) {

volatile unsigned char* const cookie = static_cast<unsigned char *>(alloca(Canary::kSize));
const Canary c(cookie);

...

}

In the crash above, it can be seen that the hand-rolled canary detected
stack corruption in the Canary constructor. We memset the bytes, and
then we read them back to checksum them, and they aren't the same. We
also sometimes see the checksum fail in the Canary destructor. But the
constructor case is more interesting. What could have happened to the
memory between when we memset it, and when we read it back? One
hypothesis would be that we had leaked a pointer to a local to another
thread which wrote to it, but if that were the case we would expect to
see crashes all the time, and on other systems, and we don't. More
details on that below.

Looking at the corrupted memory, we see:

(gdb) frame 4
#4 mongo::(anonymous namespace)::Canary::Canary (t=0x3fff5d98be90 '\315' <repeats 199 times>, <incomplete sequence \315>..., this=<synthetic pointer>) at src/mongo/bson/util/bson_extract.cpp:46
46 fassert(100001, std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
(gdb) print t
$1 = (volatile unsigned char * const) 0x3fff5d98be90 '\315' <repeats 199 times>, <incomplete sequence \315>...
(gdb) x /2048ub t
0x3fff5d98be90: 205 205 205 205 205 205 205 205
...
0x3fff5d98c010: 205 205 205 205 205 205 205 205
0x3fff5d98c018: 205 205 205 205 205 205 205 205
0x3fff5d98c020: 205 205 205 205 205 205 1 0
0x3fff5d98c028: 205 205 205 205 205 205 205 205
...
0x3fff5d98c680: 205 205 205 205 205 205 205 205
0x3fff5d98c688: 205 205 205 205 205 205 205 205

Interestingly, the corrupted bytes are always two bytes, always either
0x00 or 0x01, and always starting at an offset aligned 0xe.

We have tried several things to narrow down the range of possible
causes.

- We have reproduced with our home built GCC 5.4.
- We have reproduced with the system GCC 5.4.
- We have reproduced with clang-3.9.
- We have run the testcase under the clang address sanitizer, with ASAN_OPTIONS=detect_stack_use_after_return=1.
- We have run on different hardware to ensure that this is not bad memory.
- We have run on bare metal to ensure that this is not related to the virtualization layer on which most of our ppc64le Ubuntu 16.04 instances run.
- The same test case is part of our continuous integration loop and runs nightly across dozens of operating systems and compiler variations, including Windows, OS X, and Linux on x86_64 (including Ubuntu 16.04).
- We have run the same test case on RHEL 7.1 ppc64le.

In none of these cases have we been able to reproduce the issue. It
appears only on Ubuntu 16.04, and only when running that OS on POWER8.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-10 00:45:25 UTC
Permalink
Here is the patch for the above comment

** Patch added: "Apply to 3220495083b0d678578a76591f54ee1d7a5ec5df"
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+attachment/4775111/+files/acm.nov9.patch
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 00:48:04 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-10 00:49:32 UTC
Permalink
Bill -

I will try again with valgrind without --track-origins=yes and post any
interesting findings. Re ThreadSanitizer, we have tried before without
success. The last time we tried, it didn't work because clang TSAN
didn't support exceptions. Perhaps that has changed? We really like the
sanitizers (we run ASAN and UBSAN in our CI loop), but our experience
has been that there is a significant period of scrubbing out false
positives and adding suppressions before meaningful signal can be
extracted. I'm happy to add it to the list of things to try though.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 00:58:07 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-10 01:00:41 UTC
Permalink
Hi Andrew -- not sure about Clang TSAN supporting exceptions at this
point; it is probable that we don't have a solution there as I would
expect that to require target support, and I've not heard of that
happening for POWER. That said, I've been less connected to the Clang
community for the last year or so, so anything's possible.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 01:08:13 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Ubuntu Foundations Team Bug Bot
2016-11-10 04:28:07 UTC
Permalink
The attachment "Apply to 3220495083b0d678578a76591f54ee1d7a5ec5df" seems
to be a patch. If it isn't, please remove the "patch" flag from the
attachment, remove the "patch" tag, and if you are a member of the
~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by
~brian-murray, for any issues please contact him.]

** Tags added: patch
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 04:39:19 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-10 14:09:54 UTC
Permalink
Overnight, I ran this test case on both an Ubuntu 16.04 ppc64le system
and a RHEL 7.1 ppc64le system.

The test ran 219 times on Ubuntu, with 15 cores, for a failure rate of
around 5%. Most of the time corruption was detected in the Canary ctor
(before doing other work), but a few times in the dtor:

$ grep "10000[12]" resmoke.log
[js_test:fsm_all_sharded_replication] 2016-11-09T21:06:42.140+0000 s40019| 2016-11-09T21:06:42.140+0000 I - [conn59] Fatal Assertion 100002 at src/mongo/bson/util/bson_extract.cpp 50
[js_test:fsm_all_sharded_replication] 2016-11-09T23:11:56.413+0000 s40019| 2016-11-09T23:11:56.412+0000 I - [conn1] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T03:49:43.577+0000 s40019| 2016-11-10T03:49:43.576+0000 I - [conn47] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T04:54:20.748+0000 s40019| 2016-11-10T04:54:20.745+0000 I - [conn28] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T04:56:43.929+0000 s40020| 2016-11-10T04:56:43.929+0000 I - [conn30] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T07:26:14.013+0000 s40019| 2016-11-10T07:26:14.013+0000 I - [conn3] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T07:52:45.732+0000 s40019| 2016-11-10T07:52:45.698+0000 I - [conn1] Fatal Assertion 100002 at src/mongo/bson/util/bson_extract.cpp 50
[js_test:fsm_all_sharded_replication] 2016-11-10T07:54:45.606+0000 s40020| 2016-11-10T07:54:45.606+0000 I - [conn30] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T08:28:04.278+0000 s40020| 2016-11-10T08:28:04.277+0000 I - [conn13] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T09:28:24.221+0000 s40020| 2016-11-10T09:28:24.221+0000 I - [conn32] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T09:41:14.880+0000 s40019| 2016-11-10T09:41:14.800+0000 I - [conn48] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T10:35:58.143+0000 s40019| 2016-11-10T10:35:58.118+0000 I - [conn2] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T11:06:01.027+0000 s40019| 2016-11-10T11:06:01.027+0000 I - [conn1] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T12:37:43.100+0000 s40019| 2016-11-10T12:37:43.100+0000 I - [conn1] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T13:31:31.912+0000 s40020| 2016-11-10T13:31:31.912+0000 I - [conn1] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46

The test ran 227 times on RHEL without issue. I will try valgrind
without --track-origins=yes today on the Ubuntu 16.04 system.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 14:18:50 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-10 14:42:22 UTC
Permalink
I tried valgrind as suggested above.

By adding --show-mismatched-frees=no and removing --track-origins=yes I
was able to get the process to start up without a lot of false
positives. However, the server process fails to open its listening
socket, because valgrind reports an unsupported syscall:

[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.496+0000 s40019| --17827-- WARNING: unhandled ppc64le-linux syscall: 326
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.496+0000 s40019| --17827-- You may be able to write your own handler.
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.496+0000 s40019| --17827-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.497+0000 s40019| --17827-- Nevertheless we consider this a bug. Please report
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.497+0000 s40019| --17827-- it at http://valgrind.org/support/bug_reports.html.
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.509+0000 s40019| 2016-11-10T14:24:04.509+0000 I - [mongosMain] Assertion: 15863:listen(): invalid socket? Function not implemented src/mongo/util/net/listen.cpp 184

The code around listen.cpp:184 looks like:

182 SOCKET sock = ::socket(me.getType(), SOCK_STREAM, 0);
183 ScopeGuard socketGuard = MakeGuard(&closesocket, sock);
184 massert(15863,
185 str::stream() << "listen(): invalid socket? " << errnoWithDescription(),
186 sock >= 0);

It looks as if valgrind on ppc64le doesn't support the socket syscall?
Note that I happened to be subscribed to the valgrind-developers list so
I saw the other post there, and I've followed up with this same
information already.

That discussion is here:
https://sourceforge.net/p/valgrind/mailman/message/35483420/

In any event, it doesn't look like valgrind is going to be able to help
here.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 14:48:34 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-10 18:10:31 UTC
Permalink
OK, I upgraded valgrind to 3.12 on the power machine and I can now get
it to run meaningfully. We are seeing many error reports of the
following form:

[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== Thread 50:
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== Invalid read of size 2
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== at 0x4F2AD20: __lll_unlock_elision (elision-unlock.c:36)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== by 0x4F1DB07: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:64)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== by 0x4F1DB07: pthread_mutex_unlock (pthread_mutex_unlock.c:314)

Or

[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== Invalid write of size 2
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== at 0x4F2AD30: __lll_unlock_elision (elision-unlock.c:37)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== by 0x4F1DB07: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:64)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== by 0x4F1DB07: pthread_mutex_unlock (pthread_mutex_unlock.c:314)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.999+0000 s40019| ==34604== by 0xD803C7: operator()<const mongo::executor::TaskExecutor::RemoteCommandCallbackArgs&, long unsigned int&, void> (functional:600)


In all cases, the invalid write appears to be a write into a freed block. Frequently, the address appears to be aligned 'Address 0x...e'. So, this is very interesting.

Another engineer and I took a close look at one of these instances, and
we do not believe there is any way that the mutex could be accessed
after it was deleted.

Is there a way we can disable the libc lock elision code? An environment
variable or other similar setting? We would like to see if we still see
these sorts of reports after disabling lock elision. If so, then it
would almost certainly be a logic error in our code that we are just
missing. On the other hand, if the valgrind reports go away when we
disable lock elision, then it would be evidence that lock elision might
be at fault for the stack corruption we are observing, at which point I
would re-try our original repro.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-10 19:25:49 UTC
Permalink
Hi Andrew,

That indeed looks suspicious. I've been talking with our libc team. It
appears that the existing patch that provides for disabling lock elision
dynamically isn't present in the libc on Ubuntu 16.04, which is very
unfortunate. They are thinking about other possible solutions.

This seems a very possible culprit as lock elision is not present on the
other configurations where you've seen success, to my knowledge.

We may need to get a special test build of glibc with --disable-lock-
elision for you to test with. I will ask the Ubuntu toolchain team
whether this is a possibility.

Thanks for persevering through the valgrind concerns to get to this
clue!

Bill
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 19:28:21 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Peter Bergner
2016-11-10 22:16:56 UTC
Permalink
The following might override the HTM lock elision. Can someone try it
to see if it works?

***@ampere:~$ cat pthread_mutex_lock.c
#include <pthread.h>

#define PTHREAD_MUTEX_NO_ELISION_NP 512
extern int __pthread_mutex_lock (pthread_mutex_t *);

int
pthread_mutex_lock (pthread_mutex_t *mutex)
{
mutex->__data.__kind |= PTHREAD_MUTEX_NO_ELISION_NP;
return __pthread_mutex_lock (mutex);
}
***@ampere:~$ gcc -c -fPIC pthread_mutex_lock.c
***@ampere:~$ gcc -shared -Wl,-soname,libfoo.so.1 -o libfoo.so.1 pthread_mutex_lock.o
***@ampere:~$ LD_PRELOAD=./libfoo.so.1 ./a.out

...replacing ./a.out with the binary you want to run without lock
elision.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 22:19:02 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-10 23:24:31 UTC
Permalink
I have the libfoo.so.1 interposer running, I will let it run overnight
and report back tomorrow with any interesting findings.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 23:28:33 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-10 23:40:51 UTC
Permalink
I don't think the interposition is working, or I'm doing something
wrong.


I changed pthread_mutex_lock.c to the following:

$ cat pthread_mutex_lock.c
#include <pthread.h>
#include <stdlib.h>

#define PTHREAD_MUTEX_NO_ELISION_NP 512
extern int __pthread_mutex_lock (pthread_mutex_t *);

int
pthread_mutex_lock (pthread_mutex_t *mutex)
{
abort();
mutex->__data.__kind |= PTHREAD_MUTEX_NO_ELISION_NP;
return __pthread_mutex_lock (mutex);
}

$ gcc -c -fPIC pthread_mutex_lock.c
$ gcc -shared -Wl,-soname,libfoo.so.1 -o libfoo.so.1


I then wrote a small C++ program:

$ cat use_foo.cpp
#include <mutex>

int main(int argc, char* argv[]) {
std::mutex m;
std::lock_guard<std::mutex> guard(m);
return EXIT_SUCCESS;
}


Compiled it to a.out:

$ g++ -std=c++11 -pthread ./use_foo.cpp


When run, it does *not* terminate in abort:

LD_PRELOAD=$(realpath ./libfoo.so.1 ) ./a.out ; echo $?


Under gdb, I can see that the libfoo.so.1 library is loaded:

$ LD_PRELOAD=$(realpath ./libfoo.so.1 ) gdb ./a.out
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...(no debugging symbols found)...done.
(gdb) sta
Temporary breakpoint 1 at 0x10000a24
Starting program: /home/acm/opt/src/1640518/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".

Temporary breakpoint 1, 0x0000000010000a24 in main ()
(gdb) info inferior
Num Description Executable
* 1 process 14046 /home/acm/opt/src/1640518/a.out
(gdb) !lsof -p 14046 | grep lib
a.out 14046 amorrow mem REG 8,2 74328 2621621 /lib/powerpc64le-linux-gnu/libgcc_s.so.1
a.out 14046 amorrow mem REG 8,2 856616 2622143 /lib/powerpc64le-linux-gnu/libm-2.23.so
a.out 14046 amorrow mem REG 8,2 1851512 2622139 /lib/powerpc64le-linux-gnu/libc-2.23.so
a.out 14046 amorrow mem REG 8,2 171632 2622098 /lib/powerpc64le-linux-gnu/libpthread-2.23.so
a.out 14046 amorrow mem REG 8,2 2042040 2752992 /usr/lib/powerpc64le-linux-gnu/libstdc++.so.6.0.21
a.out 14046 amorrow mem REG 8,2 69136 6182484 /home/acm/opt/src/1640518/libfoo.so.1
a.out 14046 amorrow mem REG 8,2 268976 2622140 /lib/powerpc64le-linux-gnu/ld-2.23.so


Setting a breakpoint in pthread_mutex_lock lands me in __GI___pthread_mutex_lock. Perhaps the interposition symbol needs to be changed?

$ LD_PRELOAD=$(realpath ./libfoo.so.1 ) gdb ./a.out
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...(no debugging symbols found)...done.
(gdb) break pthread_mutex_lock
Breakpoint 1 at 0x10000da0
(gdb) r
Starting program: /home/acm/opt/src/1640518/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".

Breakpoint 1, __GI___pthread_mutex_lock (mutex=0x3fffb7ff0908 <_rtld_global+2312>) at ../nptl/pthread_mutex_lock.c:67
warning: Source file is more recent than executable.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-10 23:48:32 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 00:09:05 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 00:59:44 UTC
Permalink
------- Comment From ***@us.ibm.com 2016-11-10 19:54 EDT-------
(In reply to comment #49)
$ LD_PRELOAD= gdb ./a.out
That should be:
$ LD_PRELOAD=/path-to.../libfoo.so.1 gdb ./a.out

Also, if you use LD_PRELOAD (without export) for the gdb command, I
wouldn't expect it to carry forward to the inferior. Although, as you
say, gdb shows the library loaded.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 01:09:04 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Peter Bergner
2016-11-11 01:28:41 UTC
Permalink
When I add the abort and use your C++ test case, I see the abort:

***@ampere:~$ cat pthread_mutex_lock.c
#include <stdlib.h>
#include <pthread.h>

#define PTHREAD_MUTEX_NO_ELISION_NP 512
extern int __pthread_mutex_lock (pthread_mutex_t *);

int
pthread_mutex_lock (pthread_mutex_t *mutex)
{
abort();
mutex->__data.__kind |= PTHREAD_MUTEX_NO_ELISION_NP;
return __pthread_mutex_lock (mutex);
}
***@ampere:~$ gcc -fPIC -c pthread_mutex_lock.c
***@ampere:~$ gcc -shared -Wl,-soname,libbar.so.1 -o libbar.so.1 pthread_mutex_lock.o
***@ampere:~$ cat foo.cpp
#include <mutex>

int main(int argc, char* argv[]) {
std::mutex m;
std::lock_guard<std::mutex> guard(m);
return EXIT_SUCCESS;
}
***@ampere:~$ g++ -std=c++11 -pthread foo.cpp
***@ampere:~$ ./a.out
***@ampere:~$ LD_PRELOAD=./libbar.so.1 ./a.out
Aborted
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Peter Bergner
2016-11-11 01:38:32 UTC
Permalink
gdb shows the abort is from the shim library too:

***@ampere:~$ gdb -q ./a.out
Reading symbols from ./a.out...(no debugging symbols found)...done.
(gdb) set environment LD_PRELOAD=./libbar.so.1
(gdb) run
Starting program: /home/bergner/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".

Program received signal SIGABRT, Aborted.
0x00003fffb7b3f27c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00003fffb7b3f27c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00003fffb7b418f4 in __GI_abort () at abort.c:89
#2 0x00003fffb7f507e4 in pthread_mutex_lock () from ./libbar.so.1
#3 0x0000000010000954 in __gthread_mutex_lock(pthread_mutex_t*) ()
#4 0x0000000010000b10 in std::mutex::lock() ()
#5 0x0000000010000be8 in std::lock_guard<std::mutex>::lock_guard(std::mutex&) ()
#6 0x0000000010000a80 in main ()
(gdb)
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 01:38:47 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 01:49:48 UTC
Permalink
------- Comment From ***@us.ibm.com 2016-11-10 20:44 EDT-------
(In reply to comment #51)
Post by bugproxy
(In reply to comment #49)
Something doesn't look right in your commands. They all have "LD_PRELOAD="
$ LD_PRELOAD= gdb ./a.out
It's a mirroring problem to our bugzilla. The command looks correct in
the Launchpad comment. I recommend reading/adding comments in Launchpad
rather than from our bugzilla.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 07:54:12 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Adam Conrad
2016-11-11 16:11:10 UTC
Permalink
A test build of glibc with lock elision disabled is in progress here:

https://launchpad.net/~adconrad/+archive/ubuntu/nole/+packages

That said, the above trace looks suspiciously like a double-unlock.
That breaks pthread rules, but the software implementation has
historically let you do it anyway (or, rather, it fails to abort there
because it would be expensive to do so, but your code is still buggy,
prone to hanging, etc).

There's a thread on debian-devel right now about this same issue[1] as
it related to ghostscript, where people noticed that with hardware lock
elision, the world exploded, and without, it seemed to work, maybe, ish
(though, as I said, the ghostscript bug was likely the cause of some
mysterious hangs).

[1] https://lists.debian.org/debian-devel/2016/11/msg00210.html
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 16:19:08 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Paul Clarke
2016-11-11 16:20:47 UTC
Permalink
from "man pthread_mutex_unlock":
If the mutex type is PTHREAD_MUTEX_ERRORCHECK, then error checking
shall be provided. If a thread attempts to relock a mutex that it has
already locked, an error shall be returned. If a thread attempts to
unlock a mutex that it has not locked or a mutex which is unlocked, an
error shall be returned.

That might be something to try.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-11 16:25:43 UTC
Permalink
Per the blog post mentioned from the thread in #35, this sort of problem
should also manifest on a Broadwell or Skylake processor. Andrew, have
you tried running on such machines?
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 16:29:04 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-11 16:35:05 UTC
Permalink
"Per logs from message #15 on bug #842796:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15

SIGSEGV on __lll_unlock_elision is a signature (IME with very high
confidence) of an attempt to unlock an already unlocked lock while
running under hardware lock elision.


Well, unlocking an already unlocked lock is a pthreads API rule
violation, and it is going to crash the process on something that
implements hardware lock elision."

So I think we have some pretty good evidence of an application problem.
I think that using Paul Clarke's suggestion may be necessary for you to
figure out where the double-unlock is occurring. I'm not confident that
valgrind will spot this.

We're going to continue trying to reproduce on our side and disable TLE
to confirm that this segv goes away. Hard to know if this is related to
the original reported problem, of course, but perhaps losing TLE will
allow valgrind to find that if it's a separate issue.

** Bug watch added: Debian Bug tracker #842796
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 16:39:00 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-11 18:14:17 UTC
Permalink
First, I'm not sure what I was doing wrong yesterday, but I now have the
LD_PRELOAD lock-elision-disablement running. And, when running under
valgrind, we no longer see the reports from valgrind. I'm now running
without valgrind to see whether we still observe stack corruption.

A few comments on the most recent comments.

Re #40: Bill, we have never actually seen a natural SEGV in
__lll_unlock_elision in our tests - instead we see random stack
corruption, per my description in #5. We saw some complaints about
__lll_lock_elision when running under valgrind, but some comments in
this thread have made me question whether valgrind actually understands
what is happening there.

Re #38: We were thinking the same thing. I will ask around to see if we
have any appropriate machines.

Re #37: I don't think it is an option for us as almost all of our locks
are wrapped pthread locks via std::mutex.

Re #35: Almost all of our lock management is via C++ RAII types
(std::unique_lock, etc), and we run these tests across many many
systems. I'm almost certain that a double unlock would have been caught
by now, or would cause crashes elsewhere. In particular I would expect
that the windows debug runtime would alert us. However, I will see if
there is some easy way we can check for this.

I'm running the original repro right now with HLE disabled. Usually it
takes a few hours before we see a stack corruption event, so I will
follow up later today with more results.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 18:19:13 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 19:20:18 UTC
Permalink
------- Comment From ***@us.ibm.com 2016-11-11 14:14 EDT-------
I've been able to set up and reproduce this on a p8 16.04 system here. Seems to be the identical signature, 2 bytes at address 0xXXXXXe in the middle of the canary block set to 0x00.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-11 19:57:37 UTC
Permalink
That is good news that you have been able to reproduce the issue. I'm
currently running the reproducer with the LD_PRELOAD disable-lock-
elision hack in place, without valgrind, and I'm currently at 55 runs
with no crashes. I will let it run overnight.

Also, per the earlier comment about double unlocking: I confirmed with
one of the other engineers here that the Windows Debug CRT does check
for double unlocks, and this test suite runs in our CI loop in that
configuration without issue.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 20:19:22 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Peter Bergner
2016-11-11 20:30:38 UTC
Permalink
I'll note that the LD_PRELOAD interposer library is only needed for
binaries that are already compiled and you want to override the
pthread_mutex_lock() routine. If you can recompile your source, then
you can place the interposer directly into your source and there is no
need for LD_PRELOADing anything. Ie, you could just add the following
to any of your source files:

#define PTHREAD_MUTEX_NO_ELISION_NP (512)

extern int __pthread_mutex_lock (pthread_mutex_t *);
int
pthread_mutex_lock (pthread_mutex_t *mutex)
{
mutex->__data.__kind |= PTHREAD_MUTEX_NO_ELISION_NP;
return __pthread_mutex_lock (mutex);
}
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Peter Bergner
2016-11-11 20:38:16 UTC
Permalink
...and possibly wrap the above in:

#ifdef __powerpc__
...
#endif

so it's only used on POWER?
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 20:49:14 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Aaron Sawdey
2016-11-11 20:53:03 UTC
Permalink
Question: Is there any magic I can do to this test case:

python buildscripts/resmoke.py --suites=concurrency_sharded
--storageEngine=wiredTiger --excludeWithAnyTags=requires_mmapv1
--dbpathPrefix=... --repeat=500 --continueOnFailure

that would allow me to run multiple copies on the same machine?
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 20:59:05 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Aaron Sawdey
2016-11-11 22:08:39 UTC
Permalink
Found it, looks like the --basePort option to resmoke is what I want.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-11 22:14:30 UTC
Permalink
Arron, re #50, yes, you can run as may copies as you want
simultaneously, as long as:

1) The --dbpathPrefix argument points to distinct paths. So resmoke.py
... --dbpathPrefix=/var/tmp/run1 and resmoke.py ...
--dbpathPrefix=/var/tmp/run2, etc.

2) You specify disjoint "port ranges" with the --basePort argument to
resmoke.py. I don't happen to know how deep in to each port range one
instance of the test needs to go, so I usually separate them by 10k but
that is probably way too generous. I'll bet 2k is sufficient.

Does that answer your question?
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-11 22:16:12 UTC
Permalink
Peter, re #47, yes, that is certainly true. However, I'm actually
finding it advantageous to load it via LD_PRELOAD exactly because I
don't need to recompile. So I can toggle back and forth between lock
elision on/off without needing to recompile.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 22:29:19 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Adam Conrad
2016-11-11 22:44:16 UTC
Permalink
Ahh, if you're never actually seeing a SEGV in __lll_unlock_elision,
this may indeed be something far more subtle than a double unlock and,
in fact, I'd be even MORE interested in having you run the same batch of
testing on an otherwise identical (ie: Ubuntu 16.04, blah blah blah)
Skylake system.

It may well be a subtle glibc bug on powerpc, or a whackadoo silicon
bug, or a more generic glibc bug or still a mongo bug, but the more data
we can scrape up (I mean, unless you accidentally stumble on the bug in
the process and fix it, then yay), the better.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 22:49:21 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Aaron Sawdey
2016-11-11 22:58:54 UTC
Permalink
Andrew,
Yes, that is working nicely with separate DB dirs and basePort I'm running multiple copies on one machine. Thanks!
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-11 23:00:23 UTC
Permalink
Adam I agree on all points.

So far, my repro running with the LD_PRELOAD hack is at 118 iterations
with no crashes and going strong. Given that we had an ~5% repro rate
without the LD_PRELOAD hack, this is looking very encouraging, but I'm
going to let it run all weekend just to be sure.

As for skylake, we are definitely going to work on tracking one down
next week.

And of course, our default stance has always been that it is more likely
our bug than anything else, but you never know! We will continue to run
experiments and gather data.

Thanks again for your help.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Aaron Sawdey
2016-11-11 23:02:52 UTC
Permalink
This is the other thing I am trying. I've modified the Canary object to
use a 128k stack zone and then use mprotect to mark the aligned 64k page
that's in the middle of it read-only. When the destructor is called, it
changes it back to read-write. This should cause any write to this
region to get a segv, and give us an idea of what is writing on the
stack in the resulting coredump.

** Patch added: "Canary that uses mprotect() to mark a page read-only."
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+attachment/4776177/+files/mprotect.patch
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Aaron Sawdey
2016-11-11 23:05:51 UTC
Permalink
One other thing, if you use the mprotect thing, it may be necessary to
bump up the value of /proc/sys/vm/max_map_count, depending on how many
of these Canary objects get constructed.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-11 23:09:40 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-12 01:49:27 UTC
Permalink
An engineer on our side did some Canary+mprotect experiments as well,
but I don't happen to have details on what the approach/results were
right now. I'll ask them to update this ticket with any interesting
findings they may have.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-12 02:11:16 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-12 15:51:00 UTC
Permalink
Here's another interesting data point. The original bug description
specifies that the memory corruption is not seen on Ubuntu 15. Per
https://bugzilla.linux.ibm.com/show_bug.cgi?id=117535, however,
transactional lock elision has been enabled by default since Ubuntu
15.04 (glibc 2.21). Yet on 16.04, the use of TLE is sufficient to cause
the stack corruption.

This seems to deepen the mystery more than it illuminates it. Have
there been changes to TLE between the releases that could be at fault?
Is another unknown component involved?

I'm told this morning that so far no failures are observed using the
mprotect canary, the working theory being that the syscall disturbs the
timing too much. Otherwise our results are consistent with yours on
16.04: failures with TLE enabled, no failures with the LD_PRELOAD
workaround.

** Bug watch added: bugzilla.linux.ibm.com/ #117535
https://bugzilla.linux.ibm.com/show_bug.cgi?id=117535
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-12 16:01:08 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-13 01:32:28 UTC
Permalink
Regarding Ubuntu 15, I think that was a miscommunication somewhere along
the line.

The only versions of Ubuntu that we build for are the LTS releases
(12.04, 14.04, and 16.04), and the only one of those we have ever built
on POWER is 16.04. Other than Ubuntu 16.04, the only other POWER distro
we target is RHEL 7.1. So what we are seeing is I think still consistent
with either a latent locking bug in MongoDB exposed by the new-to-us
lock elision support, or one of the other sorts of issues as mentioned
by Adam above.

Also, my test run using the disabled lock elision just reached 500
successful runs. I'm going to switch lock elision back on and make sure
I start seeing corruption again, but I'm nearly certain that I will.

Thanks for the update on the mprotect experiment. It would have been
great if that had worked. I think we will be able to share some details
on our similar experiments on Monday.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-13 01:41:02 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Aaron Sawdey
2016-11-13 03:47:48 UTC
Permalink
An update on my experiments:
* 500 runs no failures with TLE disabled
* 500 runs no failures TLE enabled but mprotect() syscall in Canary constructor/destructor
* 500 runs 11 failed with TLE enabled so about 2% fail rate
* Tried switching SMT off and interestingly got 200 runs no fails with TLE enabled.

This suggests to me that the timing of this race condition is rather
tight. A lot of the fails are in the checksum right after the memset in
the Canary constructor, which means the other guy comes back and writes
the stack after memset wrote it but before the checksum read it. Also if
strace's syscall timing is to be believed, 14% of the time mprotect() is
under 8 microseconds, and even that seems to be sufficient to prevent
the problem. Finally, by forcing other pthreads to always be on a
different processor core (by disabling SMT) we also apparently eliminate
this.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-13 04:00:29 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Dimitri John Ledkov
2016-11-13 10:58:44 UTC
Permalink
Hello, could bugproxy please be silenced, and/or prevented from
reposting the same python command line over and over again? It seems
there are sometimes attachments/comments that get amplified and re-
posted with every other comment.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-13 11:10:51 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-13 14:21:18 UTC
Permalink
We'll try to get to the bottom of the bugproxy oddity. It seems to be
echoing a chunk out of comment #5 for no reason that I can see.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
bugproxy
2016-11-13 14:30:35 UTC
Permalink
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-
cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-14 13:32:13 UTC
Permalink
Given the results of Aaron's experiments over the weekend, here's a summary of what we think we're seeing:
- There is an unsafe interaction between two threads.
- This interaction can only be observed in a very small time window, and then relatively rarely.
- The interaction is observable only when TLE is enabled. We posit this is because TLE improves lock/unlock speed, so that the observable time window applies.
- The time window is sufficiently small that a relatively fast syscall to mprotect is sufficient to disrupt the timing.
- If all threads are forced to run on separate processors (SMT=1), this is sufficient to disrupt the timing.

We believe this is an application problem. Further "evidence" against a
problem in TLE itself is that TLE has been enabled for ppc64el on Ubuntu
for over 18 months without any similar reports.

Unfortunately, I don't think we have any way to directly debug the
problem, due to the narrow time window. We have considered hardware
watchpoints. There is a DAWR (data address watch register) for each
thread, but using it in this case seems impractical. GDB's
implementation of hardware watchpoints in a multithreaded environment is
such that, when a hardware watchpoint is set, it is set to the same
address for all threads. So even if we were to script the setting of a
hardware watchpoint under GDB, the time required for GDB to set up the
watchpoint address on the fly would surely exceed the critical time
window. You could try it, but I wouldn't expect much.

Further debugging seems to require application knowledge or a code crawl
of some kind. The setting of two flag bytes is the only clue we have.
To my knowledge we have never seen only a single byte clobbered in the
canary, and the two bytes appear to be aligned on a 16-bit boundary. So
the code doing the clobbering very likely contains a store-halfword (sth
or, less likely, sthx or sthu) instruction. You could examine the
disassembly of the application for occurrences of sth to narrow the
field of search.

It could be two individual stores, in which case you'd be looking for
two instructions very close together of the form:

stb r<x>,<n>(r<b>)
stb r<y>,<n+1>(r<b>)

Example:

stb r5,0(r9)
stb r6,1(r9)

Obviously this is a huge application so this doesn't help much in and of
itself, but perhaps if you've already narrowed the problem down
somewhat, this could be helpful.

I am running low on ideas for how we can help you debug the problem.
We'll continue discussing it; if any better thoughts arise, we'll be
sure to let you know.

Bill
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-14 14:49:09 UTC
Permalink
Well, let me back that off a little. We're going to look into the TLE
code a little more. The various __lll_*_elision routines are handed a
pointer to short that they update, which certainly looks suspicious. So
it's certainly possible that something in the pthreads implementation
that calls this code is providing a bad pointer to TLE.

However, we would expect to see the same problem on x86 or s390 if that
is the case, unless there is some POWER-specific code in the pthreads
implementation. So again a Skylake experiment would be helpful.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-14 15:57:36 UTC
Permalink
Ulrich Weigand made an interesting comment on the glibc code in our
internal bug (no longer mirrored), so I'm mirroring it by hand here.

--- Comment #77 from Ulrich Weigand <***@de.ibm.com> ---
According to comment #45, we see an invalid read at __lll_unlock_elision
(elision-unlock.c:36) and and invalid write at the next line. Looking at the
code, those are:

int
__lll_unlock_elision (int *lock, short *adapt_count, int pshared)
{
/* When the lock was free we're in a transaction. */
if (*lock == 0)
__libc_tend (0);
else
{
lll_unlock ((*lock), pshared);

/* Update the adapt count AFTER completing the critical section.
Doing this here prevents unneeded stalling when entering
a critical section. Saving about 8% runtime on P8. */
if (*adapt_count > 0)
(*adapt_count)--;
}
return 0;
}

exactly the two lines accessing *adapt_count (which is a short).

What seems suspicious here is the fact that these lines happen (deliberately)
*outside* of the critical section protected by the mutex. Now this should
indeed be normally OK. However, I understand it is allowed to use that
critical section to also protect the life time of the memory object holding the
mutex itself. If this is done, I'm wondering whether those two lines outside
the critical section may now access that *memory* outside of its protected life
time ...

Just as a theoretical example, assume you have thread 1 executing a function
that allocates a mutex on its stack, locks the mutex, and passes its address to
a thread 2. Thread 2 now does now work and then unlocks the mutex. Thread 1,
in the meantime, attempts to lock the mutex again, which will block until it
was unlocked by thread 2. After it got the lock, thread 1 now immediately
unlocks the mutex again and then the function returns.

Now in this scenario, the cross-thread lock-unlock pair cannot use TM, so it
will fall back to regular locks, which means the unlock in thread 2 will
perform the *adapt_count update. On the other hand, the second lock-unlock
pair in thread 1 might just happen to go through with TM, and thus not touch
*adapt_count at all. This means there is no synchronization between the update
of *adapt_count in thread 2, and the re-use of the stack memory underlying the
mutex in thread 1, which could lead to the *adapt_count update being seen
*after* that reuse in thread 1, clobbering memory ...

Am I missing something here? Of course, I'm not sure if MongoDB shows that
exact pattern. But I guess different patterns might show similar races too.

One simple test might be to move the *adapt_count update to before the
lll_unlock and check whether the issue still reproduces. Not sure about
performance impacts, however.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-14 18:24:50 UTC
Permalink
I think it is very likely that we are doing the sort of stack-based
mutex pattern described above, or something similar. In particular, I'd
expect that we certainly have states where we wait on a stack mutex, and
then immediately unwind and destroy the mutex after we unblock.

I'm working on getting access to a Skylake machine - we will definitely
do that test if we can get access.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Aaron Sawdey
2016-11-14 23:52:56 UTC
Permalink
So, I rebuilt the glibc 2.23 from the 16.04 sources and modified the
values written to the adapt_count parm in the lock elision code. It's a
short and the original code may store values 0, 1, 2, 3. We were seeing
either 1 (canary hit in constructor) or 0 (canary hit in destructor). I
changed it to use the values 0x3333, 0x2222, 0x1111, and 0. And I just
saw the constructor canary hit with the expected values 0x11, 0x11 in
the changed bytes. So, this is a race condition in the lock elision code
with mutex located on the stack and being reused quickly by another
hardware thread on the same processor core.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-15 00:04:23 UTC
Permalink
Aaron thank you very much for running that experiment and confirming
that this is an issue in libc. I think the component should probably be
updated?

Also, would you like us to try to continue to repro on a Skylake
machine, or is this all architecture neutral code and therefore the
POWER repro is conclusive?
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-15 00:53:33 UTC
Permalink
Hi Andrew,

This is a POWER-specific "optimization" that dates to last December (so
it in fact wouldn't show up in Ubuntu 15.04 or 15.10, it appears). The
decrement used to be attached to the lock rather than the unlock and it
was apparently moved to the present location because it showed a good
performance improvement. Aaron is running some experiments with moving
it ahead of the unlock rather than following it, which we believe will
remove the problem. That may or may not be the final version of the
fix, but it looks like the simplest solution.

The only reason for you to test on a Skylake machine would be for your
own peace of mind, in case any TLE-specific issues are hiding in the x86
implementation. This specific problem, at least, is a POWER issue.

** Package changed: gcc-5 (Ubuntu) => glibc (Ubuntu)
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-15 02:03:22 UTC
Permalink
Hi Bill -

Thanks for the update, and for clarifying that this is POWER 16.04 only.
We are very happy to be at a root cause for this issue - it had us
pretty worried! We really appreciate all the help from everyone involved
here.

Will there be an upstream glibc bug associated with this ticket that we
can follow?
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-15 02:30:03 UTC
Permalink
Hi Andrew,

Aaron has just opened
https://sourceware.org/bugzilla/show_bug.cgi?id=20822. An odd
confluence of vacations, world holidays, and family leave has conspired
to take all of our glibc experts out of the office tomorrow, so we may
be a little delayed with testing and submitting the final fix, thought
the rest of us will do what we can with it. We are likewise very
relieved to have proof of a cause! And we're certainly sorry for all
the trouble this has caused you.

Our team will work with the Ubuntu toolchain guys to get a solution in
the field as soon as possible. Thanks again for your patience and
professionalism!

Bill

** Bug watch added: Sourceware.org Bugzilla #20822
https://sourceware.org/bugzilla/show_bug.cgi?id=20822
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Andrew Morrow
2016-11-15 15:43:09 UTC
Permalink
Hi Bill -

Thanks for the glibc bug link.

Totally understand about people being out, not a problem. However, I'm
not very familiar with the development process for upstream glibc fixes
to make their way into an LTS release.

Do you have a rough estimate of the timeline for that landing in
something that we can pull in via apt-get update? Are we talking days,
weeks, months? My expectation right now is that we are going to continue
forward with issuing the MongoDB 3.4 GA on POWER Ubuntu 16.04, and
relnote this issue, probably stating that MongoDB 3.4 should not be run
in production on POWER Ubuntu 16.04 unless glibc has been upgraded to
version X or newer.

We might even add a server startup warning, if it will be possible for
us to detect at runtime that we are running with an affected glibc. I'd
imagine that might prove difficult if this comes in as an Ubuntu patch
to the existing libc version though. Thoughts on that?

Andrew
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-15 20:19:38 UTC
Permalink
Hi Andrew -

I don't work directly in the glibc community, so I'm not completely
familiar with their policies. However, the first step is to get a bug
approved upstream so that it can be backported to the 2.23 release (in
this case). Adam Conrad at Canonical has volunteered to help us
shepherd the patch along, so we hope to be able to expedite it.

We have two possible patches to fix the problem; we are trying to
determine which will have the least performance impact. I am hopeful
that we can get a decision on that tomorrow so the patch can be put on
the libc-alpha list and, we hope, approved without too much difficulty.

I will have to defer to Canonical folks (Adam, Matthias) on the process
for getting the fix available, but I'm sure this isn't their first rodeo
and this shouldn't take too long. My best guess is that the whole
process may get done within a week, if all goes well with the community,
and a little longer if not. We are definitely not talking months.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Adam Conrad
2016-11-15 20:57:35 UTC
Permalink
** Also affects: glibc via
https://sourceware.org/bugzilla/show_bug.cgi?id=20822
Importance: Unknown
Status: Unknown
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-16 16:50:18 UTC
Permalink
Patch submission is here: https://sourceware.org/ml/libc-
alpha/2016-11/msg00568.html
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Adam Conrad
2016-11-16 20:58:53 UTC
Permalink
** Also affects: glibc (Ubuntu Xenial)
Importance: Undecided
Status: New

** Also affects: glibc (Ubuntu Yakkety)
Importance: Undecided
Status: New

** Changed in: glibc (Ubuntu)
Assignee: Taco Screen team (taco-screen-team) => Adam Conrad (adconrad)

** Changed in: glibc (Ubuntu Xenial)
Assignee: (unassigned) => Adam Conrad (adconrad)

** Changed in: glibc (Ubuntu Yakkety)
Assignee: (unassigned) => Adam Conrad (adconrad)
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
William J. Schmidt
2016-11-16 20:59:42 UTC
Permalink
Hi Andrew,

Canonical's plans for handling this in the short term are described
here: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1642390. We
will continue to work the POWER patch, but the SRU described there is
all you should need to track for your relnote.

Bill
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Brian W Hart
2016-11-17 20:08:24 UTC
Permalink
Howdy, I'm the originator of:

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1641241

(which is dup'd to this bug)

I tested the new ("ubuntu5") libc6 packages from xenial-proposed. They
prevent the crash with TensorFlow, and I have not noticed any other
problems.

Will update:

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1642390
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Launchpad Bug Tracker
2016-12-29 02:38:40 UTC
Permalink
Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: glibc (Ubuntu)
Status: New => Confirmed
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Launchpad Bug Tracker
2016-12-29 02:38:40 UTC
Permalink
Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: glibc (Ubuntu Xenial)
Status: New => Confirmed
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Launchpad Bug Tracker
2016-12-29 02:38:40 UTC
Permalink
Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: glibc (Ubuntu Yakkety)
Status: New => Confirmed
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Bug Watch Updater
2017-01-24 09:43:01 UTC
Permalink
** Changed in: glibc
Status: Unknown => Fix Released

** Changed in: glibc
Importance: Unknown => Medium
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Launchpad Bug Tracker
2017-04-08 01:48:16 UTC
Permalink
This bug was fixed in the package glibc - 2.24-9ubuntu2

---------------
glibc (2.24-9ubuntu2) zesty; urgency=medium

* debian/patches/any/cvs-resolv-internal-qtype.diff: Revert to avoid
failure in name resolution on upgrades from yakkety (LP: #1674532)

-- Adam Conrad <***@ubuntu.com> Tue, 21 Mar 2017 15:27:15 -0600

** Changed in: glibc (Ubuntu)
Status: Confirmed => Fix Released
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1640518/+subscriptions
--
ubuntu-bugs mailing list
ubuntu-***@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Loading...