Discussion:
[gem5-users] dacapo (java) benchmark suite encounters "SIGSEGV" and "null exception" during timing mode (fs mode) after restarting from a checkpoint
Da Zhang
2018-07-16 15:39:11 UTC
Permalink
Hey guys,

I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?

best,
Da Zhang
Da Zhang
2018-07-16 17:10:14 UTC
Permalink
To clarify, "SIGSEGV and null exceptions " happens to the benchmark suite,
not gem5. Gem5 is running without errors. But in the system.pc.com_1.device
files, I observe that most of the benchmarks crash due to SIGSEGV or null
exceptions.
Example:
"

x/system.pc.com_1.device



buffers

1 #

2 # A fatal error has been detected by the Java Runtime Environment:

3 #

4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474,
tid=0x00007f81cf46d700

5 #

6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build
1.8.0_171-b11)

7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)

8 # Problematic frame:

9 # J 1815 C2
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
(389 bytes) @ 0x00007f81d17742b7 [0x00007f81d1774280+0x37]



10 #

11 #
"
Post by Da Zhang
Hey guys,
I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?
best,
Da Zhang
Jason Lowe-Power
2018-07-16 17:31:15 UTC
Permalink
Hello,

Are you seeing any warnings like "warn: Instruction XXX not implemented"?

There are many X86 SIMD instructions that are currently unimplemented. I
would bet that your application is using some of those instructions and
getting 0's as the output instead of the correct value.

The "right" way to solve this problem is to implement these instructions
(and we would really appreciate it if you contribute your fixes back on
https://gem5-review.googlesource.com. The other option is to recompile your
applications without SIMD extensions (e.g., -march=athlon64 or whatever is
the original x86-64 name in GCC). However, this likely requires compiling
all of the java runtime in your case.

Cheers,
Jason
Post by Da Zhang
To clarify, "SIGSEGV and null exceptions " happens to the benchmark
suite, not gem5. Gem5 is running without errors. But in the
system.pc.com_1.device files, I observe that most of the benchmarks crash
due to SIGSEGV or null exceptions.
"
x/system.pc.com_1.device
buffers
1 #
3 #
4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474,
tid=0x00007f81cf46d700
5 #
6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build
1.8.0_171-b11)
7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)
9 # J 1815 C2
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
10 #
11 #
"
Post by Da Zhang
Hey guys,
I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?
best,
Da Zhang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Da Zhang
2018-07-16 20:50:15 UTC
Permalink
Hey Jason,

There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in
atomic modes, during which the java benchmarks don't crash. However, there
is no these kind of warnings during timing mode. Does it imply that
unimplemented instructions don't cause the problem? Any clues or
suggestions to debug these problems?

best,
Da Zhang
Post by Jason Lowe-Power
Hello,
Are you seeing any warnings like "warn: Instruction XXX not implemented"?
There are many X86 SIMD instructions that are currently unimplemented. I
would bet that your application is using some of those instructions and
getting 0's as the output instead of the correct value.
The "right" way to solve this problem is to implement these instructions
(and we would really appreciate it if you contribute your fixes back on
https://gem5-review.googlesource.com. The other option is to recompile
your applications without SIMD extensions (e.g., -march=athlon64 or
whatever is the original x86-64 name in GCC). However, this likely requires
compiling all of the java runtime in your case.
Cheers,
Jason
Post by Da Zhang
To clarify, "SIGSEGV and null exceptions " happens to the benchmark
suite, not gem5. Gem5 is running without errors. But in the
system.pc.com_1.device files, I observe that most of the benchmarks crash
due to SIGSEGV or null exceptions.
"
x/system.pc.com_1.device
buffers
1 #
3 #
4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474,
tid=0x00007f81cf46d700
5 #
6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build
1.8.0_171-b11)
7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)
9 # J 1815 C2
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
10 #
11 #
"
Post by Da Zhang
Hey guys,
I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?
best,
Da Zhang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Jason Lowe-Power
2018-07-16 21:15:23 UTC
Permalink
Hi,

I would think that it's OK to ignore the "prefetch_nta is unimplemented"
since it should just give hints to the cache. If you don't see *any* other
unimplemented instruction warning then I would guess that unimplemented
instructions is not the problem.

Another possibility is something is going wrong with checkpoint taking or
restoring. Have you tried fast-forwarding with KVM then switching CPUs
without taking a checkpoint? If that works when the checkpoint doesn't work
then you'll know that it's a problem with checkpointing.

Jason
Post by Da Zhang
Hey Jason,
There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in
atomic modes, during which the java benchmarks don't crash. However, there
is no these kind of warnings during timing mode. Does it imply that
unimplemented instructions don't cause the problem? Any clues or
suggestions to debug these problems?
best,
Da Zhang
Post by Jason Lowe-Power
Hello,
Are you seeing any warnings like "warn: Instruction XXX not implemented"?
There are many X86 SIMD instructions that are currently unimplemented. I
would bet that your application is using some of those instructions and
getting 0's as the output instead of the correct value.
The "right" way to solve this problem is to implement these instructions
(and we would really appreciate it if you contribute your fixes back on
https://gem5-review.googlesource.com. The other option is to recompile
your applications without SIMD extensions (e.g., -march=athlon64 or
whatever is the original x86-64 name in GCC). However, this likely requires
compiling all of the java runtime in your case.
Cheers,
Jason
Post by Da Zhang
To clarify, "SIGSEGV and null exceptions " happens to the benchmark
suite, not gem5. Gem5 is running without errors. But in the
system.pc.com_1.device files, I observe that most of the benchmarks crash
due to SIGSEGV or null exceptions.
"
x/system.pc.com_1.device
buffers
1 #
3 #
4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474,
tid=0x00007f81cf46d700
5 #
6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build
1.8.0_171-b11)
7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)
9 # J 1815 C2
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
10 #
11 #
"
Post by Da Zhang
Hey guys,
I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?
best,
Da Zhang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Gabe Black
2018-07-16 21:19:53 UTC
Permalink
The older simd extensions (various SSEs, mmx/3dnow) are generally
implemented, but newer extensions like vpx generally aren't. The prefetch
hint instructions you see warnings about are fine since prefetches aren't
functionally necessary. There might be an instruction which isn't
implemented properly which causes a NULL pointer to be dereferenced, and
unfortunately there's no great way to find out which one without trying to
debug your Java program and the interpreter when it crashes.

If you haven't yet, try comparing the atomic and timing simple CPUs
(instead of the o3, aka detailed CPU) to help narrow things down. The only
difference between the simple CPUs is how they talk to memory, and how the
instructions ask them to, where the o3 has register renaming, etc which
expose other additional problems.

Gabe
Post by Da Zhang
Hey Jason,
There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in
atomic modes, during which the java benchmarks don't crash. However, there
is no these kind of warnings during timing mode. Does it imply that
unimplemented instructions don't cause the problem? Any clues or
suggestions to debug these problems?
best,
Da Zhang
Post by Jason Lowe-Power
Hello,
Are you seeing any warnings like "warn: Instruction XXX not implemented"?
There are many X86 SIMD instructions that are currently unimplemented. I
would bet that your application is using some of those instructions and
getting 0's as the output instead of the correct value.
The "right" way to solve this problem is to implement these instructions
(and we would really appreciate it if you contribute your fixes back on
https://gem5-review.googlesource.com. The other option is to recompile
your applications without SIMD extensions (e.g., -march=athlon64 or
whatever is the original x86-64 name in GCC). However, this likely requires
compiling all of the java runtime in your case.
Cheers,
Jason
Post by Da Zhang
To clarify, "SIGSEGV and null exceptions " happens to the benchmark
suite, not gem5. Gem5 is running without errors. But in the
system.pc.com_1.device files, I observe that most of the benchmarks crash
due to SIGSEGV or null exceptions.
"
x/system.pc.com_1.device
buffers
1 #
3 #
4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474,
tid=0x00007f81cf46d700
5 #
6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build
1.8.0_171-b11)
7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)
9 # J 1815 C2
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
10 #
11 #
"
Post by Da Zhang
Hey guys,
I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?
best,
Da Zhang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Gutierrez, Anthony
2018-07-16 21:46:56 UTC
Permalink
Da,

Do you encounter the segfault only when restoring from a checkpoint? That is, if you do not use checkpoints can any DaCapo benchmark successfully complete under one of the simple CPU models (and not just KVM CPU)?

If so, you may want to get a syscall trace (e.g., using strace) to see what sorts of files the JVM is trying to read etc. It’s possible that the VM generates some files that it will read back later. If you use checkpoints, due to the disk image COW layer, I do not believe any disk updates are checkpointed, thus these files will not persist, which could lead to some weird segfault issues. Not sure if this is happening in your case, but it may be worth investigating.

I created some of the original Android disk images, and the original DaCapo image, and at that time I would typically run the benchmarks thru the FS mode and Atomic CPU once, with the COW layer disabled, in order to generate the needed files on the disk image and have them persist. This was entirely for performance, however, to prevent the VMs from regenerating the same files for each run, but I can envision it causing issues during runtime as well. In particular, it seems you’re code is faulting while doing some XML serializing/deserializing, perhaps the xml file it is looking for is gone?

Beyond that, assuming it is a real bug in gem5, I would recommend an ExecAll trace to figure out why the instruction at that PC is faulting.

-Tony

From: gem5-users [mailto:gem5-users-***@gem5.org] On Behalf Of Da Zhang
Sent: Monday, July 16, 2018 1:50 PM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] dacapo (java) benchmark suite encounters "SIGSEGV" and "null exception" during timing mode (fs mode) after restarting from a checkpoint

Hey Jason,

There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in atomic modes, during which the java benchmarks don't crash. However, there is no these kind of warnings during timing mode. Does it imply that unimplemented instructions don't cause the problem? Any clues or suggestions to debug these problems?

best,
Da Zhang



On Mon, Jul 16, 2018 at 1:32 PM Jason Lowe-Power <***@lowepower.com<mailto:***@lowepower.com>> wrote:
Hello,

Are you seeing any warnings like "warn: Instruction XXX not implemented"?

There are many X86 SIMD instructions that are currently unimplemented. I would bet that your application is using some of those instructions and getting 0's as the output instead of the correct value.

The "right" way to solve this problem is to implement these instructions (and we would really appreciate it if you contribute your fixes back on https://gem5-review.googlesource.com. The other option is to recompile your applications without SIMD extensions (e.g., -march=athlon64 or whatever is the original x86-64 name in GCC). However, this likely requires compiling all of the java runtime in your case.

Cheers,
Jason

On Mon, Jul 16, 2018 at 10:11 AM Da Zhang <***@vt.edu<mailto:***@vt.edu>> wrote:
To clarify, "SIGSEGV and null exceptions " happens to the benchmark suite, not gem5. Gem5 is running without errors. But in the system.pc.com_1.device files, I observe that most of the benchmarks crash due to SIGSEGV or null exceptions.
Example:
"

x/system.pc.com_1.device buffers

1 #

2 # A fatal error has been detected by the Java Runtime Environment:

3 #

4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474, tid=0x00007f81cf46d700

5 #

6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)

7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)

8 # Problematic frame:

9 # J 1815 C2 org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V (389 bytes) @ 0x00007f81d17742b7 [0x00007f81d1774280+0x37]

10 #

11 #
"

On Mon, Jul 16, 2018 at 11:39 AM Da Zhang <***@vt.edu<mailto:***@vt.edu>> wrote:
Hey guys,

I am testing a java benchmark suite, dacapo, on gem5 with fs mode. Unfortunately, I encounter a lot of SIGSEGV and null exceptions during timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to gem5 src/ and configs/, I re-download gem5 and checkout to commit "ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB memory. For the simulation, I use build/X86/gem5.opt (in order to enable assertions) with fs mode (configs/example/fs.py). Other options include "--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache --l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms warmup and 1ps real simulation time. There are no errors presented. But with longer real simulation time, the benchmark suite crashes with segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu, without any segfaults or exceptions. I have some simple java benchmarks tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?

best,
Da Zhang
_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Da Zhang
2018-07-19 18:15:06 UTC
Permalink
Thanks for the suggestions.
I have been trying a couple of solutions (I only test for a small subset
of decapo benchmark suite, which encounters segfault with O3CPU):

1. using TimingSimpleCPU: no segfaults
2. disable COW layer and write on the disk image when taking checkpoint:
there are still segfaults
3. take checkpoints with JIT compiler disabled (20x slowdown): no segfaults
4. take checkpoints during atomic mode (without warming up JIT): no
segfaults
5. take checkpoints with Java OOPs compress disabled: there are still
segfaults

One thing that I can't tell is if the benchmark hangs since there is no
printing during the execution. Is there a statistic I can use to tell if
the benchmark hangs?

So far, all my experiments are running using 1CPU (even some benchmarks are
multithreading). I attempted to take some checkpoints with more CPUs with
KVM CPU. But unfortunately, I got some "rcu_sched self-detected stall on
CPU" issues. Any idea?

On Mon, Jul 16, 2018 at 5:47 PM Gutierrez, Anthony <
Post by Gutierrez, Anthony
Da,
Do you encounter the segfault only when restoring from a checkpoint? That
is, if you do not use checkpoints can any DaCapo benchmark successfully
complete under one of the simple CPU models (and not just KVM CPU)?
If so, you may want to get a syscall trace (e.g., using strace) to see
what sorts of files the JVM is trying to read etc. It’s possible that the
VM generates some files that it will read back later. If you use
checkpoints, due to the disk image COW layer, I do not believe any disk
updates are checkpointed, thus these files will not persist, which could
lead to some weird segfault issues. Not sure if this is happening in your
case, but it may be worth investigating.
I created some of the original Android disk images, and the original
DaCapo image, and at that time I would typically run the benchmarks thru
the FS mode and Atomic CPU once, with the COW layer disabled, in order to
generate the needed files on the disk image and have them persist. This was
entirely for performance, however, to prevent the VMs from regenerating the
same files for each run, but I can envision it causing issues during
runtime as well. In particular, it seems you’re code is faulting while
doing some XML serializing/deserializing, perhaps the xml file it is
looking for is gone?
Beyond that, assuming it is a real bug in gem5, I would recommend an
ExecAll trace to figure out why the instruction at that PC is faulting.
-Tony
Zhang
*Sent:* Monday, July 16, 2018 1:50 PM
*Subject:* Re: [gem5-users] dacapo (java) benchmark suite encounters
"SIGSEGV" and "null exception" during timing mode (fs mode) after
restarting from a checkpoint
Hey Jason,
There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in
atomic modes, during which the java benchmarks don't crash. However, there
is no these kind of warnings during timing mode. Does it imply that
unimplemented instructions don't cause the problem? Any clues or
suggestions to debug these problems?
best,
Da Zhang
Hello,
Are you seeing any warnings like "warn: Instruction XXX not implemented"?
There are many X86 SIMD instructions that are currently unimplemented. I
would bet that your application is using some of those instructions and
getting 0's as the output instead of the correct value.
The "right" way to solve this problem is to implement these instructions
(and we would really appreciate it if you contribute your fixes back on
https://gem5-review.googlesource.com. The other option is to recompile
your applications without SIMD extensions (e.g., -march=athlon64 or
whatever is the original x86-64 name in GCC). However, this likely requires
compiling all of the java runtime in your case.
Cheers,
Jason
To clarify, "SIGSEGV and null exceptions " happens to the benchmark
suite, not gem5. Gem5 is running without errors. But in the
system.pc.com_1.device files, I observe that most of the benchmarks crash
due to SIGSEGV or null exceptions.
"
x/system.pc.com_1.device
buffers
1 #
3 #
4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474,
tid=0x00007f81cf46d700
5 #
6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)
7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)
9 # J 1815 C2
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
10 #
11 #
"
Hey guys,
I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?
best,
Da Zhang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Gutierrez, Anthony
2018-07-19 18:36:27 UTC
Permalink
JIT was precisely the issue I was thinking was causing this. One thing may be necessary, that is to ensure you sync the disk image before taking your checkpoint.

gem5’s debug flags should help you identify something like a hang, for example an ExecAll trace. A SyscallAll trace would most likely help you understand better what the JIT is doing.

From: gem5-users <gem5-users-***@gem5.org> On Behalf Of Da Zhang
Sent: Thursday, July 19, 2018 11:15 AM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] dacapo (java) benchmark suite encounters "SIGSEGV" and "null exception" during timing mode (fs mode) after restarting from a checkpoint

Thanks for the suggestions.
I have been trying a couple of solutions (I only test for a small subset of decapo benchmark suite, which encounters segfault with O3CPU):

1. using TimingSimpleCPU: no segfaults
2. disable COW layer and write on the disk image when taking checkpoint: there are still segfaults
3. take checkpoints with JIT compiler disabled (20x slowdown): no segfaults
4. take checkpoints during atomic mode (without warming up JIT): no segfaults
5. take checkpoints with Java OOPs compress disabled: there are still segfaults

One thing that I can't tell is if the benchmark hangs since there is no printing during the execution. Is there a statistic I can use to tell if the benchmark hangs?

So far, all my experiments are running using 1CPU (even some benchmarks are multithreading). I attempted to take some checkpoints with more CPUs with KVM CPU. But unfortunately, I got some "rcu_sched self-detected stall on CPU" issues. Any idea?

On Mon, Jul 16, 2018 at 5:47 PM Gutierrez, Anthony <***@amd.com<mailto:***@amd.com>> wrote:
Da,

Do you encounter the segfault only when restoring from a checkpoint? That is, if you do not use checkpoints can any DaCapo benchmark successfully complete under one of the simple CPU models (and not just KVM CPU)?

If so, you may want to get a syscall trace (e.g., using strace) to see what sorts of files the JVM is trying to read etc. It’s possible that the VM generates some files that it will read back later. If you use checkpoints, due to the disk image COW layer, I do not believe any disk updates are checkpointed, thus these files will not persist, which could lead to some weird segfault issues. Not sure if this is happening in your case, but it may be worth investigating.

I created some of the original Android disk images, and the original DaCapo image, and at that time I would typically run the benchmarks thru the FS mode and Atomic CPU once, with the COW layer disabled, in order to generate the needed files on the disk image and have them persist. This was entirely for performance, however, to prevent the VMs from regenerating the same files for each run, but I can envision it causing issues during runtime as well. In particular, it seems you’re code is faulting while doing some XML serializing/deserializing, perhaps the xml file it is looking for is gone?

Beyond that, assuming it is a real bug in gem5, I would recommend an ExecAll trace to figure out why the instruction at that PC is faulting.

-Tony

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Da Zhang
Sent: Monday, July 16, 2018 1:50 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] dacapo (java) benchmark suite encounters "SIGSEGV" and "null exception" during timing mode (fs mode) after restarting from a checkpoint

Hey Jason,

There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in atomic modes, during which the java benchmarks don't crash. However, there is no these kind of warnings during timing mode. Does it imply that unimplemented instructions don't cause the problem? Any clues or suggestions to debug these problems?

best,
Da Zhang



On Mon, Jul 16, 2018 at 1:32 PM Jason Lowe-Power <***@lowepower.com<mailto:***@lowepower.com>> wrote:
Hello,

Are you seeing any warnings like "warn: Instruction XXX not implemented"?

There are many X86 SIMD instructions that are currently unimplemented. I would bet that your application is using some of those instructions and getting 0's as the output instead of the correct value.

The "right" way to solve this problem is to implement these instructions (and we would really appreciate it if you contribute your fixes back on https://gem5-review.googlesource.com. The other option is to recompile your applications without SIMD extensions (e.g., -march=athlon64 or whatever is the original x86-64 name in GCC). However, this likely requires compiling all of the java runtime in your case.

Cheers,
Jason

On Mon, Jul 16, 2018 at 10:11 AM Da Zhang <***@vt.edu<mailto:***@vt.edu>> wrote:
To clarify, "SIGSEGV and null exceptions " happens to the benchmark suite, not gem5. Gem5 is running without errors. But in the system.pc.com_1.device files, I observe that most of the benchmarks crash due to SIGSEGV or null exceptions.
Example:
"

x/system.pc.com_1.device buffers

1 #

2 # A fatal error has been detected by the Java Runtime Environment:

3 #

4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474, tid=0x00007f81cf46d700

5 #

6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)

7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)

8 # Problematic frame:

9 # J 1815 C2 org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V (389 bytes) @ 0x00007f81d17742b7 [0x00007f81d1774280+0x37]

10 #

11 #
"

On Mon, Jul 16, 2018 at 11:39 AM Da Zhang <***@vt.edu<mailto:***@vt.edu>> wrote:
Hey guys,

I am testing a java benchmark suite, dacapo, on gem5 with fs mode. Unfortunately, I encounter a lot of SIGSEGV and null exceptions during timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to gem5 src/ and configs/, I re-download gem5 and checkout to commit "ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB memory. For the simulation, I use build/X86/gem5.opt (in order to enable assertions) with fs mode (configs/example/fs.py). Other options include "--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache --l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms warmup and 1ps real simulation time. There are no errors presented. But with longer real simulation time, the benchmark suite crashes with segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu, without any segfaults or exceptions. I have some simple java benchmarks tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?

best,
Da Zhang
_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Da Zhang
2018-07-19 18:59:48 UTC
Permalink
Hey Gutierrez,

"*sync* the disk image", do you mean making sure all disk modifications are
actually made on the disk (update to date) before taking the checkpoint?
How to do that?
I haven't tried to take a checkpoint with COW layer disabled and then
restart from that checkpoint before. All I have done is "ctrl+c" to stop
gem5 to take the checkpoint (--checkpoint-at-end); I rely on gem5 to take
care of all things that need to be checked when taking checkpoints.

Best,
Da Zhang

On Thu, Jul 19, 2018 at 2:36 PM Gutierrez, Anthony <
Post by Gutierrez, Anthony
JIT was precisely the issue I was thinking was causing this. One thing may
be necessary, that is to ensure you *sync* the disk image before taking
your checkpoint.
gem5’s debug flags should help you identify something like a hang, for
example an ExecAll trace. A SyscallAll trace would most likely help you
understand better what the JIT is doing.
*Sent:* Thursday, July 19, 2018 11:15 AM
*Subject:* Re: [gem5-users] dacapo (java) benchmark suite encounters
"SIGSEGV" and "null exception" during timing mode (fs mode) after
restarting from a checkpoint
Thanks for the suggestions.
I have been trying a couple of solutions (I only test for a small subset
1. using TimingSimpleCPU: no segfaults
there are still segfaults
3. take checkpoints with JIT compiler disabled (20x slowdown): no segfaults
4. take checkpoints during atomic mode (without warming up JIT): no segfaults
5. take checkpoints with Java OOPs compress disabled: there are still segfaults
One thing that I can't tell is if the benchmark hangs since there is no
printing during the execution. Is there a statistic I can use to tell if
the benchmark hangs?
So far, all my experiments are running using 1CPU (even some benchmarks
are multithreading). I attempted to take some checkpoints with more CPUs
with KVM CPU. But unfortunately, I got some "rcu_sched self-detected stall
on CPU" issues. Any idea?
On Mon, Jul 16, 2018 at 5:47 PM Gutierrez, Anthony <
Da,
Do you encounter the segfault only when restoring from a checkpoint? That
is, if you do not use checkpoints can any DaCapo benchmark successfully
complete under one of the simple CPU models (and not just KVM CPU)?
If so, you may want to get a syscall trace (e.g., using strace) to see
what sorts of files the JVM is trying to read etc. It’s possible that the
VM generates some files that it will read back later. If you use
checkpoints, due to the disk image COW layer, I do not believe any disk
updates are checkpointed, thus these files will not persist, which could
lead to some weird segfault issues. Not sure if this is happening in your
case, but it may be worth investigating.
I created some of the original Android disk images, and the original
DaCapo image, and at that time I would typically run the benchmarks thru
the FS mode and Atomic CPU once, with the COW layer disabled, in order to
generate the needed files on the disk image and have them persist. This was
entirely for performance, however, to prevent the VMs from regenerating the
same files for each run, but I can envision it causing issues during
runtime as well. In particular, it seems you’re code is faulting while
doing some XML serializing/deserializing, perhaps the xml file it is
looking for is gone?
Beyond that, assuming it is a real bug in gem5, I would recommend an
ExecAll trace to figure out why the instruction at that PC is faulting.
-Tony
Zhang
*Sent:* Monday, July 16, 2018 1:50 PM
*Subject:* Re: [gem5-users] dacapo (java) benchmark suite encounters
"SIGSEGV" and "null exception" during timing mode (fs mode) after
restarting from a checkpoint
Hey Jason,
There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in
atomic modes, during which the java benchmarks don't crash. However, there
is no these kind of warnings during timing mode. Does it imply that
unimplemented instructions don't cause the problem? Any clues or
suggestions to debug these problems?
best,
Da Zhang
Hello,
Are you seeing any warnings like "warn: Instruction XXX not implemented"?
There are many X86 SIMD instructions that are currently unimplemented. I
would bet that your application is using some of those instructions and
getting 0's as the output instead of the correct value.
The "right" way to solve this problem is to implement these instructions
(and we would really appreciate it if you contribute your fixes back on
https://gem5-review.googlesource.com. The other option is to recompile
your applications without SIMD extensions (e.g., -march=athlon64 or
whatever is the original x86-64 name in GCC). However, this likely requires
compiling all of the java runtime in your case.
Cheers,
Jason
To clarify, "SIGSEGV and null exceptions " happens to the benchmark
suite, not gem5. Gem5 is running without errors. But in the
system.pc.com_1.device files, I observe that most of the benchmarks crash
due to SIGSEGV or null exceptions.
"
x/system.pc.com_1.device
buffers
1 #
3 #
4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474,
tid=0x00007f81cf46d700
5 #
6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)
7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)
9 # J 1815 C2
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
10 #
11 #
"
Hey guys,
I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?
best,
Da Zhang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Gutierrez, Anthony
2018-07-19 19:11:48 UTC
Permalink
Yes, make sure all buffers are flushed, etc., before taking your checkpoint you can call the “sync” command, which should be already installed on the image. You’ll need to call sync before your commands to halt and take a checkpoint.

This page explains how I did the same for an Android disk image: http://gem5.org/BBench-gem5#Tips_for_Making_Your_Disk_Image_gem5_Friendly

-Tony

From: gem5-users <gem5-users-***@gem5.org> On Behalf Of Da Zhang
Sent: Thursday, July 19, 2018 12:00 PM
To: gem5 users mailing list <gem5-***@gem5.org>
Subject: Re: [gem5-users] dacapo (java) benchmark suite encounters "SIGSEGV" and "null exception" during timing mode (fs mode) after restarting from a checkpoint

Hey Gutierrez,

"sync the disk image", do you mean making sure all disk modifications are actually made on the disk (update to date) before taking the checkpoint? How to do that?
I haven't tried to take a checkpoint with COW layer disabled and then restart from that checkpoint before. All I have done is "ctrl+c" to stop gem5 to take the checkpoint (--checkpoint-at-end); I rely on gem5 to take care of all things that need to be checked when taking checkpoints.

Best,
Da Zhang

On Thu, Jul 19, 2018 at 2:36 PM Gutierrez, Anthony <***@amd.com<mailto:***@amd.com>> wrote:
JIT was precisely the issue I was thinking was causing this. One thing may be necessary, that is to ensure you sync the disk image before taking your checkpoint.

gem5’s debug flags should help you identify something like a hang, for example an ExecAll trace. A SyscallAll trace would most likely help you understand better what the JIT is doing.

From: gem5-users <gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>> On Behalf Of Da Zhang
Sent: Thursday, July 19, 2018 11:15 AM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] dacapo (java) benchmark suite encounters "SIGSEGV" and "null exception" during timing mode (fs mode) after restarting from a checkpoint

Thanks for the suggestions.
I have been trying a couple of solutions (I only test for a small subset of decapo benchmark suite, which encounters segfault with O3CPU):

1. using TimingSimpleCPU: no segfaults
2. disable COW layer and write on the disk image when taking checkpoint: there are still segfaults
3. take checkpoints with JIT compiler disabled (20x slowdown): no segfaults
4. take checkpoints during atomic mode (without warming up JIT): no segfaults
5. take checkpoints with Java OOPs compress disabled: there are still segfaults

One thing that I can't tell is if the benchmark hangs since there is no printing during the execution. Is there a statistic I can use to tell if the benchmark hangs?

So far, all my experiments are running using 1CPU (even some benchmarks are multithreading). I attempted to take some checkpoints with more CPUs with KVM CPU. But unfortunately, I got some "rcu_sched self-detected stall on CPU" issues. Any idea?

On Mon, Jul 16, 2018 at 5:47 PM Gutierrez, Anthony <***@amd.com<mailto:***@amd.com>> wrote:
Da,

Do you encounter the segfault only when restoring from a checkpoint? That is, if you do not use checkpoints can any DaCapo benchmark successfully complete under one of the simple CPU models (and not just KVM CPU)?

If so, you may want to get a syscall trace (e.g., using strace) to see what sorts of files the JVM is trying to read etc. It’s possible that the VM generates some files that it will read back later. If you use checkpoints, due to the disk image COW layer, I do not believe any disk updates are checkpointed, thus these files will not persist, which could lead to some weird segfault issues. Not sure if this is happening in your case, but it may be worth investigating.

I created some of the original Android disk images, and the original DaCapo image, and at that time I would typically run the benchmarks thru the FS mode and Atomic CPU once, with the COW layer disabled, in order to generate the needed files on the disk image and have them persist. This was entirely for performance, however, to prevent the VMs from regenerating the same files for each run, but I can envision it causing issues during runtime as well. In particular, it seems you’re code is faulting while doing some XML serializing/deserializing, perhaps the xml file it is looking for is gone?

Beyond that, assuming it is a real bug in gem5, I would recommend an ExecAll trace to figure out why the instruction at that PC is faulting.

-Tony

From: gem5-users [mailto:gem5-users-***@gem5.org<mailto:gem5-users-***@gem5.org>] On Behalf Of Da Zhang
Sent: Monday, July 16, 2018 1:50 PM
To: gem5 users mailing list <gem5-***@gem5.org<mailto:gem5-***@gem5.org>>
Subject: Re: [gem5-users] dacapo (java) benchmark suite encounters "SIGSEGV" and "null exception" during timing mode (fs mode) after restarting from a checkpoint

Hey Jason,

There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in atomic modes, during which the java benchmarks don't crash. However, there is no these kind of warnings during timing mode. Does it imply that unimplemented instructions don't cause the problem? Any clues or suggestions to debug these problems?

best,
Da Zhang



On Mon, Jul 16, 2018 at 1:32 PM Jason Lowe-Power <***@lowepower.com<mailto:***@lowepower.com>> wrote:
Hello,

Are you seeing any warnings like "warn: Instruction XXX not implemented"?

There are many X86 SIMD instructions that are currently unimplemented. I would bet that your application is using some of those instructions and getting 0's as the output instead of the correct value.

The "right" way to solve this problem is to implement these instructions (and we would really appreciate it if you contribute your fixes back on https://gem5-review.googlesource.com. The other option is to recompile your applications without SIMD extensions (e.g., -march=athlon64 or whatever is the original x86-64 name in GCC). However, this likely requires compiling all of the java runtime in your case.

Cheers,
Jason

On Mon, Jul 16, 2018 at 10:11 AM Da Zhang <***@vt.edu<mailto:***@vt.edu>> wrote:
To clarify, "SIGSEGV and null exceptions " happens to the benchmark suite, not gem5. Gem5 is running without errors. But in the system.pc.com_1.device files, I observe that most of the benchmarks crash due to SIGSEGV or null exceptions.
Example:
"

x/system.pc.com_1.device buffers

1 #

2 # A fatal error has been detected by the Java Runtime Environment:

3 #

4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474, tid=0x00007f81cf46d700

5 #

6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)

7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)

8 # Problematic frame:

9 # J 1815 C2 org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V (389 bytes) @ 0x00007f81d17742b7 [0x00007f81d1774280+0x37]

10 #

11 #
"

On Mon, Jul 16, 2018 at 11:39 AM Da Zhang <***@vt.edu<mailto:***@vt.edu>> wrote:
Hey guys,

I am testing a java benchmark suite, dacapo, on gem5 with fs mode. Unfortunately, I encounter a lot of SIGSEGV and null exceptions during timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to gem5 src/ and configs/, I re-download gem5 and checkout to commit "ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB memory. For the simulation, I use build/X86/gem5.opt (in order to enable assertions) with fs mode (configs/example/fs.py). Other options include "--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache --l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms warmup and 1ps real simulation time. There are no errors presented. But with longer real simulation time, the benchmark suite crashes with segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu, without any segfaults or exceptions. I have some simple java benchmarks tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?

best,
Da Zhang
_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Da Zhang
2018-07-19 20:22:09 UTC
Permalink
Thanks a lot for the tips. I will give a try.

best,
Da

On Thu, Jul 19, 2018 at 3:12 PM Gutierrez, Anthony <
Post by Gutierrez, Anthony
Yes, make sure all buffers are flushed, etc., before taking your
checkpoint you can call the “sync” command, which should be already
installed on the image. You’ll need to call sync before your commands to
halt and take a checkpoint.
http://gem5.org/BBench-gem5#Tips_for_Making_Your_Disk_Image_gem5_Friendly
-Tony
*Sent:* Thursday, July 19, 2018 12:00 PM
*Subject:* Re: [gem5-users] dacapo (java) benchmark suite encounters
"SIGSEGV" and "null exception" during timing mode (fs mode) after
restarting from a checkpoint
Hey Gutierrez,
"*sync* the disk image", do you mean making sure all disk modifications
are actually made on the disk (update to date) before taking the
checkpoint? How to do that?
I haven't tried to take a checkpoint with COW layer disabled and then
restart from that checkpoint before. All I have done is "ctrl+c" to stop
gem5 to take the checkpoint (--checkpoint-at-end); I rely on gem5 to take
care of all things that need to be checked when taking checkpoints.
Best,
Da Zhang
On Thu, Jul 19, 2018 at 2:36 PM Gutierrez, Anthony <
JIT was precisely the issue I was thinking was causing this. One thing may
be necessary, that is to ensure you *sync* the disk image before taking
your checkpoint.
gem5’s debug flags should help you identify something like a hang, for
example an ExecAll trace. A SyscallAll trace would most likely help you
understand better what the JIT is doing.
*Sent:* Thursday, July 19, 2018 11:15 AM
*Subject:* Re: [gem5-users] dacapo (java) benchmark suite encounters
"SIGSEGV" and "null exception" during timing mode (fs mode) after
restarting from a checkpoint
Thanks for the suggestions.
I have been trying a couple of solutions (I only test for a small subset
1. using TimingSimpleCPU: no segfaults
there are still segfaults
3. take checkpoints with JIT compiler disabled (20x slowdown): no segfaults
4. take checkpoints during atomic mode (without warming up JIT): no segfaults
5. take checkpoints with Java OOPs compress disabled: there are still segfaults
One thing that I can't tell is if the benchmark hangs since there is no
printing during the execution. Is there a statistic I can use to tell if
the benchmark hangs?
So far, all my experiments are running using 1CPU (even some benchmarks
are multithreading). I attempted to take some checkpoints with more CPUs
with KVM CPU. But unfortunately, I got some "rcu_sched self-detected stall
on CPU" issues. Any idea?
On Mon, Jul 16, 2018 at 5:47 PM Gutierrez, Anthony <
Da,
Do you encounter the segfault only when restoring from a checkpoint? That
is, if you do not use checkpoints can any DaCapo benchmark successfully
complete under one of the simple CPU models (and not just KVM CPU)?
If so, you may want to get a syscall trace (e.g., using strace) to see
what sorts of files the JVM is trying to read etc. It’s possible that the
VM generates some files that it will read back later. If you use
checkpoints, due to the disk image COW layer, I do not believe any disk
updates are checkpointed, thus these files will not persist, which could
lead to some weird segfault issues. Not sure if this is happening in your
case, but it may be worth investigating.
I created some of the original Android disk images, and the original
DaCapo image, and at that time I would typically run the benchmarks thru
the FS mode and Atomic CPU once, with the COW layer disabled, in order to
generate the needed files on the disk image and have them persist. This was
entirely for performance, however, to prevent the VMs from regenerating the
same files for each run, but I can envision it causing issues during
runtime as well. In particular, it seems you’re code is faulting while
doing some XML serializing/deserializing, perhaps the xml file it is
looking for is gone?
Beyond that, assuming it is a real bug in gem5, I would recommend an
ExecAll trace to figure out why the instruction at that PC is faulting.
-Tony
Zhang
*Sent:* Monday, July 16, 2018 1:50 PM
*Subject:* Re: [gem5-users] dacapo (java) benchmark suite encounters
"SIGSEGV" and "null exception" during timing mode (fs mode) after
restarting from a checkpoint
Hey Jason,
There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in
atomic modes, during which the java benchmarks don't crash. However, there
is no these kind of warnings during timing mode. Does it imply that
unimplemented instructions don't cause the problem? Any clues or
suggestions to debug these problems?
best,
Da Zhang
Hello,
Are you seeing any warnings like "warn: Instruction XXX not implemented"?
There are many X86 SIMD instructions that are currently unimplemented. I
would bet that your application is using some of those instructions and
getting 0's as the output instead of the correct value.
The "right" way to solve this problem is to implement these instructions
(and we would really appreciate it if you contribute your fixes back on
https://gem5-review.googlesource.com. The other option is to recompile
your applications without SIMD extensions (e.g., -march=athlon64 or
whatever is the original x86-64 name in GCC). However, this likely requires
compiling all of the java runtime in your case.
Cheers,
Jason
To clarify, "SIGSEGV and null exceptions " happens to the benchmark
suite, not gem5. Gem5 is running without errors. But in the
system.pc.com_1.device files, I observe that most of the benchmarks crash
due to SIGSEGV or null exceptions.
"
x/system.pc.com_1.device
buffers
1 #
3 #
4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474,
tid=0x00007f81cf46d700
5 #
6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)
7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)
9 # J 1815 C2
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
10 #
11 #
"
Hey guys,
I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?
best,
Da Zhang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Da Zhang
2018-07-19 21:58:47 UTC
Permalink
I just did a quick test for one benchmark using sync and no COW layer, but
it still encountered SIGSEGV. I took the checkpoint by running the
benchmark in the background as root (with KVM CPU); I warmed up JIT for 1
round and took the checkpoint in the second round by using "sync && m5
exit" with --checkpoint-at-end.
These are two SIGSEGVs (same checkpoint with different fast forward time):
1.

# SIGSEGV (0xb) at pc=0x00007f8b592c4b80, pid=1482, tid=0x00007f8b50408700





#





# JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build
1.8.0_171-b11)




# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)




# Problematic frame:





# J 2671 C2
java.math.MutableBigInteger.divideMagnitude(Ljava/math/MutableBigInteger;Ljava/math/MutableBigInteger;Z)Ljava/math/MutableBigInteger;
(1307 bytes) @ 0x00007f8b592c4b80 [0x00007f8b592c4700+0x480]
2.

# SIGSEGV (0xb) at pc=0x00007f8b6e6b02e4, pid=1482, tid=0x00007f8b50408700





#





# JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build
1.8.0_171-b11)




# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)




# Problematic frame:





# V [libjvm.so+0x5952e4] frame::sender(RegisterMap*) const+0x114



On Thu, Jul 19, 2018 at 3:12 PM Gutierrez, Anthony <
Post by Gutierrez, Anthony
Yes, make sure all buffers are flushed, etc., before taking your
checkpoint you can call the “sync” command, which should be already
installed on the image. You’ll need to call sync before your commands to
halt and take a checkpoint.
http://gem5.org/BBench-gem5#Tips_for_Making_Your_Disk_Image_gem5_Friendly
-Tony
*Sent:* Thursday, July 19, 2018 12:00 PM
*Subject:* Re: [gem5-users] dacapo (java) benchmark suite encounters
"SIGSEGV" and "null exception" during timing mode (fs mode) after
restarting from a checkpoint
Hey Gutierrez,
"*sync* the disk image", do you mean making sure all disk modifications
are actually made on the disk (update to date) before taking the
checkpoint? How to do that?
I haven't tried to take a checkpoint with COW layer disabled and then
restart from that checkpoint before. All I have done is "ctrl+c" to stop
gem5 to take the checkpoint (--checkpoint-at-end); I rely on gem5 to take
care of all things that need to be checked when taking checkpoints.
Best,
Da Zhang
On Thu, Jul 19, 2018 at 2:36 PM Gutierrez, Anthony <
JIT was precisely the issue I was thinking was causing this. One thing may
be necessary, that is to ensure you *sync* the disk image before taking
your checkpoint.
gem5’s debug flags should help you identify something like a hang, for
example an ExecAll trace. A SyscallAll trace would most likely help you
understand better what the JIT is doing.
*Sent:* Thursday, July 19, 2018 11:15 AM
*Subject:* Re: [gem5-users] dacapo (java) benchmark suite encounters
"SIGSEGV" and "null exception" during timing mode (fs mode) after
restarting from a checkpoint
Thanks for the suggestions.
I have been trying a couple of solutions (I only test for a small subset
1. using TimingSimpleCPU: no segfaults
there are still segfaults
3. take checkpoints with JIT compiler disabled (20x slowdown): no segfaults
4. take checkpoints during atomic mode (without warming up JIT): no segfaults
5. take checkpoints with Java OOPs compress disabled: there are still segfaults
One thing that I can't tell is if the benchmark hangs since there is no
printing during the execution. Is there a statistic I can use to tell if
the benchmark hangs?
So far, all my experiments are running using 1CPU (even some benchmarks
are multithreading). I attempted to take some checkpoints with more CPUs
with KVM CPU. But unfortunately, I got some "rcu_sched self-detected stall
on CPU" issues. Any idea?
On Mon, Jul 16, 2018 at 5:47 PM Gutierrez, Anthony <
Da,
Do you encounter the segfault only when restoring from a checkpoint? That
is, if you do not use checkpoints can any DaCapo benchmark successfully
complete under one of the simple CPU models (and not just KVM CPU)?
If so, you may want to get a syscall trace (e.g., using strace) to see
what sorts of files the JVM is trying to read etc. It’s possible that the
VM generates some files that it will read back later. If you use
checkpoints, due to the disk image COW layer, I do not believe any disk
updates are checkpointed, thus these files will not persist, which could
lead to some weird segfault issues. Not sure if this is happening in your
case, but it may be worth investigating.
I created some of the original Android disk images, and the original
DaCapo image, and at that time I would typically run the benchmarks thru
the FS mode and Atomic CPU once, with the COW layer disabled, in order to
generate the needed files on the disk image and have them persist. This was
entirely for performance, however, to prevent the VMs from regenerating the
same files for each run, but I can envision it causing issues during
runtime as well. In particular, it seems you’re code is faulting while
doing some XML serializing/deserializing, perhaps the xml file it is
looking for is gone?
Beyond that, assuming it is a real bug in gem5, I would recommend an
ExecAll trace to figure out why the instruction at that PC is faulting.
-Tony
Zhang
*Sent:* Monday, July 16, 2018 1:50 PM
*Subject:* Re: [gem5-users] dacapo (java) benchmark suite encounters
"SIGSEGV" and "null exception" during timing mode (fs mode) after
restarting from a checkpoint
Hey Jason,
There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in
atomic modes, during which the java benchmarks don't crash. However, there
is no these kind of warnings during timing mode. Does it imply that
unimplemented instructions don't cause the problem? Any clues or
suggestions to debug these problems?
best,
Da Zhang
Hello,
Are you seeing any warnings like "warn: Instruction XXX not implemented"?
There are many X86 SIMD instructions that are currently unimplemented. I
would bet that your application is using some of those instructions and
getting 0's as the output instead of the correct value.
The "right" way to solve this problem is to implement these instructions
(and we would really appreciate it if you contribute your fixes back on
https://gem5-review.googlesource.com. The other option is to recompile
your applications without SIMD extensions (e.g., -march=athlon64 or
whatever is the original x86-64 name in GCC). However, this likely requires
compiling all of the java runtime in your case.
Cheers,
Jason
To clarify, "SIGSEGV and null exceptions " happens to the benchmark
suite, not gem5. Gem5 is running without errors. But in the
system.pc.com_1.device files, I observe that most of the benchmarks crash
due to SIGSEGV or null exceptions.
"
x/system.pc.com_1.device
buffers
1 #
3 #
4 # SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474,
tid=0x00007f81cf46d700
5 #
6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)
7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode
linux-amd64 compressed oops)
9 # J 1815 C2
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
10 #
11 #
"
Hey guys,
I am testing a java benchmark suite, dacapo, on gem5 with fs mode.
Unfortunately, I encounter a lot of SIGSEGV and null exceptions during
timing mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with
oracle jdk v8.0_171-b11. To eliminate the influence of my modifications to
gem5 src/ and configs/, I re-download gem5 and checkout to commit
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB
memory. For the simulation, I use build/X86/gem5.opt (in order to enable
assertions) with fs mode (configs/example/fs.py). Other options include
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms
warmup and 1ps real simulation time. There are no errors presented. But
with longer real simulation time, the benchmark suite crashes with
segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu,
without any segfaults or exceptions. I have some simple java benchmarks
tested; neither segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?
best,
Da Zhang
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Loading...