Running multi-threaded workloads in O3 CPU

Post by abhishek rawat
I have this doubt related to O3 SMT model in ALPHA_SE mode. So I understand
that for running multi-programmed benchmarks one can pass a vector of
benchmarks as the parameters and set the cpu.numThreads to the size of
vector. But is it possible for the O3 CPU to extract the TLP in a (single)
given multithreaded benchmark. So what I am trying to say is lets say I just
want to run a single multi-threaded benchmark and just by specifying the
number of SMT threads (say 4 SMT threads) can the O3 CPU extract the threads
from the benchmark and run in a SMT fashion? I would appreciate any
directions in this regard.

As far as I know, O3CPU will only run the threads that you give it, so your
actual benchmark needs to specify how many threads are going to be run in
parallel in order for you to see any benefit in SE_SMT. The problem of
automatically extracting TLP is in fact it's own subset of the research
community, so if you wish, I'm sure you can find plenty of literature on
that topic.

Post by abhishek rawat
Something tells me that it is not possible. And if that is the case then
how do I compare different SMT configurations for a given benchmark ?

That's a bit of a abstract question and on the M5-side, I'm not sure there
is a definite *right* answer here just user-dependent ones. For some people,
comparing a 2-threaded SMT core v. a 4-threaded SMT core is fine. For
others, comparing a 2T-SMT core vs. 2 CMP cores is their objective.

Considering a configuration to be the benchmark running and the hardware
that it is to be run on, It's up to the user to specify different M5
configurations that are interesting to them, run M5 for different
configurations, figure out what configurations are comparable to each other,
and then decide what to do next with their results.

--
- Korey

abhishek rawat

2010-04-16 04:43:18 UTC

Thanks for the reply. I have a follow up question. So, lets say I am only
using one O3 core and If I want to run 2-threaded SMT that means I will have
to provide as input for instance 2 FFT benchmarks. And for 4-threaded SMT
similarly I will have to provide as input 4 FFT benchmark. In this case how
can I do the comparison as the benchmarks have changed, in size. So my
question was can I just provide lets say 1 FFT benchmark as input and then
try to run it with 2 threaded-SMT and then with 4 threaded-SMT
configuration. And I think you are saying it is not possible.

Thanks,
--Abhishek

Post by abhishek rawat
I have this doubt related to O3 SMT model in ALPHA_SE mode. So I understand

Post by abhishek rawat
that for running multi-programmed benchmarks one can pass a vector of
benchmarks as the parameters and set the cpu.numThreads to the size of
vector. But is it possible for the O3 CPU to extract the TLP in a (single)
given multithreaded benchmark. So what I am trying to say is lets say I just
want to run a single multi-threaded benchmark and just by specifying the
number of SMT threads (say 4 SMT threads) can the O3 CPU extract the threads
from the benchmark and run in a SMT fashion? I would appreciate any
directions in this regard.

Post by abhishek rawat
Something tells me that it is not possible. And if that is the case then
how do I compare different SMT configurations for a given benchmark ?

That's a bit of a abstract question and on the M5-side, I'm not sure there
is a definite *right* answer here just user-dependent ones. For some people,
comparing a 2-threaded SMT core v. a 4-threaded SMT core is fine. For
others, comparing a 2T-SMT core vs. 2 CMP cores is their objective.
Considering a configuration to be the benchmark running and the hardware
that it is to be run on, It's up to the user to specify different M5
configurations that are interesting to them, run M5 for different
configurations, figure out what configurations are comparable to each other,
and then decide what to do next with their results.
--
- Korey
_______________________________________________
m5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

--
-Abhi

Korey Sewell

2010-04-16 04:50:00 UTC

Post by abhishek rawat
Thanks for the reply. I have a follow up question. So, lets say I am only
using one O3 core and If I want to run 2-threaded SMT that means I will have
to provide as input for instance 2 FFT benchmarks. And for 4-threaded SMT
similarly I will have to provide as input 4 FFT benchmark. In this case how
can I do the comparison as the benchmarks have changed, in size. So my
question was can I just provide lets say 1 FFT benchmark as input and then
try to run it with 2 threaded-SMT and then with 4 threaded-SMT
configuration. And I think you are saying it is not possible.

Unless your FFT has pthreads or some other multithreaded software that will
spawn threads for you, then no, I dont see it as possible. But, I'll let
someone else respond if they have additional thoughts.

--
- Korey

Steve Reinhardt

2010-04-16 12:50:50 UTC

If the benchmark is explicitly multithreaded and compiled against the
m5threads library, then you can run it on an SMT O3 CPU and it should
work.

Steve

Post by abhishek rawat
I have this doubt related to O3 SMT model in ALPHA_SE mode. So I
understand that for running multi-programmed benchmarks one can pass a
vector of benchmarks as the parameters and set the cpu.numThreads to the
size of vector. But is it possible for the O3 CPU to extract the TLP in a
(single) given multithreaded benchmark. So what I am trying to say is lets
say I just want to run a single multi-threaded benchmark and just by
specifying the number of SMT threads (say 4 SMT threads) can the O3 CPU
extract the threads from the benchmark and run in a SMT fashion? I would
appreciate any directions in this regard.

As far as I know, O3CPU will only run the threads that you give it, so
your actual benchmark needs to specify how many threads are going to be run
in parallel in order for you to see any benefit in SE_SMT. The problem of
automatically extracting TLP is in fact it's own subset of the research
community, so if you wish, I'm sure you can find plenty of literature on
that topic.

Post by abhishek rawat
Something tells me that it is not possible. And if that is the case then
how do I compare different SMT configurations for a given benchmark ?

That's a bit of a abstract question and on the M5-side, I'm not sure there
is a definite *right* answer here just user-dependent ones. For some people,
comparing a 2-threaded SMT core v. a 4-threaded SMT core is fine. For
others, comparing a 2T-SMT core vs. 2 CMP cores is their objective.
Considering a configuration to be the benchmark running and the hardware
that it is to be run on, It's up to the user to specify different M5
configurations that are interesting to them, run M5 for different
configurations, figure out what configurations are comparable to each other,
and then decide what to do next with their results.
--
- Korey
_______________________________________________
m5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

--
-Abhi
_______________________________________________
m5-users mailing list
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

abhishek rawat

2010-04-17 14:19:50 UTC

Thanks Steve! I tried using the run.py in configs/splash2 directory but I
keep getting segmentation fault. I am running FFT benchmark (I downloaded
the binaries from the link provided at M5 wiki's splash benchmark's page). I
made following changes in my python script (run.py):

................................
class FFT(LiveProcess):
cwd = options.rootdir + '/kernels/fft'
executable = options.rootdir + '/kernels/fft/FFT'
cmd = ['FFT', '-p', str(*options.numcpus*options.smt*), '-m18']

..............................
..............................
cpus = [DerivO3CPU(cpu_id=i,
clock=options.frequency,
*numThreads=options.smt,
smtNumFetchingThreads=options.smt,
smtFetchPolicy='RoundRobin'*)
for i in xrange(options.numcpus)]

So, if I understood what Steve said in last mail then I think I am running
it correctly. I tried debugging but it didn't help. I am using a latest
version of M5 which I downloaded from the repositories about a week ago.
Also when I used smtNumFetchingThreads=1 and numThreads=1 then it works!!

This is what I got after running gdb:

: /af20/ar8eb/Desktop/CS8501/m5-ad784e759a74 ; gdb build/ALPHA_SE/m5.debug
GNU gdb 6.8-debian
................................................................
This GDB was configured as "x86_64-linux-gnu"...
(gdb) run configs/splash2/run_test.py -n 1 --smt=2
--rootdir=../m5-stable-94c016415053/v1-splash-alpha/splash2/codes/ -d -b FFT
....................................................................
command line:
/net/af20/ar8eb/Desktop/CS8501/m5-ad784e759a74/build/ALPHA_SE/m5.debug
configs/splash2/run_test.py -n 1 --smt=2
--rootdir=../m5-stable-94c016415053/v1-splash-alpha/splash2/codes/ -d -b FFT
Global frequency set at 1000000000000 ticks per second
0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
0: system.remote_gdb.listener: listening for remote gdb #1 on port 7001

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f6d50dc76f0 (LWP 18930)]
0x00000000004522a0 in std::vector<int, std::allocator<int> >::push_back
(this=0x40, __x=@0x7fff58dea6b4) at
/usr/include/c++/4.2/bits/stl_vector.h:599
599 if (this->_M_impl._M_finish != this->_M_impl._M_end_of_storage)

Thanks,
Abhishek

Post by Steve Reinhardt
If the benchmark is explicitly multithreaded and compiled against the
m5threads library, then you can run it on an SMT O3 CPU and it should
work.
Steve

Post by abhishek rawat
Thanks for the reply. I have a follow up question. So, lets say I am only
using one O3 core and If I want to run 2-threaded SMT that means I will

have

Post by abhishek rawat
to provide as input for instance 2 FFT benchmarks. And for 4-threaded

SMT

Post by abhishek rawat
similarly I will have to provide as input 4 FFT benchmark. In this case

how

Post by abhishek rawat
can I do the comparison as the benchmarks have changed, in size. So my
question was can I just provide lets say 1 FFT benchmark as input and

then

Post by abhishek rawat
try to run it with 2 threaded-SMT and then with 4 threaded-SMT
configuration. And I think you are saying it is not possible.
Thanks,
--Abhishek

the

Post by abhishek rawat
size of vector. But is it possible for the O3 CPU to extract the TLP in

Post by abhishek rawat
(single) given multithreaded benchmark. So what I am trying to say is

lets

Post by abhishek rawat
say I just want to run a single multi-threaded benchmark and just by
specifying the number of SMT threads (say 4 SMT threads) can the O3 CPU
extract the threads from the benchmark and run in a SMT fashion? I

would

Post by abhishek rawat
appreciate any directions in this regard.

As far as I know, O3CPU will only run the threads that you give it, so
your actual benchmark needs to specify how many threads are going to be

run

Post by Korey Sewell
in parallel in order for you to see any benefit in SE_SMT. The problem

Post by Korey Sewell
automatically extracting TLP is in fact it's own subset of the research
community, so if you wish, I'm sure you can find plenty of literature on
that topic.

Post by abhishek rawat
Something tells me that it is not possible. And if that is the case

then

Post by abhishek rawat
how do I compare different SMT configurations for a given benchmark ?

That's a bit of a abstract question and on the M5-side, I'm not sure

there

Post by Korey Sewell
is a definite *right* answer here just user-dependent ones. For some

people,

Post by Korey Sewell
comparing a 2-threaded SMT core v. a 4-threaded SMT core is fine. For
others, comparing a 2T-SMT core vs. 2 CMP cores is their objective.
Considering a configuration to be the benchmark running and the hardware
that it is to be run on, It's up to the user to specify different M5
configurations that are interesting to them, run M5 for different
configurations, figure out what configurations are comparable to each

other,