Discussion:
[gem5-dev] Locked RMW in Ruby
Nilay Vaish
2011-10-25 19:58:08 UTC
Permalink
Hi

I am trying to make the O3 CPU work with Ruby, but I am running in to
problem with implementation of Locked RMW in Ruby. Currently, when the
read part of RMW is issued, Ruby puts the block on a special list. The
block is taken of that list when the write part of RMW is issued. If any
other processor issues a read / write request for that block in between
the RMW's read and write operations, the request is delayed till the block
is unlocked. This means that the RMW can never fail and the write request
needs to issued always.

Reading the code from the classic memory system, it seems that it allows
for the block to be given in case some other processor requests for it.
This means that classic memory system allows RMW to fail.

My question is which of these behavior is actually implemented in x86? As
I understand LL/SC is allowed to fail in MIPS or Alpha architecture. I
would assume that same holds true for x86 as well. Is that the case or
not?

Thanks
Nilay
Steve Reinhardt
2011-10-25 20:20:21 UTC
Permalink
Hi Nilay,

No, x86 locked RMW accesses are different from Alpha/MIPS LL/SC, and they
are not allowed to be interrupted (once the cache begins the sequence).

In the old days, all the gem5 cpu models implemented was LL/SC, and the
LOCKED flag meant LLSC. A while ago we renamed the old LOCKED flag to LLSC
and added a new LOCKED flag that means x86 atomic RMW.

The situation you're observing is that the classic memory system only
implements LLSC and not LOCKED. In contrast, I believe Ruby implements both
LOCKED and LLSC.

Actually one thing I have been meaning to do is add something like:

if (req->isLocked())
warn_once("Classic cache does not implement locked accesses. MP
execution could be wrong!\n")

to the classic cache code so people know that this is the case.

Steve
Post by Nilay Vaish
Hi
I am trying to make the O3 CPU work with Ruby, but I am running in to
problem with implementation of Locked RMW in Ruby. Currently, when the read
part of RMW is issued, Ruby puts the block on a special list. The block is
taken of that list when the write part of RMW is issued. If any other
processor issues a read / write request for that block in between the RMW's
read and write operations, the request is delayed till the block is
unlocked. This means that the RMW can never fail and the write request needs
to issued always.
Reading the code from the classic memory system, it seems that it allows
for the block to be given in case some other processor requests for it. This
means that classic memory system allows RMW to fail.
My question is which of these behavior is actually implemented in x86? As I
understand LL/SC is allowed to fail in MIPS or Alpha architecture. I would
assume that same holds true for x86 as well. Is that the case or not?
Thanks
Nilay
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
Nilay Vaish
2011-10-25 21:15:47 UTC
Permalink
Does this mean that an x86 O3 CPU will never squash an RMW instruction? I
am posting an instruction + protocol trace for obtained from O3 and Ruby.
In the first portion, you can see that the O3 CPU issues a locked RMW with
the read part having sn = 3051 and the write part having sn = 3052. In the
second portion, you can see that 3051 and 3052 are squashed and the in
the third portion of the trace, these are committed. There are several
things that I am not able to understand. Why is the RMW squashed, since
x86 architecture has to commit the instruction? Secondly, if RMW was being
executed speculatively, then what mechanism exists for informing the cache
controller about the instruction getting squashed? Thirdly, why was the
instruction committed later on, when it was originally squashed?

FullO3CPU: Ticking main, FullO3CPU.
21254500 1 Seq Begin > [0x840,
line 0x840] IFETCH
21254500: system.cpu1.iew.lsq.thread0: Executing load PC
(0x4002bd=>0x4002bf).(0=>1), [sn:3051]
21254500: system.cpu1.iew.lsq.thread0: Read called, load idx: 0, store
idx: -1, storeHead: 23 addr: 0x95b84
21254500: system.cpu1.iew.lsq.thread0: Doing memory access for inst
[sn:3051] PC (0x4002bd=>0x4002bf).(0=>1)
21254500 1 Seq Begin > [0x95b84,
line 0x95b80] Locked_RMW_Read
21254500: system.cpu1.iew.lsq.thread0: Executing store PC
(0x4002bd=>0x4002bf).(1=>2) [sn:3052]
21254500: system.cpu1.iew.lsq.thread0: Doing write to store idx 23, addr
0x95b84 data ^A | storeHead:23 [sn:3052]

:
:
:

21256500: system.cpu1.iew.lsq.thread0: Squashing until [sn:2993]!(Loads:2
Stores:2)
21256500: system.cpu1.iew.lsq.thread0: Load Instruction PC
(0x4002c7=>0x4002c8).(0=>1) squashed, [sn:3060]
21256500: system.cpu1.iew.lsq.thread0: Load Instruction PC
(0x4002bd=>0x4002bf).(0=>1) squashed, [sn:3051]
21256500: system.cpu1.iew.lsq.thread0: Store Instruction PC
(0x4002c4=>0x4002c7).(0=>1) squashed, idx:24 [sn:3059]
21256500: system.cpu1.iew.lsq.thread0: Store Instruction PC
(0x4002bd=>0x4002bf).(1=>2) squashed, idx:23 [sn:3052]

:
:
:

21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(0=>1) [sn:3013]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(1=>2) [sn:3014]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(2=>3) [sn:3015]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(0=>1) [sn:3032]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(1=>2) [sn:3033]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(2=>3) [sn:3034]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(0=>1) [sn:3051]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(1=>2) [sn:3052]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(2=>3) [sn:3053]


Thanks
Nilay
Post by Steve Reinhardt
Hi Nilay,
No, x86 locked RMW accesses are different from Alpha/MIPS LL/SC, and they
are not allowed to be interrupted (once the cache begins the sequence).
In the old days, all the gem5 cpu models implemented was LL/SC, and the
LOCKED flag meant LLSC. A while ago we renamed the old LOCKED flag to LLSC
and added a new LOCKED flag that means x86 atomic RMW.
The situation you're observing is that the classic memory system only
implements LLSC and not LOCKED. In contrast, I believe Ruby implements both
LOCKED and LLSC.
if (req->isLocked())
warn_once("Classic cache does not implement locked accesses. MP
execution could be wrong!\n")
to the classic cache code so people know that this is the case.
Steve
Post by Nilay Vaish
Hi
I am trying to make the O3 CPU work with Ruby, but I am running in to
problem with implementation of Locked RMW in Ruby. Currently, when the read
part of RMW is issued, Ruby puts the block on a special list. The block is
taken of that list when the write part of RMW is issued. If any other
processor issues a read / write request for that block in between the RMW's
read and write operations, the request is delayed till the block is
unlocked. This means that the RMW can never fail and the write request needs
to issued always.
Reading the code from the classic memory system, it seems that it allows
for the block to be given in case some other processor requests for it. This
means that classic memory system allows RMW to fail.
My question is which of these behavior is actually implemented in x86? As I
understand LL/SC is allowed to fail in MIPS or Alpha architecture. I would
assume that same holds true for x86 as well. Is that the case or not?
Thanks
Nilay
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-10-25 21:23:10 UTC
Permalink
Good questions. Clearly if we ever let the R part of an RMW instruction out
to the cache, either we have to commit the instruction or add some mechanism
to unlock the block. One solution would be to mark all RMW instructions as
serializing, which would prevent them from executing speculatively. That
(or something like it) might be necessary to get the consistency model right
anyway, since I believe locked accesses act as fences (?? is that right,
Brad?).

Gabe, did you have an alternate solution in mind?

Steve
Post by Nilay Vaish
Does this mean that an x86 O3 CPU will never squash an RMW instruction? I
am posting an instruction + protocol trace for obtained from O3 and Ruby. In
the first portion, you can see that the O3 CPU issues a locked RMW with the
read part having sn = 3051 and the write part having sn = 3052. In the
second portion, you can see that 3051 and 3052 are squashed and the in the
third portion of the trace, these are committed. There are several things
that I am not able to understand. Why is the RMW squashed, since x86
architecture has to commit the instruction? Secondly, if RMW was being
executed speculatively, then what mechanism exists for informing the cache
controller about the instruction getting squashed? Thirdly, why was the
instruction committed later on, when it was originally squashed?
FullO3CPU: Ticking main, FullO3CPU.
21254500 1 Seq Begin > [0x840,
line 0x840] IFETCH
21254500: system.cpu1.iew.lsq.thread0: Executing load PC
(0x4002bd=>0x4002bf).(0=>1), [sn:3051]
-1, storeHead: 23 addr: 0x95b84
21254500: system.cpu1.iew.lsq.thread0: Doing memory access for inst
[sn:3051] PC (0x4002bd=>0x4002bf).(0=>1)
21254500 1 Seq Begin > [0x95b84,
line 0x95b80] Locked_RMW_Read
21254500: system.cpu1.iew.lsq.thread0: Executing store PC
(0x4002bd=>0x4002bf).(1=>2) [sn:3052]
21254500: system.cpu1.iew.lsq.thread0: Doing write to store idx 23, addr
0x95b84 data ^A | storeHead:23 [sn:3052]
21256500: system.cpu1.iew.lsq.thread0: Squashing until [sn:2993]!(Loads:2
Stores:2)
21256500: system.cpu1.iew.lsq.thread0: Load Instruction PC
(0x4002c7=>0x4002c8).(0=>1) squashed, [sn:3060]
21256500: system.cpu1.iew.lsq.thread0: Load Instruction PC
(0x4002bd=>0x4002bf).(0=>1) squashed, [sn:3051]
21256500: system.cpu1.iew.lsq.thread0: Store Instruction PC
(0x4002c4=>0x4002c7).(0=>1) squashed, idx:24 [sn:3059]
21256500: system.cpu1.iew.lsq.thread0: Store Instruction PC
(0x4002bd=>0x4002bf).(1=>2) squashed, idx:23 [sn:3052]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(0=>1) [sn:3013]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(1=>2) [sn:3014]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(2=>3) [sn:3015]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(0=>1) [sn:3032]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(1=>2) [sn:3033]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(2=>3) [sn:3034]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(0=>1) [sn:3051]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(1=>2) [sn:3052]
21258000: system.cpu1: Removing committed instruction [tid:0] PC
(0x4002bd=>0x4002bf).(2=>3) [sn:3053]
Thanks
Nilay
Hi Nilay,
Post by Steve Reinhardt
No, x86 locked RMW accesses are different from Alpha/MIPS LL/SC, and they
are not allowed to be interrupted (once the cache begins the sequence).
In the old days, all the gem5 cpu models implemented was LL/SC, and the
LOCKED flag meant LLSC. A while ago we renamed the old LOCKED flag to LLSC
and added a new LOCKED flag that means x86 atomic RMW.
The situation you're observing is that the classic memory system only
implements LLSC and not LOCKED. In contrast, I believe Ruby implements both
LOCKED and LLSC.
if (req->isLocked())
warn_once("Classic cache does not implement locked accesses. MP
execution could be wrong!\n")
to the classic cache code so people know that this is the case.
Steve
Hi
Post by Nilay Vaish
I am trying to make the O3 CPU work with Ruby, but I am running in to
problem with implementation of Locked RMW in Ruby. Currently, when the read
part of RMW is issued, Ruby puts the block on a special list. The block is
taken of that list when the write part of RMW is issued. If any other
processor issues a read / write request for that block in between the RMW's
read and write operations, the request is delayed till the block is
unlocked. This means that the RMW can never fail and the write request needs
to issued always.
Reading the code from the classic memory system, it seems that it allows
for the block to be given in case some other processor requests for it. This
means that classic memory system allows RMW to fail.
My question is which of these behavior is actually implemented in x86? As I
understand LL/SC is allowed to fail in MIPS or Alpha architecture. I would
assume that same holds true for x86 as well. Is that the case or not?
Thanks
Nilay
______________________________****_________________
gem5-dev mailing list
http://m5sim.org/mailman/****listinfo/gem5-dev<http://m5sim.org/mailman/**listinfo/gem5-dev>
<http://**m5sim.org/mailman/listinfo/**gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
Nilay Vaish
2011-10-26 16:56:07 UTC
Permalink
Post by Steve Reinhardt
Good questions. Clearly if we ever let the R part of an RMW instruction out
to the cache, either we have to commit the instruction or add some mechanism
to unlock the block. One solution would be to mark all RMW instructions as
serializing, which would prevent them from executing speculatively. That
(or something like it) might be necessary to get the consistency model right
anyway, since I believe locked accesses act as fences (?? is that right,
Brad?).
Gabe, did you have an alternate solution in mind?
Steve
Post by Nilay Vaish
Does this mean that an x86 O3 CPU will never squash an RMW instruction? I
am posting an instruction + protocol trace for obtained from O3 and Ruby. In
the first portion, you can see that the O3 CPU issues a locked RMW with the
read part having sn = 3051 and the write part having sn = 3052. In the
second portion, you can see that 3051 and 3052 are squashed and the in the
third portion of the trace, these are committed. There are several things
that I am not able to understand. Why is the RMW squashed, since x86
architecture has to commit the instruction? Secondly, if RMW was being
executed speculatively, then what mechanism exists for informing the cache
controller about the instruction getting squashed? Thirdly, why was the
instruction committed later on, when it was originally squashed?
When I mark ldstl and stul as non-speculative, the O3 CPU and Ruby work on
an example code in which two threads are incrementing a counter. Since
locked RMW is a fence instruction (Steve suggested this above and AMD's
manual agrees), it seems that the read portion should commit any of the
loads and stores that appear before it in the program order. This means
that ldstl should be marked as memory barrrier, and similarly stul should
also be marked as memory barrier. But looking at
src/arch/x86/isa/microops/ldstop.isa, it does not seem like that this
flags can be currently supported. If others (especially Steve and Gabe)
concurr with my understanding, I can modify the file to add the memory
barrier flag.

--
Nilay
Gabe Black
2011-10-27 06:59:47 UTC
Permalink
I think you guys are on the right track. There's a non speculative flag,
a serialize before, and a serialize after. I'm not sure which one is
exactly right, but some combination should be. We should be careful not
to over do it since that might artificially hurt performance, but I
don't *think* the lock prefix is used all that much these days so it
shouldn't be have *too* bad an impact if it isn't perfectly correct.

Gabe
Post by Nilay Vaish
Post by Steve Reinhardt
Good questions. Clearly if we ever let the R part of an RMW
instruction out
to the cache, either we have to commit the instruction or add some mechanism
to unlock the block. One solution would be to mark all RMW
instructions as
serializing, which would prevent them from executing speculatively.
That
(or something like it) might be necessary to get the consistency model right
anyway, since I believe locked accesses act as fences (?? is that right,
Brad?).
Gabe, did you have an alternate solution in mind?
Steve
Post by Nilay Vaish
Does this mean that an x86 O3 CPU will never squash an RMW
instruction? I
am posting an instruction + protocol trace for obtained from O3 and Ruby. In
the first portion, you can see that the O3 CPU issues a locked RMW with the
read part having sn = 3051 and the write part having sn = 3052. In the
second portion, you can see that 3051 and 3052 are squashed and the in the
third portion of the trace, these are committed. There are several things
that I am not able to understand. Why is the RMW squashed, since x86
architecture has to commit the instruction? Secondly, if RMW was being
executed speculatively, then what mechanism exists for informing the cache
controller about the instruction getting squashed? Thirdly, why was the
instruction committed later on, when it was originally squashed?
When I mark ldstl and stul as non-speculative, the O3 CPU and Ruby
work on an example code in which two threads are incrementing a
counter. Since locked RMW is a fence instruction (Steve suggested this
above and AMD's manual agrees), it seems that the read portion should
commit any of the loads and stores that appear before it in the
program order. This means that ldstl should be marked as memory
barrrier, and similarly stul should also be marked as memory barrier.
But looking at src/arch/x86/isa/microops/ldstop.isa, it does not seem
like that this flags can be currently supported. If others (especially
Steve and Gabe) concurr with my understanding, I can modify the file
to add the memory barrier flag.
--
Nilay
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2011-10-27 15:32:18 UTC
Permalink
I am thinking of marking all the locked instructions with IsMemBarrier.
Where do you think this flag should appear - in locked_opcodes.isa, or in
semaphores.py? I tried adding IsMemBarrier to the instructions in
locked_opcodes.isa, but that does not work. I changed the instruction
format to BasicOperate, that also does not work.

--
Nilay
Post by Gabe Black
I think you guys are on the right track. There's a non speculative flag,
a serialize before, and a serialize after. I'm not sure which one is
exactly right, but some combination should be. We should be careful not
to over do it since that might artificially hurt performance, but I
don't *think* the lock prefix is used all that much these days so it
shouldn't be have *too* bad an impact if it isn't perfectly correct.
Gabe
Post by Nilay Vaish
Post by Steve Reinhardt
Good questions. Clearly if we ever let the R part of an RMW
instruction out
to the cache, either we have to commit the instruction or add some mechanism
to unlock the block. One solution would be to mark all RMW
instructions as
serializing, which would prevent them from executing speculatively.
That
(or something like it) might be necessary to get the consistency model right
anyway, since I believe locked accesses act as fences (?? is that right,
Brad?).
Gabe, did you have an alternate solution in mind?
Steve
Post by Nilay Vaish
Does this mean that an x86 O3 CPU will never squash an RMW
instruction? I
am posting an instruction + protocol trace for obtained from O3 and Ruby. In
the first portion, you can see that the O3 CPU issues a locked RMW with the
read part having sn = 3051 and the write part having sn = 3052. In the
second portion, you can see that 3051 and 3052 are squashed and the in the
third portion of the trace, these are committed. There are several things
that I am not able to understand. Why is the RMW squashed, since x86
architecture has to commit the instruction? Secondly, if RMW was being
executed speculatively, then what mechanism exists for informing the cache
controller about the instruction getting squashed? Thirdly, why was the
instruction committed later on, when it was originally squashed?
When I mark ldstl and stul as non-speculative, the O3 CPU and Ruby
work on an example code in which two threads are incrementing a
counter. Since locked RMW is a fence instruction (Steve suggested this
above and AMD's manual agrees), it seems that the read portion should
commit any of the loads and stores that appear before it in the
program order. This means that ldstl should be marked as memory
barrrier, and similarly stul should also be marked as memory barrier.
But looking at src/arch/x86/isa/microops/ldstop.isa, it does not seem
like that this flags can be currently supported. If others (especially
Steve and Gabe) concurr with my understanding, I can modify the file
to add the memory barrier flag.
--
Nilay
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-10-27 17:09:01 UTC
Permalink
Hi Nilay,

I think a memory barrier may not be sufficient... we need to make sure it's
non-speculative as well as ordered (unless we do something more complicated
to deal with a speculative locked read that isn't followed by a write
because it got squashed).

Gabe is a better reference (the only reference?) for the details of the x86
decoder.

Steve
Post by Nilay Vaish
I am thinking of marking all the locked instructions with IsMemBarrier.
Where do you think this flag should appear - in locked_opcodes.isa, or in
semaphores.py? I tried adding IsMemBarrier to the instructions in
locked_opcodes.isa, but that does not work. I changed the instruction format
to BasicOperate, that also does not work.
--
Nilay
I think you guys are on the right track. There's a non speculative flag,
Post by Gabe Black
a serialize before, and a serialize after. I'm not sure which one is
exactly right, but some combination should be. We should be careful not
to over do it since that might artificially hurt performance, but I
don't *think* the lock prefix is used all that much these days so it
shouldn't be have *too* bad an impact if it isn't perfectly correct.
Gabe
Post by Steve Reinhardt
Good questions. Clearly if we ever let the R part of an RMW
Post by Steve Reinhardt
instruction out
to the cache, either we have to commit the instruction or add some mechanism
to unlock the block. One solution would be to mark all RMW
instructions as
serializing, which would prevent them from executing speculatively.
That
(or something like it) might be necessary to get the consistency model right
anyway, since I believe locked accesses act as fences (?? is that right,
Brad?).
Gabe, did you have an alternate solution in mind?
Steve
Does this mean that an x86 O3 CPU will never squash an RMW
Post by Nilay Vaish
instruction? I
am posting an instruction + protocol trace for obtained from O3 and Ruby. In
the first portion, you can see that the O3 CPU issues a locked RMW with the
read part having sn = 3051 and the write part having sn = 3052. In the
second portion, you can see that 3051 and 3052 are squashed and the in the
third portion of the trace, these are committed. There are several things
that I am not able to understand. Why is the RMW squashed, since x86
architecture has to commit the instruction? Secondly, if RMW was being
executed speculatively, then what mechanism exists for informing the cache
controller about the instruction getting squashed? Thirdly, why was the
instruction committed later on, when it was originally squashed?
When I mark ldstl and stul as non-speculative, the O3 CPU and Ruby
work on an example code in which two threads are incrementing a
counter. Since locked RMW is a fence instruction (Steve suggested this
above and AMD's manual agrees), it seems that the read portion should
commit any of the loads and stores that appear before it in the
program order. This means that ldstl should be marked as memory
barrrier, and similarly stul should also be marked as memory barrier.
But looking at src/arch/x86/isa/microops/**ldstop.isa, it does not seem
like that this flags can be currently supported. If others (especially
Steve and Gabe) concurr with my understanding, I can modify the file
to add the memory barrier flag.
--
Nilay
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
Beckmann, Brad
2011-10-27 17:53:09 UTC
Permalink
Hi Nilay,

I apologize it has taken me a few days to respond. I need to read my gem5-dev email more often.

First off, I just want to be clear that we are only discussing locked prefixed RMW instructions, correct? Non-locked RMW are not an issue.

In my opinion, the absolute best source to understand the x86 memory model is Sewell et al. http://doi.acm.org/10.1145/1785414.1785443 In the paper, they explain when processors can logically execute locked prefixed instructions in a very clear and intuitive way. As Steve said, locked prefixed instructions act as fences, but they also immediately retire to the memory system to maintain global ordering. Thus the locked prefixed instruction cannot logically complete until all prior lds and sts from that processor have been retired to the memory system. In other words, the load and store buffers must be empty. Furthermore, the locked prefixed instruction must immediately become visible when the locked prefixed instruction retires. In other words, the store buffer cannot hold on to the
store value after the core retires the instruction.

I think the main question here is how does the O3 ld/st queue respond to the serialize before, serialize after, and fence flags? Essentially, we need to use the combination of flags that flushes the ld and st buffers before logically executing the load portion of the locked RMW, as well as bypasses the store buffer when executing the store portion of the locked RMW. There are certainly optimizations that can be implemented to maintain that logical behavior, while allowing the hardware to do more parallel execution. However, I would suggest not trying to implement those before getting the core functionality to work using existing mechanisms.

On a related note, have you thought about how you're going to propagate Ruby probes back to the O3 load buffer? Assuming a snooping load queue, that is one core mechanism that we need to implement to support X86+O3+Ruby. It might be useful for us to discuss different possible interface implementations before you spend too much time writing code.

Brad
-----Original Message-----
Sent: Thursday, October 27, 2011 10:09 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Locked RMW in Ruby
Hi Nilay,
I think a memory barrier may not be sufficient... we need to make sure it's
non-speculative as well as ordered (unless we do something more
complicated to deal with a speculative locked read that isn't followed by a
write because it got squashed).
Gabe is a better reference (the only reference?) for the details of the x86
decoder.
Steve
Post by Nilay Vaish
I am thinking of marking all the locked instructions with IsMemBarrier.
Where do you think this flag should appear - in locked_opcodes.isa, or
in semaphores.py? I tried adding IsMemBarrier to the instructions in
locked_opcodes.isa, but that does not work. I changed the instruction
format to BasicOperate, that also does not work.
--
Nilay
I think you guys are on the right track. There's a non speculative flag,
Post by Gabe Black
a serialize before, and a serialize after. I'm not sure which one is
exactly right, but some combination should be. We should be careful
not to over do it since that might artificially hurt performance, but
I don't *think* the lock prefix is used all that much these days so
it shouldn't be have *too* bad an impact if it isn't perfectly correct.
Gabe
Post by Steve Reinhardt
Good questions. Clearly if we ever let the R part of an RMW
Post by Steve Reinhardt
instruction out
to the cache, either we have to commit the instruction or add some
mechanism to unlock the block. One solution would be to mark all
RMW instructions as serializing, which would prevent them from
executing speculatively.
That
(or something like it) might be necessary to get the consistency
model right anyway, since I believe locked accesses act as fences
(?? is that right, Brad?).
Gabe, did you have an alternate solution in mind?
Steve
Does this mean that an x86 O3 CPU will never squash an RMW
Post by Nilay Vaish
instruction? I
am posting an instruction + protocol trace for obtained from O3
and Ruby. In the first portion, you can see that the O3 CPU issues
a locked RMW with the read part having sn = 3051 and the write
part having sn = 3052. In the second portion, you can see that
3051 and 3052 are squashed and the in the third portion of the
trace, these are committed. There are several things that I am not
able to understand. Why is the RMW squashed, since x86
architecture has to commit the instruction? Secondly, if RMW was
being executed speculatively, then what mechanism exists for
informing the cache controller about the instruction getting
squashed? Thirdly, why was the instruction committed later on,
when it was originally squashed?
When I mark ldstl and stul as non-speculative, the O3 CPU and Ruby
work on an example code in which two threads are incrementing a
counter. Since locked RMW is a fence instruction (Steve suggested
this above and AMD's manual agrees), it seems that the read portion
should commit any of the loads and stores that appear before it in
the program order. This means that ldstl should be marked as memory
barrrier, and similarly stul should also be marked as memory barrier.
But looking at src/arch/x86/isa/microops/**ldstop.isa, it does not
seem like that this flags can be currently supported. If others
(especially Steve and Gabe) concurr with my understanding, I can
modify the file to add the memory barrier flag.
--
Nilay
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-
dev<http://m5sim.org/mailma
Post by Nilay Vaish
Post by Gabe Black
Post by Steve Reinhardt
n/listinfo/gem5-dev>
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-
dev<http://m5sim.org/mailman
Post by Nilay Vaish
Post by Gabe Black
/listinfo/gem5-dev>
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-
dev<http://m5sim.org/mailman/
Post by Nilay Vaish
listinfo/gem5-dev>
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2011-10-27 23:34:21 UTC
Permalink
Post by Steve Reinhardt
Hi Nilay,
I apologize it has taken me a few days to respond. I need to read my
gem5-dev email more often.
First off, I just want to be clear that we are only discussing locked
prefixed RMW instructions, correct? Non-locked RMW are not an issue.
Right, Ruby does not lock the address in case of non-locked RMW.
Post by Steve Reinhardt
In my opinion, the absolute best source to understand the x86 memory
model is Sewell et al. http://doi.acm.org/10.1145/1785414.1785443 In
the paper, they explain when processors can logically execute locked
prefixed instructions in a very clear and intuitive way. As Steve said,
locked prefixed instructions act as fences, but they also immediately
retire to the memory system to maintain global ordering. Thus the
locked prefixed instruction cannot logically complete until all prior
lds and sts from that processor have been retired to the memory system.
In other words, the load and store buffers must be empty. Furthermore,
the locked prefixed instruction must immediately become visible when the
locked prefixed instruction retires. In other words, the store buffer
cannot hold on to the store value after the core retires the
instruction.
I am assuming that if an instruction is marked as a memory barrier, the O3
CPU will drain the load and store buffers before and after the
instruction.
Post by Steve Reinhardt
I think the main question here is how does the O3 ld/st queue respond to
the serialize before, serialize after, and fence flags? Essentially, we
need to use the combination of flags that flushes the ld and st buffers
before logically executing the load portion of the locked RMW, as well
as bypasses the store buffer when executing the store portion of the
locked RMW. There are certainly optimizations that can be implemented
to maintain that logical behavior, while allowing the hardware to do
more parallel execution. However, I would suggest not trying to
implement those before getting the core functionality to work using
existing mechanisms.
I am in agreement with you.
Post by Steve Reinhardt
On a related note, have you thought about how you're going to propagate
Ruby probes back to the O3 load buffer? Assuming a snooping load queue,
that is one core mechanism that we need to implement to support
X86+O3+Ruby. It might be useful for us to discuss different possible
interface implementations before you spend too much time writing code.
Brad
I have a patch for this available on review board. This is the link --
http://reviews.gem5.org/r/894/

--
Nilay
Post by Steve Reinhardt
-----Original Message-----
Sent: Thursday, October 27, 2011 10:09 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Locked RMW in Ruby
Hi Nilay,
I think a memory barrier may not be sufficient... we need to make sure it's
non-speculative as well as ordered (unless we do something more
complicated to deal with a speculative locked read that isn't followed by a
write because it got squashed).
Gabe is a better reference (the only reference?) for the details of the x86
decoder.
Steve
Post by Nilay Vaish
I am thinking of marking all the locked instructions with IsMemBarrier.
Where do you think this flag should appear - in locked_opcodes.isa, or
in semaphores.py? I tried adding IsMemBarrier to the instructions in
locked_opcodes.isa, but that does not work. I changed the instruction
format to BasicOperate, that also does not work.
--
Nilay
I think you guys are on the right track. There's a non speculative flag,
Post by Gabe Black
a serialize before, and a serialize after. I'm not sure which one is
exactly right, but some combination should be. We should be careful
not to over do it since that might artificially hurt performance, but
I don't *think* the lock prefix is used all that much these days so
it shouldn't be have *too* bad an impact if it isn't perfectly correct.
Gabe
Nilay Vaish
2011-10-27 23:28:57 UTC
Permalink
Post by Steve Reinhardt
Hi Nilay,
I think a memory barrier may not be sufficient... we need to make sure it's
non-speculative as well as ordered (unless we do something more complicated
to deal with a speculative locked read that isn't followed by a write
because it got squashed).
I could not find anything in AMD's manual on locked instruction being
executed non-speculatively. In one of the Intel manuals, it was stated
that read portion is never is issued unless it is ensured that write
portion will also be issued. So that means that we also need mark the
instruction as non-speculative, apart from marking it as a memory barrier.
Post by Steve Reinhardt
Gabe is a better reference (the only reference?) for the details of the x86
decoder.
That does not sound good.
Post by Steve Reinhardt
Steve
Post by Nilay Vaish
I am thinking of marking all the locked instructions with IsMemBarrier.
Where do you think this flag should appear - in locked_opcodes.isa, or in
semaphores.py? I tried adding IsMemBarrier to the instructions in
locked_opcodes.isa, but that does not work. I changed the instruction format
to BasicOperate, that also does not work.
--
Nilay
I think you guys are on the right track. There's a non speculative flag,
Post by Gabe Black
a serialize before, and a serialize after. I'm not sure which one is
exactly right, but some combination should be. We should be careful not
to over do it since that might artificially hurt performance, but I
don't *think* the lock prefix is used all that much these days so it
shouldn't be have *too* bad an impact if it isn't perfectly correct.
Gabe
Nilay Vaish
2011-10-28 00:48:21 UTC
Permalink
Now I have been able to make some progress on the ISA parser. As per my
understanding there are two ways to mark ldstl and stul as non-speculative
and memory barriers.

1. Change the definition in of LdSt and BigLdSt in
src/arch/x86/isa/microops/ldstop.isa so that they take memBar as a flag as
well. Similarly, we change the definition of StoreOp and LoadOp (they
appear in the same file) so that memBar is passed on to the super class's
constructor. Then, in each of the python files that make use of ldstl and
stul, we can add the flags nonSpec and memBar with values True.

2. Change LdSt and BigLdSt as stated in 1. Make changes to
defineMicro{Store/Load}Op definitions so that they take memBar as an
argument and passes it on to the super class' constructor. Change the
definitions of all the microops accordingly. In this case, I think the
python files do not change. But it may mean that we can not change the
microop's flags in different macroops.

If this sounds Greek, I am ready to post patches for these two.

Gabe, what do you say?
One more thing, why did you choose to treat memory flags and instruction
flags in different manner in ldstop.isa?

--
Nilay
Post by Nilay Vaish
Post by Steve Reinhardt
Hi Nilay,
I think a memory barrier may not be sufficient... we need to make sure it's
non-speculative as well as ordered (unless we do something more complicated
to deal with a speculative locked read that isn't followed by a write
because it got squashed).
I could not find anything in AMD's manual on locked instruction being
executed non-speculatively. In one of the Intel manuals, it was stated that
read portion is never is issued unless it is ensured that write portion will
also be issued. So that means that we also need mark the instruction as
non-speculative, apart from marking it as a memory barrier.
Post by Steve Reinhardt
Gabe is a better reference (the only reference?) for the details of the x86
decoder.
That does not sound good.
Post by Steve Reinhardt
Steve
Post by Nilay Vaish
I am thinking of marking all the locked instructions with IsMemBarrier.
Where do you think this flag should appear - in locked_opcodes.isa, or in
semaphores.py? I tried adding IsMemBarrier to the instructions in
locked_opcodes.isa, but that does not work. I changed the instruction format
to BasicOperate, that also does not work.
Gabe Black
2011-10-28 06:30:56 UTC
Permalink
It's been a while so I don't remember what the code is doing off hand.
Give me a chance to look at it this weekend. If you're asking why the
instruction flags and the memory flags are independent, one goes into
the instruction object itself and one goes into the request object when
the access is generated. If you're asking why the mechanism that handles
them is different, I'd have to look at the specifics to say. I'd guess
either it's always been that way, or the were the same once and one (I
think the instruction flags) was changed.

If there's a locked flag and for x86 that also means the instruction has
to be a memory barrier, there probably doesn't need to be a second
option. The first should just imply that the the locked flag should be
applied to the request and the memory barrier flag should be applied to
the instruction.

Gabe
Post by Nilay Vaish
Now I have been able to make some progress on the ISA parser. As per
my understanding there are two ways to mark ldstl and stul as
non-speculative and memory barriers.
1. Change the definition in of LdSt and BigLdSt in
src/arch/x86/isa/microops/ldstop.isa so that they take memBar as a
flag as well. Similarly, we change the definition of StoreOp and
LoadOp (they appear in the same file) so that memBar is passed on to
the super class's constructor. Then, in each of the python files that
make use of ldstl and stul, we can add the flags nonSpec and memBar
with values True.
2. Change LdSt and BigLdSt as stated in 1. Make changes to
defineMicro{Store/Load}Op definitions so that they take memBar as an
argument and passes it on to the super class' constructor. Change the
definitions of all the microops accordingly. In this case, I think the
python files do not change. But it may mean that we can not change the
microop's flags in different macroops.
If this sounds Greek, I am ready to post patches for these two.
Gabe, what do you say?
One more thing, why did you choose to treat memory flags and
instruction flags in different manner in ldstop.isa?
--
Nilay
Post by Nilay Vaish
Post by Steve Reinhardt
Hi Nilay,
I think a memory barrier may not be sufficient... we need to make sure it's
non-speculative as well as ordered (unless we do something more complicated
to deal with a speculative locked read that isn't followed by a write
because it got squashed).
I could not find anything in AMD's manual on locked instruction being
executed non-speculatively. In one of the Intel manuals, it was
stated that read portion is never is issued unless it is ensured that
write portion will also be issued. So that means that we also need
mark the instruction as non-speculative, apart from marking it as a
memory barrier.
Post by Steve Reinhardt
Gabe is a better reference (the only reference?) for the details of the x86
decoder.
That does not sound good.
Post by Steve Reinhardt
Steve
Post by Nilay Vaish
I am thinking of marking all the locked instructions with
IsMemBarrier.
Where do you think this flag should appear - in locked_opcodes.isa, or in
semaphores.py? I tried adding IsMemBarrier to the instructions in
locked_opcodes.isa, but that does not work. I changed the
instruction format
to BasicOperate, that also does not work.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-10-30 06:36:55 UTC
Permalink
I looked at it, and I think #2 is the way to go, ie change the microops
and not the way they're instantiated.

Gabe
Post by Gabe Black
It's been a while so I don't remember what the code is doing off hand.
Give me a chance to look at it this weekend. If you're asking why the
instruction flags and the memory flags are independent, one goes into
the instruction object itself and one goes into the request object when
the access is generated. If you're asking why the mechanism that handles
them is different, I'd have to look at the specifics to say. I'd guess
either it's always been that way, or the were the same once and one (I
think the instruction flags) was changed.
If there's a locked flag and for x86 that also means the instruction has
to be a memory barrier, there probably doesn't need to be a second
option. The first should just imply that the the locked flag should be
applied to the request and the memory barrier flag should be applied to
the instruction.
Gabe
Post by Nilay Vaish
Now I have been able to make some progress on the ISA parser. As per
my understanding there are two ways to mark ldstl and stul as
non-speculative and memory barriers.
1. Change the definition in of LdSt and BigLdSt in
src/arch/x86/isa/microops/ldstop.isa so that they take memBar as a
flag as well. Similarly, we change the definition of StoreOp and
LoadOp (they appear in the same file) so that memBar is passed on to
the super class's constructor. Then, in each of the python files that
make use of ldstl and stul, we can add the flags nonSpec and memBar
with values True.
2. Change LdSt and BigLdSt as stated in 1. Make changes to
defineMicro{Store/Load}Op definitions so that they take memBar as an
argument and passes it on to the super class' constructor. Change the
definitions of all the microops accordingly. In this case, I think the
python files do not change. But it may mean that we can not change the
microop's flags in different macroops.
If this sounds Greek, I am ready to post patches for these two.
Gabe, what do you say?
One more thing, why did you choose to treat memory flags and
instruction flags in different manner in ldstop.isa?
--
Nilay
Post by Nilay Vaish
Post by Steve Reinhardt
Hi Nilay,
I think a memory barrier may not be sufficient... we need to make sure it's
non-speculative as well as ordered (unless we do something more complicated
to deal with a speculative locked read that isn't followed by a write
because it got squashed).
I could not find anything in AMD's manual on locked instruction being
executed non-speculatively. In one of the Intel manuals, it was
stated that read portion is never is issued unless it is ensured that
write portion will also be issued. So that means that we also need
mark the instruction as non-speculative, apart from marking it as a
memory barrier.
Post by Steve Reinhardt
Gabe is a better reference (the only reference?) for the details of the x86
decoder.
That does not sound good.
Post by Steve Reinhardt
Steve
Post by Nilay Vaish
I am thinking of marking all the locked instructions with
IsMemBarrier.
Where do you think this flag should appear - in locked_opcodes.isa, or in
semaphores.py? I tried adding IsMemBarrier to the instructions in
locked_opcodes.isa, but that does not work. I changed the
instruction format
to BasicOperate, that also does not work.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-10-28 06:32:04 UTC
Permalink
Post by Nilay Vaish
Post by Steve Reinhardt
Gabe is a better reference (the only reference?) for the details of the x86
decoder.
That does not sound good.
It would be best if that knowledge were spread around more, but it's not
secret. If somebody wants to learn how it works I can answer questions.

Gabe
Loading...