[gem5-dev] Review Request 2787: ruby: Fixed pipeline squashes caused by aliased requests

Discussion:

[gem5-dev] Review Request 2787: ruby: Fixed pipeline squashes caused by aliased requests

Anthony Gutierrez

2015-05-11 22:22:23 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------

Review request for Default.

Repository: gem5

Description
-------

Changeset 10844:0848038fe1d8
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests

This patch was created by Bihn Pham during his internship at AMD.

This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.

Diffs
-----

src/mem/ruby/system/Sequencer.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/ruby/system/Sequencer.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec

Diff: http://reviews.gem5.org/r/2787/diff/

Testing
-------

Thanks,

Anthony Gutierrez

Joel Hestness

2015-05-14 19:33:17 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6262
-----------------------------------------------------------

From experience with the O3 CPU, this is a VERY important change for simulated CPU performance. I appreciate the effort to finally fix this.

It would be nice for the Ruby and gem5-classic memory hierarchies to provide the same access interface, but I think the consistency implications of this patch need to be discussed.

I'm worried that this patch seems likely to upset consistency models for cores that may have relied on Ruby to block aliased memory accesses. Specifically, if a core was blocking multiple outstanding accesses to a single line as a way to enforce consistency to data in that line (e.g. TSO), but now the accesses could be concurrently issued to Ruby, seems like it would now be the responsibility of the sequencer and maybe even the coherence protocol to ensure that those accesses remain ordered as required.

Given the behavior of the O3 CPU, perhaps the classic memory hierarchy allows multiple outstanding accesses to a single line. However, it handles transient coherence states with atomic coherence updates, which make it much easier to guarantee access ordering to a single line, so I'm not clear that it exposes the same interface as this patch provides.

Are you sure that all Ruby-working CPU cores and existing protocols still enforce correct consistency?

- Joel Hestness

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 11, 2015, 10:22 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10844:0848038fe1d8
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/ruby/system/Sequencer.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Brad Beckmann

2015-05-18 23:49:45 UTC

Post by Anthony Gutierrez

Post by Joel Hestness
From experience with the O3 CPU, this is a VERY important change for simulated CPU performance. I appreciate the effort to finally fix this.
It would be nice for the Ruby and gem5-classic memory hierarchies to provide the same access interface, but I think the consistency implications of this patch need to be discussed.
I'm worried that this patch seems likely to upset consistency models for cores that may have relied on Ruby to block aliased memory accesses. Specifically, if a core was blocking multiple outstanding accesses to a single line as a way to enforce consistency to data in that line (e.g. TSO), but now the accesses could be concurrently issued to Ruby, seems like it would now be the responsibility of the sequencer and maybe even the coherence protocol to ensure that those accesses remain ordered as required.
Given the behavior of the O3 CPU, perhaps the classic memory hierarchy allows multiple outstanding accesses to a single line. However, it handles transient coherence states with atomic coherence updates, which make it much easier to guarantee access ordering to a single line, so I'm not clear that it exposes the same interface as this patch provides.
Are you sure that all Ruby-working CPU cores and existing protocols still enforce correct consistency?

I'm glad you appreciate the importance of this patch. I'm not deeply familar with the Classic model but I'm pretty sure it does not allow a CPU to issue multiple outstanding access to a single cache line.

What is the complete set of Ruby-working CPU cores and protocols? What tests does one use to ensure correct consistency? We can certainly assert for the correctness of x86. I believe that all existing Ruby protocols implement coherence and the evictionCallbacks the same way, so I'm not expecting any differences. What makes you think there may be an issue with a certain combination of cores and protocols?

- Brad

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6262
-----------------------------------------------------------

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 11, 2015, 10:22 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10844:0848038fe1d8
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/ruby/system/Sequencer.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Joel Hestness

2015-05-21 19:53:19 UTC

Post by Brad Beckmann

Post by Joel Hestness
From experience with the O3 CPU, this is a VERY important change for simulated CPU performance. I appreciate the effort to finally fix this.
It would be nice for the Ruby and gem5-classic memory hierarchies to provide the same access interface, but I think the consistency implications of this patch need to be discussed.
I'm worried that this patch seems likely to upset consistency models for cores that may have relied on Ruby to block aliased memory accesses. Specifically, if a core was blocking multiple outstanding accesses to a single line as a way to enforce consistency to data in that line (e.g. TSO), but now the accesses could be concurrently issued to Ruby, seems like it would now be the responsibility of the sequencer and maybe even the coherence protocol to ensure that those accesses remain ordered as required.
Given the behavior of the O3 CPU, perhaps the classic memory hierarchy allows multiple outstanding accesses to a single line. However, it handles transient coherence states with atomic coherence updates, which make it much easier to guarantee access ordering to a single line, so I'm not clear that it exposes the same interface as this patch provides.
Are you sure that all Ruby-working CPU cores and existing protocols still enforce correct consistency?

I'm glad you appreciate the importance of this patch. I'm not deeply familar with the Classic model but I'm pretty sure it does not allow a CPU to issue multiple outstanding access to a single cache line.
What is the complete set of Ruby-working CPU cores and protocols? What tests does one use to ensure correct consistency? We can certainly assert for the correctness of x86. I believe that all existing Ruby protocols implement coherence and the evictionCallbacks the same way, so I'm not expecting any differences. What makes you think there may be an issue with a certain combination of cores and protocols?

If, in fact, the classic memory model blocks concurrent accesses to the same line, does the O3CPU also suffer from O3 pipeline squashes there? According to the status matrix http://gem5.org/Status_Matrix, this could be tested, since you don't need to use atomic memory accesses for this. If the classic memory model suffers from this issue, it doesn't sound like O3 pipeline flushing should be fixed in Ruby, but instead by fixing the O3 CPU. Why aren't we doing that instead?

On the other hand, if the classic hierarchy doesn't suffer from pipeline squashes, I'd prefer some exposition about why it differs and why we can't implement Ruby's blocking the same way (read: this patch's description and code comments leave a ton to be desired). A common cache interface between the two memory models is very desirable to allow some certainty that cores should work with both. We need to know if this patch makes classic and Ruby cache interfaces differ.

If the classic memory hierarchy does allow multiple outstanding requests to the same line, then there are possible consistency implications to be explored. Here is a simple example (specifically with MOESI_CMP_directory) that might suffer from broken TSO with this patch:
1) A core issues store 1 to cache line A
2) A core tries to issue store 2 to cache line A. Without this patch, Ruby would block, which can help the LSQ preserve TSO. With this patch, Ruby now accepts store 2, and both store 1 and 2 end up in the L1 controller's mandatory queue.
3) Suppose the cache line is in a transient state being grabbed from the cache so that store 1 gets recycled in the mandatory queue (which is unordered).
Now store 2 is effectively ahead of store 1 in the cache controller, and their order of visibility to the rest of the system could be upset. This is a simple example, which witnesses the issue pretty easily due to recycling the store queue. However, other protocols must currently assume a single outstanding access per line at a time (otherwise the RubyPort would allow them), so they do not enforce ordering if it is required.

For tests, consistency litmus. For working systems, we should start with those working in the current status matrix. Unfortunately, the matrix appears to be out of date, because I know that x86 O3CPU works with at least MOESI_hammer in both SE and FS mode. We currently depend on the O3CPU+Ruby to enforce correct x86 consistency for our heterogeneous processor tests, so I'd prefer to be assured this change won't break that. It seems likely that other configurations should also be tested (e.g. Marco just posted a review suggesting he is testing consistency of MESI_Two_Level and O3: http://reviews.gem5.org/r/2840/).

- Joel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6262
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 11, 2015, 10:22 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10844:0848038fe1d8
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/ruby/system/Sequencer.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Joel Hestness

2015-05-21 20:02:13 UTC

Post by Brad Beckmann

Post by Joel Hestness
From experience with the O3 CPU, this is a VERY important change for simulated CPU performance. I appreciate the effort to finally fix this.
It would be nice for the Ruby and gem5-classic memory hierarchies to provide the same access interface, but I think the consistency implications of this patch need to be discussed.
I'm worried that this patch seems likely to upset consistency models for cores that may have relied on Ruby to block aliased memory accesses. Specifically, if a core was blocking multiple outstanding accesses to a single line as a way to enforce consistency to data in that line (e.g. TSO), but now the accesses could be concurrently issued to Ruby, seems like it would now be the responsibility of the sequencer and maybe even the coherence protocol to ensure that those accesses remain ordered as required.
Given the behavior of the O3 CPU, perhaps the classic memory hierarchy allows multiple outstanding accesses to a single line. However, it handles transient coherence states with atomic coherence updates, which make it much easier to guarantee access ordering to a single line, so I'm not clear that it exposes the same interface as this patch provides.
Are you sure that all Ruby-working CPU cores and existing protocols still enforce correct consistency?

I'm glad you appreciate the importance of this patch. I'm not deeply familar with the Classic model but I'm pretty sure it does not allow a CPU to issue multiple outstanding access to a single cache line.
What is the complete set of Ruby-working CPU cores and protocols? What tests does one use to ensure correct consistency? We can certainly assert for the correctness of x86. I believe that all existing Ruby protocols implement coherence and the evictionCallbacks the same way, so I'm not expecting any differences. What makes you think there may be an issue with a certain combination of cores and protocols?
If, in fact, the classic memory model blocks concurrent accesses to the same line, does the O3CPU also suffer from O3 pipeline squashes there? According to the status matrix http://gem5.org/Status_Matrix, this could be tested, since you don't need to use atomic memory accesses for this. If the classic memory model suffers from this issue, it doesn't sound like O3 pipeline flushing should be fixed in Ruby, but instead by fixing the O3 CPU. Why aren't we doing that instead?
On the other hand, if the classic hierarchy doesn't suffer from pipeline squashes, I'd prefer some exposition about why it differs and why we can't implement Ruby's blocking the same way (read: this patch's description and code comments leave a ton to be desired). A common cache interface between the two memory models is very desirable to allow some certainty that cores should work with both. We need to know if this patch makes classic and Ruby cache interfaces differ.
1) A core issues store 1 to cache line A
2) A core tries to issue store 2 to cache line A. Without this patch, Ruby would block, which can help the LSQ preserve TSO. With this patch, Ruby now accepts store 2, and both store 1 and 2 end up in the L1 controller's mandatory queue.
3) Suppose the cache line is in a transient state being grabbed from the cache so that store 1 gets recycled in the mandatory queue (which is unordered).
Now store 2 is effectively ahead of store 1 in the cache controller, and their order of visibility to the rest of the system could be upset. This is a simple example, which witnesses the issue pretty easily due to recycling the store queue. However, other protocols must currently assume a single outstanding access per line at a time (otherwise the RubyPort would allow them), so they do not enforce ordering if it is required.
For tests, consistency litmus. For working systems, we should start with those working in the current status matrix. Unfortunately, the matrix appears to be out of date, because I know that x86 O3CPU works with at least MOESI_hammer in both SE and FS mode. We currently depend on the O3CPU+Ruby to enforce correct x86 consistency for our heterogeneous processor tests, so I'd prefer to be assured this change won't break that. It seems likely that other configurations should also be tested (e.g. Marco just posted a review suggesting he is testing consistency of MESI_Two_Level and O3: http://reviews.gem5.org/r/2840/).

UPDATE: Apologies, I see now that the Sequencer does not actually issue aliased requests, but instead buffers them in the sequencer (the subject of your thread with Nilay). The access ordering in cache controllers is not of concern, then.

I've also inspected the O3 CPU's LSQ, and it appears to block itself from issuing multiple stores when enforcing TSO. x86 O3 CPU should be fine with this change.

I'd still like us to discuss the cache interface differences between classic and Ruby.

- Joel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6262
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 11, 2015, 10:22 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10844:0848038fe1d8
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/ruby/system/Sequencer.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Nilay Vaish

2015-05-18 22:06:31 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6314
-----------------------------------------------------------

I have asked this question before when Steve posted this patch several months ago.
I am going to ask it again? Is it all right to buffer requests in the Sequencer?
Do we know of CPU designs that do so? What problems do we face when we push through
requests for same address to the cache controllers?

- Nilay Vaish

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 11, 2015, 10:22 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10844:0848038fe1d8
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/ruby/system/Sequencer.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Brad Beckmann

2015-05-18 23:55:50 UTC

Post by Anthony Gutierrez

Post by Nilay Vaish
I have asked this question before when Steve posted this patch several months ago.
I am going to ask it again? Is it all right to buffer requests in the Sequencer?
Do we know of CPU designs that do so? What problems do we face when we push through
requests for same address to the cache controllers?

Yes, it is ok to buffer requests in the Sequencer. We are pretty satisfied with the correlation results against our hardware using Ruby. The Sequencer is a simplification of much more complicated buffering in real hardware. That simplification is a very good thing.

I believe it is pretty universal across most designs that they don't allow a single CPU to issue multiple requests for the same cache line that go out throughout the memory system. That has huge power and complexity implications.

- Brad

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6314
-----------------------------------------------------------

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 11, 2015, 10:22 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10844:0848038fe1d8
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/ruby/system/Sequencer.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Nilay Vaish

2015-05-19 16:23:05 UTC

Post by Brad Beckmann

Post by Nilay Vaish
I have asked this question before when Steve posted this patch several months ago.
I am going to ask it again? Is it all right to buffer requests in the Sequencer?
Do we know of CPU designs that do so? What problems do we face when we push through
requests for same address to the cache controllers?

Yes, it is ok to buffer requests in the Sequencer. We are pretty
satisfied with the correlation results against our hardware using Ruby.
The Sequencer is a simplification of much more complicated buffering in
real hardware. That simplification is a very good thing.

I need some reference on this. I talked to Prof. Wood about it and he
said that he is not aware of any CPUs that do this.

Post by Brad Beckmann
I believe it is pretty universal across most designs that they don't
allow a single CPU to issue multiple requests for the same cache line
that go out throughout the memory system. That has huge power and
complexity implications.

Note that the L0/L1 controller would serve those requests from the same
cache block. So requests would not be sent out beyond the L0/L1
controller. And as much as I understand the protocols currently in gem5,
if we completely remove the aliasing support from sequencer, the L0/L1
controllers would either merge or block aliased requests.

--
Nilay

Joel Hestness

2015-05-21 19:52:06 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6324
-----------------------------------------------------------

src/mem/ruby/system/Sequencer.cc (line 588)
<http://reviews.gem5.org/r/2787/#comment5461>

Whitespace on this line looks too wide.

- Joel Hestness

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 11, 2015, 10:22 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10844:0848038fe1d8
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/ruby/system/Sequencer.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Tony Gutierrez

2015-05-26 19:42:13 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------

(Updated May 26, 2015, 12:42 p.m.)

Review request for Default.

Repository: gem5

Description (updated)
-------

Changeset 10817:d010e6a8e783
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests

This patch was created by Bihn Pham during his internship at AMD.

This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.

Diffs (updated)
-----

src/mem/ruby/system/Sequencer.hh df2aa91dba5b0f0baa351039f0802baad9ed8f1d
src/mem/ruby/system/Sequencer.cc df2aa91dba5b0f0baa351039f0802baad9ed8f1d

Diff: http://reviews.gem5.org/r/2787/diff/

Testing
-------

Thanks,

Tony Gutierrez

Brad Beckmann

2015-05-29 00:27:13 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6416
-----------------------------------------------------------

On 5/19 Nilay wrote: "I need some reference on this. I talked to Prof. Wood about it and he said that he is not aware of any CPUs that do this.'

Is the challenge that CPUs do not use complicated buffering to properly implement memory models? If so, I don't see how someone could disagree with that statement. For instance, see Butler et al. IEEE Micro paper from 2011. There are three different L/S units that can generate requests to the L1 I and D caches. Do you not think there is complicated buffering to support this?

On 5/19 Nilay also wrote: "Note that the L0/L1 controller would serve those requests from the same cache block. So requests would not be sent out beyond the L0/L1 controller. And as much as I understand the protocols currently in gem5, if we completely remove the aliasing support from sequencer, the L0/L1 controllers would either merge or block aliased requests."

I would be very careful with trying such a solution. Most SLICC protocols rely on stalling or recycling to deal with conflicting requests. A lot of stalls and recycles can significantly impact performance.

In general, we need to get past the expectation that contributors are going to completely re-implement their entire approach when posting a patch. What you are suggesting goes well beyond this patch.

- Brad Beckmann

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 26, 2015, 7:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10817:d010e6a8e783
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh df2aa91dba5b0f0baa351039f0802baad9ed8f1d
src/mem/ruby/system/Sequencer.cc df2aa91dba5b0f0baa351039f0802baad9ed8f1d
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Joel Hestness

2015-05-29 15:01:03 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6383
-----------------------------------------------------------

I was buffering these comments, because this change has large implications and we still haven't discussed the differences between Ruby and classic memory interfaces. The patch submitter should be able to address questions like the ones I've raised so far (I don't feel any clarification has been provided yet).

Once again, I did more digging: It appears that classic caches completely model MSHR queuing and write buffering, and they block requests from the core when these are full. This indicates that they accept multiple concurrent accesses to a single line (it would be nice to get confirmation on that from a classic user). Thus, it makes sense to allow the Ruby sequencer to accept multiple outstanding accesses to a single line concurrently to be consistent with classic.

Overall, I have significant complaints about this patch, because it introduces buffering into the sequencer, which has the effect of implementing MSHR queuing that should probably be in the caches to be consistent with classic. This breaks the sequencer's thin shim abstraction between cores and cache controllers, appears to break no-RFO L1 caches, and side-steps cache resource constraints. Detailed comments are below.

src/mem/ruby/system/Sequencer.hh (line 187)
<http://reviews.gem5.org/r/2787/#comment5490>

Minor: This declaration introduces yet another member variable naming convention (m_UpperCamel). Can you please change the name of this table so it at least matches some other variables? m_lowerCamel looks most common, and gem5's style guideline suggests it should be lowerCamel (i.e. without 'm_').

(I understand a lot of Ruby code is bad about this, but we can keep from making it worse)

src/mem/ruby/system/Sequencer.cc (line 89)
<http://reviews.gem5.org/r/2787/#comment5491>

Minor: Please follow the local_variable naming convention that is already used in this function. These should be aliased_it and aliased_it_end

src/mem/ruby/system/Sequencer.cc (line 154)
<http://reviews.gem5.org/r/2787/#comment5492>

Minor: Ibid

src/mem/ruby/system/Sequencer.cc (line 221)
<http://reviews.gem5.org/r/2787/#comment5519>

Can you please clarify, since it hasn't been stated explicitly yet: Current L0/L1 cache controllers cannot handle multiple outstanding accesses to a single line? If so, can you elaborate on why? What breaks?

src/mem/ruby/system/Sequencer.cc (line 315)
<http://reviews.gem5.org/r/2787/#comment5520>

This is a symptom of never sending the aliased request to the controller (at which point it's initialRequestTime would have been set appropriately). I feel that all requests to the Sequencer need to be sent through to the controller, which would fix this.

src/mem/ruby/system/Sequencer.cc (line 428)
<http://reviews.gem5.org/r/2787/#comment5522>

The Sequencer should not be handling stores in this manner. This effectively coalesces and completes concurrent stores to a single line (in zero time) without their data ever being sent to the cache. In other words this effectively hoists MSHRs either up from the top-level cache controller (or pulls them down from the LSQ?). By doing this, repeated stores are never subject to resource constraints like cache porting or available MSHRs in the controller, and cache/MSHR access stats are incorrect.

You need to send requests through to the cache controller.

src/mem/ruby/system/Sequencer.cc (line 436)
<http://reviews.gem5.org/r/2787/#comment5523>

This breaks the no-RFO property of some L1s (and is particularly relevant for GPUs). For instance, this would fail with gem5-gpu's VI_hammer protocol, because the cache line is not returned on writes. You can't assume that the current data will be available when you've queued loads behind a store to a line. This is a major signal that MSHR queuing should not happen in the sequencer (which must be protocol-independent).

You need to send requests through to the cache controller.

src/mem/ruby/system/Sequencer.cc (line 495)
<http://reviews.gem5.org/r/2787/#comment5521>

There is a similar issue for decoalescing loads here in the Sequencer (also in zero time!). Specifically, by queuing accesses to a line in the Sequencer, the cache is never made aware of the data in the line that is needed for each of the queued accesses. This means that it does not know what portion of the line should be returned to the core, and this again dodges potential resource constraints like the amount of data that can be pulled from the line per cycle to respond to each outstanding access. Can the data from a cache line really be fanned out to multiple separate requests in the same cycle when queuing MSHRs (note: with strided cache access from O3 CPU, this could be 16, 32 or more concurrent accesses in a single cycle)?

You really need to send requests through to the cache controller.

- Joel Hestness

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 26, 2015, 7:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10817:d010e6a8e783
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh df2aa91dba5b0f0baa351039f0802baad9ed8f1d
src/mem/ruby/system/Sequencer.cc df2aa91dba5b0f0baa351039f0802baad9ed8f1d
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Steve Reinhardt

2015-06-02 02:04:06 UTC

Post by Anthony Gutierrez

Post by Joel Hestness
I was buffering these comments, because this change has large implications and we still haven't discussed the differences between Ruby and classic memory interfaces. The patch submitter should be able to address questions like the ones I've raised so far (I don't feel any clarification has been provided yet).
Once again, I did more digging: It appears that classic caches completely model MSHR queuing and write buffering, and they block requests from the core when these are full. This indicates that they accept multiple concurrent accesses to a single line (it would be nice to get confirmation on that from a classic user). Thus, it makes sense to allow the Ruby sequencer to accept multiple outstanding accesses to a single line concurrently to be consistent with classic.
Overall, I have significant complaints about this patch, because it introduces buffering into the sequencer, which has the effect of implementing MSHR queuing that should probably be in the caches to be consistent with classic. This breaks the sequencer's thin shim abstraction between cores and cache controllers, appears to break no-RFO L1 caches, and side-steps cache resource constraints. Detailed comments are below.

To follow up on Joel's high-level question: the classic cache model does indeed combine requests to the same cache block using a finite set of MSHRs, and blocks when the MSHRs are full. Note that this is useful at all levels of the hierarchy, e.g., in a shared L2 you can end up combining requests from different L1s that happen to target the same cache block. The general rule in the classic hierarchy is that a single cache will have at most one outstanding request for a given cache block at any point in time. When a response arrives, the cache will satisfy as many of the buffered requests for the block as it can. Note that the timing is sketchy here on the classic side too; the latency of handling these (potentially multiple) requests is not modeled. It's not clear how large of an effect that is.

Note that the set of requests to a given cache block are processed in order, both in the classic cache and (I believe) in this patch as well. (Binh wrote this code originally, but I spent a fair amount of time cleaning it up before it was posted the first time, though that was long enough ago I don't remember all the details off the top of my head.) If stronger ordering guarantees are required by the consistency model, it's the responsibility of the CPU to hold off on issuing requests to the cache appropriately (as Joel saw with the O3 CPU).

- Steve

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6383
-----------------------------------------------------------

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 26, 2015, 12:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10817:d010e6a8e783
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh df2aa91dba5b0f0baa351039f0802baad9ed8f1d
src/mem/ruby/system/Sequencer.cc df2aa91dba5b0f0baa351039f0802baad9ed8f1d
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Joel Hestness

2015-06-02 20:26:01 UTC

Post by Steve Reinhardt

Post by Joel Hestness
I was buffering these comments, because this change has large implications and we still haven't discussed the differences between Ruby and classic memory interfaces. The patch submitter should be able to address questions like the ones I've raised so far (I don't feel any clarification has been provided yet).
Once again, I did more digging: It appears that classic caches completely model MSHR queuing and write buffering, and they block requests from the core when these are full. This indicates that they accept multiple concurrent accesses to a single line (it would be nice to get confirmation on that from a classic user). Thus, it makes sense to allow the Ruby sequencer to accept multiple outstanding accesses to a single line concurrently to be consistent with classic.
Overall, I have significant complaints about this patch, because it introduces buffering into the sequencer, which has the effect of implementing MSHR queuing that should probably be in the caches to be consistent with classic. This breaks the sequencer's thin shim abstraction between cores and cache controllers, appears to break no-RFO L1 caches, and side-steps cache resource constraints. Detailed comments are below.

To follow up on Joel's high-level question: the classic cache model does indeed combine requests to the same cache block using a finite set of MSHRs, and blocks when the MSHRs are full. Note that this is useful at all levels of the hierarchy, e.g., in a shared L2 you can end up combining requests from different L1s that happen to target the same cache block. The general rule in the classic hierarchy is that a single cache will have at most one outstanding request for a given cache block at any point in time. When a response arrives, the cache will satisfy as many of the buffered requests for the block as it can. Note that the timing is sketchy here on the classic side too; the latency of handling these (potentially multiple) requests is not modeled. It's not clear how large of an effect that is.
Note that the set of requests to a given cache block are processed in order, both in the classic cache and (I believe) in this patch as well. (Binh wrote this code originally, but I spent a fair amount of time cleaning it up before it was posted the first time, though that was long enough ago I don't remember all the details off the top of my head.) If stronger ordering guarantees are required by the consistency model, it's the responsibility of the CPU to hold off on issuing requests to the cache appropriately (as Joel saw with the O3 CPU).

Thanks for the confirmation on that, Steve.

I've thought more about this patch, and while I have issue with adding buffering in the Sequencer, I think this patch is headed in an appropriate direction if we figure out how to make it more flexible for protocol authors. Specifically, I feel that we should move the aliased access buffering out to a separate Ruby component, and slim the sequencer to pass all requests directly through to the top-level cache controller. This would allow a protocol author the following options:
A) Use the new buffering component to block accesses like the current mainline sequencer
B) Use the new buffering component to buffer them like this patch (but fix incorrect assumptions about line ownership)
C) Remove the new component and send all accesses straight through to the cache controller, where the protocol author can do MSHR handling/queuing/coalescing as desired

A major difference between the classic memory hierarchy and Ruby is that Ruby supports writing new coherence protocols rather than having a static protocol. To ease writing new coherence protocols, Ruby (and SLICC) makes significant and useful abstractions that decrease a protocol author's effort required to get an apparently-working coherence protocol. Examples of this include disallowing multiple in-flight requests to one cache line in the L1 caches (eliminates need to reason about MSHR queuing in L1s immediately), and providing a (optional) functionally coherent backing store (eliminates the need to make protocols 'data-correct' immediately). We need the mainline codebase to uphold and improve these sorts of simplifying Ruby abstractions.

By adding a new component that parameterizes how to buffer aliased accesses, we would - again - be aiding the protocol author by introducing a path toward a working, high-performance protocol. By blocking aliased requests, the author can choose to set up such buffering in the requesting core and keep the L1 cache controller simple as a first cut when writing a protocol. By using buffering like this patch, the author might be able to get the performance of relaxed consistency without a complicated L1 cache (i.e. by allowing fake, unconstrained buffering). However, an author should be allowed to 'take the training wheels off' and handle multiple outstanding requests per line in the L1 controller. This aggressive protocol development is currently not even allowed by Ruby.

::comment:: I have realized that my gem5-gpu patches are, in effect, allowing multiple outstanding requests per line to the VI_hammer GPU L1s, but doing so in hacky ways that circumvent the sequencer callbacks. If I had the ability to send requests to the same line through the sequencer, I would fix the VI_hammer GPU L1 to do real MSHR queuing, since we know that is how NVIDIA parts work. Mainline gem5 needs to offer this to protocol authors.

Finally, I think it makes most sense that this fake buffering be in a separate component: Aside from being 'training wheels' for a protocol author, providing buffering in a separate component can provide flexible comparisons between hierarchies. Currently, there is no good way to compare performance of a system using Ruby against the classic memory hierarchy, because they lack a common interface. For example, it would be interesting to know if classic (MESI) performs similarly to MESI_Two_Level, but MESI_Two_Level assumes a single outstanding L1 access per line. However, MESI_Two_Level would probably perform terribly for single-element-strided heap access from the inorder or O3 CPUs. By having a separate buffering component that can be used with any core and any cache hierarchy, we can enforce the same interfaces on both hierarchies to reasonably compare them.

Would this new buffering component (or at least sequencer buffer parameterization) be a reasonable way to proceed with this patch? If so, I'll offer to help make it happen.

- Joel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6383
-----------------------------------------------------------

Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated May 26, 2015, 7:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10817:d010e6a8e783
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh df2aa91dba5b0f0baa351039f0802baad9ed8f1d
src/mem/ruby/system/Sequencer.cc df2aa91dba5b0f0baa351039f0802baad9ed8f1d
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Andreas Hansson

2015-09-04 16:26:24 UTC

Post by Anthony Gutierrez

src/mem/ruby/system/Sequencer.cc, line 532
<http://reviews.gem5.org/r/2787/diff/1/?file=44990#file44990line532>
The Sequencer should not be handling stores in this manner. This effectively coalesces and completes concurrent stores to a single line (in zero time) without their data ever being sent to the cache. In other words this effectively hoists MSHRs either up from the top-level cache controller (or pulls them down from the LSQ?). By doing this, repeated stores are never subject to resource constraints like cache porting or available MSHRs in the controller, and cache/MSHR access stats are incorrect.
You need to send requests through to the cache controller.

Following up on this, I am of the opinion that we should probably do: 1) read/write combining in the LSQ before sending out a packet, and 2) combining of MSHR targets in the L1 before propagating a miss downwards. I am not sure why we would ever do it in the Sequencer. Am I missing something?

- Andreas

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6383
-----------------------------------------------------------

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Tony Gutierrez

2015-07-22 18:15:15 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------

(Updated July 22, 2015, 11:15 a.m.)

Review request for Default.

Repository: gem5

Description (updated)
-------

Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests

This patch was created by Bihn Pham during his internship at AMD.

This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.

Diffs (updated)
-----

src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18

Diff: http://reviews.gem5.org/r/2787/diff/

Testing
-------

Thanks,

Tony Gutierrez

Joel Hestness

2015-07-31 23:13:44 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6875
-----------------------------------------------------------

Thanks for updating this and apologies for the delayed review.

Besides the comments below, this looks a lot more agreeable.

src/mem/ruby/system/Sequencer.cc (line 205)
<http://reviews.gem5.org/r/2787/#comment5942>

Can you adjust this comment to be consistent with the check (i.e. 'check for any outstanding requests to the line')?

src/mem/ruby/system/Sequencer.cc (line 450)
<http://reviews.gem5.org/r/2787/#comment5943>

I'm still not convinced that this should be allowed, since this assumes the L1 cache is RFO. My preference is that the read request be issued to the caches instead (mirroring the readCallback). If the caches are RFO, then issuing reads here should complete in the next cycle, so this shouldn't affect performance much.

- Joel Hestness

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Brad Beckmann

2015-08-01 00:03:34 UTC

Post by Anthony Gutierrez

Post by Joel Hestness
Thanks for updating this and apologies for the delayed review.
Besides the comments below, this looks a lot more agreeable.

We are just hours away from checking our code in, so we won't be able to address your comments. We did put a lot effort into adding the coalesce_reqs flag as you requested in your last review.

Post by Anthony Gutierrez

Post by Joel Hestness
src/mem/ruby/system/Sequencer.cc, line 258
<http://reviews.gem5.org/r/2787/diff/3/?file=48362#file48362line258>
Can you adjust this comment to be consistent with the check (i.e. 'check for any outstanding requests to the line')?

We'll have to get it next time we check in patches. Or you're welcome to update yourself.

Post by Anthony Gutierrez

Post by Joel Hestness
src/mem/ruby/system/Sequencer.cc, line 553
<http://reviews.gem5.org/r/2787/diff/3/?file=48362#file48362line553>
I'm still not convinced that this should be allowed, since this assumes the L1 cache is RFO. My preference is that the read request be issued to the caches instead (mirroring the readCallback). If the caches are RFO, then issuing reads here should complete in the next cycle, so this shouldn't affect performance much.

There is a lot of code in the Sequencer that assumes RfO (example the block on calls to support RMW).

- Brad

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6875
-----------------------------------------------------------

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Joel Hestness

2015-08-01 13:51:20 UTC

Post by Brad Beckmann

Post by Joel Hestness
src/mem/ruby/system/Sequencer.cc, line 553
<http://reviews.gem5.org/r/2787/diff/3/?file=48362#file48362line553>
I'm still not convinced that this should be allowed, since this assumes the L1 cache is RFO. My preference is that the read request be issued to the caches instead (mirroring the readCallback). If the caches are RFO, then issuing reads here should complete in the next cycle, so this shouldn't affect performance much.

There is a lot of code in the Sequencer that assumes RfO (example the block on calls to support RMW).

This is not cool. First, I added this request in my original review here, so it should have been addressed previously.

Second, your reasoning here for not addressing my request doesn't hold water: RMW operations are atomics that require ownership prior to performing the operation. Otherwise, the results would be nonsense. Even if there is other code in the Sequencer that assumes RFO, we don't need to perpetuate these assumptions.

- Joel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6875
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Joel Hestness

2015-08-30 18:01:32 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------

Hi guys,

I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.

Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.

Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.

Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

- Joel Hestness

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Brad Beckmann

2015-09-02 22:59:41 UTC

Post by Anthony Gutierrez

Post by Joel Hestness
Hi guys,
I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.
Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.
Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.
Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.

I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

- Brad

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Nilay Vaish

2015-09-03 01:58:17 UTC

Post by Brad Beckmann

Post by Joel Hestness
Hi guys,
I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.
Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.
Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.
Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.
I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.
The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

I am acutally in favor of the idea of coalescing requests. This is something that classic memory does as well.
And I read some paper on MSHRs maintaining primary and secondary requests. I would like that we ultimately
merge this patch. I think we should split this patch into two parts:

a. the first patch will combine the read and the write request tables maintained by the sequencer. I think Joel
has no objections to that part. So should be possible to merge that patch any time.

b. the second part will provide each entry in the combined request table with a list of packets instead of a
single packet. This part of the patch is what we need to discuss.

-- I think we should drop the max outstanding request variable all together. As both of you agree
the LSQ size should decide how many requests should be pending, there is no need for the sequencer to
maintain its own limit.

-- We also need to decide on how the statistical variables are updated. With this patch, the requests buffered
in the sequencer will not update the stats maintained by the cache controller. This becomes a problem
when one tries to compare the loads and stores issued by the cpu and the loads and stores seen by the cache
controller.

- Nilay

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Brad Beckmann

2015-09-04 20:51:30 UTC

Post by Brad Beckmann

Post by Joel Hestness
Hi guys,
I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.
Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.
Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.
Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.
I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.
The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?
I am acutally in favor of the idea of coalescing requests. This is something that classic memory does as well.
And I read some paper on MSHRs maintaining primary and secondary requests. I would like that we ultimately
a. the first patch will combine the read and the write request tables maintained by the sequencer. I think Joel
has no objections to that part. So should be possible to merge that patch any time.
b. the second part will provide each entry in the combined request table with a list of packets instead of a
single packet. This part of the patch is what we need to discuss.
-- I think we should drop the max outstanding request variable all together. As both of you agree
the LSQ size should decide how many requests should be pending, there is no need for the sequencer to
maintain its own limit.
-- We also need to decide on how the statistical variables are updated. With this patch, the requests buffered
in the sequencer will not update the stats maintained by the cache controller. This becomes a problem
when one tries to compare the loads and stores issued by the cpu and the loads and stores seen by the cache
controller.

I appreciate the concerns with statistics. At one point, this patch actually included statistics for merged requests so that one could do the math to verify that no requests were missing. Those stats where removed when we dropped ruby.stats and they were never added back. Would you be ok with adding counts in the Sequencer/RubyPort rather than trying to update the controller stats?

I will try to add those stats, but can we keep this patch as one rather than split up? It has been a real pain to maintain and there is a lot of interdependent parts to it. Splitting it up would essentially require re-implementing it. We'll be lucky if we can find someone who as the time to add the stats as you requested.

- Brad

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Joel Hestness

2015-09-04 22:43:10 UTC

Post by Brad Beckmann

Post by Joel Hestness
Hi guys,
I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.
Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.
Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.
Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.
I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.
The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?
I am acutally in favor of the idea of coalescing requests. This is something that classic memory does as well.
And I read some paper on MSHRs maintaining primary and secondary requests. I would like that we ultimately
a. the first patch will combine the read and the write request tables maintained by the sequencer. I think Joel
has no objections to that part. So should be possible to merge that patch any time.
b. the second part will provide each entry in the combined request table with a list of packets instead of a
single packet. This part of the patch is what we need to discuss.
-- I think we should drop the max outstanding request variable all together. As both of you agree
the LSQ size should decide how many requests should be pending, there is no need for the sequencer to
maintain its own limit.
-- We also need to decide on how the statistical variables are updated. With this patch, the requests buffered
in the sequencer will not update the stats maintained by the cache controller. This becomes a problem
when one tries to compare the loads and stores issued by the cpu and the loads and stores seen by the cache
controller.
I appreciate the concerns with statistics. At one point, this patch actually included statistics for merged requests so that one could do the math to verify that no requests were missing. Those stats where removed when we dropped ruby.stats and they were never added back. Would you be ok with adding counts in the Sequencer/RubyPort rather than trying to update the controller stats?
I will try to add those stats, but can we keep this patch as one rather than split up? It has been a real pain to maintain and there is a lot of interdependent parts to it. Splitting it up would essentially require re-implementing it. We'll be lucky if we can find someone who as the time to add the stats as you requested.

Post by Joel Hestness
From Brad: I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

Most problems with this patch as-is are merging issues with the current head, where a handful of sequencer things have changed (e.g. LLSC and L1/sequencer latency). Other problems include list iteration, dead variables that should be stripped out, and the coalesce disable feature. I could open these as issues for revision, but there are already a ton of open issues on this patch.

Post by Brad Beckmann

Post by Joel Hestness
From Brad: The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

The configuration is non-trivial in that there is nothing connecting these parameters (LSQ depth with sequencer max_outstanding_requests), and performance debugging this issue requires substantial understanding of both the O3 CPU and Ruby. It took me roughly 2 hours of digging to understand it, and I know both of these pieces of code.

Post by Brad Beckmann
From Nilay: I think we should drop the max outstanding request variable all together. As both of you agree the LSQ size should decide how many requests should be pending, there is no need for the sequencer to maintain its own limit.

I disagree with removing max_outstanding_requests: Currently there are a number of places in Ruby without flow control, so removing max_outstanding_requests could bloat various unbounded buffers and cause plenty of other problems.

While Nilay accurately captures my feelings about the specific changes, my overall opinion is that this patch introduces too many hard questions and too much uncertainty about performance correctness due to the CPU<->sequencer interface change. I apologize for being blunt, but given that AMD may be too busy to add stats (i.e. just a few lines of code), I don't trust that this patch will be adequately performance validated before commit. Further, I don't feel it should be the community's responsibility to fix any issues that it may introduce.

If we're going to decide to push these changes, can we at least figure out a way for the patch to be shepherded to commit (e.g. someone volunteer to pick up the ball and carry it through adequate revision and testing)?

- Joel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Nilay Vaish

2015-09-05 00:28:07 UTC

Post by Brad Beckmann

Post by Joel Hestness
Hi guys,
I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.
Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.
Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.
Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.
I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.
The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?
I am acutally in favor of the idea of coalescing requests. This is something that classic memory does as well.
And I read some paper on MSHRs maintaining primary and secondary requests. I would like that we ultimately
a. the first patch will combine the read and the write request tables maintained by the sequencer. I think Joel
has no objections to that part. So should be possible to merge that patch any time.
b. the second part will provide each entry in the combined request table with a list of packets instead of a
single packet. This part of the patch is what we need to discuss.
-- I think we should drop the max outstanding request variable all together. As both of you agree
the LSQ size should decide how many requests should be pending, there is no need for the sequencer to
maintain its own limit.
-- We also need to decide on how the statistical variables are updated. With this patch, the requests buffered
in the sequencer will not update the stats maintained by the cache controller. This becomes a problem
when one tries to compare the loads and stores issued by the cpu and the loads and stores seen by the cache
controller.
I appreciate the concerns with statistics. At one point, this patch actually included statistics for merged requests so that one could do the math to verify that no requests were missing. Those stats where removed when we dropped ruby.stats and they were never added back. Would you be ok with adding counts in the Sequencer/RubyPort rather than trying to update the controller stats?
I will try to add those stats, but can we keep this patch as one rather than split up? It has been a real pain to maintain and there is a lot of interdependent parts to it. Splitting it up would essentially require re-implementing it. We'll be lucky if we can find someone who as the time to add the stats as you requested.

Post by Joel Hestness

Post by Joel Hestness
From Brad: I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

Most problems with this patch as-is are merging issues with the current head, where a handful of sequencer things have changed (e.g. LLSC and L1/sequencer latency). Other problems include list iteration, dead variables that should be stripped out, and the coalesce disable feature. I could open these as issues for revision, but there are already a ton of open issues on this patch.

Post by Joel Hestness

Post by Joel Hestness
From Brad: The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

The configuration is non-trivial in that there is nothing connecting these parameters (LSQ depth with sequencer max_outstanding_requests), and performance debugging this issue requires substantial understanding of both the O3 CPU and Ruby. It took me roughly 2 hours of digging to understand it, and I know both of these pieces of code.

Post by Joel Hestness
From Nilay: I think we should drop the max outstanding request variable all together. As both of you agree the LSQ size should decide how many requests should be pending, there is no need for the sequencer to maintain its own limit.

I disagree with removing max_outstanding_requests: Currently there are a number of places in Ruby without flow control, so removing max_outstanding_requests could bloat various unbounded buffers and cause plenty of other problems.
While Nilay accurately captures my feelings about the specific changes, my overall opinion is that this patch introduces too many hard questions and too much uncertainty about performance correctness due to the CPU<->sequencer interface change. I apologize for being blunt, but given that AMD may be too busy to add stats (i.e. just a few lines of code), I don't trust that this patch will be adequately performance validated before commit. Further, I don't feel it should be the community's responsibility to fix any issues that it may introduce.
If we're going to decide to push these changes, can we at least figure out a way for the patch to be shepherded to commit (e.g. someone volunteer to pick up the ball and carry it through adequate revision and testing)?

I have posted a set of five patches that break this patch. The last patch is different
from AMD is doing in this patch. Instead of fulfilling all the secondary requests in one
go, the first secondary request (and hence now the primary request) is issued to the
cache controller after the original primary request has completed.

On max_outstanding_requests: I think we should not do control flow in the Sequencer. Instead,
we should abort() when the number of requests in the sequencer exceeds max_outstanding_requests.
This would force protocols / networks / testers / cores to do flow control.

- Nilay

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Brad Beckmann

2015-09-11 19:45:07 UTC

Post by Brad Beckmann

Post by Joel Hestness
Hi guys,
I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.
Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.
Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.
Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.
I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.
The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?
I am acutally in favor of the idea of coalescing requests. This is something that classic memory does as well.
And I read some paper on MSHRs maintaining primary and secondary requests. I would like that we ultimately
a. the first patch will combine the read and the write request tables maintained by the sequencer. I think Joel
has no objections to that part. So should be possible to merge that patch any time.
b. the second part will provide each entry in the combined request table with a list of packets instead of a
single packet. This part of the patch is what we need to discuss.
-- I think we should drop the max outstanding request variable all together. As both of you agree
the LSQ size should decide how many requests should be pending, there is no need for the sequencer to
maintain its own limit.
-- We also need to decide on how the statistical variables are updated. With this patch, the requests buffered
in the sequencer will not update the stats maintained by the cache controller. This becomes a problem
when one tries to compare the loads and stores issued by the cpu and the loads and stores seen by the cache
controller.
I appreciate the concerns with statistics. At one point, this patch actually included statistics for merged requests so that one could do the math to verify that no requests were missing. Those stats where removed when we dropped ruby.stats and they were never added back. Would you be ok with adding counts in the Sequencer/RubyPort rather than trying to update the controller stats?
I will try to add those stats, but can we keep this patch as one rather than split up? It has been a real pain to maintain and there is a lot of interdependent parts to it. Splitting it up would essentially require re-implementing it. We'll be lucky if we can find someone who as the time to add the stats as you requested.

Post by Joel Hestness

Post by Joel Hestness
From Brad: I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

Most problems with this patch as-is are merging issues with the current head, where a handful of sequencer things have changed (e.g. LLSC and L1/sequencer latency). Other problems include list iteration, dead variables that should be stripped out, and the coalesce disable feature. I could open these as issues for revision, but there are already a ton of open issues on this patch.

Post by Joel Hestness

Post by Joel Hestness
From Brad: The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

The configuration is non-trivial in that there is nothing connecting these parameters (LSQ depth with sequencer max_outstanding_requests), and performance debugging this issue requires substantial understanding of both the O3 CPU and Ruby. It took me roughly 2 hours of digging to understand it, and I know both of these pieces of code.

Post by Joel Hestness
From Nilay: I think we should drop the max outstanding request variable all together. As both of you agree the LSQ size should decide how many requests should be pending, there is no need for the sequencer to maintain its own limit.

I disagree with removing max_outstanding_requests: Currently there are a number of places in Ruby without flow control, so removing max_outstanding_requests could bloat various unbounded buffers and cause plenty of other problems.
While Nilay accurately captures my feelings about the specific changes, my overall opinion is that this patch introduces too many hard questions and too much uncertainty about performance correctness due to the CPU<->sequencer interface change. I apologize for being blunt, but given that AMD may be too busy to add stats (i.e. just a few lines of code), I don't trust that this patch will be adequately performance validated before commit. Further, I don't feel it should be the community's responsibility to fix any issues that it may introduce.
If we're going to decide to push these changes, can we at least figure out a way for the patch to be shepherded to commit (e.g. someone volunteer to pick up the ball and carry it through adequate revision and testing)?
I have posted a set of five patches that break this patch. The last patch is different
from AMD is doing in this patch. Instead of fulfilling all the secondary requests in one
go, the first secondary request (and hence now the primary request) is issued to the
cache controller after the original primary request has completed.
On max_outstanding_requests: I think we should not do control flow in the Sequencer. Instead,
we should abort() when the number of requests in the sequencer exceeds max_outstanding_requests.
This would force protocols / networks / testers / cores to do flow control.

Joel, I thought we took care of the prior open issues with this patch. I understand your concern with merging with a patch that touches a lot of performance critical code. We've had to many of those types of merges as well. I feel like I've personally put a lot of effort in trying to address your concerns. So far, once I address your concerns, you seem to just have more concerns. Please list out your current "hard questions" and "open issues". I will try again to address them as much as I can. I just ask you to not add more items to the list and please keep in mind that we have very specific, cycle-by-cycle tests that compare the Ruby+O3 interface to real hardware that strongly justify this patch.

Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part of this patch: http://reviews.gem5.org/r/3096/, http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are the other 2 patches? Do they implement the disable coalescing feature?

I believe that an important part of this patch is ensuring that merged requests are completed in back-to-back cycles. If we require that all requests must go across the mandatory queue to the L1 controller, I believe that adds unecessary delay. We will rerun our tests to confirm.

We absolutely should not force protocols to implement flow control. In particular, the tester is designed to stress the system in ways that are not practical with realistic flow control. Flow control is and should be an option for those who want more accuracy.

- Brad

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Nilay Vaish

2015-09-11 21:13:46 UTC

Post by Brad Beckmann

Post by Joel Hestness
Hi guys,
I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.
Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.
Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.
Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.
I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.
The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?
I am acutally in favor of the idea of coalescing requests. This is something that classic memory does as well.
And I read some paper on MSHRs maintaining primary and secondary requests. I would like that we ultimately
a. the first patch will combine the read and the write request tables maintained by the sequencer. I think Joel
has no objections to that part. So should be possible to merge that patch any time.
b. the second part will provide each entry in the combined request table with a list of packets instead of a
single packet. This part of the patch is what we need to discuss.
-- I think we should drop the max outstanding request variable all together. As both of you agree
the LSQ size should decide how many requests should be pending, there is no need for the sequencer to
maintain its own limit.
-- We also need to decide on how the statistical variables are updated. With this patch, the requests buffered
in the sequencer will not update the stats maintained by the cache controller. This becomes a problem
when one tries to compare the loads and stores issued by the cpu and the loads and stores seen by the cache
controller.
I appreciate the concerns with statistics. At one point, this patch actually included statistics for merged requests so that one could do the math to verify that no requests were missing. Those stats where removed when we dropped ruby.stats and they were never added back. Would you be ok with adding counts in the Sequencer/RubyPort rather than trying to update the controller stats?
I will try to add those stats, but can we keep this patch as one rather than split up? It has been a real pain to maintain and there is a lot of interdependent parts to it. Splitting it up would essentially require re-implementing it. We'll be lucky if we can find someone who as the time to add the stats as you requested.

Post by Joel Hestness

Post by Joel Hestness
From Brad: I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

Most problems with this patch as-is are merging issues with the current head, where a handful of sequencer things have changed (e.g. LLSC and L1/sequencer latency). Other problems include list iteration, dead variables that should be stripped out, and the coalesce disable feature. I could open these as issues for revision, but there are already a ton of open issues on this patch.

Post by Joel Hestness

Post by Joel Hestness
From Brad: The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

The configuration is non-trivial in that there is nothing connecting these parameters (LSQ depth with sequencer max_outstanding_requests), and performance debugging this issue requires substantial understanding of both the O3 CPU and Ruby. It took me roughly 2 hours of digging to understand it, and I know both of these pieces of code.

Post by Joel Hestness
From Nilay: I think we should drop the max outstanding request variable all together. As both of you agree the LSQ size should decide how many requests should be pending, there is no need for the sequencer to maintain its own limit.

I disagree with removing max_outstanding_requests: Currently there are a number of places in Ruby without flow control, so removing max_outstanding_requests could bloat various unbounded buffers and cause plenty of other problems.
While Nilay accurately captures my feelings about the specific changes, my overall opinion is that this patch introduces too many hard questions and too much uncertainty about performance correctness due to the CPU<->sequencer interface change. I apologize for being blunt, but given that AMD may be too busy to add stats (i.e. just a few lines of code), I don't trust that this patch will be adequately performance validated before commit. Further, I don't feel it should be the community's responsibility to fix any issues that it may introduce.
If we're going to decide to push these changes, can we at least figure out a way for the patch to be shepherded to commit (e.g. someone volunteer to pick up the ball and carry it through adequate revision and testing)?
I have posted a set of five patches that break this patch. The last patch is different
from AMD is doing in this patch. Instead of fulfilling all the secondary requests in one
go, the first secondary request (and hence now the primary request) is issued to the
cache controller after the original primary request has completed.
On max_outstanding_requests: I think we should not do control flow in the Sequencer. Instead,
we should abort() when the number of requests in the sequencer exceeds max_outstanding_requests.
This would force protocols / networks / testers / cores to do flow control.
Joel, I thought we took care of the prior open issues with this patch. I understand your concern with merging with a patch that touches a lot of performance critical code. We've had to many of those types of merges as well. I feel like I've personally put a lot of effort in trying to address your concerns. So far, once I address your concerns, you seem to just have more concerns. Please list out your current "hard questions" and "open issues". I will try again to address them as much as I can. I just ask you to not add more items to the list and please keep in mind that we have very specific, cycle-by-cycle tests that compare the Ruby+O3 interface to real hardware that strongly justify this patch.
Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part of this patch: http://reviews.gem5.org/r/3096/, http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are the other 2 patches? Do they implement the disable coalescing feature?
I believe that an important part of this patch is ensuring that merged requests are completed in back-to-back cycles. If we require that all requests must go across the mandatory queue to the L1 controller, I believe that adds unecessary delay. We will rerun our tests to confirm.
We absolutely should not force protocols to implement flow control. In particular, the tester is designed to stress the system in ways that are not practical with realistic flow control. Flow control is and should be an option for those who want more accuracy.
Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part of this patch: http://reviews.gem5.org/r/3096/,
http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are the other 2 patches? Do they implement the disable coalescing
feature?

394 and 3095. 3099 implements disable feature. But now I think we should not have disable at all since the controller is going to receive all the requests.

Post by Brad Beckmann
I believe that an important part of this patch is ensuring that merged requests are completed in back-to-back cycles. If we require that all > requests must go across the mandatory queue to the L1 controller, I believe that adds unecessary delay. We will rerun our tests to confirm.

The main problem, as I understand, is that aliased requests were nacked at the sequencer.
This would not happen any longer. So you should not see pipline squashes. I do not think we can
assume that the l1 controller would be ok with not seeing some of the requests. So I am not in favor
of fulfilling multiple requests in one go from the sequencer. I would be fine with fulfilling
multiple requests together in the l1 controller.

- Nilay

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Joel Hestness

2015-09-12 15:27:09 UTC

Post by Brad Beckmann

Post by Joel Hestness
Hi guys,
I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.
Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.
Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.
Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.
I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.
The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?
I am acutally in favor of the idea of coalescing requests. This is something that classic memory does as well.
And I read some paper on MSHRs maintaining primary and secondary requests. I would like that we ultimately
a. the first patch will combine the read and the write request tables maintained by the sequencer. I think Joel
has no objections to that part. So should be possible to merge that patch any time.
b. the second part will provide each entry in the combined request table with a list of packets instead of a
single packet. This part of the patch is what we need to discuss.
-- I think we should drop the max outstanding request variable all together. As both of you agree
the LSQ size should decide how many requests should be pending, there is no need for the sequencer to
maintain its own limit.
-- We also need to decide on how the statistical variables are updated. With this patch, the requests buffered
in the sequencer will not update the stats maintained by the cache controller. This becomes a problem
when one tries to compare the loads and stores issued by the cpu and the loads and stores seen by the cache
controller.
I appreciate the concerns with statistics. At one point, this patch actually included statistics for merged requests so that one could do the math to verify that no requests were missing. Those stats where removed when we dropped ruby.stats and they were never added back. Would you be ok with adding counts in the Sequencer/RubyPort rather than trying to update the controller stats?
I will try to add those stats, but can we keep this patch as one rather than split up? It has been a real pain to maintain and there is a lot of interdependent parts to it. Splitting it up would essentially require re-implementing it. We'll be lucky if we can find someone who as the time to add the stats as you requested.

Post by Joel Hestness

Post by Joel Hestness
From Brad: I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

Most problems with this patch as-is are merging issues with the current head, where a handful of sequencer things have changed (e.g. LLSC and L1/sequencer latency). Other problems include list iteration, dead variables that should be stripped out, and the coalesce disable feature. I could open these as issues for revision, but there are already a ton of open issues on this patch.

Post by Joel Hestness

Post by Joel Hestness
From Brad: The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

The configuration is non-trivial in that there is nothing connecting these parameters (LSQ depth with sequencer max_outstanding_requests), and performance debugging this issue requires substantial understanding of both the O3 CPU and Ruby. It took me roughly 2 hours of digging to understand it, and I know both of these pieces of code.

Post by Joel Hestness
From Nilay: I think we should drop the max outstanding request variable all together. As both of you agree the LSQ size should decide how many requests should be pending, there is no need for the sequencer to maintain its own limit.

I disagree with removing max_outstanding_requests: Currently there are a number of places in Ruby without flow control, so removing max_outstanding_requests could bloat various unbounded buffers and cause plenty of other problems.
While Nilay accurately captures my feelings about the specific changes, my overall opinion is that this patch introduces too many hard questions and too much uncertainty about performance correctness due to the CPU<->sequencer interface change. I apologize for being blunt, but given that AMD may be too busy to add stats (i.e. just a few lines of code), I don't trust that this patch will be adequately performance validated before commit. Further, I don't feel it should be the community's responsibility to fix any issues that it may introduce.
If we're going to decide to push these changes, can we at least figure out a way for the patch to be shepherded to commit (e.g. someone volunteer to pick up the ball and carry it through adequate revision and testing)?
I have posted a set of five patches that break this patch. The last patch is different
from AMD is doing in this patch. Instead of fulfilling all the secondary requests in one
go, the first secondary request (and hence now the primary request) is issued to the
cache controller after the original primary request has completed.
On max_outstanding_requests: I think we should not do control flow in the Sequencer. Instead,
we should abort() when the number of requests in the sequencer exceeds max_outstanding_requests.
This would force protocols / networks / testers / cores to do flow control.
Joel, I thought we took care of the prior open issues with this patch. I understand your concern with merging with a patch that touches a lot of performance critical code. We've had to many of those types of merges as well. I feel like I've personally put a lot of effort in trying to address your concerns. So far, once I address your concerns, you seem to just have more concerns. Please list out your current "hard questions" and "open issues". I will try again to address them as much as I can. I just ask you to not add more items to the list and please keep in mind that we have very specific, cycle-by-cycle tests that compare the Ruby+O3 interface to real hardware that strongly justify this patch.
Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part of this patch: http://reviews.gem5.org/r/3096/, http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are the other 2 patches? Do they implement the disable coalescing feature?
I believe that an important part of this patch is ensuring that merged requests are completed in back-to-back cycles. If we require that all requests must go across the mandatory queue to the L1 controller, I believe that adds unecessary delay. We will rerun our tests to confirm.
We absolutely should not force protocols to implement flow control. In particular, the tester is designed to stress the system in ways that are not practical with realistic flow control. Flow control is and should be an option for those who want more accuracy.

Post by Joel Hestness
Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part of this patch: http://reviews.gem5.org/r/3096/,
http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are the other 2 patches? Do they implement the disable coalescing
feature?

394 and 3095. 3099 implements disable feature. But now I think we should not have disable at all since the controller is going to receive all the requests.

Post by Joel Hestness
I believe that an important part of this patch is ensuring that merged requests are completed in back-to-back cycles. If we require that all > requests must go across the mandatory queue to the L1 controller, I believe that adds unecessary delay. We will rerun our tests to confirm.

The main problem, as I understand, is that aliased requests were nacked at the sequencer.
This would not happen any longer. So you should not see pipline squashes. I do not think we can
assume that the l1 controller would be ok with not seeing some of the requests. So I am not in favor
of fulfilling multiple requests in one go from the sequencer. I would be fine with fulfilling
multiple requests together in the l1 controller.

Hi guys. Can we organize a bit here? There are a lot of loose ends in this review, so it's really difficult to know how to help move this forward.

@Nilay: I appreciate splitting this patch into orthogonal parts, since I feel it would move things along. Unfortunately, though, it's not clear whether we should continue reviewing this patch or your set. One tricky thing about reviewing your patches is that we may end up needing you to do performance validation unless Brad/AMD is willing to take your patches and merge them into their queue. Can you state your preferences on your involvement in these changes? Which patches do you prefer we review?

@Brad: I'm sorry I do not yet feel comfortable with this patch. I was under the impression that this patch was going to change O3 CPU pipeline squashing behavior, but my tests show that current mainline O3 squashing behavior is far better than older code (e.g. chgset 10238). My microbenchmarking shows that the performance differences between this patch and current mainline appear to be trivial (when the sim'd system is appropriately configured). I'd prefer to know how this change is being validated.

Beyond performance validation, Nilay's new patches make it difficult to know what we should be reviewing. I would assume that your preference would be to finish fixing up this patch and you/AMD validates it, rather than trying to mix in Nilay's patches, which may be hard to merge. Can you let us know your preference for how to proceed with these, so we can know where to spend effort reviewing? If you'd prefer continuing with this review over Nilay's, can you please close/drop issues that you feel have been resolved?

- Joel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------

Post by Brad Beckmann
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Andreas Hansson

2015-09-12 13:54:48 UTC

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7166
-----------------------------------------------------------

My comment seems to have gotten lost in all the different threads going on...bad S/N. Anyways, here it is:

I am of the opinion that we should probably 1) do read/write combining in the core LSQ before sending out a packet, and 2) combining of MSHR targets in the L1 before propagating a miss downwards. I am not sure why we would ever do it in the Sequencer. Am I missing something?

This solution would also translate very well between both Ruby and the classic memory system.

- Andreas Hansson

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Joel Hestness

2015-09-12 14:27:16 UTC

Post by Anthony Gutierrez

Post by Andreas Hansson
I am of the opinion that we should probably 1) do read/write combining in the core LSQ before sending out a packet, and 2) combining of MSHR targets in the L1 before propagating a miss downwards. I am not sure why we would ever do it in the Sequencer. Am I missing something?
This solution would also translate very well between both Ruby and the classic memory system.

Hi Andreas. I think we all agree with you about where coalescing should happen. It appears that (1) is available from particular cores (e.g. the O3 CPU). The problem currently is that getting (2) in Ruby would require very non-trivial modification to the L1 controllers (to do combining) in each of the existing protocols (7 in gem5 + at least 2 not yet in gem5). To avoid all this protocol modification, this patch is AMD's proposal to do L1 MSHR-like combining within the sequencer. This proposed solution should be viewed as a stopgap on the road to MSHR combining in the L1 controllers.

- Joel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7166
-----------------------------------------------------------

Post by Anthony Gutierrez
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Andreas Hansson

2015-09-12 14:31:42 UTC

Post by Joel Hestness

Post by Andreas Hansson
I am of the opinion that we should probably 1) do read/write combining in the core LSQ before sending out a packet, and 2) combining of MSHR targets in the L1 before propagating a miss downwards. I am not sure why we would ever do it in the Sequencer. Am I missing something?
This solution would also translate very well between both Ruby and the classic memory system.

Hi Andreas. I think we all agree with you about where coalescing should happen. It appears that (1) is available from particular cores (e.g. the O3 CPU). The problem currently is that getting (2) in Ruby would require very non-trivial modification to the L1 controllers (to do combining) in each of the existing protocols (7 in gem5 + at least 2 not yet in gem5). To avoid all this protocol modification, this patch is AMD's proposal to do L1 MSHR-like combining within the sequencer. This proposed solution should be viewed as a stopgap on the road to MSHR combining in the L1 controllers.

I see. Thanks for the clarification.

I am fairly convinced the O3 CPU is not doing any coalescing at the moment. Are you sure? If not I'd say that's the place to start. Coalescing in the LSQ is probably as important as coalescing of MSHR targets, if not more so.

- Andreas

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7166
-----------------------------------------------------------

Post by Joel Hestness
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

Steve Reinhardt

2015-09-14 00:50:50 UTC

Post by Andreas Hansson

Post by Joel Hestness

Post by Andreas Hansson
My comment seems to have gotten lost in all the different threads
I am of the opinion that we should probably 1) do read/write combining

in the core LSQ before sending out a packet, and 2) combining of MSHR
targets in the L1 before propagating a miss downwards. I am not sure why we
would ever do it in the Sequencer. Am I missing something?

Post by Joel Hestness

Post by Andreas Hansson
This solution would also translate very well between both Ruby and the

classic memory system.

Post by Joel Hestness
Hi Andreas. I think we all agree with you about where coalescing

should happen. It appears that (1) is available from particular cores (e.g.
the O3 CPU). The problem currently is that getting (2) in Ruby would
require very non-trivial modification to the L1 controllers (to do
combining) in each of the existing protocols (7 in gem5 + at least 2 not
yet in gem5). To avoid all this protocol modification, this patch is AMD's
proposal to do L1 MSHR-like combining within the sequencer. This proposed
solution should be viewed as a stopgap on the road to MSHR combining in the
L1 controllers.
I see. Thanks for the clarification.
I am fairly convinced the O3 CPU is not doing any coalescing at the
moment. Are you sure? If not I'd say that's the place to start. Coalescing
in the LSQ is probably as important as coalescing of MSHR targets, if not
more so.

Forking this off into a separate thread since it's not even one of the
issues that Joel brought up when he tried to organize things...

Andreas, can you clarify more what you mean by adding coalescing in the O3
LSQ? I'm not sure exactly what you mean, and I have some concerns about how
that would work and also what it would save us.

First, are you talking about having the LSQ track memory accesses at
cache-block granularity, and only forward a single access per cache block
to the L1 cache? What happens then when I have multiple read accesses to a
cache block, but the block arrives with a pending invalidation? In the
current cache we would invalidate the block after completing the one
outstanding access the L1 is aware of, which means that the subsequent
accesses from the LSQ (forwarded after the first hit completes) would be
unnecessary misses. You could fix this by hanging oin to the last block in
the cache even after acking the invalidation as long as you had additional
accesses to the same block (actually discarding the block only on the first
access to a different block), but that would not simplify the cache access
code path :).

Second, I'm confused about how this would help, given that we need to do
coalescing at any shared downstream (e.g., L2) cache anyway, to combine
multiple accesses to the same block from different upstream (e.g., L1)
caches. If pushing coalescing up into the LSQ doesn't let us remove it
from the cache model, it doesn't seem like a big win to me. I understand
that it would be a win on the Ruby side, given the current situation there,
but that raises the question in my mind of how Ruby deals with multiple
coalescable accesses to shared downstream caches itself, and why can't that
solution be applied to the Ruby L1 caches?

Thanks,

Steve

Beckmann, Brad

2015-09-15 00:32:40 UTC

Hi All,

At the bottom of Steve's message, he wrote " If pushing coalescing up into the LSQ doesn't let us remove it from the cache model, it doesn't seem like a big win to me. I understand that it would be a win on the Ruby side, given the current situation there, but that raises the question in my mind of how Ruby deals with multiple coalescable accesses to shared downstream caches itself, and why can't that solution be applied to the Ruby L1 caches?"

I just want to clarify that I do not want to push the coalescing into the LSQ. The packets generated from the CPU have always been per-scalar-instruction, per-request and per-address. I personally want to keep them this way. The Ruby Sequencer contains protocol-independent logic that is part of the L1 cache, such as coalescing from a single thread.

To answer Steve's question, Ruby allows coalescing in downstream shared caches, but it is protocol dependent. The protocols can dictate what coalescing from multiple threads is permissible (for instance multiple writes can merge in weaker memory models, but not stronger ones). Thus it is not something we can do in a protocol-independent fashion.

Brad

-----Original Message-----
From: gem5-dev [mailto:gem5-dev-***@gem5.org] On Behalf Of Steve Reinhardt
Sent: Sunday, September 13, 2015 5:51 PM
To: gem5 Developer List; Gutierrez, Anthony; Andreas Hansson; Joel Hestness
Subject: [gem5-dev] O3 coalescing (was Re: Review Request 2787: ruby: Fixed pipeline squashes caused by aliased requests)

Post by Andreas Hansson

Post by Joel Hestness

Post by Andreas Hansson
My comment seems to have gotten lost in all the different threads
I am of the opinion that we should probably 1) do read/write combining

in the core LSQ before sending out a packet, and 2) combining of MSHR
targets in the L1 before propagating a miss downwards. I am not sure
why we would ever do it in the Sequencer. Am I missing something?

Post by Joel Hestness

Post by Andreas Hansson
This solution would also translate very well between both Ruby and the

classic memory system.

Post by Joel Hestness
Hi Andreas. I think we all agree with you about where coalescing

should happen. It appears that (1) is available from particular cores (e.g.
the O3 CPU). The problem currently is that getting (2) in Ruby would
require very non-trivial modification to the L1 controllers (to do
combining) in each of the existing protocols (7 in gem5 + at least 2
not yet in gem5). To avoid all this protocol modification, this patch
is AMD's proposal to do L1 MSHR-like combining within the sequencer.
This proposed solution should be viewed as a stopgap on the road to
MSHR combining in the
L1 controllers.
I see. Thanks for the clarification.
I am fairly convinced the O3 CPU is not doing any coalescing at the
moment. Are you sure? If not I'd say that's the place to start.
Coalescing in the LSQ is probably as important as coalescing of MSHR
targets, if not more so.

Forking this off into a separate thread since it's not even one of the issues that Joel brought up when he tried to organize things...

Andreas, can you clarify more what you mean by adding coalescing in the O3 LSQ? I'm not sure exactly what you mean, and I have some concerns about how that would work and also what it would save us.

First, are you talking about having the LSQ track memory accesses at cache-block granularity, and only forward a single access per cache block to the L1 cache? What happens then when I have multiple read accesses to a cache block, but the block arrives with a pending invalidation? In the current cache we would invalidate the block after completing the one outstanding access the L1 is aware of, which means that the subsequent accesses from the LSQ (forwarded after the first hit completes) would be unnecessary misses. You could fix this by hanging oin to the last block in the cache even after acking the invalidation as long as you had additional accesses to the same block (actually discarding the block only on the first access to a different block), but that would not simplify the cache access code path :).

Second, I'm confused about how this would help, given that we need to do coalescing at any shared downstream (e.g., L2) cache anyway, to combine multiple accesses to the same block from different upstream (e.g., L1) caches. If pushing coalescing up into the LSQ doesn't let us remove it from the cache model, it doesn't seem like a big win to me. I understand that it would be a win on the Ruby side, given the current situation there, but that raises the question in my mind of how Ruby deals with multiple coalescable accesses to shared downstream caches itself, and why can't that solution be applied to the Ruby L1 caches?

Thanks,

Steve
_______________________________________________
gem5-dev mailing list
gem5-***@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Joel Hestness

2015-09-12 14:49:01 UTC

Post by Joel Hestness

Post by Andreas Hansson
I am of the opinion that we should probably 1) do read/write combining in the core LSQ before sending out a packet, and 2) combining of MSHR targets in the L1 before propagating a miss downwards. I am not sure why we would ever do it in the Sequencer. Am I missing something?
This solution would also translate very well between both Ruby and the classic memory system.

Hi Andreas. I think we all agree with you about where coalescing should happen. It appears that (1) is available from particular cores (e.g. the O3 CPU). The problem currently is that getting (2) in Ruby would require very non-trivial modification to the L1 controllers (to do combining) in each of the existing protocols (7 in gem5 + at least 2 not yet in gem5). To avoid all this protocol modification, this patch is AMD's proposal to do L1 MSHR-like combining within the sequencer. This proposed solution should be viewed as a stopgap on the road to MSHR combining in the L1 controllers.
I see. Thanks for the clarification.
I am fairly convinced the O3 CPU is not doing any coalescing at the moment. Are you sure? If not I'd say that's the place to start. Coalescing in the LSQ is probably as important as coalescing of MSHR targets, if not more so.

Sorry. Inspecting the O3 LSQ code, I believe you are correct that it does not do any coalescing.

- Joel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7166
-----------------------------------------------------------

Post by Joel Hestness
-----------------------------------------------------------
http://reviews.gem5.org/r/2787/
-----------------------------------------------------------
(Updated July 22, 2015, 6:15 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests
This patch was created by Bihn Pham during his internship at AMD.
This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.
Diffs
-----
src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18
Diff: http://reviews.gem5.org/r/2787/diff/
Testing
-------
Thanks,
Tony Gutierrez

33 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Anthony Gutierrez 2015-05-11 22:22:23 UTC

Joel Hestness 2015-05-14 19:33:17 UTC

Brad Beckmann 2015-05-18 23:49:45 UTC

Joel Hestness 2015-05-21 19:53:19 UTC

Joel Hestness 2015-05-21 20:02:13 UTC

Nilay Vaish 2015-05-18 22:06:31 UTC

Brad Beckmann 2015-05-18 23:55:50 UTC

Nilay Vaish 2015-05-19 16:23:05 UTC

Joel Hestness 2015-05-21 19:52:06 UTC

Tony Gutierrez 2015-05-26 19:42:13 UTC

Brad Beckmann 2015-05-29 00:27:13 UTC

Joel Hestness 2015-05-29 15:01:03 UTC

Steve Reinhardt 2015-06-02 02:04:06 UTC

Joel Hestness 2015-06-02 20:26:01 UTC

Andreas Hansson 2015-09-04 16:26:24 UTC

Tony Gutierrez 2015-07-22 18:15:15 UTC

Joel Hestness 2015-07-31 23:13:44 UTC

Brad Beckmann 2015-08-01 00:03:34 UTC

Joel Hestness 2015-08-01 13:51:20 UTC

Joel Hestness 2015-08-30 18:01:32 UTC

Brad Beckmann 2015-09-02 22:59:41 UTC

Nilay Vaish 2015-09-03 01:58:17 UTC

Brad Beckmann 2015-09-04 20:51:30 UTC

Joel Hestness 2015-09-04 22:43:10 UTC

Nilay Vaish 2015-09-05 00:28:07 UTC

Brad Beckmann 2015-09-11 19:45:07 UTC

Nilay Vaish 2015-09-11 21:13:46 UTC

Joel Hestness 2015-09-12 15:27:09 UTC

Andreas Hansson 2015-09-12 13:54:48 UTC

Joel Hestness 2015-09-12 14:27:16 UTC

Andreas Hansson 2015-09-12 14:31:42 UTC

Steve Reinhardt 2015-09-14 00:50:50 UTC

Beckmann, Brad 2015-09-15 00:32:40 UTC

Joel Hestness 2015-09-12 14:49:01 UTC

about - legalese

Loading...