-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
FIRST BIG QUESTION: Address Range vs per-Cache-Line CMO instructions #9
Comments
I totally agree with the basic principle of specifying the full address range you want to work on, and the hardware telling you how much of it it actually did. Trapping if there is a problem is of course one option, but I'd like to see a way to allow the software loop to learn there is a problem and handle it. I've made a suggestion for this in #8 and on the mailing list. |
I have a proposal (coming later this week) that runs pretty much counter to most of what is said and suggested here (it's 89 slides long and goes into great detail). It'll be out sometime this week. And the answer to the mainline question is that the basline CMO (I call them Cache Block Operations CBOs -- with emphasis on Block) operations should be one cacheline at a time. Where it's safe to do so and well defined, we can create something I call an MRO which has arbitrary byte ranges on it. I'd implement these as an instruction that always traps and runs a seqeunce of CBOs or stores. More details to come. Derek |
I wonder whether specifying address ranges and returning the number of bytes affected is too CISC-y. Some implementations may want it, but single-cache-block operations will get most of the implementations most of the way there. Requiring I look forward to @strikerdw 's write-up! |
On Tue, Sep 15, 2020 at 6:28 PM John Ingalls ***@***.***> wrote:
I wonder whether specifying address ranges and returning the number of
bytes affected is too CISC-y.
I can't see how it's CISCy if it takes 1 clock cycle and affects 1 cache
line. Arbitrarily complex 2R1W register-to-register logic is fine for a
RISC instruction, as long as it's combinatorial, not sequential.
Requiring rs1=rd seems like a waste of opcode space.
Maybe. It would mean that 97% of the room in that opcode space would be
available for other instructions that never want rs1=rd.
|
That's fine for arithmetic and in-order processors which share the same register file for all instructions, but comes with added costs or complexity for memory instructions in larger designs. Do CMOs really need to drive us to introduce a new memory addressing mode (register+register)? [1] https://github.com/riscv-boom/riscv-boom |
"Do CMOs really need to drive us to introduce a new memory addressing mode
(register+register)?"
I don't think so.
Furthermore, I don't see the connection you're making here.
Specifying a desired address range in two registers isn't register+register
addressing. The address used for the cache block operation is the address
in the first register.
…On Wed, Sep 16, 2020 at 4:43 AM John Ingalls ***@***.***> wrote:
Arbitrarily complex 2R1W register-to-register logic is fine for a RISC
instruction
That's fine for arithmetic and in-order processors which share the same
register file for all instructions, but comes with added costs or
complexity for memory instructions in larger designs.
To illustrate, let's consider the BOOM core [1] (details are different in
other pipeline arrangements / proprietary microarchitectures, but the
themes still apply). BOOM has one physical register file read port [2] per
load/store address calculation (x2 issue) [3].
A straightforward implementation of this new register+register memory
addressing mode in BOOM would add another read port to the physical
register file and another fanout to the data forwarding bypass network.
Granted, adding this functionality is straightforward, and the larger cost
is only imposed on larger designs, i.e. the cost is proportional to the
overall size, and there are trade-off techniques to bring the cost down but
those add more complexity.
Again, this isn't free, and it doesn't "Just Work (TM)" with existing
plumbing like fixed-block-size CMOs would do.
*Do CMOs really need to drive us to introduce a new memory addressing mode
(register+register)?*
[1] https://github.com/riscv-boom/riscv-boom
[2]
https://docs.boom-core.org/en/latest/sections/reg-file-bypass-network.html
[3] https://docs.boom-core.org/en/latest/sections/load-store-unit.html
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGPYYD64ODD7TGBPAZM3ATSF6KO7ANCNFSM4QTSAJKA>
.
|
You're correct, Bruce, and I apologize for my imprecision. I'll use the notation "register<=register" instead. The connection is that the address used for the cache block operation in the first register is compared against the end address in the second register. This may either be done at added cost in the same pipeline as the CMO, or in different pipelines at added complexity cost. |
I agree with that, and that a comparison of two 64 bit values (or even two
58 or 59 bit values after you mask off the low bits) is expensive.
That's one reason I prefer base+desired length as the arguments, and
return achieved length (which is equal to the cache block size in the
common case that the base is aligned, the desired size is >= the cache
block size, and the machine operates on one cache block per execution of
the instruction)
I haven't worked through the exact details, but I think it's not difficult
to compute the achieved length in the cases where it is not the cache block
size, and for sure it involves only 5 or 6 bit adders, not full address
width.
The achieved length is then added to the start address to get the next
start address, and subtracted from the desired length. By instructions the
user writes (or more likely, their library author)
Returning the achieved length has the additional advantage that results
such as 0 or negative values are available to tell the user's driver code
that the hardware didn't do anything.
I won't preempt Derek's proposal, which would work also, at slightly lower
hardware cost but with quite a few more instructions in the user program
loop.
it's a continuum, and hard to say what is best.
The fast interrupts group have chosen to make the hardware just a little
more complex in order to reduce the number of instructions and clock cycles
to get to the user's handler. The Vector extension uses a "tell the machine
the desired length and it tells you the achieved length" mechanism with the
user code adding the (scaled) achieved length to all the data pointers and
subtracting it from the desired length.
…On Wed, Sep 16, 2020 at 12:09 PM John Ingalls ***@***.***> wrote:
You're correct, Bruce, and I apologize for my imprecision. I'll use the
notation "register<=register" instead.
The connection is that the address used for the cache block operation in
the first register is compared against the end address in the second
register. This may either be done in the same pipeline at added area cost,
or in different pipelines at added complexity cost.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGPYYAXDBFLCY7CMPU2723SF762HANCNFSM4QTSAJKA>
.
|
Still working my way through strikerdw's proposal & following discussion, but I wanted to mention this for the record: There is real demand for the performance gain of using a hardware FSM to flush caches rather than a software loop. I've been involved first-hand in the following implementations in CPU IP products (which were all directly motivated by customer requests):
I would much prefer to have this standardized so that portable software can use it and get whatever performance benefit a particular CPU implementation has to offer. I agree that CMO instructions should be defined in a way that cleanly degenerates to a simple operation on one cache line in simple CPU pipelines. I really like the way that a range-based instruction can take just one trap to emulate a range-based operation (via instruction, CSR, memory-mapped) rather than trapping on each line. Some other misc comments so-far from strikerdw's proposal (sorry if these have been mentioned previously; still catching up on that discussion):
Cheers, |
On Tue, Sep 22, 2020 at 11:08 AM PhilNMcCoy ***@***.***> wrote:
There is real demand for the performance gain of using a hardware FSM to
flush caches rather than a software loop. I've been involved first-hand in
the following implementations in CPU IP products (which were all directly
motivated by customer requests):
- FSM to flush L1 cache, controlled by writing to the equivalent of
CSRs
- FSM to flush L2 cache, controlled by writing to memory-mapped
registers
- FSM to flush L1 cache, controlled by CMO instruction
None of these implementations were standardized per the ISA, and each
was specific to a particular CPU implementation. Portable software cannot
count on any of the above being available unless it knows exactly which CPU
microarchitecture/pipeline it is running on.
I would much prefer to have this standardized so that portable software
can use it and get whatever performance benefit a particular CPU
implementation has to offer.
It seems like the preceding specifically do not want to be address or
address-range based operations, but set/way-based operations that flush
whatever addresses are found in the cache entries. Which then falls
outside of what Derek's slides were trying to focus on.
A common use case for the above is power management (e.g. entering deeper
sleep states in which caches need to be completely flushed) - for which a
"flush all sets/ways" operation is desired. But this could be done
efficiently in a careful software loop (using block set/way operations) by
the hart who owns those caches (i.e. the total flush time should be limited
more by all the cache flushing activity, than by the loop code itself).
Only if some other entity needs to perform this operation would the
argument for a hardware FSM be stronger.
But I'm not really trying to argue against hardware FSM approaches.
Instead I would note that (I think) the use cases for "flush all sets/ways"
operations are in platform-specific code and hence have a weaker need for
ISA standardization. But this really gets into the broader topic of
standardizing set/way CMOs - which can be separated from the topic of
address-based CMOs.
- I don't agree that DCBZ is necessarily the preferred way to delete
ECC errors from caches; I'm not sure it's even desirable to try to define a
standard/portable way to do this (what if you have a strange cache
microarchitecture, or if the ECC error is from the TLB or branch prediction
table or what if ... )
I would have said that this IS the preferred way to remove a poisoned line
from the coherence domain. (Note that this isn't simply about cleaning out
an ECC error in a cache entry - a set/way operation on that cache would
handle that.) The problem is that the only way to "un-poison" a poisoned
line floating around the cache/memory hierarchy is to coherently overwrite
the entire cache line in some reliable way. Simply doing ISA stores to the
entire line is uarch-dependent and even then not necessarily reliable.
Greg
|
It seems the real performance gains to flushing an entire cache (or possibly even a range?) would be obtained from making the operation asynchronous to the rest of instruction execution. For full cache operations, you don't have to worry about address translation (and the possible translation traps that result), so that seems achievable. In general, though, it's still not clear to me how a range instruction provides a performance benefit, especially if you're tying up cache resources and execution resources to execute it. (Granted, the trap & emulate case goes faster because you're passing the range to the handler once instead of a trapping per op, but aren't there cheaper ways of obtaining the same result, e.g. an SBI call? And I'm curious about the kinds of designs where this method provides a real performance advantage, since this style of operation provides no benefit to the types of designs I work on.) Anyway, I'm looking forward to a robust conversation on this topic in the near future.... :) |
I agree with Phil McCoy as I also have experience with performance requirements that are not well met by block-at-a-time instructions. The ability to process blocks rapidly in complex cache structures helps performance (when there's an on-chip backing store). I agree with dkruckemyer-ventana that there is more performance to be gained from flushing in the background, but then some construct needs to be added to know when the flushing is done, which tends to require more than just an instruction. Concepts also need to be developed such as looking up but not allocating cache lines. And if it's for a region only, there is a question of whether accesses to the region are stalled while the flush is ongoing.... Maybe we should approach the concept, but it is a bigger thing. Another possibility is to do only non-required operations in the background. These might be done using rd=x0 as a marker that the instruction is a hint. If the operation is not required for correctness but as an optimization, it can be done in the background. The x0 result register says that, architecturally, the hart is not required to wait for the result or even to do the operation at all. Cache management for I/O purposes would not usually meet this requirement, but cache management for better performance often would. Simpler implementations do a single cache line - or nothing at all.
|
What if the error is in the cache tag RAM - how do you know which address to target with your DCBZ, and how does the hardware know which way (if any) has that address? I don't want to digress into a big debate about ECC - I don't think it's part of the charter for this workgroup anyway and I don't want to delay ratification of useful CMOs. Suffice it to say that we're not in universal agreement that DCBZ is the One True Way to handle ECC.
Running the FSM in the background while the CPU gets on with other work is part of it (especially with multithreaded CPUs, L2 caches, etc.). The other part is that you can pipeline the cache RAM accesses more tightly when the control logic knows it is working on a contiguous range of addresses. Even an unrolled loop of CMO Clean operations would have pipeline stalls (example assuming 64B lines) could look like this in the pipeline (each line is a clock cycle): Read tag 0 With an FSM, it could be something like: If software is trying to flush say 8KB from a 64KB cache, doing an address-range operation with an FSM can be much more efficient than either doing a software loop line by line or doing an FSM-based flush of the entire cache (which will create lots of extra cache misses later). Cheers, |
On Tue, Sep 22, 2020 at 12:45 PM PhilNMcCoy ***@***.***> wrote:
Even an unrolled loop of CMO Clean operations would have pipeline stalls
(example assuming 64B lines)
CMO Clean 0
CMO Clean 64
CMO Clean 128
CMO Clean 192
...
If software is trying to flush say 8KB from a 64KB cache, doing an
address-range operation with an FSM can be much more efficient than either
doing a software loop line by line
Yes but only if all the resultant writeback traffic isn't the limiter.
Whether software or hardware scans through an 8KB address range and causes
8KB's worth of cache line "cleans" (i.e. writebacks of dirty data to
memory), unless the CPU and the system can perform a cache line-sized block
transfer every couple of CPU clocks on a sustained basis - including
resolving coherency (e.g. doing any needed snoops, etc.) for each clean
line being "cleaned", then software versus hardware won't really matter.
or doing an FSM-based flush of the entire cache (which will create lots of
extra cache misses later).
I agree that if you only want to operate across a range of addresses, then
you don't want to be using a "whole cache" operation.
As a side note, and taking this 8KB / 64KB example and assuming 8-way
set-associativity, I would observe that cleaning 8 KB out of this cache
involves scanning through all sets of the cache in all these approaches
(e.g. the FSM-based clean of the whole cache would do the same number of
cache lookups as an address range-based FSM). Only with greater or lesser
associativity would there be a difference (but not a factor of 8x).
Greg
|
Taking the same 8KB/64KB example Greg comments on, if the cache is a 4-way sectored cache, the FSM takes 1/4 the cycles to go through.
|
On Tue, Sep 22, 2020 at 12:45 PM PhilNMcCoy ***@***.***> wrote:
I would have said that this IS the preferred way to remove a poisoned line
from the coherence domain.
What if the error is in the cache tag RAM - how do you know which address
to target with your DCBZ, and how does the hardware know which way (if any)
has that address? I don't want to digress into a big debate about ECC - I
don't think it's part of the charter for this workgroup anyway and I don't
want to delay ratification of useful CMOs. Suffice it to say that we're not
in universal agreement that DCBZ is the One True Way to handle ECC.
I wasn't trying to imply that DCBZ is the appropriate hammer for all ECC
nails. I agree that it isn't.
If you have a poisoned line in the cache/memory system (whether it's in
your cache or someone else's cache or out in DRAM), that generally means a
line with a valid address but corrupted data. For coherently removing the
poison on this address, a DCBZ is a very nice hammer (with not much in the
way of other good ISA options).
If one has a cache line with a corrupted tag, then one has a much bigger
problem since there now is an unknown address in the system that is
potentially corrupted (if this line held dirty data). To clean up this
cache entry, a DCBZ is useless. What you want is a set/way line invalidate
operation. And in any case there is a bigger system-level issue to be
dealt with.
I agree that we don't want to digress into a RAS discussion (especially
since a new RAS TG will be forming soon). But removing poison from an
address in the system is a notable use case for DCBZ (besides the popular
block zero'ing of memory use cases), especially since there aren't any
other good options in the current ISA or in any currently contemplated arch
extensions.
Greg
|
BTW, I am going to break my rule about trying to have discussions email and not on the list, and I just want to make two points that are directly relevant: The performance advantage in using a hardware FSM is NOT the biggest justification for address range. IMHO one of the biggest reasons is dealing wih "idiosyncratic and inconsistent systems" when you have a system that is assembled out of IP blocks from different vendors. Some of the cache IP blocks may not respond to the CMO bus transactions emitted by your CPU. Sometimes the bus bridge between your CPU's preferred bus and the busses used elsewhere in the system bridge ordinary loads and stores, but don't bridge CMO bus transactions. Etc. Each cache IP block probably has a mechanism to do CMOs - e.g. MMIOs - but they may be different for different vendors. Heterogenous multiprocessor systems can be worse - a mixture of CPUs, GPUs, DSPs, multiple of each from different vendors. We would prefer not to have user or even system code know about such idiosyncrasies. It is straightforward to trap any CMO instructions to M-mode, and then have M-mode do whatever is needed to deal with your idiosyncratic system. That's the RISC way. But trapping on every 64B cache line can be really expensive. Whereas if you have address ranges, you can trap once for the entire range. Same issue, different point of view: if you are a RISC-V vendor who has already shipped hardware using whatever cache flush mechanism you have - definitely not the RISC-V CMO standard, because that does not exist yet - would it not be nice to be able to ship a software patch so that systems that are already in the field can run new code using the CMO instructions that we wikll define soon? The general solution to such compatibility, running new code on old systems, is trap and emulate. But trap and emulate is slow. Unless you can handle multiple such operations in a single trap. Put another way, compatibility and performance of the CMOs is one of the modifications or address range. compatibility: interfacing to hardware that does not provide the bus support or the full system support needed to transparently implement the CMO instructions performance: the performance of trap and emulate that provides that compatibility. I.e. not the performance of a hardware FSM, but the performance of software emulation. Companies that build the entire system have the benefit of having ensured that the entire system works together. However, there are markets that do not have this luxury. I am tempted to say the "embedded" market that's not 100% true - there are some embedded product lines that don't have to live with such heterogenous and idiosyncratic systems. However, there are some that do. Moreover, even if the vendors of the different IP blocks or CPUs and GPUs and DSPs and caches and bus bridges are willing to work together to make sure that CPUs CMO instructions get properly put onto the bus and bridged two other buses and interpreted by other cache IP blocks, sometimes it takes an extra six months to do so. In some markets, that makes a big difference. There are other reasons, many of which are on the wiki and/or in the original proposal. But this is one of the biggest reasons. |
Leaving aside my own bias, one question is whether "idiosyncratic and
inconsistent systems" are the tail wagging the dog. Is this the atypical
design and 95%+ designs don't have this issue, or is this something that a
significant fraction of systems have to deal with?
Picking on the point about one CPU needing to do a global CMO that covers
other CPU's non-coherent caches (aka "cache IP blocks that may not respond
to the CMO bus transactions emitted by your CPU"), that sounds like an
"interesting" system. This would be in contrast to a system where a CPU
with a non-coherent cache software-manages its own cache?
Should a big goal of base CMOs be to support encapsulating all the
system-specific vagaries of partially-coherent and non-coherent systems in
trappable global CMO instructions?
Greg
My own experience (not that it is representative of lower-end embedded
designs) is that an IP
…On Wed, Sep 23, 2020 at 2:38 PM AndyGlew ***@***.***> wrote:
BTW, I am going to break my rule about trying to have discussions email
and not on the list, and I just want to make two points that are directly
relevant:
The performance advantage in using a hardware FSM is NOT the biggest
justification for address range.
IMHO one of the biggest reasons is dealing wih "idiosyncratic and
inconsistent systems" when you have a system that is assembled out of IP
blocks from different vendors. Some of the cache IP blocks may not respond
to the CMO bus transactions emitted by your CPU. Sometimes the bus bridge
between your CPU's preferred bus and the busses used elsewhere in the
system bridge ordinary loads and stores, but don't bridge CMO bus
transactions. Etc. Each cache IP block probably has a mechanism to do CMOs
- e.g. MMIOs - but thedy may be different for different vendors,.
We would prefer not to have user or even system code know about such
idiosyncrasies.
It is straightforward to trap any CMO instructions to M-mode, and then
have M-mode do whatever is needed to deal with your idiosyncratic system.
That's the RISC way.
But trapping on every 64B cache line can be really expensive.
Whereas if you have address ranges, you can trap once for the entire range.
------------------------------
There are other reasons, many of which are on the wiki and/or in the
original proposal. But this is one of the biggest reasons.
------------------------------
Put another way: if you are a RISC-V vendor who has already shipped
hardware using whatever cache flush mechanism you have - definitely not the
RISC-V CMO standard, bedcause that does not exist yet - would it not be
nice to be able to ship a software patch so that systems that are already
in the field can run new code using the CMO instructions that we wikll
define soon?
The general solution to such compatibility, running new code on old
systems, is trap and emulate.
But trap and emulate is slow. Unless you can handle multiple such
operations in a single trap.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLX6GSYSYLUUDRNFOTNNZDSHJTDRANCNFSM4QTSAJKA>
.
|
In my opinion, this is the first big question that the CMO group needs to answer. Top priority because (a) it is a big source of disagrteement, (b) the J-extension I/D proposal by Derek Williams wants to follow the CMO group, (c) because the decisioon has big implications for code portability, for legacy compatibility with RISC-V systems already built, and for building systems where the CPU IP is developed independently of the bus IP and external cache and memory IP - i.e. for system "mash-ups".
Should we provide traditional RISC cache-line-at-a-time instructions, like POWER DCBF, DCBI, DCBZ, ... Not just RISC, but also CISCs like x86's CLFLUSH.
Basically, of the form
CMO <memory address>
. However, probably not of the formCMO rs1, Mem[rs2+imm12]
, because such 2reg+imm formats are quite expensive. If we were to do per-cache-line operations, would probably be of the formCMO rs1:cacheline_address
.Or should we provide "address range" CMO operations?
The draft proposal (by me, Andy Glew - TBD link here) contains a proposal for address range CMOs. Actually, it is a proposal for an instruction that can be implemented in several different ways, as described below. This CMO.ASR.* instruction (AR=address range) is intended to be used in a loop that looks like
(This is just an example, although IMHO the best. Other issues will discuss details like [start,end] vs [start,end) vs [start,start+length) vs ... But many iof not most of tye address range proposals have a loop like the above, varying in minor details like BNE vs BLT vs ...)
It can be implmented in different ways
(1) per-cache-line implementations, i.e. the traditional RISC way,
rs1 contains an address in the cache line to be invalidated. an address in the next cache line is returned in rd. (my proposal requires rs1=rd, in order to be restartable after exceptions like page-faults without requiring OS modifications, but that can be tweaked)
(2) trap to M-mode, so that can be emulated on systems where idiosyncratic MMIOs and CSRs invalidate caches that the CPU IP is not aware of;
KEY: the M mode software can perform the entire address range validation, and thus drop overhead than if it had to trap on every cache line or DCBF block
(3) using state machines and block invalidations, i.e,. using microarchitecture techniques that may be more efficient than a cache line at a time.
These can apply the CMO to the entire address region; but if they encounter something like a page-fault, they stop so the OS can handle it. i.e. they are restartable.
it is not the purpose of this issue to discuss all of the details about which register operand encodes which values, or whether the loop closing test should be a BNE or a BLT, or whether the and address should be inclusive or exclusive. those undoubtedly will be subsequent issues
this issue is mainly for the overall question: should be RISC-V CMOs be traditional per cache line operations or should they be address ranges using the approach above that allows per cache line implementations
The text was updated successfully, but these errors were encountered: