Search This Blog

Friday, 21 February 2014

Dynamic Scheduling:Tomosulo's Approach

Overcoming Data Hazards with Dynamic Scheduling

A simple statically scheduled pipeline fetches an instruction and issues it, unless there is a data dependence between an instruction already in the pipeline and the fetched instruction that cannot be hidden with bypassing or forwarding. (Forwarding logic reduces the effective pipeline latency so that the certain dependences do not result in hazards.) If there is a data dependence that cannot be hidden, then the hazard detection hardware stalls the pipeline starting with the instruction that uses the result. No new instructions are fetched or issued until the dependence is cleared.
In this section, we explore dynamic scheduling, in which the hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior. Dynamic scheduling offers several advantages. First, it allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline, eliminating the need to have multiple binaries and recompile for a different microarchitecture. In today’s computing environment, where much of the software is from third parties and distributed in binary form, this advantage is significant. Second, it enables handling some cases when dependences are unknown at compile time; for example, they may involve a memory reference or a data-dependent branch, or they may result from a modern programming environment that uses dynamic linking or dispatching. Third, and perhaps most importantly, it allows the processor to tolerate unpredictable delays, such as cache misses, by executing other code while waiting for the miss to resolve. In Section 3.6, we explore hardware speculation, a technique with additional performance advantages, which builds on dynamic scheduling. As we will see, the advantages of dynamic scheduling are gained at a cost of significant increase in hardware complexity.
Although a dynamically scheduled processor cannot change the data flow, it tries to avoid stalling when dependences are present. In contrast, static pipeline scheduling by the compiler (covered in Section 3.2) tries to minimize stalls by separating dependent instructions so that they will not lead to hazards. Of course, compiler pipeline scheduling can also be used on code destined to run on a processor with a dynamically scheduled pipeline.

Dynamic Scheduling: The Idea

A major limitation of simple pipelining techniques is that they use in-order instruction issue and execution: Instructions are issued in program order, and if an instruction is stalled in the pipeline no later instructions can proceed. Thus, if there is a dependence between two closely spaced instructions in the pipeline, this will lead to a hazard and a stall will result. If there are multiple functional units, these units could lie idle. If instruction j depends on a long-running instruction i, currently in execution in the pipeline, then all instructions after j must be stalled until i is finished and j can execute. For example, consider this code:
The SUB.D instruction cannot execute because the dependence of ADD.D on DIV.D causes the pipeline to stall; yet, SUB.D is not data dependent on anything in the pipeline. This hazard creates a performance limitation that can be eliminated by not requiring instructions to execute in program order.
In the classic five-stage pipeline, both structural and data hazards could be checked during instruction decode (ID): When an instruction could execute without hazards, it was issued from ID knowing that all data hazards had been resolved.
To allow us to begin executing the SUB.D in the above example, we must separate the issue process into two parts: checking for any structural hazards and waiting for the absence of a data hazard. Thus, we still use in-order instruction issue (i.e., instructions issued in program order), but we want an instruction to begin execution as soon as its data operands are available. Such a pipeline does out-of-order execution, which implies out-of-order completion.
Out-of-order execution introduces the possibility of WAR and WAW hazards, which do not exist in the five-stage integer pipeline and its logical extension to an in-order floating-point pipeline. Consider the following MIPS floating-point code sequence:
There is an antidependence between the ADD.D and the SUB.D, and if the pipeline executes the SUB.D before the ADD.D (which is waiting for the DIV.D), it will violate the antidependence, yielding a WAR hazard. Likewise, to avoid violating output dependences, such as the write of F6 by MUL.D, WAW hazards must be handled. As we will see, both these hazards are avoided by the use of register renaming.
Out-of-order completion also creates major complications in handling exceptions. Dynamic scheduling with out-of-order completion must preserve exception behavior in the sense that exactly those exceptions that would arise if the program were executed in strict program order actually do arise. Dynamically scheduled processors preserve exception behavior by delaying the notification of an associated exception until the processor knows that the instruction should be the next one completed.
Although exception behavior must be preserved, dynamically scheduled processors could generate imprecise exceptions. An exception is imprecise if the processor state when an exception is raised does not look exactly as if the instructions were executed sequentially in strict program order. Imprecise exceptions can occur because of two possibilities:
  1. The pipeline may have already completed instructions that are later in program order than the instruction causing the exception.
  2. The pipeline may have not yet completed some instructions that are earlier in program order than the instruction causing the exception.
Imprecise exceptions make it difficult to restart execution after an exception. Rather than address these problems in this section, we will discuss a solution that provides precise exceptions in the context of a processor with speculation in Section 3.6. For floating-point exceptions, other solutions have been used, as discussed in Appendix J.
To allow out-of-order execution, we essentially split the ID pipe stage of our simple five-stage pipeline into two stages:
  1. Issue—Decode instructions, check for structural hazards.
  2. Read operands—Wait until no data hazards, then read operands.
An instruction fetch stage precedes the issue stage and may fetch either into an instruction register or into a queue of pending instructions; instructions are then issued from the register or queue. The execution stage follows the read operands stage, just as in the five-stage pipeline. Execution may take multiple cycles, depending on the operation.
We distinguish when an instruction begins execution and when it completes execution; between the two times, the instruction is in execution. Our pipeline allows multiple instructions to be in execution at the same time; without this capability, a major advantage of dynamic scheduling is lost. Having multiple instructions in execution at once requires multiple functional units, pipelined functional units, or both. Since these two capabilities—pipelined functional units and multiple functional units—are essentially equivalent for the purposes of pipeline control, we will assume the processor has multiple functional units.
In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order. Scoreboarding is a technique for allowing instructions to execute out of order when there are sufficient resources and no data dependences; it is named after the CDC 6600 scoreboard, which developed this capability. Here, we focus on a more sophisticated technique, called Tomasulo’s algorithm. The primary difference is that Tomasulo’s algorithm handles antidependences and output dependences by effectively renaming the registers dynamically. Additionally, Tomasulo’s algorithm can be extended to handle speculation, a technique to reduce the effect of control dependences by predicting the outcome of a branch, executing instructions at the predicted destination address, and taking corrective actions when the prediction was wrong. While the use of scoreboarding is probably sufficient to support a simple two-issue superscalar like the ARM A8, a more aggressive processor, like the four-issue Intel i7, benefits from the use of out-of-order execution.

Dynamic Scheduling Using Tomasulo’s Approach

The IBM 360/91 floating-point unit used a sophisticated scheme to allow out-of-order execution. This scheme, invented by Robert Tomasulo, tracks when operands for instructions are available to minimize RAW hazards and introduces register renaming in hardware to minimize WAW and WAR hazards. There are many variations on this scheme in modern processors, although the key concepts of tracking instruction dependences to allow execution as soon as operands are available and renaming registers to avoid WAR and WAW hazards are common characteristics.
IBM’s goal was to achieve high floating-point performance from an instruction set and from compilers designed for the entire 360 computer family, rather than from specialized compilers for the high-end processors. The 360 architecture had only four double-precision floating-point registers, which limits the effectiveness of compiler scheduling; this fact was another motivation for the Tomasulo approach. In addition, the IBM 360/91 had long memory accesses and long floating-point delays, which Tomasulo’s algorithm was designed to overcome. At the end of the section, we will see that Tomasulo’s algorithm can also support the overlapped execution of multiple iterations of a loop.
We explain the algorithm, which focuses on the floating-point unit and load-store unit, in the context of the MIPS instruction set. The primary difference between MIPS and the 360 is the presence of register-memory instructions in the latter architecture. Because Tomasulo’s algorithm uses a load functional unit, no significant changes are needed to add register-memory addressing modes. The IBM 360/91 also had pipelined functional units, rather than multiple functional units, but we describe the algorithm as if there were multiple functional units. It is a simple conceptual extension to also pipeline those functional units.
As we will see, RAW hazards are avoided by executing an instruction only when its operands are available, which is exactly what the simpler scoreboarding approach provides. WAR and WAW hazards, which arise from name dependences, are eliminated by register renaming. Register renaming eliminates these hazards by renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order write does not affect any instructions that depend on an earlier value of an operand.
To better understand how register renaming eliminates WAR and WAW hazards, consider the following example code sequence that includes potential WAR and WAW hazards:
There are two antidependences: between the ADD.D and the SUB.D and between the S.D and the MUL.D. There is also an output dependence between the ADD.D and the MUL.D, leading to three possible hazards: WAR hazards on the use of F8 by ADD.D and the use of F6 by the SUB.D, as well as a WAW hazard since the ADD.D may finish later than the MUL.D. There are also three true data dependences: between the DIV.D and the ADD.D, between the SUB.D and the MUL.D, and between the ADD.D and the S.D.
These three name dependences can all be eliminated by register renaming. For simplicity, assume the existence of two temporary registers, S and T. Using S and T, the sequence can be rewritten without any dependences as:
In addition, any subsequent uses of F8 must be replaced by the register T. In this code segment, the renaming process can be done statically by the compiler. Finding any uses of F8 that are later in the code requires either sophisticated compiler analysis or hardware support, since there may be intervening branches between the above code segment and a later use of F8. As we will see, Tomasulo’s algorithm can handle renaming across branches.
In Tomasulo’s scheme, register renaming is provided by reservation stations, which buffer the operands of instructions waiting to issue. The basic idea is that a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register. In addition, pending instructions designate the reservation station that will provide their input. Finally, when successive writes to a register overlap in execution, only the last one is actually used to update the register. As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming.
Since there can be more reservation stations than real registers, the technique can even eliminate hazards arising from name dependences that could not be eliminated by a compiler. As we explore the components of Tomasulo’s scheme, we will return to the topic of register renaming and see exactly how the renaming occurs and how it eliminates WAR and WAW hazards.
The use of reservation stations, rather than a centralized register file, leads to two other important properties. First, hazard detection and execution control are distributed: The information held in the reservation stations at each functional unit determines when an instruction can begin execution at that unit. Second, results are passed directly to functional units from the reservation stations where they are buffered, rather than going through the registers. This bypassing is done with a common result bus that allows all units waiting for an operand to be loaded simultaneously (on the 360/91 this is called the common data bus, or CDB). In pipelines with multiple execution units and issuing multiple instructions per clock, more than one result bus will be needed.
Figure 3.6 shows the basic structure of a Tomasulo-based processor, including both the floating-point unit and the load/store unit; none of the execution control tables is shown. Each reservation station holds an instruction that has been issued and is awaiting execution at a functional unit and either the operand values for that instruction, if they have already been computed, or else the names of the reservation stations that will provide the operand values.
 

Figure 3.6 The basic structure of a MIPS floating-point unit using Tomasulo’s algorithm. Instructions are sent from the instruction unit into the instruction queue from which they are issued in first-in, first-out (FIFO) order. The reservation stations include the operation and the actual operands, as well as information used for detecting and resolving hazards. Load buffers have three functions: (1) hold the components of the effective address until it is computed, (2) track outstanding loads that are waiting on the memory, and (3) hold the results of completed loads that are waiting for the CDB. Similarly, store buffers have three functions: (1) hold the components of the effective address until it is computed, (2) hold the destination memory addresses of outstanding stores that are waiting for the data value to store, and (3) hold the address and value to store until the memory unit is available. All results from either the FP units or the load unit are put on the CDB, which goes to the FP register file as well as to the reservation stations and store buffers. The FP adders implement addition and subtraction, and the FP multipliers do multiplication and division.
The load buffers and store buffers hold data or addresses coming from and going to memory and behave almost exactly like reservation stations, so we distinguish them only when necessary. The floating-point registers are connected by a pair of buses to the functional units and by a single bus to the store buffers. All results from the functional units and from memory are sent on the common data bus, which goes everywhere except to the load buffer. All reservation stations have tag fields, employed by the pipeline control.
Before we describe the details of the reservation stations and the algorithm, let’s look at the steps an instruction goes through. There are only three steps, although each one can now take an arbitrary number of clock cycles:
  1. Issue—Get the next instruction from the head of the instruction queue, which is maintained in FIFO order to ensure the maintenance of correct data flow. If there is a matching reservation station that is empty, issue the instruction to the station with the operand values, if they are currently in the registers. If there is not an empty reservation station, then there is a structural hazard and the instruction stalls until a station or buffer is freed. If the operands are not in the registers, keep track of the functional units that will produce the operands. This step renames registers, eliminating WAR and WAW hazards. (This stage is sometimes called dispatch in a dynamically scheduled processor.)
  2. Execute—If one or more of the operands is not yet available, monitor the common data bus while waiting for it to be computed. When an operand becomes available, it is placed into any reservation station awaiting it. When all the operands are available, the operation can be executed at the corresponding functional unit. By delaying instruction execution until the operands are available, RAW hazards are avoided. (Some dynamically scheduled processors call this step “issue,” but we use the name “execute,” which was used in the first dynamically scheduled processor, the CDC 6600.)
    Notice that several instructions could become ready in the same clock cycle for the same functional unit. Although independent functional units could begin execution in the same clock cycle for different instructions, if more than one instruction is ready for a single functional unit, the unit will have to choose among them. For the floating-point reservation stations, this choice may be made arbitrarily; loads and stores, however, present an additional complication.
    3. Write result—When the result is available, write it on the CDB and from there into the registers and into any reservation stations (including store buffers) waiting for this result. Stores are buffered in the store buffer until both the value to be stored and the store address are available, then the result is written as soon as the memory unit is free.

CS 2354—ADVANCED COMPUTER ARCHITECTURE previous year questions

Anna University
B.E./B.Tech. DEGREE EXAMINATION
Sixth Semester
Computer Science and Engineering
CS 2354—ADVANCED COMPUTER ARCHITECTURE
(Regulation 2008)
APRIL/MAY 2011
PART A
1. What is loop unrolling? and what are its advantages?
2. Differentiate between static and dynamic branch prediction approaches.
3. What is fine-grained multithreading and what is the advantage and disadvantages of fine-grained multithreading?
4. What a VLIW processor?
5. What is sequential consistency?
6. State the advantages of threading.
7. Differentiate between write-through cache and snoopy cache.
8. Compare SDRAM with DRAM.
9. What is multi–core processor and what are the application areas of multi-core processors?
10. What is a Cell Processor?
PART B
11. (a) Briefly describe any techniques to reduce the control hazard stalls. (16)
Or
(b) (i) Discuss about any two compiler techniques for exposing ILP in
detail. (8)
(ii) Explain how ILP is achieved using dynamic scheduling. (8)
12. (a) (i) Describe the architectural features of IA 64 Processors in detail. (10)
(ii) Explain the architecture of a typical VLIW processor in detail. (6)
or
(b) (i) Describe the architectural features of Itanium Processor. (10)
(ii) Explain how instruction level parallelism is achieved in EPIC
processor. (6)
13. (a) (i) Describe the basic structure of a centralized shared-memory multiprocessor
in detail. (6)
(ii) Describe the implementation of directory-based cache coherence
protocol. (10)
Or
(b) (i) What are the advantages and disadvantages of distributed-memory
Multiprocessors? Describe the basic structure of a distributedmemory multiprocessor
in detail. (8)
(ii) Describe sequential and relaxed consistency model. (8)
14. (a) (i) With suitable diagram, explain how virtual address is mapped to L2 cache
address. (10)
(ii) Discuss about the steps to be followed in designing I/O system. (6)
Or
(b) Describe the optimizations techniques used in compilers to reduce cache miss rate.
(16)
15. (a) (i) Describe the features of SUN CMP architecture in detail. (6)
(ii) What are Multi Core processors? Explain how a multi core
processors works. (10)
Or
(b) (i) Discuss about the SMT kernel structure in detail. (8)
(ii) Describe the architecture of the IBM Cell Processor in detail. (8)

NOVEMBER/DECEMBER 2011.
PART A
1. What is instruction level parallelism?
2. What are the advantages of loop unrolling?
3. What are the limitations of VLIW?
4. What is the use of branch-target buffer?
5. Distinguish between shared memory multiprocessor and messagepassing
multiprocessor.
6. Differentiate multithreading computers from multiprocessor systems
7. Define the terms cache miss and cache hit.
8. What is RAID?
9. What is a multi-core processor?
10. What is a cell processor?
PART B
11. (a) (i) Explain the data and name dependencies with suitable example. (10)
(ii) Discuss about the benefits and limitations of static branch prediction and
dynamic branch prediction (6)
Or
(b) Briefly explain how to overcome data hazards with dynamic scheduling using
Tomasula’s approach. (16)
12. (a) (i) Describe the architecture of Itanium processor with the help of a block
diagram. (8)
(ii) Explain how ILP is achieved in EPIC processors (8)
Or
(b) (i) Describe the architectural features of IA64 processor in detail.(8)
(ii) What are the advantages and disadvantages of software-based and hardwarebased
speculation mechanism? (8)
13. (a) (i) Briefly compare instruction level parallelism with threadlevel
parallelism. (8)
(ii) Explain the basic architecture of a distributed memorymultiprocessor system.
(8)
Or
(b) (i) Explain various memory consistency models in detail. (10)
(ii) What is multithreading and what are the advantages ofmultithreading? (6)
14. (a) What is meant by cache coherence problem? Describe various protocols
for cache coherence. (16)
Or
(b) Briefly explain various I/O performance measures. (16)
15. (a) (i) Describe the architecture of typical CMT processor. (8)
(ii) Discuss the design issues for simultaneous multithreading. (8)
Or
(b) (i) Explain the architectural features of IBM cell processor in detail. (10)
(ii) Briefly compare SMT and CMP architectures. (6)

MAY/JUNE 2013

PART A
1. Define Dynamic Scheduling?
2. List the five level of branch prediction?
3. list loop carried dependences?
4. What is the major disadvantages of supporting apeculation hardware?
5. What is the major disadvantages of using symmetric shared memory.
6. what is consistency?
7. Define the terms cache miss and cache hit.
8. What is the bus master?
9. What are the categories of multi processor?
10. What is fine grained multithreading?
PART B
11. What is instruction-level parallelism? Explain in detail about the varíous

dependence caused in ILP.
Or
(b) Explain how to reduce branch cost with dynamic hardware prediction.
12. Explain how hardware support for exposing more parallelism at compile time.
or
       Explain how hardware based speculation is used to overcome control dependence.
13 .Discuss about the different models for memory consistency.
Or
Define synchronization and explain the different mechanisms employed for synchronization among processors.
14. Explain the various levels of RAID.
0r
Explain the various ways to measure I/O performance.
15. How is multithreading used to exploit thread level parallelism within processor? Explain with example.
Or
Discuss SMT and CMP' architectures  detail.

NOVEMBER/DECEMBER 2012.
PART A
1. Define spatial and temporal locality?
2. What are the advantages of dynamic scheduling using Tomosulo’s Approach?
3. What is ILP and what are the two approaches for ILP?
4. Define casual consistency model?
5. What is loop unrolling and what are the major limitation of loop unrolling?.
6. What is multiprocessor cache coherence             problem?
7. What are the differences and similarities between SCSI and IDE?
8. Compare software and hardware RAID?
9. What is Meant by simultaneous multithreading?
10. What are the advantages of CMP architecture?
PART B
11. (a) i)Discuss about guidelines and principles that are useful in the design and analysis of computer (8)
(ii) Explain dynamic branch prediction (8)
Or
(b) i) Describe the major factor that influences the cost of the computer andf how these factors are changing over time. (8)
ii) Explain hardware based speculation to overcome control dependence (8)

12. (a) (i) Compare CISC, RISC and VLIW. (6)
(ii) Explain VLIW typical architecture with block diagram (10)
Or
(b) (i) Explain basic compiler techniques to exploit ILP.(8)
(ii) Briefly compare hardware and software speculation? (8)
13. (a) (i) What is multithreading and what are the advantages of multithreading?(6).
(ii). Define synchronization and explain the different mechanisms employed for synchronization among processors (10)
Or
(b) (i) Explain the basic architecture of symmetric multiprocessor system with the help of block diagram. (10)
(ii) Describe Coarse grained and fine grained multithreading? (6)
14. (a) Describe the various techniques of optimization of cache in detail. (16)
Or
(b) Briefly Describe levels of RAID. (10)
Describe issues in designing I/O system (6)
15. (a) (i) Describe the various techniques for designing hardware multithreading in detail. (8)
(ii) Explain single chip multiprocessor architecture with the help of architecture. (8)
Or
(b) (i) Explain major challenges and issues in the design of multi core architecture. (16)


May/June 2012
PART A
  1. Define ILP
  2. Explain Advantages of Dynamic scheduling.
  3. What are the advantages and disadvantages of trace scheduling?
  4. Whar are the limits on ILP
  5. What is multiprocessor cache coherenc?
  6. Difference between Coarse grained and fine grained multithreading?
  7. Why do DRAMs generally have much larger capacities than SRAMs constructed in the same fabrication technology?
  8. What is the average time to read or write a 512sector for a disk? The advertised average seek time is 5ms. The transfer rate is 40 MB/sec. it rotates at 10000RPM, and the controller overhead is 0.1ms. assume the disk is idle so that there is no queuing delay.
  9. What is multi core architecture
  10. What are the advantage of CPM architecture?
PART B
  1. (a) i)Explain various dependences caused in ILP (8)
(ii) Explain static and dynamic branch prediction (8)
Or
(b) i) Explain Tomosulo’s Approach used in dynamic scheduling for over coming data hazards. (8)
ii) Explain How the compiler technology used to improve the performance ILP (8)

12.(a) (i) Explain software pipelining methods to uncover parallelisms . (8)
(ii) Briefly compare hardware and software speculation? (8)
Or

 (b) (i) Discuss the essential features of IA 64 and Itanium processors (16)
     13. (a) (i) Explain the basic architecture of symmetric multiprocessor system with the help
                  of block diagram (8).
(ii). Define synchronization and explain the different mechanisms employed for synchronization among processors (8)
Or
(b) (i) Explain the basic architecture of symmetric multiprocessor system with the help of block diagram consisting of both user and OS activity (8)
(ii) Discuss various memory consistency models? (6)
14. (a) Discuss the various techniques available for reducing cache miss penality. (8)
Briefly Describe levels of RAID. (8)
Or
(b) Write notes on compiler optimization to reduce miss rate. (8)
Describe issues in designing I/O system (8)
15. (a) (i) Explain SMT architecture and its challenges. (8)
(ii) Explain Heterogeneous multicore processors (8)
Or
(b) (i) Explain CMT architecture (8)
ii) Explain IBM cell processor concept in detail (8)



Thursday, 20 February 2014

Advanced COmputer Architecturee

Distributed Shared-Memory Architectures

  • Distributed shared-memory machines need cache coherence for the same reasons that centralized shared-memory machines need it.
 
  • However, due to the interconnection network and scalability requirements, centralized protocols have drawbacks in these architectures.
 
  • There are several options:
  • No coherence
    • Instead, focus on scalability (Cray T3D).
    • In this scheme, only data that actually resides in the private memory may be cached (shared data is marked uncacheable .)
 
    • Coherence is maintained by software which has several disadvantages.
 
  • Compiler mechanisms are very limited.
      • They have to be very conservative, e.g. treat a block on another processor as dirty even though it may not be.
      • This results in excessive coherence overhead (extra fetching).

Distributed Shared-Memory Architectures

  • Disadvantages of software implemented coherence:
  • Multiple words in a block provide no advantage.
      • Software coherence must be run each time a word is needed.
      • The advantage of spatial locality (the "prefetch" of other words in the block) is lost in single word fetching.
 
  • Latencies to remote memory are relatively high.
      • Remote references can take 50 - 1000 CPU cycles, making coherency "misses" a very costly proposition.
 
  • Snooping
    • The lack of scalability of snooping coherence schemes is a problem in DSMs.
      • The distributed nature of the snooping protocol's data structure (which maintains the state of the cache blocks) does NOT scale well.
 
    • Snooping requires broadcast (communication with all caches on every miss) which is very expensive with an interconnection network.

Directory-based Cache Coherence Protocols

  • Directory-based coherence
    • Information kept in the directory:
  • The state of every block in memory, e.g. shared, uncached or exclusive.
      • For exclusive, the block has been written, is in one cache and memory is out-of-date .
      • This information is also keep in the cache for efficiency reasons.
 
  • Which caches have copies.
      • Can be implemented using a bit vector for each block with the processors identified by the bit's position.
 
  • Whether or not the block is dirty.
 
    • The amount of information in the directory is proportional to:
 
      • This works O.K. for less than 100 processors -- other solutions are needed for >100 processors.

Directory-based Cache Coherence Protocols

  • The directory entries can also be distributed along with the memory.
    • The high order bits of an address can be used to identify the location of the memory and directory entries for that portion of the memory.
  • This structure avoids broadcast.

Directory-based Cache Coherence Protocols

  • Two basic primitives that must be handled:
  • Handling a read miss.
  • Handling a write to a shared, clean cache block.
 
  • Handling a write miss is a combination of these two.
 
  • Our simplifying assumptions still hold here.
    • Writes to non-exclusive data generate write misses.
    • Write misses are atomic (processors block until the access completes).
 
  • This introduces two complications:
    • Since there is no longer a bus, there is no single point of arbitration.
 
    • Since broadcast is to be avoided, the directory and cache must issue explicit response messages, e.g., invalidate and write-back request messages.

Directory-based Cache Coherence Protocols

  • The states and transitions at each cache are identical to the snooping protocol.
    • The actions are somewhat different however.
 
  • First let's look at the message types:
    • Local node : Where the request originates.
    • Home node : Where memory and directory live.
    • Remote node : Node that has a copy of the block (exclusive or shared).
    • Message type
      Source
      Destination
      Con-tents
      Function
      Read miss
      Local cache
      Home directory
      P, A
      P has a read miss at addr A; request data and make P a read sharer.
      Write miss
      Local cache
      Home directory
      P, A
      P has a write miss at addr A; request data and make P exclusive owner.
      Invalidate
      Home directory
      Remote cache
      A
      Invalidate a shared copy of data at addr A.

Directory-based Cache Coherence Protocols

  • More message types:
  • Message type
    Source
    Destination
    Con-tents
    Function
    Fetch
    Home directory
    Remote cache
    A
    Fetch block at addr A and send to home directory; change the state of A in the remote cache to shared.
    Fetch/invalidate
    Home directory
    Remote cache
    A
    Fetch block at addr A and send to home directory; invalidate the block in the cache.
    Date value reply
    Home directory
    Local cache
    Data
    Return a data value from the home memory.
    Data write back
    Remote cache
    Home directory
    A, data
    Write back a data value for addr A.

Directory-based Cache Coherence Protocols

  • The protocol actions to which an individual cache responds.

Directory-based Cache Coherence Protocols

  • Actions taken by the directory in response to messages received.

Directory-based Cache Coherence Protocols

  • Directory operation
    • The directory can receive three kinds of messages:
  • Read miss
  • Write miss
  • Write-back
 
  • Uncached state
    • When the block is uncached, the directory can only receive two kinds of messages: read miss and write miss .
 
    • A read miss moves the block into the shared state.
    • A write miss moves it into exclusive .
 
    • In either case, the directory updates its list of sharing nodes to include only the node that requested the data.

Directory-based Cache Coherence Protocols

  • Shared state
    • Again, only read or write misses are possible, since all caches have the same value as memory.
 
    • If it is a read miss , the node requesting the data is added to the list of sharing nodes.
 
    • If it's a write miss :
  • The block is moved to the exclusive state.
  • Invalidate messages are sent to all current sharing nodes.
  • The sharing list is updated to only the requesting processor.
 
  • Exclusive state
    • On a read miss , the owner node is sent a fetch message.
      • This tells the node to write its data back to memory.
      • The requesting node is added to the sharing list, and the block is marked as shared.

Directory-based Cache Coherence Protocols

  • Exclusive state
    • If it's a write miss , the block must be written back by the current owner, so the directory sends out a fetch message.
 
    • When the data is written, the directory forwards it to the new owner and replaces the old owner with the new owner in the sharing list.
 
    • On a write-back , the data is updated in memory and the block goes into the uncached state.
      • The sharing list is cleared.
 
    • One obvious optimization is to have the old owner send the data directly to the new owner on a write miss.
      • This can be done either instead of or in addition to writing the data to the home.

Direc tory-based Cache Coherence Protocols

  • Issues:
  • When read-only data is replaced
    • Note that this scheme does not explicitly notify the directory when a clean block is replaced in the cache.
 
    • This is fine, since the cache will simply ignore invalidate messages for blocks that are not currently cached.
 
    • The only problem is that it will cause the directory to send out a few unnecessary messages
      • But that is probably not as bad as having the remote caches send a message each time they replace a block.

Directory-based Cache Coherence Protocols

  • Issues:
  • Synchronization
    • Deciding the order of accesses in a distributed memory system is much harder.
 
    • Without a shared bus, it is impossible to tell which writes come first.
 
    • It is not feasible to stall all accesses until a write completes.
 
    • Often, this can be handled by requiring all writes to be atomic .
      • But doing so slows down the system greatly.