Overcoming Data Hazards with Dynamic Scheduling
A
simple statically scheduled pipeline fetches an instruction and issues
it, unless there is a data dependence between an instruction already in
the pipeline and the fetched instruction that cannot be hidden with
bypassing or forwarding. (Forwarding logic reduces the effective
pipeline latency so that the certain dependences do not result in
hazards.) If there is a data dependence that cannot be hidden, then the
hazard detection hardware stalls the pipeline starting with the
instruction that uses the result. No new instructions are fetched or
issued until the dependence is cleared.
In this section, we explore dynamic scheduling,
in which the hardware rearranges the instruction execution to reduce
the stalls while maintaining data flow and exception behavior. Dynamic
scheduling offers several
advantages. First, it allows code that was compiled with one pipeline
in mind to run efficiently on a different pipeline, eliminating the need
to have multiple binaries and recompile for a different
microarchitecture. In today’s computing environment, where much of the
software is from third parties and distributed in binary form, this
advantage is significant. Second, it enables handling some cases when
dependences are unknown at compile time; for example, they may involve a
memory reference or a data-dependent branch, or they may result from a
modern programming environment that uses dynamic linking or dispatching.
Third, and perhaps most importantly, it allows the processor to
tolerate unpredictable delays, such as cache misses, by executing other
code while waiting for the miss to resolve. In Section 3.6,
we explore hardware speculation, a technique with additional
performance advantages, which builds on dynamic scheduling. As we will
see, the advantages of dynamic scheduling are gained at a cost of
significant increase in hardware complexity.
Although a
dynamically scheduled processor cannot change the data flow, it tries to
avoid stalling when dependences are present. In contrast, static
pipeline scheduling by the compiler (covered in Section 3.2)
tries to minimize stalls by separating dependent instructions so that
they will not lead to hazards. Of course, compiler pipeline scheduling
can also be used on code destined to run on a processor with a
dynamically scheduled pipeline.
Dynamic Scheduling: The Idea
A major
limitation of simple pipelining techniques is that they use in-order
instruction issue and execution: Instructions are issued in program
order, and if an instruction is stalled in the pipeline no later
instructions can proceed. Thus, if there is a dependence between two
closely spaced instructions in the pipeline, this will lead to a hazard
and a stall will result. If there are multiple functional units, these
units could lie idle. If instruction j depends on a long-running instruction i, currently in execution in the pipeline, then all instructions after j must be stalled until i is finished and j can execute. For example, consider this code:

The
SUB.D
instruction cannot execute because the dependence of ADD.D
on DIV.D
causes the pipeline to stall; yet, SUB.D
is not data dependent on anything in the pipeline. This hazard creates a
performance limitation that can be eliminated by not requiring
instructions to execute in program order.
In
the classic five-stage pipeline, both structural and data hazards could
be checked during instruction decode (ID): When an instruction could
execute without hazards, it was issued from ID knowing that all data
hazards had been resolved.
To allow us to begin executing the
SUB.D
in the above example, we must separate the issue process into two
parts: checking for any structural hazards and waiting for the absence
of a data hazard. Thus, we still use in-order instruction issue (i.e.,
instructions issued in program order), but we want an instruction to
begin execution as soon as its data operands are available. Such a
pipeline does out-of-order execution, which implies out-of-order completion.
Out-of-order
execution introduces the possibility of WAR and WAW hazards, which do
not exist in the five-stage integer pipeline and its logical extension
to an in-order floating-point pipeline. Consider the following MIPS
floating-point code sequence:

There is an antidependence between the
ADD.D
and the SUB.D
, and if the pipeline executes the SUB.D
before the ADD.D
(which is waiting for the DIV.D
),
it will violate the antidependence, yielding a WAR hazard. Likewise, to
avoid violating output dependences, such as the write of F6
by MUL.D
, WAW hazards must be handled. As we will see, both these hazards are avoided by the use of register renaming.
Out-of-order
completion also creates major complications in handling exceptions.
Dynamic scheduling with out-of-order completion must preserve exception
behavior in the sense that exactly those exceptions that would arise if the program were executed in strict program order actually
do arise. Dynamically scheduled processors preserve exception behavior
by delaying the notification of an associated exception until the
processor knows that the instruction should be the next one completed.
Although exception behavior must be preserved, dynamically scheduled processors could generate imprecise exceptions. An exception is imprecise
if the processor state when an exception is raised does not look
exactly as if the instructions were executed sequentially in strict
program order. Imprecise exceptions can occur because of two
possibilities:
- The pipeline may have already completed instructions that are later in program order than the instruction causing the exception.
- The pipeline may have not yet completed some instructions that are earlier in program order than the instruction causing the exception.
Imprecise
exceptions make it difficult to restart execution after an exception.
Rather than address these problems in this section, we will discuss a
solution that provides precise exceptions in the context of a processor
with speculation in Section 3.6. For floating-point exceptions, other solutions have been used, as discussed in Appendix J.
To allow out-of-order execution, we essentially split the ID pipe stage of our simple five-stage pipeline into two stages:
- Issue—Decode instructions, check for structural hazards.
- Read operands—Wait until no data hazards, then read operands.
An
instruction fetch stage precedes the issue stage and may fetch either
into an instruction register or into a queue of pending instructions;
instructions are then issued from the register or queue. The execution
stage follows the read operands stage, just as in the five-stage
pipeline. Execution may take multiple cycles, depending on the
operation.
We distinguish when an instruction begins execution and when it completes execution; between the two times, the instruction is in execution.
Our pipeline allows multiple instructions to be in execution at the
same time; without this capability, a major advantage of dynamic
scheduling is lost. Having multiple instructions in execution at once
requires multiple functional units, pipelined functional units, or both.
Since these two capabilities—pipelined functional units and multiple
functional units—are essentially equivalent for the purposes of pipeline
control, we will assume the processor has multiple functional units.
In a
dynamically scheduled pipeline, all instructions pass through the issue
stage in order (in-order issue); however, they can be stalled or bypass
each other in the second stage (read operands) and thus enter execution
out of order. Scoreboarding is a technique for allowing
instructions to execute out of order when there are sufficient resources
and no data dependences; it is named after the CDC 6600 scoreboard,
which developed this capability. Here, we focus on a more sophisticated
technique, called Tomasulo’s algorithm. The primary difference is
that Tomasulo’s algorithm handles antidependences and output
dependences by effectively renaming the registers dynamically.
Additionally, Tomasulo’s algorithm can be extended to handle speculation,
a technique to reduce the effect of control dependences by predicting
the outcome of a branch, executing instructions at the predicted
destination address, and taking corrective actions when the prediction
was wrong. While the use of scoreboarding is probably sufficient to
support a simple two-issue superscalar like the ARM A8, a more
aggressive processor, like the four-issue Intel i7, benefits from the
use of out-of-order execution.
Dynamic Scheduling Using Tomasulo’s Approach
The IBM 360/91
floating-point unit used a sophisticated scheme to allow out-of-order
execution. This scheme, invented by Robert Tomasulo, tracks when
operands for instructions are available to minimize RAW hazards and
introduces register renaming in hardware to minimize WAW and WAR
hazards. There are many variations on this scheme in modern processors,
although the key concepts of tracking instruction dependences to allow
execution as soon as operands are available and renaming registers to
avoid WAR and WAW hazards are common characteristics.
IBM’s
goal was to achieve high floating-point performance from an instruction
set and from compilers designed for the entire 360 computer family,
rather than from specialized compilers for the high-end processors. The
360 architecture had only four double-precision floating-point
registers, which limits the effectiveness of compiler scheduling; this
fact was another motivation for the Tomasulo approach. In addition, the
IBM 360/91 had long memory accesses and long floating-point delays,
which Tomasulo’s algorithm was designed to overcome. At the end of the
section, we will see that Tomasulo’s algorithm can also support the
overlapped execution of multiple iterations of a loop.
We explain the
algorithm, which focuses on the floating-point unit and load-store
unit, in the context of the MIPS instruction set. The primary difference
between MIPS and the 360 is the presence of register-memory
instructions in the latter architecture. Because Tomasulo’s algorithm
uses a load functional unit, no significant changes are needed to add
register-memory addressing modes. The IBM 360/91 also had pipelined
functional units, rather than multiple functional units, but we describe
the algorithm as if there were multiple functional units. It is a
simple conceptual extension to also pipeline those functional units.
As we will
see, RAW hazards are avoided by executing an instruction only when its
operands are available, which is exactly what the simpler scoreboarding
approach provides. WAR and WAW hazards, which arise from name
dependences, are eliminated by register renaming. Register renaming
eliminates these hazards by renaming all destination registers,
including those with a pending read or write for an earlier instruction,
so that the out-of-order write does not affect any instructions that
depend on an earlier value of an operand.
To better
understand how register renaming eliminates WAR and WAW hazards,
consider the following example code sequence that includes potential WAR
and WAW hazards:

There are two antidependences: between the
ADD.D
and the SUB.D
and between the S.D and the MUL.D
. There is also an output dependence between the ADD.D
and the MUL.D
, leading to three possible hazards: WAR hazards on the use of F8
by ADD.D
and the use of F6
by the SUB.D
, as well as a WAW hazard since the ADD.D
may finish later than the MUL.D
. There are also three true data dependences: between the DIV.D
and the ADD.D
, between the SUB.D
and the MUL.D
, and between the ADD.D
and the S.D
.
These
three name dependences can all be eliminated by register renaming. For
simplicity, assume the existence of two temporary registers,
S
and T
. Using S
and T
, the sequence can be rewritten without any dependences as:
In addition, any subsequent uses of
F8
must be replaced by the register T
. In this code segment, the renaming process can be done statically by the compiler. Finding any uses of F8
that are later in the code requires either sophisticated compiler
analysis or hardware support, since there may be intervening branches
between the above code segment and a later use of F8
. As we will see, Tomasulo’s algorithm can handle renaming across branches.
In Tomasulo’s scheme, register renaming is provided by reservation stations,
which buffer the operands of instructions waiting to issue. The basic
idea is that a reservation station fetches and buffers an operand as
soon as it is available, eliminating the need to get the operand from a
register. In addition, pending instructions designate the reservation
station that will provide their input. Finally, when successive writes
to a register overlap in execution, only the last one is actually used
to update the register. As instructions are issued, the register
specifiers for pending operands are renamed to the names of the
reservation station, which provides register renaming.
Since there
can be more reservation stations than real registers, the technique can
even eliminate hazards arising from name dependences that could not be
eliminated by a compiler. As we explore the components of Tomasulo’s
scheme, we will return to the topic of register renaming and see exactly
how the renaming occurs and how it eliminates WAR and WAW hazards.
The use of
reservation stations, rather than a centralized register file, leads to
two other important properties. First, hazard detection and execution
control are distributed: The information held in the reservation
stations at each functional unit determines when an instruction can
begin execution at that unit. Second, results are passed directly to
functional units from the reservation stations where they are buffered,
rather than going through the registers. This bypassing is done with a
common result bus that allows all units waiting for an operand to be
loaded simultaneously (on the 360/91 this is called the common data bus,
or CDB). In pipelines with multiple execution units and issuing
multiple instructions per clock, more than one result bus will be
needed.
Figure 3.6
shows the basic structure of a Tomasulo-based processor, including both
the floating-point unit and the load/store unit; none of the execution
control tables is shown. Each reservation station holds an instruction
that has been issued and is awaiting execution at a functional unit and
either the operand values for that instruction, if they have already
been computed, or else the names of the reservation stations that will
provide the operand values.

The
load buffers and store buffers hold data or addresses coming from and
going to memory and behave almost exactly like reservation stations, so
we distinguish them only when necessary. The floating-point registers
are connected by a pair of buses to the functional units and by a single
bus to the store buffers. All results from the functional units and
from memory are sent on the common data bus, which goes everywhere
except to the load buffer. All reservation stations have tag fields,
employed by the pipeline control.
Before we
describe the details of the reservation stations and the algorithm,
let’s look at the steps an instruction goes through. There are only
three steps, although each one can now take an arbitrary number of clock
cycles:
- Issue—Get the next instruction from the head of the instruction queue, which is maintained in FIFO order to ensure the maintenance of correct data flow. If there is a matching reservation station that is empty, issue the instruction to the station with the operand values, if they are currently in the registers. If there is not an empty reservation station, then there is a structural hazard and the instruction stalls until a station or buffer is freed. If the operands are not in the registers, keep track of the functional units that will produce the operands. This step renames registers, eliminating WAR and WAW hazards. (This stage is sometimes called dispatch in a dynamically scheduled processor.)
- Execute—If
one or more of the operands is not yet available, monitor the common
data bus while waiting for it to be computed. When an operand becomes
available, it is placed into any reservation station awaiting it. When
all the operands are available, the operation can be executed at the
corresponding functional unit. By delaying instruction execution until
the operands are available, RAW hazards are avoided. (Some dynamically
scheduled processors call this step “issue,” but we use the name
“execute,” which was used in the first dynamically scheduled processor,
the CDC 6600.)
Notice that several instructions could become ready in the same clock cycle for the same functional unit. Although independent functional units could begin execution in the same clock cycle for different instructions, if more than one instruction is ready for a single functional unit, the unit will have to choose among them. For the floating-point reservation stations, this choice may be made arbitrarily; loads and stores, however, present an additional complication.3. Write result—When the result is available, write it on the CDB and from there into the registers and into any reservation stations (including store buffers) waiting for this result. Stores are buffered in the store buffer until both the value to be stored and the store address are available, then the result is written as soon as the memory unit is free.
No comments:
Post a Comment