Much of the material in this chapter was leveraged from L. Spracklen and
S. G. Abraham, “Chip Multithreading: Opportunities and Challenges,” in
11th International Symposium on High-Performance Computer
Architecture, 2005.
Over the last few decades microprocessor performance has increased
exponentially, with processor architects successfully achieving significant
gains in single-thread performance from one processor generation to the
next. Semiconductor technology has been the main driver for this
increase, with faster transistors allowing rapid increases in clock speed to
today’s multi-GHz frequencies. In addition to these frequency increases,
each new technology generation has essentially doubled the number of
available transistors. As a result, architects have been able to aggressively
chase increased single-threaded performance by using a range of
expensive microarchitectural techniques, such as, superscalar issue, out-
of-order issue, on-chip caching, and deep pipelines supported by
sophisticated branch predictors.
However, process technology challenges, including power constraints, the
memory wall, and ever-increasing difficulties in extracting further
instruction-level parallelism (ILP), are all conspiring to limit the
performance of individual processors in the future. While recent attempts
at improving single-thread performance through even deeper pipelines
have led to impressive clock frequencies, these clock frequencies have
not translated into significantly better performance in comparison with
less aggressive designs. As a result, microprocessor frequency, which
used to increase exponentially, has now leveled off, with most processors
operating in the 2–4 GHz range.
1
----------------------- Page 21-----------------------
2 Chapter 1 Introducing Chip Multithreaded (CMT) Processors
This combination of the limited realizable ILP, practical limits to pipelining,
and a “power ceiling” imposed by cost-effective cooling considerations have
conspired to limit future performance increases within conventional processor
cores. Accordingly, processor designers are searching for new ways to
effectively utilize their ever-increasing transistor budgets.
The techniques being embraced across the microprocessor industry are chip
multiprocessors (CMPs) and chip multithreaded (CMT) processors. CMP, as
the name implies, is simply a group of processors integrated onto the same
chip. The individual processors typically have comparable performance to
their single-core brethren, but for workloads with sufficient thread-level
parallelism (TLP), the aggregate performance delivered by the processor can
be many times that delivered by a single-core processor. Most current
processors adopt this approach and simply involve the replication of existing
single-processor processor cores on a single die.
Moving beyond these simple CMP processors, chip multithreaded (CMT)
processors go one step further and support many simultaneous hardware
strands (or threads) of execution per core by simultaneous multithreading
(SMT) techniques. SMT effectively combats increasing latencies by enabling
multiple strands to share many of the resources within the core, including the
execution resources. With each strand spending a significant portion of time
stalled waiting for off-chip misses to complete, each strand’s utilization of the
core’s execution resources is extremely low. SMT improves the utilization of
key resources and reduces the sensitivity of an application to off-chip misses.
Similarly, as with CMP, multiple cores can share chip resources such as the
memory controller, off-chip bandwidth, and the level-2/level-3 cache,
improving the utilization of these resources.
The benefits of CMT processors are apparent in a wide variety for application
spaces. For instance, in the commercial space, server workloads are broadly
characterized by high levels of TLP, low ILP, and large working sets. The
potential for further improvements in overall single-thread performance is
limited; on-chip cycles per instruction (CPI) cannot be improved significantly
because of low ILP, and off-chip CPI is large and growing because of relative
increases in memory latency. However, typical server applications
concurrently serve a large number of users or clients; for instance, a database
server may have hundreds of active processes, each associated with a different
client. Furthermore, these processes are currently multithreaded to hide disk
access latencies. This structure leads to high levels of TLP. Thus, it is
extremely attractive to couple the high TLP in the application domain with
support for multiple threads of execution on a processor chip.
----------------------- Page 22-----------------------
Evolution of CMTs 3
Though the arguments for CMT processors are often made in the context of
overlapping memory latencies, memory bandwidth considerations also play a
significant role. New memory technologies, such as fully buffered DIMMs
(FBDs), have higher bandwidths (for example, 60 GB/s/chip), as well as
higher latencies (for example, 130 ns), pushing up their bandwidth-delay
product to 60 GB/s × 130 ns = 7800 bytes. The processor chip’s pins represent
an expensive resource, and to keep these pins fully utilized (assuming a cache
line size of 64 bytes), the processor chip must sustain 7800/64 or over 100
parallel requests. To put this in perspective, a single strand on an aggressive
out-of-order processor core generates less than two parallel requests on typical
server workloads: therefore, a large number of strands are required to sustain
a high utilization of the memory ports.
Finally, power considerations also favor CMT processors. Given the almost
cubic dependence between core frequency and power consumption, the latter
drops dramatically with reductions in frequency. As a result, for workloads
with adequate TLP, doubling the number of cores and halving the frequency
delivers roughly equivalent performance while reducing power consumption
by a factor of four.
Evolution of CMTs
Given the exponential growth in transistors per chip over time, a rule of
thumb is that a board design becomes a chip design in ten years or less. Thus,
most industry observers expected that chip-level multiprocessing would
eventually become a dominant design trend. The case for a single-chip
multiprocessor was presented as early as 1996 by Kunle Olukotun’s team at
Stanford University. Their Stanford Hydra CMP processor design called for
the integration of four MIPS-based processors on a single chip. A DEC/
Compaq research team proposed the incorporation of eight simple Alpha cores
and a two-level cache hierarchy on a single chip (code-named Piranha) and
estimated a simulated performance of three times that of a single-core, next-
generation Alpha processor for on-line transaction processing workloads.
As early as the mid-1990s, Sun recognized the problems that would soon face
processor designers as a result of the rapidly increasing clock frequencies
required to improve single-thread performance. In response, Sun defined the
MAJC architecture to target thread-level parallelism. Providing well-defined
support for both CMP and SMT processors, MAJC architecture was industry’s
first step toward general-purpose CMT processors. Shortly after publishing
the MAJC architecture, Sun announced its first MAJC-compliant processor
(MAJC-5200), a dual-core CMT processor with cores sharing an L1 data
cache.
----------------------- Page 23-----------------------
4 Chapter 1 Introducing Chip Multithreaded (CMT) Processors
Subsequently, Sun moved its SPARC processor family toward the CMP design
point. In 2003, Sun announced two CMP SPARC processors: Gemini, a dual-
core UltraSPARC II derivative; and UltraSPARC IV. These first-generation
CMP processors were derived from earlier uniprocessor designs, and the two
cores did not share any resources other than off-chip datapaths. In most CMP
designs, it is preferable to share the outermost caches, because doing so
localizes coherency traffic between the strands and optimizes inter-strand
communication in the chip—allowing very fine-grained thread interaction
(microparallelism). In 2003, Sun also announced its second-generation CMP
processor, UltraSPARC IV+, a follow-on to the UltraSPARC IV processor, in
which the on-chip L2 and off-chip L3 caches are shared between the two
cores.
In 2006, Sun introduced a 32-way CMT SPARC processor, called
UltraSPARC T1, for which the entire design, including the cores, is optimized
for a CMT design point. UltraSPARC T1 has eight cores; each core is a four-
way SMT with its own private L1 caches. All eight cores share a 3-Mbyte, 12-
way level-2 cache,. Since UltraSPARC T1 is targeted at commercial server
workloads with high TLP, low ILP, and large working sets, the ability to
support many strands and therefore many concurrent off-chip misses is key to
overall performance. Thus, to accommodate eight cores, each core supports
single issue and has a fairly short pipeline.
Sun’s most recent CMT processor is the UltraSPARC T2 processor. The
UltraSPARC T2 processor provides double the threads of the UltraSPARC T1
processor (eight threads per core), as well as improved single-thread
performance, additional level-2 cache resources (increased size and
associativity), and improved support for floating-point operations.
Sun’s move toward the CMT design has been mirrored throughout industry. In
2001, IBM introduced the dual-core POWER-4 processor and recently
released second-generation CMT processors, the POWER-5 and POWER-6
processors, in which each core supports 2-way SMT. While this fundamental
shift in processor design was initially confined to the high-end server
processors, where the target workloads are the most thread-rich, this change
has recently begun to spread to desktop processors. AMD and Intel have also
subsequently released multicore CMP processors, starting with dual-core
CMPs and more recently quad-core CMP processors. Further, Intel has
announced that its next-generation quad-core processors will support 2-way
SMT, providing a total of eight threads per chip.
CMT is emerging as the dominant trend in general-purpose processor design,
with manufacturers discussing their multicore plans beyond their initial quad-
core offerings. Similar to the CISC-to-RISC shift that enabled an entire
processor to fit on a single chip and internalized all communication between
----------------------- Page 24-----------------------
Future CMT Designs 5
pipeline stages to within a chip, the move to CMT represents a fundamental
shift in processor design that internalizes much of the communication between
processors to within a chip.
Future CMT Designs
An attractive proposition for future CMT design is to just double the number
of cores per chip every generation since a new process technology essentially
doubles the transistor budget. Little design effort is expended on the cores,
and performance is almost doubled every process generation on workloads
with sufficient TLP. Though reusing existing core designs is an attractive
option, this approach may not scale well beyond a couple of process
generations. Processor designs are already pushing the limits of power
dissipation. For the total power consumption to be restrained, the power
dissipation of each core must be halved in each generation. In the past, supply
voltage scaling delivered most of the required power reduction, but
indications are that voltage scaling will not be sufficient by itself. Though
well-known techniques, such as clock gating and frequency scaling, may be
quite effective in the short term, more research is needed to develop low-
power, high-performance cores for future CMT designs.
Further, given the significant area cost associated with high-performance
cores, for a fixed area and power budget, the CMP design choice is between a
small number of high-performance (high frequency, aggressive out-of-order,
large issue width) cores or multiple simple (low frequency, in-order, limited
issue width) cores. For workloads with sufficient TLP, the simpler core
solution may deliver superior chipwide performance at a fraction of the power.
However, for applications with limited TLP, unless speculative parallelism can
be exploited, CMT performance will be poor. One possible solution is to
support heterogeneous cores, potentially providing multiple simple cores for
thread-rich workloads and a single more complex core to provide robust
performance for single-threaded applications.
Another interesting opportunity for CMT processors is support for on-chip
hardware accelerators. Hardware accelerators improve performance on certain
specialized tasks and off-load work from the general-purpose processor.
Additionally, on-chip hardware accelerators may be an order of magnitude
more power efficient than the general-purpose processor and may be
significantly more efficient than off-chip accelerators (for example,
eliminating the off-chip traffic required to communicate to an off-chip
accelerator). Although high cost and low utilization typically make on-chip
hardware accelerators unattractive for traditional processors, the cost of an
accelerator can be amortized over many strands, thanks to the high degree of
resource sharing associated with CMTs. While a wide variety of hardware
----------------------- Page 25-----------------------
accelerators can be envisaged, emerging trends make an extremely compelling
case for supporting on-chip network off-load engines and cryptographic
accelerators. The future processors will afford opportunities for accelerating
other functionality. For instance, with the increasing usage of XML-formatted
data, it may become attractive to have hardware support XML parsing and
processing.
Finally, for the same amount of off-chip bandwidth to be maintained per core,
the total off-chip bandwidth for the processor chip must also double every
process generation. Processor designers can meet the bandwidth need by
adding more pins or increasing the bandwidth per pin. However, the
maximum number of pins per package is only increasing at a rate of 10
percent per generation. Further packaging costs per pin are barely going down
with each new generation and increase significantly with pin count. As a
result, efforts have recently focused on increasing the per-pin bandwidth by
innovations in the processor chip to DRAM memory interconnect through
technologies such as double data rate and fully buffered DIMMs. Additional
benefits can be obtained by doing more with the available bandwidth; for
instance, by compressing off-chip traffic or exploiting silentness to minimize
the bandwidth required to perform write-back operations. Compression of the
on-chip caches themselves can also improve performance, but the (significant)
additional latency that is introduced as a result of the decompression overhead
must be carefully balanced against the benefits of the reduced miss rate,
favoring adaptive compression strategies.
As a result, going forward we are likely to see an ever-increasing proportion
of CMT processors designed from the ground-up in order to deliver ever-
increasing performance while satisfying these power and bandwidth
constraints.
----------------------- Page 26-----------------------
CHAPTER 2
OpenSPARC Designs
Sun Microsystems began shipping the UltraSPARC T1 chip
multithreaded (CMT) processor in December 2005. Sun surprised the
industry by announcing that it would not only ship the processor but also
open-source that processor—a first in the industry. By March 2006,
UltraSPARC T1 had been open-sourced in a distribution called
OpenSPARC T1, available on http://OpenSPARC.net.
In 2007, Sun began shipping its newer, more advanced UltraSPARC T2
processor, and open-sourced the bulk of that design as OpenSPARC T2.
The “source code” for both designs offered on OpenSPARC.net is
comprehensive, including not just millions of lines of the hardware
description language (Verilog, a form of “register transfer logic”—RTL)
for these microprocessors, but also scripts to compile (“synthesize”) that
source code into hardware implementations, source code of processor and
full-system simulators, prepackaged operating system images to boot on
the simulators, source code to the Hypervisor software layer, a large suite
of verification software, and thousands of pages of architecture and
implementation specification documents.
This book is intended as a “getting started” companion to both
OpenSPARC T1 and OpenSPARC T2. In this chapter, we begin that
association by addressing this question: Now that Sun has open-sourced
OpenSPARC T1 and T2, what can they be used for?
One thing is certain: the real-world uses to which OpenSPARC will be
put will be infinitely more diverse and interesting than anything that
could be suggested in this book! Nonetheless, this short chapter offers a
few ideas, in the hope that they will stimulate even more creative
thinking …
7
----------------------- Page 27-----------------------
8 Chapter 2 OpenSPARC Designs
2.1 Academic Uses for OpenSPARC
The utility of OpenSPARC in academia is limited only by students’
imaginations.
The most common academic use of OpenSPARC to date is as a complete
example processor architecture and/or implementation. It can be used in
coursework areas such as computer architecture, VLSI design, compiler code
generation/optimization, and general computer engineering.
In university lab courses, OpenSPARC provides a design that can be used as a
known-good starting point for assigned projects.
OpenSPARC can be used as a basis for compiler research, such as for code
generation/optimization for highly threaded target processors or for
experimenting with instruction set changes and additions.
OpenSPARC is already in use in multiple FPGA-based projects at universities.
For more information, visit:
http://www.opensparc.net/fpga/index.html
For more information on programs supporting academic use of OpenSPARC,
including availability of the Xilinx OpenSPARC FPGA Board, visit web page:
http://www.OpenSPARC.net/edu/university-program.html
Specific questions about university programs can be posted on the
OpenSPARC general forum at:
http://forums.sun.com/forum.jspa?forumID=837
or emailed to OpenSPARC-UniversityProgram@sun.com.
Many of the commercial applications of OpenSPARC, mentioned in the
following section, suggest corresponding academic uses.
2.2 Commercial Uses for OpenSPARC
OpenSPARC provides a springboard for design of commercial processors. By
starting from a complete, known-good design—including a full verification
suite—the time-to-market for a new custom processor can be drastically
slashed.
----------------------- Page 28-----------------------
2.2 Commercial Uses for OpenSPARC 9
Derivative processors ranging from a simple single-core, single-thread design
all the way up through an 8-core, 64-thread design can rapidly be synthesized
from OpenSPARC T1 or T2.
2.2.1 FPGA Implementation
An OpenSPARC design can be synthesized and loaded into a field-
programmable gate array (FPGA) device. This can be used in several ways:
* An FPGA version of the processor can be used for product prototyping,
allowing rapid design iteration
* An FPGA can be used to provide a high-speed simulation engine for a
processor under development
* For extreme time-to-market needs where production cost per processor
isn’t critical, a processor could even be shipped in FPGA form. This could
also be useful if the processor itself needs to be field-upgradable via a
software download.
2.2.2 Design Minimization
Portions of a standard OpenSPARC design that are not needed for the target
application can be stripped out, to make the resulting processor smaller,
cheaper, faster, and/or with higher yield rates. For example, for a network
routing application, perhaps hardware floating-point operations are
superfluous—in which case, the FPU(s) can be removed, saving die area and
reducing verification effort.
2.2.3 Coprocessors
Specialized coprocessors can be incorporated into a processor based on
OpenSPARC. OpenSPARC T2, for example, comes with a coprocessor
containing two 10 Gbit/second Ethernet transceivers (the network interface
unit or “NIU”). Coprocessors can be added for any conceivable purpose,
including (but hardly limited to) the following:
* Network routing
* Floating-point acceleration
* Cryptographic processing
* I/O compression/decompression engines
* Audio compression/decompression (codecs)
* Video codecs
* I/O interface units for embedded devices such as displays or input sensors
----------------------- Page 29-----------------------
10 Chapter 2 OpenSPARC Designs
2.2.4 OpenSPARC as Test Input to CAD/
EDA Tools
The OpenSPARC source code (Verilog RTL) provides a large, real-world
input dataset for CAD/EDA tools. It can be used to test the robustness of
CAD tools and simulators. Many major commercial CAD/EDA tool vendors
are already using OpenSPARC this way!
----------------------- Page 30-----------------------
CHAPTER 3
Architecture Overview
OpenSPARC processors are based on a processor architecture named the
UltraSPARC Architecture. The OpenSPARC T1 design is based on the
UltraSPARC Architecture 2005, and OpenSPARC T2 is based on the
UltraSPARC Architecture 2007. This chapter is intended as an overview
of the architecture; more details can be found in the UltraSPARC
Architecture 2005 Specification and the UltraSPARC Architecture 2007
Specification.
The UltraSPARC Architecture is descended from the SPARC V9
architecture and complies fully with the “Level 1” (nonprivileged)
SPARC V9 specification.
The UltraSPARC Architecture supports 32-bit and 64-bit integer and 32-
bit, 64-bit, and 128-bit floating-point as its principal data types. The 32-
bit and 64-bit floating-point types conform to IEEE Std 754-1985. The
128-bit floating-point type conforms to IEEE Std 1596.5-1992. The
architecture defines general-purpose integer, floating-point, and special
state/status register instructions, all encoded in 32-bit-wide instruction
formats. The load/store instructions address a linear, 264-byte virtual
address space.
As used here, the word architecture refers to the processor features that
are visible to an assembly language programmer or to a compiler code
generator. It does not include details of the implementation that are not
visible or easily observable by software, nor those that only affect timing
(performance).
The chapter contains these sections:
* The UltraSPARC Architecture on page 12
* Processor Architecture on page 15
* Instructions on page 17
* Traps on page 23
* Chip-Level Multithreading (CMT) on page 23
11
----------------------- Page 31-----------------------
12 Chapter 3 Architecture Overview
3.1 The UltraSPARC Architecture
This section briefly describes features, attributes, and components of the
UltraSPARC Architecture and, further, describes correct implementation of
the architecture specification and SPARC V9-compliance levels.
3.1.1 Features
The UltraSPARC Architecture, like its ancestor SPARC V9, includes the
following principal features:
* A linear 64-bit address space with 64-bit addressing.
* 32-bit wide instructions — These are aligned on 32-bit boundaries in
memory. Only load and store instructions access memory and perform I/O.
* Few addressing modes — A memory address is given as either “register +
register” or “register + immediate”.
* Triadic register addresses — Most computational instructions operate on
two register operands or one register and a constant and place the result in
a third register.
* A large windowed register file — At any one instant, a program sees 8
global integer registers plus a 24-register window of a larger register file.
The windowed registers can be used as a cache of procedure arguments,
local values, and return addresses.
* Floating point — The architecture provides an IEEE 754-compatible
floating-point instruction set, operating on a separate register file that
provides 32 single-precision (32-bit), 32 double-precision (64-bit), and 16
quad-precision (128-bit) overlayed registers.
* Fast trap handlers — Traps are vectored through a table.
* Multiprocessor synchronization instructions — Multiple variations of
atomic load-store memory operations are supported.
* Predicted branches — The branch with prediction instructions allows the
compiler or assembly language programmer to give the hardware a hint
about whether a branch will be taken.
* Branch elimination instructions — Several instructions can be used to
eliminate branches altogether (for example, Move on Condition).
Eliminating branches increases performance in superscalar and
superpipelined implementations.
----------------------- Page 32-----------------------
3.1 The UltraSPARC Architecture 13
* Hardware trap stack — A hardware trap stack is provided to allow nested
traps. It contains all of the machine state necessary to return to the previous
trap level. The trap stack makes the handling of faults and error conditions
simpler, faster, and safer.
In addition, UltraSPARC Architecture includes the following features that
were not present in the SPARC V9 specification:
* Hyperprivileged mode— This mode simplifies porting of operating
systems, supports far greater portability of operating system (privileged)
software, supports the ability to run multiple simultaneous guest operating
systems, and provides more robust handling of error conditions.
Hyperprivileged mode is described in detail in the Hyperprivileged version
of the UltraSPARC Architecture 2005 Specification or the UltraSPARC
Architecture 2007 Specification .
* Multiple levels of global registers — Instead of the two 8-register sets of
global registers specified in the SPARC V9 architecture, the UltraSPARC
Architecture provides multiple sets; typically, one set is used at each trap
level.
* Extended instruction set — The UltraSPARC Architecture provides many
instruction set extensions, including the VIS instruction set for “vector”
(SIMD) data operations.
* More detailed, specific instruction descriptions — UltraSPARC
Architecture specifications provide many more details regarding what
exceptions can be generated by each instruction, and the specific conditions
under which those exceptions can occur, than did SPARC V9. Also,
detailed lists of valid ASIs are provided for each load/store instruction
from/to alternate space.
* Detailed MMU architecture — Although some details of the UltraSPARC
MMU architecture are necessarily implementation-specific, UltraSPARC
Architecture specifications provide a blueprint for the UltraSPARC MMU,
including software view (TTEs and TSBs) and MMU hardware control
registers.
* Chip-level multithreading (CMT) — The UltraSPARC Architecture
provides a control architecture for highly threaded processor
implementations.
3.1.2 Attributes
The UltraSPARC Architecture is a processor instruction set architecture (ISA)
derived from SPARC V8 and SPARC V9, which in turn come from a reduced
instruction set computer (RISC) lineage. As an architecture, the UltraSPARC
----------------------- Page 33-----------------------
14 Chapter 3 Architecture Overview
Architecture allows for a spectrum of processor and system implementations
at a variety of price/performance points for a range of applications, including
scientific or engineering, programming, real-time, and commercial
applications. OpenSPARC further extends the possible breadth of design
possibilities by opening up key implementations to be studied, enhanced, or
redesigned by anyone in the community.
3.1.2.1 Design Goals
The UltraSPARC Architecture is designed to be a target for optimizing
compilers and high-performance hardware implementations. The UltraSPARC
Architecture 2005 and UltraSPARC Architecture 2007 Specification
documents provide design specs against which an implementation can be
verified, using appropriate verification software.
3.1.2.2 Register Windows
The UltraSPARC Architecture architecture is derived from the SPARC
architecture, which was formulated at Sun Microsystems in 1984 through
1987. The SPARC architecture is, in turn, based on the RISC I and II designs
engineered at the University of California at Berkeley from 1980 through
1982. The SPARC “register window” architecture, pioneered in the UC
Berkeley designs, allows for straightforward, high-performance compilers and
a reduction in memory load/store instructions.
Note that privileged software, not user programs, manages the register
windows. Privileged software can save a minimum number of registers
(approximately 24) during a context switch, thereby optimizing context-switch
latency.
3.1.3 System Components
The UltraSPARC Architecture allows for a spectrum of subarchitectures, such
as cache system, I/O, and memory management unit (MMU).
3.1.3.1 Binary Compatibility
An important mandate for the UltraSPARC Architecture is compatibility
across implementations of the architecture for application (nonprivileged)
software, down to the binary level. Binaries executed in nonprivileged mode
should behave identically on all UltraSPARC Architecture systems when those
----------------------- Page 34-----------------------
3.2 Processor Architecture 15
systems are running an operating system known to provide a standard
execution environment. One example of such a standard environment is the
SPARC V9 Application Binary Interface (ABI).
Although different UltraSPARC Architecture systems can execute
nonprivileged programs at different rates, they will generate the same results
as long as they are run under the same memory model. See Chapter 9,
Memory, in an UltraSPARC Architecture specification for more information.
Additionally, UltraSPARC Architecture 2005 and UltraSPARC Architecture
2007 are are upward-compatible from SPARC V9 for applications running in
nonprivileged mode that conform to the SPARC V9 ABI and upward-
compatible from SPARC V8 for applications running in nonprivileged mode
that conform to the SPARC V8 ABI.
An OpenSPARC implementation may or may not maintain the same binary
compatibility, depending on how the implementation has been modified and
what software execution environment is run on it.
3.1.3.2 UltraSPARC Architecture MMU
UltraSPARC Architecture defines a common MMU architecture (see Chapter
14, Memory Management, in any UltraSPARC Architecture specification for
details). Some specifics are left implementation-dependent.
3.1.3.3 Privileged Software
UltraSPARC Architecture does not assume that all implementations must
execute identical privileged software (operating systems) or hyperprivileged
software (hypervisors). Thus, certain traits that are visible to privileged
software may be tailored to the requirements of the system.
3.2 Processor Architecture
An UltraSPARC Architecture processor—therefore an OpenSPARC
processor—logically consists of an integer unit (IU) and a floating-point unit
(FPU), each with its own registers. This organization allows for
implementations with concurrent integer and floating-point instruction
execution. Integer registers are 64 bits wide; floating-point registers are 32,
64, or 128 bits wide. Instruction operands are single registers, register pairs,
register quadruples, or immediate constants.
----------------------- Page 35-----------------------
16 Chapter 3 Architecture Overview
A virtual processor (synonym: strand) is the hardware containing the state for
execution of a software thread. A physical core is the hardware required to
execute instructions from one or more software threads, including resources
shared among strands. A complete processor comprises one or more physical
cores and is the physical module that plugs into a system.
An OpenSPARC virtual processor can run in nonprivileged mode, privileged
mode, or hyperprivileged mode. In hyperprivileged mode, the processor can
execute any instruction, including privileged instructions. In privileged mode,
the processor can execute nonprivileged and privileged instructions. In
nonprivileged mode, the processor can only execute nonprivileged
instructions. In nonprivileged or privileged mode, an attempt to execute an
instruction requiring greater privilege than the current mode causes a trap to
hyperprivileged software.
3.2.1 Integer Unit (IU)
An OpenSPARC implementation’s integer unit contains the general-purpose
registers and controls the overall operation of the virtual processor. The IU
executes the integer arithmetic instructions and computes memory addresses
for loads and stores. It also maintains the program counters and controls
instruction execution for the FPU.
An UltraSPARC Architecture implementation may contain from 72 to 640
general-purpose 64-bit R registers. This corresponds to a grouping of the
registers into a number of sets of global R registers plus a circular stack of
N_REG_ WINDOWS sets of 16 registers each, known as register windows. The
number of register windows present (N_REG_ WINDOWS) is implementation
dependent, within the range of 3 to 32 (inclusive). In an unmodified
OpenSPARC T1 or T2 implementation, N_REG_ WINDOWS = 8.
3.2.2 Floating-Point Unit (FPU)
An OpenSPARC FPU has thirty-two 32-bit (single-precision) floating-point
registers, thirty-two 64-bit (double-precision) floating-point registers, and
sixteen 128-bit (quad-precision) floating-point registers, some of which
overlap (as described in detail in UltraSPARC Architecture specifications).
If no FPU is present, then it appears to software as if the FPU is permanently
disabled.
----------------------- Page 36-----------------------
3.3 Instructions 17
If the FPU is not enabled, then an attempt to execute a floating-point
instruction generates an fp_disabled trap and the fp_disabled trap handler
software must either
* Enable the FPU (if present) and reexecute the trapping instruction, or
* Emulate the trapping instruction in software.
3.3 Instructions
Instructions fall into the following basic categories:
* Memory access
* Integer arithmetic / logical / shift
* Control transfer
* State register access
* Floating-point operate
* Conditional move
* Register window management
* SIMD (single instruction, multiple data) instructions
These classes are discussed in the following subsections.
3.3.1 Memory Access
Load, store, load-store, and PREFETCH instructions are the only instructions
that access memory. They use two R registers or an R register and a signed
13-bit immediate value to calculate a 64-bit, byte-aligned memory address.
The integer unit appends an ASI to this address.
The destination field of the load/store instruction specifies either one or two R
registers or one, two, or four F registers that supply the data for a store or that
receive the data from a load.
Integer load and store instructions support byte, halfword (16-bit), word (32-
bit), and extended-word (64-bit) accesses. There are versions of integer load
instructions that perform either sign-extension or zero-extension on 8-bit, 16-
bit, and 32-bit values as they are loaded into a 64-bit destination register.
Floating-point load and store instructions support word, doubleword, and
quadword1 memory accesses.
1. OpenSPARC T1 and T2 processors do not implement the LDQF instruction in hardware; it
generates an exception and is emulated in hyperprivileged software.
----------------------- Page 37-----------------------
18 Chapter 3 Architecture Overview
CASA, CASXA, and LDSTUB are special atomic memory access instructions
that concurrent processes use for synchronization and memory updates.
Note The SWAP instruction is also specified, but it is
deprecated and should not be used in newly developed
software.
The (nonportable) LDTXA instruction supplies an atomic 128-bit (16-byte)
load that is important in certain system software applications.
3.3.1.1 Memory Alignment Restrictions
A memory access on an OpenSPARC virtual processor must typically be
aligned on an address boundary greater than or equal to the size of the datum
being accessed. An iproperly aligned address in a load, store, or load-store
instruction may trigger an exception and cause a subsequent trap. For details,
see the Memory Alignment Restrictions section in an UltraSPARC
Architecture specification.
3.3.1.2 Addressing Conventions
An unmodified OpenSPARC processor uses big-endian byte order by default:
the address of a quadword, doubleword, word, or halfword is the address of its
most significant byte. Increasing the address means decreasing the
significance of the unit being accessed. All instruction accesses are performed
using big-endian byte order.
An unmodified OpenSPARC processor also supports little-endian byte order
for data accesses only: the address of a quadword, doubleword, word, or
halfword is the address of its least significant byte. Increasing the address
means increasing the significance of the data unit being accessed.
3.3.1.3 Addressing Range
An OpenSPARC implementation supports a 64-bit virtual address space. The
supported range of virtual addresses is restricted to two equal-sized ranges at
the extreme upper and lower ends of 64-bit addresses; that is, for n-bit virtual
addresses, the valid address ranges are 0 to 2n–1 - 1 and 264 - 2n–1 to 264 - 1.
See the OpenSPARC T1 Implementation Supplement or OpenSPARC T2
Implementation Supplement for details.
----------------------- Page 38-----------------------
3.3 Instructions 19
3.3.1.4 Load/Store Alternate
Versions of load/store instructions, the load/store alternate instructions, can
specify an arbitrary 8-bit address space identifier for the load/store data
access.
Access to alternate spaces 0016–2F16 is restricted to privileged and
hyperprivileged software, access to alternate spaces 3016–7F16 is restricted to
hyperprivileged software, and access to alternate spaces 8016–FF16 is
unrestricted. Some of the ASIs are available for implementation-dependent
uses. Privileged and hyperprivileged software can use the implementation-
dependent ASIs to access special protected registers, such as MMU control
registers, cache control registers, virtual processor state registers, and other
processor-dependent or system-dependent values. See the Address Space
Identifiers (ASIs) chapter in an UltraSPARC Architecture specification for
more information.
Alternate space addressing is also provided for the atomic memory access
instructions LDSTUBA, CASA, and CASXA.
3.3.1.5 Separate Instruction and Data Memories
The interpretation of addresses in an unmodified OpenSPARC process is
“split”; instruction references use one caching and translation mechanism and
data references use another, although the same underlying main memory is
shared.
In such split-memory systems, the coherency mechanism may be split, so a
write1 into data memory is not immediately reflected in instruction memory.
For this reason, programs that modify their own instruction stream (self-
modifying code2) and that wish to be portable across all UltraSPARC
Architecture (and SPARC V9) processors must issue FLUSH instructions, or a
system call with a similar effect, to bring the instruction and data caches into
a consistent state.
An UltraSPARC Architecture virtual processor may or may not have coherent
instruction and data caches. Even if an implementation does have coherent
instruction and data caches, a FLUSH instruction is required for self-
modifying code—not for cache coherency, but to flush pipeline instruction
buffers that contain unmodified instructions which may have been
subsequently modified.
1. This includes use of store instructions (executed on the same or another virtual processor) that
write to instruction memory, or any other means of writing into instruction memory (for example,
DMA).
2. This is practiced, for example, by software such as debuggers and dynamic linkers.
----------------------- Page 39-----------------------
20 Chapter 3 Architecture Overview
3.3.1.6 Input/Output (I/O)
The UltraSPARC Architecture assumes that input/output registers are accessed
through load/store alternate instructions, normal load/store instructions, or
read/write Ancillary State register instructions (RDasr, WRasr).
3.3.1.7 Memory Synchronization
Two instructions are used for synchronization of memory operations: FLUSH
and MEMBAR. Their operation is explained in Flush Instruction Memory and
Memory Barrier sections, respectively, of UltraSPARC Architecture
specifications.
3.3.2 Integer Arithmetic / Logical / Shift
Instructions
The arithmetic/logical/shift instructions perform arithmetic, tagged arithmetic,
logical, and shift operations. With one exception, these instructions compute a
result that is a function of two source operands; the result is either written into
a destination register or discarded. The exception, SETHI, can be used in
combination with other arithmetic and/or logical instructions to create a
constant in an R register.
Shift instructions shift the contents of an R register left or right by a given
number of bits (“shift count”). The shift distance is specified by a constant in
the instruction or by the contents of an R register.
3.3.3 Control Transfer
Control-transfer instructions (CTIs) include PC-relative branches and calls,
register-indirect jumps, and conditional traps. Most of the control-transfer
instructions are delayed; that is, the instruction immediately following a
control-transfer instruction in logical sequence is dispatched before the control
transfer to the target address is completed. Note that the next instruction in
logical sequence may not be the instruction following the control-transfer
instruction in memory.
The instruction following a delayed control-transfer instruction is called a
delay instruction. Setting the annul bit in a conditional delayed control-
transfer instruction causes the delay instruction to be annulled (that is, to have
----------------------- Page 40-----------------------
3.3 Instructions 21
no effect) if and only if the branch is not taken. Setting the annul bit in an
unconditional delayed control-transfer instruction (“branch always”) causes
the delay instruction to be always annulled.
Branch and CALL instructions use PC-relative displacements. The jump and
link (JMPL) and return (RETURN) instructions use a register-indirect target
address. They compute their target addresses either as the sum of two R
registers or as the sum of an R register and a 13-bit signed immediate value.
The “branch on condition codes without prediction” instruction provides a
displacement of ±8 Mbytes; the “branch on condition codes with prediction”
instruction provides a displacement of ±1 Mbyte; the “branch on register
contents” instruction provides a displacement of ±128 Kbytes; and the CALL
instruction’s 30-bit word displacement allows a control transfer to any address
within ± 2 gigabytes (± 231 bytes).
Note The return from privileged trap instructions (DONE and
RETRY) get their target address from the appropriate
TPC or TNPC register.
3.3.4 State Register Access
This section describes the following state registers:
* Ancillary state registers
* Read and write privileged state registers
* Read and writer hyperprivileged state registers
3.3.4.1 Ancillary State Registers
The read and write ancillary state register instructions read and write the
contents of ancillary state registers visible to nonprivileged software (Y, CCR,
ASI, PC, TICK, and FPRS) and some registers visible only to privileged and
hyperprivileged software (PCR, SOFTINT, TICK_CMPR, and
STICK_CMPR).
3.3.4.2 PR State Registers
The read and write privileged register instructions (RDPR and WRPR) read
and write the contents of state registers visible only to privileged and
hyperprivileged software (TPC, TNPC, TSTATE, TT, TICK, TBA, PSTATE,
TL, PIL, CWP, CANSAVE, CANRESTORE, CLEANWIN, OTHERWIN, and
WSTATE).
----------------------- Page 41-----------------------
22 Chapter 3 Architecture Overview
3.3.4.3 HPR State Registers
The read and write hyperprivileged register instructions (RDHPR and
WRHPR) read and write the contents of state registers visible only to
hyperprivileged software (HPSTATE, HTSTATE, HINTP, HVER, and
HSTICK_CMPR).
3.3.5 Floating-Point Operate
Floating-point operate (FPop) instructions perform all floating-point
calculations; they are register-to-register instructions that operate on the
floating-point registers. FPops compute a result that is a function of one, two,
or three source operands. The groups of instructions that are considered FPops
are listed in the Floating-Point Operate (FPop) Instructions section of
UltraSPARC Architecture specifications.
3.3.6 Conditional Move
Conditional move instructions conditionally copy a value from a source
register to a destination register, depending on an integer or floating-point
condition code or on the contents of an integer register. These instructions can
be used to reduce the number of branches in software.
3.3.7 Register Window Management
Register window instructions manage the register windows. SAVE and
RESTORE are nonprivileged and cause a register window to be pushed or
popped. FLUSHW is nonprivileged and causes all of the windows except the
current one to be flushed to memory. SAVED and RESTORED are used by
privileged software to end a window spill or fill trap handler.
3.3.8 SIMD
An unmodified OpenSPARC processor includes SIMD (single instruction,
multiple data) instructions, also known as “vector” instructions, which allow a
single instruction to perform the same operation on multiple data items,
totaling 64 bits, such as eight 8-bit, four 16-bit, or two 32-bit data items.
These operations are part of the “VIS” instruction set extensions.
----------------------- Page 42-----------------------
3.4 Traps 23
3.4 Traps
A trap is a vectored transfer of control to privileged or hyperprivileged
software through a trap table that may contain the first 8 instructions (32 for
some frequently used traps) of each trap handler. The base address of the table
is established by software in a state register (the Trap Base Address register,
TBA, or the Hyperprivileged Trap Base register, HTBA). The displacement
within the table is encoded in the type number of each trap and the level of the
trap. Part of the trap table is reserved for hardware traps, and part of it is
reserved for software traps generated by trap (Tcc) instructions.
A trap causes the current PC and NPC to be saved in the TPC and TNPC
registers. It also causes the CCR, ASI, PSTATE, and CWP registers to be
saved in TSTATE. TPC, TNPC, and TSTATE are entries in a hardware trap
stack, where the number of entries in the trap stack is equal to the number of
supported trap levels. A trap causes hyperprivileged state to be saved in the
HTSTATE trap stack. A trap also sets bits in the PSTATE (and, in some cases,
HPSTATE) register and typically increments the GL register. Normally, the
CWP is not changed by a trap; on a window spill or fill trap, however, the
CWP is changed to point to the register window to be saved or restored.
A trap can be caused by a Tcc instruction, an asynchronous exception, an
instruction-induced exception, or an interrupt request not directly related to a
particular instruction. Before executing each instruction, a virtual processor
determines if there are any pending exceptions or interrupt requests. If any are
pending, the virtual processor selects the highest-priority exception or
interrupt request and causes a trap.
See the Traps chapter in an UltraSPARC Architecture specification for a
complete description of traps.
3.5 Chip-Level Multithreading
(CMT)
An OpenSPARC implementation may include multiple virtual processor cores
within the processor (“chip”) to provide a dense, high-throughput system. This
may be achieved by having a combination of multiple physical processor
----------------------- Page 43-----------------------
24 Chapter 3 Architecture Overview
cores and/or multiple strands (threads) per physical processor core, referred to
as chip-level multithreaded (CMT) processors. CMT-specific hyperprivileged
registers are used for identification and configuration of CMT processors.
The CMT programming model describes a common interface between
hardware (CMT registers) and software
The common CMT registers and the CMT programming model are described
in the Chip-Level Multithreading (CMT) chapter in UltraSPARC Architecture
specifications.
----------------------- Page 44-----------------------
CHAPTER 4
OpenSPARC T1 and T2 Processor
Implementations
This chapter introduces the OpenSPARC T1 and OpenSPARC T2 chip-
level multithreaded (CMT) processors in the following sections:
* General Background on page 25
* OpenSPARC T1 Overview on page 27
* OpenSPARC T1 Components on page 29
* OpenSPARC T2 Overview on page 33
* OpenSPARC T2 Components on page 34
* Summary of Differences Between OpenSPARC T1 and OpenSPARC T2
on page 36
4.1 General Background
OpenSPARC T1 is the first chip multiprocessor that fully implements
Sun’s Throughput Computing initiative. OpenSPARC T2 is the follow-on
chip multi-threaded (CMT) processor to the OpenSPARC T1 processor.
Throughput Computing is a technique that takes advantage of the thread-
level parallelism that is present in most commercial workloads. Unlike
desktop workloads, which often have a small number of threads
concurrently running, most commercial workloads achieve their
scalability by employing large pools of concurrent threads.
Historically, microprocessors have been designed to target desktop
workloads, and as a result have focused on running a single thread as
quickly as possible. Single-thread performance is achieved in these
microprocessors by a combination of extremely deep pipelines (over 20
stages in Pentium 4) and by execution of multiple instructions in parallel
(referred to as instruction-level parallelism, or ILP). The basic tenet
25
----------------------- Page 45-----------------------
26 Chapter 4 OpenSPARC T1 and T2 Processor Implementations
behind Throughput Computing is that exploiting ILP and deep pipelining has
reached the point of diminishing returns and as a result, current
microprocessors do not utilize their underlying hardware very efficiently.
For many commercial workloads, the physical processor core will be idle
most of the time waiting on memory, and even when it is executing, it will
often be able to utilize only a small fraction of its wide execution width. So
rather than building a large and complex ILP processor that sits idle most of
the time, build a number of small, single-issue physical processor cores that
employ multithreading built in the same chip area. Combining multiple
physical processors cores on a single chip with multiple hardware-supported
threads (strands) per physical processor core allows very high performance for
highly threaded commercial applications. This approach is called thread-level
parallelism (TLP). The difference between TLP and ILP is shown in
FIGURE 4-1.
Strand 1
Strand 2
TLP
Strand 3
Strand 4
ILP Single strand
executing
two
Executing Stalled on Memory
FIGURE 4-1 Differences Between TLP and ILP
The memory stall time of one strand can often be overlapped with execution
of other strands on the same physical processor core, and multiple physical
processor cores run their strands in parallel. In the ideal case, shown in
FIGURE 4-1, memory latency can be completely overlapped with execution of
other strands. In contrast, instruction-level parallelism simply shortens the
time to execute instructions, and does not help much in overlapping execution
with memory latency.1
1. Processors that employ out-of-order ILP can overlap some memory latency with execution.
However, this overlap is typically limited to shorter memory latency events such as L1 cache
misses that hit in the L2 cache. Longer memory latency events such as main memory accesses are
rarely overlapped to a significant degree with execution by an out-of-order processor.
----------------------- Page 46-----------------------
4.2 OpenSPARC T1 Overview 27
Given this ability to overlap execution with memory latency, why don’t more
processors utilize TLP? The answer is that designing processors is a mostly
evolutionary process, and the ubiquitous deeply pipelined, wide ILP physical
processor cores of today are the evolutionary outgrowth from a time when the
CPU was the bottleneck in delivering good performance.
With physical processor cores capable of multiple-GHz clocking, the
performance bottleneck has shifted to the memory and I/O subsystems and
TLP has an obvious advantage over ILP for tolerating the large I/O and
memory latency prevalent in commercial applications. Of course, every
architectural technique has its advantages and disadvantages. The one
disadvantage of employing TLP over ILP is that execution of a single strand
may be slower on a TLP processor than on an ILP processor. With physical
processor cores running at frequencies well over 1 GHz, a strand capable of
executing only a single instruction per cycle is fully capable of completing
tasks in the time required by the application, making this disadvantage a
nonissue for nearly all commercial applications.
4.2 OpenSPARC T1 Overview
OpenSPARC T1 is a single-chip multiprocessor. OpenSPARC T1 contains
eight SPARC physical processor cores. Each SPARC physical processor core
has full hardware support for four virtual processors (or “strands”). These four
strands run simultaneously, with the instructions from each of the four strands
executed round-robin by the single-issue pipeline. When a strand encounters a
long-latency event, such as a cache miss, it is marked unavailable and
instructions are not issued from that strand until the long-latency event is
resolved. Round-robin execution of the remaining available strands continues
while the long-latency event of the first strand is resolved.
Each OpenSPARC T1 physical core has a 16-Kbyte, 4-way associative
instruction cache (32-byte lines), 8-Kbyte, 4-way associative data cache (16-
byte lines), 64-entry fully associative instruction Translation Lookaside Buffer
(TLB), and 64-entry fully associative data TLB that are shared by the four
strands. The eight SPARC physical cores are connected through a crossbar to
an on-chip unified 3-Mbyte, 12-way associative L2 cache (with 64-byte lines).
The L2 cache is banked four ways to provide sufficient bandwidth for the
eight OpenSPARC T1 physical cores. The L2 cache connects to four on-chip
DRAM controllers, which directly interface to DDR2-SDRAM. In addition,
----------------------- Page 47-----------------------
28 Chapter 4 OpenSPARC T1 and T2 Processor Implementations
an on-chip J-Bus controller and several on-chip I/O-mapped control registers
are accessible to the SPARC physical cores. Traffic from the J-Bus coherently
interacts with the L2 cache.
A block diagram of the OpenSPARC T1 chip is shown in FIGURE 4-2.
OpenSPARC T1
FPU
124,145
SPARC Core
DRAM Control DDR-II
L2 Bank0 Channel 0
SPARC Core 156,64
SPARC Core
DRAM Control DDR-II
L2 Bank1 Channel1
156,64
SPARC Core Cache
Crossbar
SPARC Core (CCX)
DRAM Control
L2 Bank2 Channel 2 DDR-II
156,64
SPARC Core
SPARC Core
DRAM Control
DDR-II
L2 Bank3 Channel3
156,64
SPARC Core
32,32
32,32
32,32 32,32
eFuse
32, 16, Copy
8, or 4 for
JTAG CTU IOB each J-Bus 200 MHz
Port
4 or 8 block System J-Bus
with Interface
CSRs
50 MHz
SSI ROM
SSI
Interface
Notes:
(1) Blocks not scaled to physical size.
(2) Bus widths are labeled as in#,out#, where “in” is into CCX or L2.
FIGURE 4-2 OpenSPARC T1 Chip Block Diagram
----------------------- Page 48-----------------------
4.3 OpenSPARC T1 Components 29
4.3 OpenSPARC T1 Components
This section describes each component in OpenSPARC T1 in these
subsections.
* SPARC Physical Core on this page
* Floating-Point Unit (FPU) on page 30
* L2 Cache on page 31
* DRAM Controller on page 31
* I/O Bridge (IOB) Unit on page 31
* J-Bus Interface (JBI) on page 32
* SSI ROM Interface on page 32
* Clock and Test Unit (CTU) on page 32
* EFuse on page 33
4.3.1 OpenSPARC T1 Physical Core
Each OpenSPARC T1 physical core has hardware support for four strands.
This support consists of a full register file (with eight register windows) per
strand, with most of the ASI, ASR, and privileged registers replicated per
strand. The four strands share the instruction and data caches and TLBs. An
autodemap1 feature is included with the TLBs to allow the multiple strands to
update the TLB without locking.
The core pipeline consists of six stages: Fetch, Switch, Decode, Execute,
Memory, and Writeback. As shown in FIGURE 4-3, the Switch stage contains a
strand instruction register for each strand. One of the strands is picked by the
strand scheduler and the current instruction for that strand is issued to the
pipe. While this is done, the hardware fetches the next instruction for that
strand and updates the strand instruction register.
The scheduled instruction proceeds down the rest of the stages of the pipe,
similar to instruction execution in a single-strand RISC machine. It is decoded
in the Decode stage. The register file access also happens at this time. In the
Execute stage, all arithmetic and logical operations take place. The memory
address is calculated in this stage. The data cache is accessed in the Memory
stage and the instruction is committed in the Writeback stage. All traps are
signaled in this stage.
1. Autodemap causes an existing TLB entry to be automatically removed when a new entry is
installed with the same virtual page number (VPN) and same page size.
----------------------- Page 49-----------------------
30 Chapter 4 OpenSPARC T1 and T2 Processor Implementations
Instructions are classified as either short or long latency instructions. Upon
encountering a long latency instruction or other stall condition in a certain
strand, the strand scheduler stops scheduling that strand for further execution.
Scheduling commences again when the long latency instruction completes or
the stall condition clears.
FIGURE 4-3 illustrates the OpenSPARC T1 physical core.
Strand
I-Cache Instruction Strand
Decode
Registers Scheduler
Store Buffers
ALU
Register Files
D-Cache
External
Interface
FIGURE 4-3 OpenSPARC T1 Core Block Diagram
4.3.2 Floating-Point Unit (FPU)
A single floating-point unit is shared by all eight OpenSPARC T1 physical
cores. The shared floating-point unit is sufficient for most commercial
applications, in which fewer than 1% of instructions typically involve
floating-point operations.
----------------------- Page 50-----------------------
4.3 OpenSPARC T1 Components 31
4.3.3 L2 Cache
The L2 cache is banked four ways, with the bank selection based on physical
address bits 7:6. The cache is 3-Mbyte, 12-way set associative with pseudo-
LRU replacement (replacement is based on a used-bit scheme), and has a line
size of 64 bytes. Unloaded access time is 23 cycles for an L1 data cache miss
and 22 cycles for an L1 instruction cache miss.
4.3.4 DRAM Controller
1
OpenSPARC T1’s DRAM Controller is banked four ways , with each L2 bank
interacting with exactly one DRAM Controller bank. The DRAM Controller is
interleaved based on physical address bits 7:6, so each DRAM Controller
bank must have the same amount of memory installed and enabled.
OpenSPARC T1 uses DDR2 DIMMs and can support one or two ranks of
stacked or unstacked DIMMs. Each DRAM bank/port is two DIMMs wide
(128-bit + 16-bit ECC). All installed DIMMs on an individual bank/port must
be identical, and the same total amount of memory (number of bytes) must be
installed on each DRAM Controller port. The DRAM controller frequency is
an exact ratio of the CMP core frequency, where the CMP core frequency
must be at least 4× the DRAM controller frequency. The DDR (double data
rate) data buses, of course, transfer data at twice the frequency of the DRAM
Controller frequency.
The DRAM Controller also supports a small memory configuration mode,
using only two DRAM ports. In this mode, L2 banks 0 and 2 are serviced by
DRAM port 0, and L2 banks 1 and 3 are serviced by DRAM port 1. The
installed memory on each of these ports is still two DIMMs wide.
4.3.5 I/O Bridge (IOB) Unit
The IOB performs an address decode on I/O-addressable transactions and
directs them to the appropriate internal block or to the appropriate external
interface (J-Bus or SSI). In addition, the IOB maintains the register status for
external interrupts.
1. A two-bank option is available for cost-constrained minimal memory configurations.
----------------------- Page 51-----------------------
32 Chapter 4 OpenSPARC T1 and T2 Processor Implementations
4.3.6 J-Bus Interface (JBI)
J-Bus is the interconnect between OpenSPARC T1 and the I/O subsystem. It is
a 200 MHz, 128-bit-wide, multiplexed address/data bus, used predominantly
for DMA traffic, plus the PIO traffic to control it.
The JBI is the block that interfaces to J-Bus, receiving and responding to
DMA requests, routing them to the appropriate L2 banks, and also issuing PIO
transactions on behalf of the strands and forwarding responses back.
4.3.7 SSI ROM Interface
OpenSPARC T1 has a 50 Mbit/s serial interface (SSI) which connects to an
external FPGA which interfaces to the BOOT ROM. In addition, the SSI
interface supports PIO accesses across the SSI, thus supporting optional CSRs
or other interfaces within the FPGA.
4.3.8 Clock and Test Unit (CTU)
The CTU contains the clock generation, reset, and JTAG circuitry.
OpenSPARC T1 has a single PLL, which takes the J-Bus clock as its input
reference, where the PLL output is divided down to generate the CMP core
clocks (for OpenSPARC T1 and caches), the DRAM clock (for the DRAM
controller and external DIMMs), and internal J-Bus clock (for IOB and JBI).
Thus, all OpenSPARC T1 clocks are ratioed. Sync pulses are generated to
control transmission of signals and data across clock domain boundaries.
The CTU has the state machines for internal reset sequencing, which includes
logic to reset the PLL and signal when the PLL is locked, updating clock
ratios on warm resets (if so programmed), enabling clocks to each block in
turn, and distributing reset so that its assertion is seen simultaneously in all
clock domains.
The CTU also contains the JTAG block, which allows access to the shadow
scan chains, plus has a CREG interface that allows the JTAG to issue reads of
any I/O-addressable register, some ASI locations, and any memory location
while OpenSPARC T1 is in operation.
----------------------- Page 52-----------------------
4.4 OpenSPARC T2 Overview 33
4.3.9 EFuse
The eFuse (electronic fuse) block contains configuration information that is
electronically burned in as part of manufacturing, including part serial number
and strand-available information.
4.4 OpenSPARC T2 Overview
OpenSPARC T2 is a single chip multithreaded (CMT) processor.
OpenSPARC T2 contains eight SPARC physical processor cores. Each SPARC
physical processor core has full hardware support for eight processors, two
integer execution pipelines, one floating-point execution pipeline, and one
memory pipeline. The floating-point and memory pipelines are shared by all
eight strands. The eight strands are hard-partitioned into two groups of four,
and the four strands within a group share a single integer pipeline.
While all eight strands run simultaneously, at any given time at most two
strands will be active in the physical core, and those two strands will be
issuing either a pair of integer pipeline operations, an integer operation and a
floating-point operation, an integer operation and a memory operation, or a
floating-point operation and a memory operation. Strands are switched on a
cycle-by-cycle basis between the available strands within the hard-partitioned
group of four, using a least recently issued priority scheme.
When a strand encounters a long-latency event, such as a cache miss, it is
marked unavailable and instructions will not be issued from that strand until
the long-latency event is resolved. Execution of the remaining available
strands will continue while the long-latency event of the first strand is
resolved.
Each OpenSPARC T2 physical core has a 16-Kbyte, 8-way associative
instruction cache (32-byte lines), 8-Kbyte, 4-way associative data cache (16-
byte lines), 64-entry fully-associative instruction TLB, and 128-entry fully
associative data TLB that are shared by the eight strands. The eight
OpenSPARC T2 physical cores are connected through a crossbar to an on-chip
unified 4-Mbyte, 16-way associative L2 cache (64-byte lines).
The L2 cache is banked eight ways to provide sufficient bandwidth for the
eight OpenSPARC T2 physical cores. The L2 cache connects to four on-chip
DRAM Controllers, which directly interface to a pair of fully buffered DIMM
----------------------- Page 53-----------------------
34 Chapter 4 OpenSPARC T1 and T2 Processor Implementations
(FBD) channels. In addition, two 1-Gbit/10-Gbit Ethernet MACs and several
on-chip I/O-mapped control registers are accessible to the SPARC physical
cores.
A block diagram of the OpenSPARC T2 chip is shown in FIGURE 4-4.
.
Fully B uffered
D IM M s (FB D )
1.4Ghz1.4Ghz 4.8Ghz
OpenSPARC T2 800M hz
N ia g a ra 2
64
10
S P A R C C o re L2 B ank0 128 M C U 0 14
1010
64
S P A R C C o re L2 B ank1 14
64
1010
S P A R C C o re C ache L2 B ank0
C ro ssbar 128 M C U 1 14
1010
S P A R C C o re (C C X ) L2 B ank1 64
14
64
1010
S P A R C C o re L2 B ank0
128 M C U 2 14
1010
S P A R C C o re L2 B ank1 64
14
S P A R C C o re L2 B ank0 64 1010
14
128 M C U 3
1010
S P A R C C o re L2 B ank1 64
14
TC U C C U eFuse
DD IMIM MM s:s: 11 22 33 88
Ranks: 1 or 2 per D IMM
10 Gb MA C N IU P C I-E X
10 Gb MA C S IU
Optional dual C hannel M ode
FC R A M Intf P C I-E X
S S I R O M Intf
FIGURE 4-4 OpenSPARC T2 Chip Block Diagram
4.5 OpenSPARC T2 Components
This section describes the major components in OpenSPARC T2.
----------------------- Page 54-----------------------
4.5 OpenSPARC T2 Components 35
4.5.1 OpenSPARC T2 Physical Core
Each OpenSPARC T2 physical core has hardware support for eight strands.
This support consists of a full register file (with eight register windows) per
strand, with most of the ASI, ASR, and privileged registers replicated per
strand. The eight strands share the instruction and data caches and TLBs. An
autodemap feature is included with the TLBs to allow the multiple strands to
update the TLB without locking.
Each OpenSPARC T2 physical core contains a floating-point unit, shared by
all eight strands. The floating-point unit performs single- and double-precision
floating-point operations, graphics operations, and integer multiply and divide
operations.
4.5.2 L2 Cache
The L2 cache is banked eight ways. To provide for better partial-die recovery,
OpenSPARC T2 can also be configured in 4-bank and 2-bank modes (with 1/
2 and 1/4 the total cache size respectively). Bank selection is based on
physical address bits 8:6 for 8 banks, 7:6 for 4 banks, and 6 for 2 banks. The
cache is 4 Mbytes, 16-way set associative, and uses index hashing. The line
size is 64 bytes.
4.5.3 Memory Controller Unit (MCU)
OpenSPARC T2 has four MCUs, one for each memory branch with a pair of
L2 banks interacting with exactly one DRAM branch. The branches are
interleaved based on physical address bits 7:6, and support 1–16 DDR2
DIMMs. Each memory branch is two FBD channels wide. A branch may use
only one of the FBD channels in a reduced power configuration.
Each DRAM branch operates independently and can have a different memory
size and a different kind of DIMM (for example, a different number of ranks
or different CAS latency). Software should not use address space larger than
four times the lowest memory capacity in a branch because the cache lines are
interleaved across branches. The DRAM Controller frequency is the same as
that of the DDR (double data rate) data buses, which is twice the DDR
frequency. The FBDIMM links run at six times the frequency of the DDR data
buses.
The OpenSPARC T2 MCU implements a DDR2 FBD design model that is
based on various JEDEC-approved DDR2 SDRAM and FBDIMM standards.
JEDEC has received information that certain patents or patent applications
----------------------- Page 55-----------------------
36 Chapter 4 OpenSPARC T1 and T2 Processor Implementations
may be relevant to FBDIMM Advanced Memory Buffer standard (JESD82-
20) as well as other standards related to FBDIMM technology (JESD206) (For
more information, see
http://www.jedec.org/download/search/FBDIMM/Patents.xls).
Sun Microsystems does not provide any legal opinions as to the validity or
relevancy of such patents or patent applications. Sun Microsystems
encourages prospective users of the OpenSPARC T2 MCU design to review
all information assembled by JEDEC and develop their own independent
conclusion.
4.5.4 Noncacheable Unit (NCU)
The NCU performs an address decode on I/O-addressable transactions and
directs them to the appropriate block (for example, DMU, CCU). In addition,
the NCU maintains the register status for external interrupts.
4.5.5 System Interface Unit (SIU)
The SIU connects the DMU and L2 cache. SIU is the L2 cache access point
for the Network subsystem.
4.5.6 SSI ROM Interface (SSI)
OpenSPARC T2 has a 50 Mb/s serial interface (SSI), which connects to an
external boot ROM. In addition, the SSI supports PIO accesses across the SSI,
thus supporting optional Control and Status registers (CSRs) or other
interfaces attached to the SSI.
4.6 Summary of Differences
Between OpenSPARC T1 and
OpenSPARC T2
OpenSPARC T2 follows the CMT philosophy of OpenSPARC T1, but adds
more execution capability to each physical core, as well as significant system-
on-a-chip components and an enhanced L2 cache.
----------------------- Page 56-----------------------
4.6 Summary of Differences Between OpenSPARC T1 and OpenSPARC T2 37
4.6.1 Microarchitectural Differences
The following lists the microarchitectural differences.
* Physical core consists of two integer execution pipelines and a single
floating-point pipeline. OpenSPARC T1 has a single integer execution
pipeline and all cores shared a single floating-point pipeline.
* Each physical core in OpenSPARC T2 supports eight strands, which all
share the floating-point pipeline. The eight strands are partitioned into two
groups of four strands, each of which shares an integer pipeline.
OpenSPARC T1 shares the single integer pipeline among four strands.
* Pipeline in OpenSPARC T2 is eight stages, two stages longer than
OpenSPARC T1.
* Instruction cache is 8-way associative, compared to 4-way in
OpenSPARC T1.
* The L2 cache is 4-Mbyte, 8-banked and 16-way associative, compared to
3-Mbyte, 4-banked and 12-way associative in OpenSPARC T1.
* Data TLB is 128 entries, compared to 64 entries in OpenSPARC T1.
* The memory interface in OpenSPARC T2 supports fully buffered DIMMS
(FBDs), providing higher capacity and memory clock rates.
* The OpenSPARC T2 memory channels support a single-DIMM option for
low-cost configurations.
* OpenSPARC T2 includes a network interface unit (NIU), to which network
traffic management tasks can be off-loaded.
4.6.2 Instruction Set Architecture (ISA)
Differences
There are a number of ISA differences between OpenSPARC T2 and
OpenSPARC T1, as follows:
* OpenSPARC T2 fully supports all VIS 2.0 instructions. OpenSPARC T1
supports a subset of VIS 1.0 plus the SIAM (Set Interval Arithmetic Mode)
instruction (on OpenSPARC T1, the remainder of VIS 1.0 and 2.0
instructions trap to software for emulation).
* OpenSPARC T2 supports the full CMP specification, as described in
UltraSPARC Architecture 2007. OpenSPARC T1 has its own version of
CMP control/status registers. OpenSPARC T2 consists of eight physical
cores, with eight virtual processors per physical core.
----------------------- Page 57-----------------------
38 Chapter 4 OpenSPARC T1 and T2 Processor Implementations
* OpenSPARC T2 does not support OpenSPARC T1’s idle state or its idle,
halt, or resume messages. Instead, OpenSPARC T2 supports parking and
unparking as specified in the CMP chapter of UltraSPARC Architecture
2007 Specification. Note that parking is similar to OpenSPARC T1’s idle
state. OpenSPARC T2 does support an equivalent to the halt state, which
on OpenSPARC T1 is entered by writing to HPR 1E16. However,
OpenSPARC T2 does not support OpenSPARC T1’s STRAND_STS_REG
ASR, which holds the strand state. Halted state is not software-visible on
OpenSPARC T2.
* OpenSPARC T2 does not support the INT_VEC_DIS register (which
allows any OpenSPARC T1 strand to generate an interrupt, reset, idle, or
resume message to any strand). Instead, an alias to ASI_INTR_W is
provided, which allows only the generation of an interrupt to any strand.
* OpenSPARC T2 supports the ALLCLEAN, INVALW, NORMALW,
OTHERW, POPC, and FSQRT instructions in hardware.
* OpenSPARC T2’s floating-point unit generates fp_unfinished_other with
FSR.ftt unfinished_FPop for most denorm cases and supports a nonstandard
mode that flushes denorms to zero. OpenSPARC T1 handles denorms in
hardware, never generates an unfinished_FPop, and does not support a
nonstandard mode.
* OpenSPARC T2 generates an illegal_instruction trap on any quad-precision
FP instruction, whereas OpenSPARC T1 generates an fp_exception_other
trap on numeric and move-FP-quad instructions. See Table 5-2 of the
UltraSPARC T2 Supplement to the “UltraSPARC Architecture 2007
Specification.”
* OpenSPARC T2 generates a privileged_action exception upon attempted
access to hyperprivileged ASIs by privileged software, whereas, in such
cases, OpenSPARC T1 takes a data_access_exception exception.
* OpenSPARC T2 supports PSTATE.tct; OpenSPARC T1 does not.
* OpenSPARC T2 implements the SAVE instruction similarly to all previous
UltraSPARC processors. OpenSPARC T1 implements a SAVE instruction
that updates the locals in the new window to be the same as the locals in
the old window, and swaps the ins (outs) of the old window with the outs
(ins) of the new window.
* PSTATE.am masking details differ between OpenSPARC T1 and
OpenSPARC T2, as described in Section 11.1.8 of the UltraSPARC T2
Supplement to the “UltraSPARC Architecture 2007 Specification.”
* OpenSPARC T2 implements PREFETCH fcn = 1816 as a prefetch
invalidate cache entry, for efficient software cache flushing.
* The Synchronous Fault register (SFSR) is eliminated in OpenSPARC T2.
----------------------- Page 58-----------------------
4.6 Summary of Differences Between OpenSPARC T1 and OpenSPARC T2 39
* T1’s data_access_exception is replaced in OpenSPARC T2 by multiple
DAE_* exceptions.
* T1’s instruction_access_exception exception is replaced in
OpenSPARC T2 by multiple IAE_* exceptions.
4.6.3 MMU Differences
The OpenSPARC T2 and OpenSPARC T1 MMUs differ as follows:
* OpenSPARC T2 has a 128-entry DTLB, whereas OpenSPARC T1 has a 64-
entry DTLB.
* OpenSPARC T2 supports a pair of primary context registers and a pair of
secondary context registers. OpenSPARC T1 supports a single primary
context and single secondary context register.
* OpenSPARC T2 does not support a locked bit in the TLBs.
OpenSPARC T1 supports a locked bit in the TLBs.
* OpenSPARC T2 supports only the sun4v (the architected interface between
privileged software and hyperprivileged software) TTE format for I/D-TLB
Data-In and Data-Access registers. OpenSPARC T1 supports both the
sun4v and the older sun4u TTE formats.
* OpenSPARC T2 is compatible with UltraSPARC Architecture 2007 with
regard to multiple flavors of data access exception (DAE_*) and instruction
access exception (IAE_*). As per UltraSPARC Architecture 2005,
OpenSPARC T1 uses the single flavor of data_access_exception and
instruction_access_exception, indicating the “flavors” in its SFSR
register.
* OpenSPARC T2 supports a hardware Table Walker to accelerate ITLB and
DTLB miss handling.
* The number and format of translation storage buffer (TSB) configuration
and pointer registers differs between OpenSPARC T1 and OpenSPARC T2.
OpenSPARC T2 uses physical addresses for TSB pointers; OpenSPARC T1
uses virtual addresses for TSB pointers.
* OpenSPARC T1 and OpenSPARC T2 support the same four page sizes (8
Kbyte, 64 Kbyte, 4 Mbyte, 256 Mbyte). OpenSPARC T2 supports an
unsupported_page_size trap when an illegal page size is programmed into
TSB registers or attempted to be loaded into the TLB. OpenSPARC T1
forces an illegal page size being programmed into TSB registers to be 256
Mbytes and generates a data_access_exception trap when a page with an
illegal size is loaded into the TLB.
* OpenSPARC T2 adds a demap real operation, which demaps all pages with
r = 1 from the TLB.
----------------------- Page 59-----------------------
40 Chapter 4 OpenSPARC T1 and T2 Processor Implementations
* OpenSPARC T2 supports an I-TLB probe ASI.
* Autodemapping of pages in the TLBs only demaps pages of the same size
or of a larger size in OpenSPARC T2. In OpenSPARC T1, autodemap
demaps pages of the same size, larger size, or smaller size.
* OpenSPARC T2 supports detection of multiple hits in the TLBs.
4.6.4 Performance Instrumentation
Differences
Both OpenSPARC T1 and OpenSPARC T2 provide access to hardware
performance counters through the PIC and PCR registers. However, the
events captured by the hardware differ significantly between OpenSPARC T1
and OpenSPARC T2, with OpenSPARC T2 capturing a much larger set of
events, as described in Chapter 10 of the UltraSPARC T2 Supplement to the
“UltraSPARC Architecture 2007 Specification.” OpenSPARC T2 also supports
count events in hyperprivileged mode; OpenSPARC T1 does not.
In addition, the implementation of pic_overflow differs between
OpenSPARC T1 and OpenSPARC T2. OpenSPARC T1 provides a disrupting
pic_overflow trap on the instruction following the one that caused the
overflow event. OpenSPARC T2 provides a disrupting pic_overflow on the
instruction that generates the event, but that occurs within an epsilon number
of event-generating instructions from the actual overflow.
Both OpenSPARC T2 and OpenSPARC T1 support DRAM performance
counters.
4.6.5 Error Handling Differences
Error handling differs quite a bit between OpenSPARC T1 and
OpenSPARC T2. OpenSPARC T1 primarily employs hardware correction of
errors, whereas OpenSPARC T2 primarily employs software correction of
errors.
* OpenSPARC T2 uses the following traps for error handling:
* data_access_error
* data_access_MMU_error
* hw_corrected_error
* instruction_access_error
* instruction_access_MMU_error
* internal_processor_error
* store_error
* sw_recoverable_error
没有评论:
发表评论