The Way of the great learning involves manifesting virtue, renovating the people, and abiding by the highest good.

2008年12月31日星期三

voyage linux

Debian Lovers - Why I love Voyage Linux      http://www.adamsinfo.com/debian-lovers-why-i-love-voyage-linux/


I'm Adam Palmer, and I'm an Embedded Linux/Embedded Hardware enthusiast. I spend most of my time managing server clusters, and doing some PHP/MySQL development, whilst what little spare time I have is dedicated to playing with gadgets. I'm always happy to consult on or manage any Linux related, Web Application or Hosting project. Please get in touch with me! 

For those Debian lovers I have finally found a great embedded distro. I’ve always stayed away from the multitude of distros available, each with their own package manager or lack of, each with their own preinstalled software or again, lack of, and each with their own caveats.

I began my jorney into Linux with SuSE about 11 years ago at the time of writing, and have also given RedHat a fair chance in the past. In my first employment I was forced to battle against Slackware for two years, and about 7 years ago, discovered Debian.

From that point on, my love for Debian has been absolutely unshakeable. Why do I love Debian? It’s perfectly crafted, clean, unbloated, exquisitely simple yet powerful to the more Novice user, and elegantly understated. I toyed with Ubuntu for some time as a desktop OS, as it’s Debian based, and does most of the Xorg configuration hassle for you, although I usually find myself using a minimalist Windows XP Professional setup on my desktops.

Enter Voyage Linux. It must have been fate.. I was browsing the Alix web site, and noticed amongst many that ‘Voyage Linux’ was supported. Something other than it’s mediocre name must have drawn my attention to it, as a quick Google search lead me to http://linux.voyage.hk

The first sentence sold it to me, “Voyage Linux is Debian derived distribution that is best run on a x86-based embedded platforms such as WRAP, ALIX and Soekris 45xx/48xx boards.”

I downloaded and installed Voyage on my Alix Robot board in minutes, and without delay was up and running. If Debian is clean and unbloated, Voyage is anorexic. It’s so lightweight by default, that you don’t need to even consider removing packages from the default installation. It has a fully working apt package manager, and you can also happily install packages from the main Debian package repositories. It looks, works and behaves just like Debian, no annoyances or broken/badly behaved tools. It’s also shipped with a working set of base utilities and does not rely on busybox as some of the embedded distros do. The default kernel is also incredibly well crafted, and includes madwifi support by default which again is great as I use Atheros chipset. I quickly and cleanly upgrade from the included 2.6.24 to a regular 2.6.27.6 from kernel.org which was the newest available at the time. Rebooted into 2.6.27.6 with no complaints and began copying my C applications over to the new distro to operate the Robot. Absolutely perfect and no complaints.

For a hardcore Debian lover who would never imagine installing anything else… ever… Voyage Linux is now immediately available on my Key chain USB mass storage stick, and I would be hard pushed for a reason to go back to Debian on any embedded/minimalist hardware project.

About Voyage Linux

Voyage Linux is Debian derived distribution that is best run on a x86-based embedded platforms such as WRAP, ALIX and Soekris 45xx/48xx boards.

It can also run on low-end x86 PC platforms. Typical installation requires 128MB disk space, although larger storage allows more packages to be installed. Voyage Linux is so small that it is best suitable for running a full-feature firewall, wireless access point, VoIP gateway and network storage device.



http://sumancolumbia.blogspot.com/2008/07/long-voyage-modpost-to-work-and-compile.html    

Long "voyage": modpost to work and compile

Ah... after weeks of hacking through kernel code and Makefiles, I am finally able to compile and run modpost , for compiling my kernel modules on Voyage Linux 0.4.

This is the story: I need to install some additional kernel modules for the WORKIT project on Voyage. All of the kernel modules pass step 1 of the Makefile, and they compile, but then I get the error:

Building modules, stage 2.
MODPOST 1 modules
/bin/sh: scripts/mod/modpost: No such file or directory


A long search for modpost lead me to find that modpost is used for compiling modules, and is part of the kernel development tools (sorry, missing the link for that post, and couldn't find it again on Google - tells you how exotic this modpost is.)

After a lot more trial and error - which included downloading and trying to recompile the Voyage Linux kernels - I found two posts today that helped immensely in coming to the last step:
  • No modpost directory - Linux Forums : helped me realize how to build the modpost (but it was referencing the wrong directory; on my Voyage Linux, the code is at /lib/modules/2.6.20-486-voyage/build)
But I got the exact same error message as in the forum post: unrecognized command line option "-m".

I copied the makefile command, removed the -m directive, and tried compiling again and got a "missing elfconfig.h" error message and other errors dependent on that.

A search for "elfconfig.h" led me to a post of the Linux kernel mailing list about missing elfconfig.h which hints at running "make modules" and "make scripts".

Running those two commands on the /lib/modules/2.6.20-486-voyage/build directory solves the problem - it creates the elfconfig.h (!) file and compiles modpost. (Expect this to take a while - it makes all modules and scripts.) 
Added information:

The Internet Real-Time Lab (IRT) in the Computer Science Department at Columbia University conducts research in the areas of Internet and multimedia services: Internet telephony, wireless and mobile networks, streaming, quality of service, resource reservation, dynamic pricing for the Internet, network measurement and reliability, service location, network security, media on demand, content distribution networks, multicast networks and ubiquitous and context-aware computing and communication.

2008年12月30日星期二

Linux: The 0.01 Release

Linux: The 0.01 Release

July 26, 2007 - 3:56pm
Submitted by Jeremy on July 26, 2007 - 3:56pm.
Linux news

"This is a free minix-like kernel for i386(+) based AT-machines," began the Linux version 0.01 release notes in September of 1991 for the first release of the Linux kernel. "As the version number (0.01) suggests this is not a mature product. Currently only a subset of AT-hardware is supported (hard-disk, screen, keyboard and serial lines), and some of the system calls are not yet fully implemented (notably mount/umount aren't even implemented)." Booting the original 0.01 Linux kernel required bootstrapping it with minix, and the keyboard driver was written in assembly and hard-wired for a Finnish keyboard. The listed features were mostly presented as a comparison to minix and included, efficiently using the 386 chip rather than the older 8088, use of system calls rather than message passing, a fully multithreaded FS, minimal task switching, and visible interrupts. Linus Torvalds noted, "the guiding line when implementing linux was: get it working fast. I wanted the kernel simple, yet powerful enough to run most unix software." In a section titled "Apologies :-)" he noted:

"This isn't yet the 'mother of all operating systems', and anyone who hoped for that will have to wait for the first real release (1.0), and even then you might not want to change from minix. This is a source release for those that are interested in seeing what linux looks like, and it's not really supported yet."

Linus had originally intended to call the new kernel "Freax". According to Wikipedia, the name Linux was actually invented by Ari Lemmke who maintained the ftp.funet.fi FTP server from which the kernel was originally distributed.

The initial post that Linus made about Linux was to the comp.os.minix Usenet group titled, "What would you like to see most in minix". It began:

"I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones. This has been brewing since april, and is starting to get ready. I'd like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things)."

Later in the same thread, Linus went on to talk about how unportable the code was:

"Simply, I'd say that porting is impossible. It's mostly in C, but most people wouldn't call what I write C. It uses every conceivable feature of the 386 I could find, as it was also a project to teach me about the 386. As already mentioned, it uses a MMU, for both paging (not to disk yet) and segmentation. It's the segmentation that makes it REALLY 386 dependent (every task has a 64Mb segment for code & data - max 64 tasks in 4Gb. Anybody who needs more than 64Mb/task - tough cookies).

"It also uses every feature of gcc I could find, specifically the __asm__ directive, so that I wouldn't need so much assembly language objects. Some of my 'C'-files (specifically mm.c) are almost as much assembler as C. It would be 'interesting' even to port it to another compiler (though why anybody would want to use anything other than gcc is a mystery).

"Unlike minix, I also happen to LIKE interrupts, so interrupts are handled without trying to hide the reason behind them (I especially like my hard-disk-driver. Anybody else make interrupts drive a state-machine?). All in all it's a porters nightmare. "

Indeed, Linux 1.0 was released on March 13th, 1994 supporting only the 32-bit i386 architecture. However, by the release of Linux 1.2 on March 7th, 1995 it had already been ported to 32-bit MIPS, 32-bit SPARC, and the 64-bit Alpha. By the release of Linux 2.0 on June 9th, 1996 support had also been added for the 32-bit m68k and 32-bit PowerPC architectures. And jumping forward to the Linux 2.6 kernel, first released in 2004, it has been and continues to be ported to numerous additional architectures.



Linux 0.01 release notes:

  Notes for linux release 0.01     0. Contents of this directory  linux-0.01.tar.Z - sources to the kernel bash.Z   - compressed bash binary if you want to test it update.Z  - compressed update binary RELNOTES-0.01  - this file     1. Short intro   This is a free minix-like kernel for i386(+) based AT-machines.  Full source is included, and this source has been used to produce a running kernel on two different machines.  Currently there are no kernel binaries for public viewing, as they have to be recompiled for different machines.  You need to compile it with gcc (I use 1.40, don't know if 1.37.1 will handle all __asm__-directives), after having changed the relevant configuration file(s).   As the version number (0.01) suggests this is not a mature product.  Currently only a subset of AT-hardware is supported (hard-disk, screen, keyboard and serial lines), and some of the system calls are not yet fully implemented (notably mount/umount aren't even implemented).  See comments or readme's in the code.  This version is also meant mostly for reading - ie if you are interested in how the system looks like currently.  It will compile and produce a working kernel, and though I will help in any way I can to get it working on your machine (mail me), it isn't really supported.  Changes are frequent, and the first "production" version will probably differ wildly from this pre-alpha-release.   Hardware needed for running linux:  - 386 AT  - VGA/EGA screen  - AT-type harddisk controller (IDE is fine)  - Finnish keyboard (oh, you can use a US keyboard, but not    without some practise :-)  The Finnish keyboard is hard-wired, and as I don't have a US one I cannot change it without major problems. See kernel/keyboard.s for details. If anybody is willing to make an even partial port, I'd be grateful. Shouldn't be too hard, as it's tabledriven (it's assembler though, so ...)  Although linux is a complete kernel, and uses no code from minix or other sources, almost none of the support routines have yet been coded. Thus you currently need minix to bootstrap the system. It might be possible to use the free minix demo-disk to make a filesystem and run linux without having minix, but I don't know...     2. Copyrights etc   This kernel is (C) 1991 Linus Torvalds, but all or part of it may be redistributed provided you do the following:   - Full source must be available (and free), if not with the    distribution then at least on asking for it.   - Copyright notices must be intact. (In fact, if you distribute    only parts of it you may have to add copyrights, as there aren't    (C)'s in all files.) Small partial excerpts may be copied    without bothering with copyrights.   - You may not distibute this for a fee, not even "handling"    costs.  Mail me at [email blocked] if you have any questions.  Sadly, a kernel by itself gets you nowhere. To get a working system you need a shell, compilers, a library etc. These are separate parts and may be under a stricter (or even looser) copyright. Most of the tools used with linux are GNU software and are under the GNU copyleft. These tools aren't in the distribution - ask me (or GNU) for more info.     3. Short technical overview of the kernel.   The linux kernel has been made under minix, and it was my original idea to make it binary compatible with minix. That was dropped, as the differences got bigger, but the system still resembles minix a great deal. Some of the key points are:   - Efficient use of the possibilities offered by the 386 chip.    Minix was written on a 8088, and later ported to other    machines - linux takes full advantage of the 386 (which is    nice if you /have/ a 386, but makes porting very difficult)   - No message passing, this is a more traditional approach to    unix. System calls are just that - calls. This might or might    not be faster, but it does mean we can dispense with some of    the problems with messages (message queues etc). Of course, we    also miss the nice features :-p.   - Multithreaded FS - a direct consequence of not using messages.    This makes the filesystem a bit (a lot) more complicated, but    much nicer. Coupled with a better scheduler, this means that    you can actually run several processes concurrently without    the performance hit induced by minix.   - Minimal task switching. This too is a consequence of not using    messages. We task switch only when we really want to switch    tasks - unlike minix which task-switches whatever you do. This    means we can more easily implement 387 support (indeed this is    already mostly implemented)   - Interrupts aren't hidden. Some people (among them Tanenbaum)    think interrupts are ugly and should be hidden. Not so IMHO.    Due to practical reasons interrupts must be mainly handled by    machine code, which is a pity, but they are a part of the code    like everything else. Especially device drivers are mostly    interrupt routines - see kernel/hd.c etc.   - There is no distinction between kernel/fs/mm, and they are all    linked into the same heap of code. This has it's good sides as    well as bad. The code isn't as modular as the minix code, but    on the other hand some things are simpler. The different parts    of the kernel are under different sub-directories in the    source tree, but when running everything happens in the same    data/code space.  The guiding line when implementing linux was: get it working fast. I wanted the kernel simple, yet powerful enough to run most unix software. The file system I couldn't do much about - it needed to be minix compatible for practical reasons, and the minix filesystem was simple enough as it was. The kernel and mm could be simplified, though:   - Just one data structure for tasks. "Real" unices have task    information in several places, I wanted everything in one    place.   - A very simple memory management algorithm, using both the    paging and segmentation capabilities of the i386. Currently    MM is just two files - memory.c and page.s, just a couple of    hundreds of lines of code.  These decisions seem to have worked out well - bugs were easy to spot, and things work.     4. The "kernel proper"   All the routines handling tasks are in the subdirectory "kernel". These include things like 'fork' and 'exit' as well as scheduling and minor system calls like 'getpid' etc. Here are also the handlers for most exceptions and traps (not page faults, they are in mm), and all low-level device drivers (get_hd_block, tty_write etc). Currently all faults lead to a exit with error code 11 (Segmentation fault), and the system seems to be relatively stable ("crashme" hasn't - yet).     5. Memory management   This is the simplest of all parts, and should need only little changes. It contains entry-points for some things that the rest of the kernel needs, but mostly copes on it's own, handling page faults as they happen. Indeed, the rest of the kernel usually doesn't actively allocate pages, and just writes into user space, letting mm handle any possible 'page-not-present' errors.  Memory is dealt with in two completely different ways - by paging and segmentation.  First the 386 VM-space (4GB) is divided into a number of segments (currently 64 segments of 64Mb each), the first of which is the kernel memory segment, with the complete physical memory identity-mapped into it.  All kernel functions live within this area.   Tasks are then given one segment each, to use as they wish. The paging mechanism sees to filling the segment with the appropriate pages, keeping track of any duplicate copies (created at a 'fork'), and making copies on any write. The rest of the system doesn't need to know about all this.     6. The file system   As already mentioned, the linux FS is the same as in minix. This makes crosscompiling from minix easy, and means you can mount a linux partition from minix (or the other way around as soon as I implement mount :-). This is only on the logical level though - the actual routines are very different.   NOTE! Minix-1.6.16 seems to have a new FS, with minor  modifications to the 1.5.10 I've been using. Linux  won't understand the new system.  The main difference is in the fact that minix has a single-threaded file-system and linux hasn't. Implementing a single-threaded FS is much easier as you don't need to worry about other processes allocating buffer blocks etc while you do something else. It also means that you lose some of the multiprocessing so important to unix.  There are a number of problems (deadlocks/raceconditions) that the linux kernel needed to address due to multi-threading.  One way to inhibit race-conditions is to lock everything you need, but as this can lead to unnecessary blocking I decided never to lock any data structures (unless actually reading or writing to a physical device).  This has the nice property that dead-locks cannot happen.   Sadly it has the not so nice property that race-conditions can happen almost everywhere.  These are handled by double-checking allocations etc (see fs/buffer.c and fs/inode.c).  Not letting the kernel schedule a task while it is in supervisor mode (standard unix practise), means that all kernel/fs/mm actions are atomic (not counting interrupts, and we are careful when writing those) if you don't call 'sleep', so that is one of the things we can count on.     7. Apologies :-)   This isn't yet the "mother of all operating systems", and anyone who hoped for that will have to wait for the first real release (1.0), and even then you might not want to change from minix.  This is a source release for those that are interested in seeing what linux looks like, and it's not really supported yet.  Anyone with questions or suggestions (even bug-reports if you decide to get it working on your system) is encouraged to mail me.      8. Getting it working   Most hardware dependancies will have to be compiled into the system, and there a number of defines in the file "include/linux/config.h" that you have to change to get a personalized kernel.  Also you must uncomment the right "equ" in the file boot/boot.s, telling the bootup-routine what kind of device your A-floppy is.  After that a simple "make" should make the file "Image", which you can copy to a floppy (cp Image /dev/PS0 is what I use with a 1.44Mb floppy).  That's it.   Without any programs to run, though, the kernel cannot do anything. You should find binaries for 'update' and 'bash' at the same place you found this, which will have to be put into the '/bin' directory on the specified root-device (specified in config.h). Bash must be found under the name '/bin/sh', as that's what the kernel currently executes. Happy hacking.      Linus Torvalds  [email blocked]   Petersgatan 2 A 2   00140 Helsingfors 14   FINLAND  


First posting about Linux:

From: Linus Benedict Torvalds Newsgroups: comp.os.minix Subject: Gcc-1.40 and a posix-question Date: 3 Jul 91 10:00:50 GMT  Hello netlanders,  Due to a project I'm working on (in minix), I'm interested in the posix standard definition. Could somebody please point me to a (preferably) machine-readable format of the latest posix rules? Ftp-sites would be nice.  As an aside for all using gcc on minix - the new version (1.40) has been out for some weeks, and I decided to test what needed to be done to get it working on minix (1.37.1, which is the version you can get from plains is nice, but 1.40 is better :-).  To my surpice, the answer turned out to be - NOTHING! Gcc-1.40 compiles as-is on minix386 (with old gcc-1.37.1), with no need to change source files (I changed the Makefile and some paths, but that's it!).  As default this results in a compiler that uses floating point insns, but if you'd rather not, changing 'toplev.c' to define DEFAULT_TARGET from 1 to 0 (this is from memory - I'm not at my minix-box) will handle that too.  Don't make the libs, use the old gnulib&libc.a.  I have successfully compiled 1.40 with itself, and everything works fine (I got the newest versions of gas and binutils at the same time, as I've heard of bugs with older versions of ld.c).  Makefile needs some chmem's (and gcc2minix if you're still using it).                  Linus Torvalds          [email blocked]  PS. Could someone please try to finger me from overseas, as I've installed a "changing .plan" (made by your's truly), and I'm not certain it works from outside? It should report a new .plan every time.  



First Linux announcement:

From: Linus Benedict Torvalds [email blocked] Newsgroups: comp.os.minix Subject: What would you like to see most in minix? Date: 25 Aug 91 20:57:08 GMT  Hello everybody out there using minix -  I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones.  This has been brewing since april, and is starting to get ready.  I'd like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things).  I've currently ported bash(1.08) and gcc(1.40), and things seem to work. This implies that I'll get something practical within a few months, and I'd like to know what features most people would want.  Any suggestions are welcome, but I won't promise I'll implement them :-)                  Linus (torva... at kruuna.helsinki.fi)  PS.  Yes - it's free of any minix code, and it has a multi-threaded fs. It is NOT protable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that's all I have :-(. 
From: Jyrki Kuoppala [email blocked] Newsgroups: comp.os.minix Subject: What would you like to see most in minix? Date: 25 Aug 91 23:44:50 GMT In article Linus Benedict Torvalds writes: >I've currently ported bash(1.08) and gcc(1.40), and things seem to work. >This implies that I'll get something practical within a few months, and >I'd like to know what features most people would want. Any suggestions >are welcome, but I won't promise I'll implement them :-) Tell us more! Does it need a MMU? >PS. Yes - it's free of any minix code, and it has a multi-threaded fs. >It is NOT protable (uses 386 task switching etc) How much of it is in C? What difficulties will there be in porting? Nobody will believe you about non-portability ;-), and I for one would like to port it to my Amiga (Mach needs a MMU and Minix is not free). As for the features; well, pseudo ttys, BSD sockets, user-mode filesystems (so I can say cat /dev/tcp/kruuna.helsinki.fi/finger), window size in the tty structure, system calls capable of supporting POSIX.1. Oh, and bsd-style long file names. //Jyrki
From: Linus Benedict Torvalds [email blocked] Newsgroups: comp.os.minix Subject: Re: What would you like to see most in minix? Date: 26 Aug 91 11:06:02 GMT In article Jyrki Kuoppala writes: >> [re: my post about my new OS] >Tell us more! Does it need a MMU? Yes, it needs a MMU (sorry everybody), and it specifically needs a 386/486 MMU (see later). >>PS. Yes - it's free of any minix code, and it has a multi-threaded fs. >>It is NOT protable (uses 386 task switching etc) >How much of it is in C? What difficulties will there be in porting? >Nobody will believe you about non-portability ;-), and I for one would >like to port it to my Amiga (Mach needs a MMU and Minix is not free). Simply, I'd say that porting is impossible. It's mostly in C, but most people wouldn't call what I write C. It uses every conceivable feature of the 386 I could find, as it was also a project to teach me about the 386. As already mentioned, it uses a MMU, for both paging (not to disk yet) and segmentation. It's the segmentation that makes it REALLY 386 dependent (every task has a 64Mb segment for code & data - max 64 tasks in 4Gb. Anybody who needs more than 64Mb/task - tough cookies). It also uses every feature of gcc I could find, specifically the __asm__ directive, so that I wouldn't need so much assembly language objects. Some of my "C"-files (specifically mm.c) are almost as much assembler as C. It would be "interesting" even to port it to another compiler (though why anybody would want to use anything other than gcc is a mystery). Unlike minix, I also happen to LIKE interrupts, so interrupts are handled without trying to hide the reason behind them (I especially like my hard-disk-driver. Anybody else make interrupts drive a state- machine?). All in all it's a porters nightmare. >As for the features; well, pseudo ttys, BSD sockets, user-mode >filesystems (so I can say cat /dev/tcp/kruuna.helsinki.fi/finger), >window size in the tty structure, system calls capable of supporting >POSIX.1. Oh, and bsd-style long file names. Most of these seem possible (the tty structure already has stubs for window size), except maybe for the user-mode filesystems. As to POSIX, I'd be delighted to have it, but posix wants money for their papers, so that's not currently an option. In any case these are things that won't be supported for some time yet (first I'll make it a simple minix- lookalike, keyword SIMPLE). Linus [email blocked] PS. To make things really clear - yes I can run gcc on it, and bash, and most of the gnu [bin/file]utilities, but it's not very debugged, and the library is really minimal. It doesn't even support floppy-disks yet. It won't be ready for distribution for a couple of months. Even then it probably won't be able to do much more than minix, and much less in some respects. It will be free though (probably under gnu-license or similar).
From: Alan Barclay [email blocked] Newsgroups: comp.os.minix Subject: Re: What would you like to see most in minix? Date: 27 Aug 91 14:34:32 GMT In article Linus Benedict Torvalds writes: >yet) and segmentation. It's the segmentation that makes it REALLY 386 >dependent (every task has a 64Mb segment for code & data - max 64 tasks >in 4Gb. Anybody who needs more than 64Mb/task - tough cookies). Is that max 64 64Mb tasks or max 64 tasks no matter what their size? -- Alan Barclay iT | E-mail : [email blocked] Barker Lane | BANG-STYLE : [email blocked] CHESTERFIELD S40 1DY | VOICE : +44 246 214241
From: Linus Benedict Torvalds [email blocked] Newsgroups: comp.os.minix Subject: Re: What would you like to see most in minix? Date: 28 Aug 91 10:56:19 GMT In article Alan Barclay writes: >In article Linus Benedict Torvalds writes: >>yet) and segmentation. It's the segmentation that makes it REALLY 386 >>dependent (every task has a 64Mb segment for code & data - max 64 tasks >>in 4Gb. Anybody who needs more than 64Mb/task - tough cookies). >Is that max 64 64Mb tasks or max 64 tasks no matter what their size? I'm afraid that is 64 tasks max (and one is used as swapper), no matter how small they should be. Fragmentation is evil - this is how it was handled. As the current opinion seems to be that 64 Mb is more than enough, but 64 tasks might be a little crowded, I'll probably change the limits be easily changed (to 32Mb/128 tasks for example) with just a recompilation of the kernel. I don't want to be on the machine when someone is spawning >64 processes, though :-) Linus



Early Linux installation guide:

  Installing Linux on your system  Ok, this is a short guide for those people who actually want to get a running system, not just look at the pretty source code :-). You'll certainly need minix for most of the steps.    0.  Back up any important software.  This kernel has been working beautifully on my machine for some time, and has never destroyed anything on my hard-disk, but you never can be too careful when it comes to using the disk directly.  I'd hate to get flames like "you destroyed my entire collection of Sam Fox nude gifs (all 103 of them), I'll hate you forever", just because I may have done something wrong.  Double-check your hardware.  If you are using other than EGA/VGA, you'll have to make the appropriate changes to 'linux/kernel/console.c', which may not be easy.  If you are able to use the at_wini.c under minix, linux will probably also like your drive.  If you feel comfortable with scan-codes, you might want to hack 'linux/kernel/keyboard.s' making it more practical for your [US|German|...] keyboard.    1.  Decide on what root device you'll be using.  You can use any (standard) partition on any of your harddisks, the numbering is the same as for minix (ie 0x306, which I'm using, means partition 1 on hd2).  It is certainly possible to use the same device as for minix, but I wouldn't recommend it.  You'd have to change pathnames (or make a chroot in init) to get minix and linux to live together peacefully.  I'd recommend making a new filesystem, and filling it with the necessary files: You need at least the following:   - /dev/tty0  (same as under minix, ie mknod ...)  - /dev/tty  (same as under minix)  - /bin/sh  (link to bash)  - /bin/update  (I guess this should be /etc/update ...)  Note that linux and minix binaries aren't compatible, although they use the same (gcc-)header (for ease of cross-compiling), so running one under the other will result in errors.     2.  Compile the source, making necessary changes into the makefiles and linux/include/linux/config.h and linux/boot/boot.s.  I'm using a slightly hacked gcc-1.40, to which I have added a -mstring-insns flag, which uses the i386 string instructions for structure copy etc.  Removing the flag from all makefiles should do the trick for you.  NOTE! I'm using -Wall, and I'm not seeing many warnings (2 I think, one about _exit returning although it's volatile - it's ok.) If you get more warnings when compiling, something's wrong.    3.  Copy the resultant code to a diskette of the right type.  Use 'cp Image /dev/PS0' or equivalent.     4.  Boot with the new diskette.  If you've done everything right (and if *I've* done everything right), you should now be running bash as root.  You can't do much (alias ls='echo *' is a good idea :-), but if you do run, most other things should work.  I'd be happy to hear from anybody that has come this far - and I'll send any ported binaries you might want (and I have).  I'll also put them out for ftp if there is enough interest.  With gcc, make and uemacs, I've been able to stop crosscompiling and actually compile natively under linux.  (I also have a term-emu, sz/rz, sed, etc ...)  The boot-sequence should start with "Loading system...", and then a "Partition table ok" followed by some root-dev info. If you forget to make the /dev/tty0-character device, you'll never see anything but the "loading" message. Hopefully errors will be told to the console, but if there are problems at boot-up there is a distinct possibility that the machine just hangs.    5.  Check the new filesystem regularly with (minix) fsck.  I haven't got any errors for some time now, but I cannot guarantee that this means it will never happen.  Due to slight differences in 'unlink', fsck will report "mode inode XXX not cleared", but that isn't an error, and you can safely ignore it (if you don't like it, do a fsck -a every once in a while).  Minix "restore" will not work on a file deleted with linux - so be extra careful if you have a tendency to delete files you don't really want to.  Logging out from the "login-shell" will automatically do a sync, and will leave you hanging without any processes (except update, which isn't much fun), so do the "three-finger-salute" to restart dos/minix/linux or whatever.    6.  Mail me and ask about problems/updates etc.  Even more welcome are success-reports (yeah, sure), and bugreports or even patches (or pointers to corrections).   NOTE!!! I haven't included diffs with the binaries I've posted for the simple reason that there aren't any - I've had this silly idea that I'd rather change the OS than do a lot of porting.  All source to the binaries can be found on nic.funet.fi under /pub/gnu or /pub/unix.  Changes have been to makefiles or configuration files, and anybody interested in them might want to contact me. Mostly it's been a matter of adding a -DUSG to makefiles.  The one exception if gcc - I've made some hacks on it (string-insns), and have got it (with the gracious help of Bruce Evans) to correctly emit software floating point. I haven't got diffs to that one either, as my hard-disk is overflowing and I cannot accomodate both originals and changes, but as per the GNU copyleft I'll make them available if someone wants them. I hope nobody want's them :-)     Linus  [email blocked] 



README about early pictures of Linus Torvalds:

I finally got these made, and even managed to persuade Linus into allowing me to publish three pictures instead of only the first one. (He still vetoes the one with the toy moose... :-)    linus1.gif, linus2.gif, linus3.gif          Three pictures of Linus Torvalds, showing what a despicable         figure he is in real life.  The beer is from the pre-Linux         era, so it's not virtual.  In nic.funet.fi: pub/OS/Linux/doc/PEOPLE.  -- Lars.Wirzenius [email blocked]  (finger wirzeniu at klaava.helsinki.fi)    MS-DOS, you can't live with it, you can live without it. 
Attachment Size
linux-0.01.tar.bz2 61.88 KB
linux-0.01.tar.bz2.sign 248 bytes
linux-0.01.tar.gz 71.38 KB
linux-0.01.tar.gz.sign 248 bytes
linus1.gif 104.93 KB
linus2.gif 72.24 KB
linus3.gif 123.49 KB

http://kerneltrap.org/node/14002

10 Linux Predictions for 2009

Dec 29th, 2008, 2:20 pm
veryone wants to know what's going to happen in the new year as if anyone can accurately predict these things. However, one can deduce, with reasonable accuracy, that there will be innovations that are designed to get our attention. This is my list of Linux-oriented predictions for 2009. 
The keyword for 2009 is Innovation.

1. Buyouts/Mergers - 2009 will see its share of company buyouts and mergers--all innovation-related. Larger companies will buy up smaller ones with innovative products and services. Many new open source millionaires will be created through these transitions.

2. Gadgets, Gadgets, and more Gadgets - This will be the Year of the Gadget and they'll be Linux-powered. You'll see dozens of new gadgets from phones to home appliances to weather stations come out in 2009 all designed to attract your attention and your money. Watch for rapidly falling prices on these little gems along the way too. 

3. Virtualization - Linux-powered virtualization in the form of virtual appliances, virtual services, and hosted solutions are going to overwhelm even the most enthusiastic virtualization afficianados among you. I will have plenty of fodder for my Virtualization column at linux-mag.com as well as posts here on DaniWeb. I expect to see weekly announcements for new products, new services, and new companies popping-up to solve our problems.

4. Desktop Innovations - Ahh, the pet peeve of every IT jock in the business: Desktop Linux. Well hold on to your shorts naysayers, this is going to be one helluva ride through the dark recesses of the Desktop nether regions. Expect big things in the Linux Desktop in 2009 as a true Microsoft killer will emerge from an unlikely source. Windows 7 will be laughable by comparison.

5. Portable Servers - Say what? Oh yeah, get ready for this: Portable Servers. I've written articles on this and now it will happen from commercial sources. Truly portable services on portable servers. You'll be able to provide services to any group, anywhere, any time with these. Want to have a LAN party at a Community Center? Take your WiFi-enabled portable server and get to it.

6. Embedded Systems - This is one area that will enjoy quantum leaps of innovations in the coming year. Embedded Linux systems will power microwave ovens, regular ovens, sprinkler systems, robotic maids, and much more. Get ready for the embedded revolution.

7. Game Console - A major game console manufacturer will switch to Linux for their operating system to power the most innovative game console yet. You think the Wii is cool? Just wait till you see what's brewing elsewhere.

8. Home Automation - New homes will not only be built with green technology, green materials, and more efficient fixtures; those fixtures and technologies will be powered by Linux. Your new home will resemble something from The Jetsons rather than something from contemporary life. For existing homes, there will be packages available to upgrade your home to a "smart" home but it will still pale in comparison to a freshly built home with the innovations built-in.

9. Automobiles - Auto companies need to trim those budgets and executive salaries aside, they need to use some innovative technologies to remedy some of their money angst. Linux-powered car brains with pluggable and programmable modules will arise as one solution. I wouldn't be a bit surprised to see new car companies emerge using green technologies and Linux as part of the mix.

10. Cloud Computing - 2009 is going to be a big year for The Cloud and Cloud Computing. Linux-powered Cloud vendors will win out over those who employ other operating systems. Why? Not just costs but the major Virtualization vendors use Linux for their virtual platforms (VMware and Xen). Cloud vendors are going to use technology that's cheap, easy to maintain, commercially supported, and mature--in other words, Linux.

There you have them. My Linux predictions for 2009. The proof of my prophecy won't be available until this time next year. Stay tuned and keep referring to this post over the coming year to check my status. I will refer to this posting as innovations emerge over the coming year.

Do you have any Linux predictions for 2009?

How Small Can Computers Get? Computing in a Molecule

How Small Can Computers Get? Computing in a Molecule

Posted by ScuttleMonkey on Tuesday December 30, @05:36AM
from the nano-pc dept.
ScienceDaily at what the future might bring for atomic-scale computing. "Joachim, the head of the CEMES Nanoscience and Picotechnology Group (GNS), is currently coordinating a team of researchers from 15 academic and industrial research institutes in Europe whose groundbreaking work on developing a molecular replacement for transistors has brought the vision of atomic-scale computing a step closer to reality. Their efforts, a continuation of work that began in the 1990s, are today being funded by the European Union in the Pico-Inside project. [...] The team has managed to design a simple logic gate with 30 atoms that perform the same task as 14 transistors, while also exploring the architecture, technology and chemistry needed to achieve computing inside a single molecule and to interconnect molecules."

http://www.sciencedaily.com/releases/2008/12/081222113532.htm

How Small Can Computers Get? Computing In A Molecule

ScienceDaily (Dec. 30, 2008) — Over the last 60 years, ever-smaller generations of transistors have driven exponential growth in computing power. Could molecules, each turned into miniscule computer components, trigger even greater growth in computing over the next 60?

Atomic-scale computing, in which computer processes are carried out in a single molecule or using a surface atomic-scale circuit, holds vast promise for the microelectronics industry. It allows computers to continue to increase in processing power through the development of components in the nano- and pico scale. In theory, atomic-scale computing could put computers more powerful than today’s supercomputers in everyone’s pocket.

“Atomic-scale computing researchers today are in much the same position as transistor inventors were before 1947. No one knows where this will lead,” says Christian Joachim of the French National Scientific Research Centre’s (CNRS) Centre for Material Elaboration & Structural Studies (CEMES) in Toulouse, France.

Joachim, the head of the CEMES Nanoscience and Picotechnology Group (GNS), is currently coordinating a team of researchers from 15 academic and industrial research institutes in Europe whose groundbreaking work on developing a molecular replacement for transistors has brought the vision of atomic-scale computing a step closer to reality. Their efforts, a continuation of work that began in the 1990s, are today being funded by the European Union in the Pico-Inside project.

In a conventional microprocessor – the “motor” of a modern computer – transistors are the essential building blocks of digital circuits, creating logic gates that process true or false signals. A few transistors are needed to create a single logic gate and modern microprocessors contain billions of them, each measuring around 100 nanometres.

Transistors have continued to shrink in size since Intel co-founder Gordon E. Moore famously predicted in 1965 that the number that can be placed on a processor would double roughly every two years. But there will inevitably come a time when the laws of quantum physics prevent any further shrinkage using conventional methods. That is where atomic-scale computing comes into play with a fundamentally different approach to the problem.

“Nanotechnology is about taking something and shrinking it to its smallest possible scale. It’s a top-down approach,” Joachim says. He and the Pico-Inside team are turning that upside down, starting from the atom, the molecule, and exploring if such a tiny bit of matter can be a logic gate, memory source, or more. “It is a bottom-up or, as we call it, 'bottom-bottom' approach because we do not want to reach the material scale,” he explains.

Joachim’s team has focused on taking one individual molecule and building up computer components, with the ultimate goal of hosting a logic gate in a single molecule.

How many atoms to build a computer?

“The question we have asked ourselves is how many atoms does it take to build a computer?” Joachim says. “That is something we cannot answer at present, but we are getting a better idea about it.”

The team has managed to design a simple logic gate with 30 atoms that perform the same task as 14 transistors, while also exploring the architecture, technology and chemistry needed to achieve computing inside a single molecule and to interconnect molecules.

They are focusing on two architectures: one that mimics the classical design of a logic gate but in atomic form, including nodes, loops, meshes etc., and another, more complex, process that relies on changes to the molecule’s conformation to carry out the logic gate inputs and quantum mechanics to perform the computation.

The logic gates are interconnected using scanning-tunnelling microscopes and atomic-force microscopes – devices that can measure and move individual atoms with resolutions down to 1/100 of a nanometre (that is one hundred millionth of a millimetre!). As a side project, partly for fun but partly to stimulate new lines of research, Joachim and his team have used the technique to build tiny nano-machines, such as wheels, gears, motors and nano-vehicles each consisting of a single molecule.

“Put logic gates on it and it could decide where to go,” Joachim notes, pointing to what would be one of the world’s first implementations of atomic-scale robotics.

The importance of the Pico-Inside team’s work has been widely recognised in the scientific community, though Joachim cautions that it is still very much fundamental research. It will be some time before commercial applications emerge from it. However, emerge they all but certainly will.

“Microelectronics needs us if logic gates – and as a consequence microprocessors – are to continue to get smaller,” Joachim says.

The Pico-Inside researchers, who received funding under the ICT strand of the EU’s Sixth Framework Programme, are currently drafting a roadmap to ensure computing power continues to increase in the future.

OpenSPARC(TM) Internals abstract





Much of the material in this chapter was leveraged from L. Spracklen and 

S. G. Abraham, “Chip Multithreading: Opportunities and Challenges,” in 

11th      International     Symposium        on    High-Performance         Computer 

Architecture, 2005. 



Over     the  last  few  decades    microprocessor     performance     has   increased 

exponentially, with processor architects successfully achieving significant 

gains in single-thread performance from one processor generation to the 

next.    Semiconductor      technology     has   been   the   main   driver   for   this 

increase, with faster transistors allowing rapid increases in clock speed to 

today’s multi-GHz frequencies. In addition to these frequency increases, 

each   new   technology   generation   has   essentially   doubled   the   number   of 

available transistors. As a result, architects have been able to aggressively 

chase     increased    single-threaded     performance      by   using    a  range    of 

expensive   microarchitectural   techniques,   such   as,  superscalar   issue,   out- 

of-order     issue,   on-chip    caching,    and   deep    pipelines   supported     by 

sophisticated branch predictors. 



However, process technology challenges, including power constraints, the 

memory       wall,   and   ever-increasing     difficulties   in  extracting    further 

instruction-level      parallelism    (ILP),   are   all  conspiring    to   limit  the 

performance of individual processors in the future. While recent attempts 

at   improving    single-thread    performance     through   even   deeper   pipelines 

have   led   to   impressive   clock   frequencies,   these   clock   frequencies   have 

not   translated   into   significantly   better   performance   in   comparison   with 

less   aggressive    designs.  As   a  result,  microprocessor     frequency,    which 

used to increase exponentially, has now leveled off, with most processors 

operating in the 2–4 GHz range. 



                                           1 


----------------------- Page 21-----------------------

2                              Chapter 1    Introducing Chip Multithreaded (CMT) Processors 



This combination of the limited realizable ILP, practical   limits to pipelining, 

and a “power ceiling” imposed by cost-effective cooling considerations have 

conspired to limit future performance increases within conventional processor 

cores.     Accordingly,      processor     designers    are   searching     for   new    ways     to 

effectively utilize their ever-increasing transistor budgets. 



The   techniques   being   embraced   across   the   microprocessor   industry   are   chip 

multiprocessors   (CMPs)   and   chip   multithreaded   (CMT)   processors.   CMP,   as 

the   name   implies,   is   simply   a   group   of   processors   integrated   onto   the   same 

chip.    The   individual    processors     typically    have   comparable      performance       to 

their    single-core    brethren,     but  for   workloads      with   sufficient    thread-level 

parallelism (TLP), the aggregate performance delivered by the processor can 

be    many     times    that  delivered     by   a  single-core     processor.     Most    current 

processors adopt this approach and simply involve the replication of existing 

single-processor processor cores on a single die. 



Moving       beyond    these   simple    CMP     processors,     chip   multithreaded      (CMT) 

processors      go   one   step   further   and  support      many    simultaneous      hardware 

strands     (or  threads)   of   execution    per   core   by   simultaneous      multithreading 

(SMT) techniques. SMT effectively combats increasing latencies by enabling 

multiple strands to share many of the resources within the core, including the 

execution   resources.   With   each   strand   spending   a   significant   portion   of   time 

stalled waiting for off-chip misses to complete, each strand’s utilization of the 

core’s execution resources is extremely low. SMT improves the utilization of 

key resources and reduces the sensitivity of an application to off-chip misses. 

Similarly,   as   with   CMP,   multiple   cores can   share   chip   resources   such   as   the 

memory        controller,    off-chip     bandwidth,      and    the   level-2/level-3      cache, 

improving the utilization of these resources. 



The benefits of CMT processors are apparent in a wide variety for application 

spaces.   For   instance,   in   the   commercial  space,   server   workloads   are   broadly 

characterized   by   high   levels   of   TLP,  low   ILP,   and   large   working   sets.   The 

potential     for  further   improvements       in  overall    single-thread    performance       is 

limited; on-chip cycles per instruction (CPI) cannot be improved significantly 

because of low ILP, and off-chip CPI is large and growing because of relative 

increases       in   memory        latency.    However,       typical     server     applications 

concurrently serve a large number of users or clients; for instance, a database 

server may have hundreds of active processes, each associated with a different 

client.   Furthermore,   these   processes   are   currently   multithreaded   to   hide   disk 

access     latencies.   This    structure   leads    to  high   levels   of   TLP.    Thus,   it  is 

extremely   attractive   to   couple   the   high   TLP   in   the   application   domain   with 

support for multiple threads of execution on a processor chip. 


----------------------- Page 22-----------------------

   Evolution of CMTs                                                                               3 



Though   the   arguments   for   CMT   processors  are   often   made   in   the   context   of 

overlapping memory latencies, memory bandwidth considerations also play a 

significant   role.   New   memory   technologies,   such   as   fully   buffered   DIMMs 

(FBDs),      have    higher   bandwidths      (for  example,      60   GB/s/chip),     as  well   as 

higher     latencies    (for  example,     130   ns),   pushing    up   their   bandwidth-delay 

product to 60 GB/s × 130 ns = 7800 bytes. The processor chip’s pins represent 

an expensive resource, and to keep these pins fully utilized (assuming a cache 

line   size   of   64   bytes),   the   processor   chip   must   sustain   7800/64   or   over   100 

parallel requests. To put this in perspective, a single strand on an aggressive 

out-of-order processor core generates less than two parallel requests on typical 

server workloads: therefore, a large number of strands are required to sustain 

a high utilization of the memory ports. 



Finally,   power   considerations   also   favor   CMT   processors.   Given   the   almost 

cubic dependence between core frequency and power consumption, the latter 

drops   dramatically   with   reductions   in  frequency.   As   a   result,   for   workloads 

with adequate TLP,   doubling   the number of cores and   halving the frequency 

delivers   roughly   equivalent   performance   while   reducing   power   consumption 

by a factor of four. 



Evolution of CMTs 



Given     the   exponential     growth     in  transistors   per   chip   over   time,   a  rule  of 

thumb is that a board design becomes a chip design in ten years or less. Thus, 

most     industry     observers     expected     that   chip-level     multiprocessing       would 

eventually      become      a  dominant      design    trend.   The    case   for   a  single-chip 

multiprocessor   was   presented   as   early   as   1996   by   Kunle   Olukotun’s   team   at 

Stanford   University.   Their   Stanford  Hydra   CMP   processor   design   called   for 

the    integration    of  four   MIPS-based      processors      on   a  single   chip.   A   DEC/ 

Compaq research team proposed the incorporation of eight simple Alpha cores 

and   a   two-level   cache   hierarchy   on   a   single   chip   (code-named   Piranha)   and 

estimated a simulated performance of three times that of a single-core, next- 

generation Alpha processor for on-line transaction processing workloads. 



As early as the mid-1990s, Sun recognized the problems that would soon face 

processor      designers    as  a  result   of  the  rapidly    increasing    clock   frequencies 

required   to   improve   single-thread   performance.   In   response,   Sun   defined   the 

MAJC   architecture   to   target   thread-level   parallelism.   Providing   well-defined 

support for both CMP and SMT processors, MAJC architecture was industry’s 

first   step   toward   general-purpose   CMT   processors.   Shortly   after   publishing 

the   MAJC   architecture,   Sun   announced  its   first   MAJC-compliant   processor 

(MAJC-5200),         a  dual-core     CMT     processor     with   cores   sharing    an   L1   data 

cache. 


----------------------- Page 23-----------------------

4                             Chapter 1    Introducing Chip Multithreaded (CMT) Processors 



Subsequently, Sun moved its SPARC processor family toward the CMP design 

point. In 2003, Sun announced two CMP  SPARC processors: Gemini, a dual- 

core   UltraSPARC   II   derivative;   and  UltraSPARC   IV.   These   first-generation 

CMP processors were derived from earlier uniprocessor designs, and the two 

cores did not share any resources other than off-chip datapaths. In most CMP 

designs,     it  is  preferable   to  share   the  outermost     caches,   because     doing   so 

localizes    coherency      traffic  between    the   strands   and   optimizes     inter-strand 

communication         in  the  chip—allowing        very   fine-grained     thread   interaction 

(microparallelism).   In   2003,   Sun   also  announced   its   second-generation   CMP 

processor, UltraSPARC IV+, a follow-on to the UltraSPARC IV processor, in 

which   the   on-chip   L2   and   off-chip   L3   caches   are   shared   between   the   two 

cores. 



In    2006,     Sun    introduced      a   32-way     CMT       SPARC       processor,     called 

UltraSPARC T1, for which the entire design, including the cores, is optimized 

for a CMT design point. UltraSPARC T1 has eight cores; each core is a four- 

way SMT with its own private L1 caches. All eight cores share a 3-Mbyte, 12- 

way   level-2   cache,.   Since   UltraSPARC   T1   is   targeted   at   commercial   server 

workloads      with   high   TLP,   low   ILP,   and   large  working     sets,  the  ability  to 

support many strands and therefore many concurrent off-chip misses is key to 

overall   performance.   Thus,   to   accommodate   eight   cores,   each   core   supports 

single issue and has a fairly short pipeline. 



Sun’s     most   recent    CMT     processor    is  the  UltraSPARC T2         processor.    The 

UltraSPARC T2 processor provides double the threads of the UltraSPARC T1 

processor      (eight   threads    per    core),   as   well   as   improved      single-thread 

performance,        additional     level-2    cache     resources      (increased     size   and 

associativity), and improved support for floating-point operations. 



Sun’s move toward the CMT design has been mirrored throughout industry. In 

2001,     IBM     introduced     the   dual-core   POWER-4          processor     and   recently 

released   second-generation   CMT   processors,   the   POWER-5   and   POWER-6 

processors, in which each core supports 2-way SMT. While this fundamental 

shift    in  processor     design    was   initially   confined    to   the   high-end    server 

processors,   where   the   target   workloads   are   the   most   thread-rich,   this   change 

has recently begun to spread to desktop processors. AMD and Intel have also 

subsequently       released    multicore    CMP    processors,      starting   with   dual-core 

CMPs       and   more    recently    quad-core    CMP      processors.     Further,   Intel   has 

announced   that   its   next-generation   quad-core   processors   will   support   2-way 

SMT, providing a total of eight threads per chip. 



CMT is emerging as the dominant trend in general-purpose processor design, 

with manufacturers discussing their multicore plans beyond their initial quad- 

core    offerings.    Similar   to  the   CISC-to-RISC        shift  that  enabled    an   entire 

processor to fit on a single chip and internalized all communication between 


----------------------- Page 24-----------------------

   Future CMT Designs                                                                              5 



pipeline stages   to within a chip,   the  move   to CMT   represents   a fundamental 

shift in processor design that internalizes much of the communication between 

processors to within a chip. 



Future CMT Designs 



An attractive proposition for future CMT design is to just double the number 

of cores per chip every generation since a new process technology essentially 

doubles   the   transistor   budget.   Little   design   effort   is   expended   on   the   cores, 

and   performance   is   almost   doubled   every   process   generation   on   workloads 

with    sufficient    TLP.   Though      reusing   existing    core   designs    is  an  attractive 

option,     this   approach     may     not   scale   well    beyond     a  couple     of   process 

generations.       Processor     designs    are   already    pushing     the   limits   of   power 

dissipation.     For   the   total  power     consumption       to  be   restrained,   the   power 

dissipation of each core must be halved in each generation. In the past, supply 

voltage      scaling    delivered     most     of   the   required     power      reduction,     but 

indications   are   that   voltage   scaling   will   not   be   sufficient   by   itself.   Though 

well-known   techniques,   such   as   clock  gating   and   frequency   scaling,   may   be 

quite   effective   in   the   short   term,   more   research   is   needed   to   develop   low- 

power, high-performance cores for future CMT designs. 

Further,     given    the  significant    area   cost   associated     with   high-performance 

cores, for a fixed area and power budget, the CMP design choice is between a 

small   number   of   high-performance   (high   frequency,   aggressive   out-of-order, 

large   issue   width)   cores   or   multiple   simple   (low   frequency,   in-order,   limited 

issue    width)    cores.    For   workloads      with   sufficient    TLP,    the  simpler     core 

solution may deliver superior chipwide performance at a fraction of the power. 

However, for applications with limited TLP, unless speculative parallelism can 

be    exploited,    CMT     performance      will   be  poor.   One    possible    solution    is  to 

support   heterogeneous   cores,   potentially providing   multiple   simple   cores   for 

thread-rich      workloads     and    a  single   more    complex      core   to  provide    robust 

performance for single-threaded applications. 



Another   interesting   opportunity   for   CMT   processors   is   support   for   on-chip 

hardware accelerators. Hardware accelerators improve performance on certain 

specialized      tasks   and    off-load    work   from     the   general-purpose       processor. 

Additionally,   on-chip   hardware   accelerators   may   be   an   order   of   magnitude 

more      power     efficient    than   the    general-purpose       processor     and    may     be 

significantly       more     efficient     than   off-chip      accelerators      (for    example, 

eliminating       the   off-chip    traffic  required     to  communicate        to   an   off-chip 

accelerator).   Although   high   cost   and   low   utilization   typically   make   on-chip 

hardware   accelerators   unattractive   for  traditional   processors,   the   cost   of   an 

accelerator can be amortized over many strands, thanks to the high degree of 

resource   sharing   associated   with   CMTs.   While   a   wide   variety   of   hardware 


----------------------- Page 25-----------------------

accelerators can be envisaged, emerging trends make an extremely compelling 

case    for   supporting     on-chip    network     off-load    engines    and   cryptographic 

accelerators.   The   future   processors   will   afford   opportunities   for   accelerating 

other functionality. For instance, with the increasing usage of XML-formatted 

data,   it   may   become   attractive   to   have   hardware   support   XML   parsing   and 

processing. 



Finally, for the same amount of off-chip bandwidth to be maintained per core, 

the   total   off-chip   bandwidth   for   the   processor   chip   must   also   double   every 

process     generation.    Processor     designers    can   meet   the   bandwidth     need    by 

adding      more    pins   or   increasing     the bandwidth       per   pin.   However,      the 

maximum        number     of  pins  per   package    is  only   increasing    at  a  rate  of  10 

percent per generation. Further packaging costs per pin are barely going down 

with    each   new    generation   and    increase    significantly   with   pin   count.   As   a 

result,   efforts   have   recently   focused  on   increasing   the   per-pin   bandwidth   by 

innovations      in  the  processor    chip   to  DRAM      memory      interconnect    through 

technologies such as double data rate and fully buffered DIMMs. Additional 

benefits   can   be   obtained   by   doing   more   with   the   available   bandwidth;   for 

instance, by compressing off-chip traffic or exploiting silentness to minimize 

the bandwidth required to perform write-back operations. Compression of the 

on-chip caches themselves can also improve performance, but the (significant) 

additional latency that is introduced as a result of the decompression overhead 

must     be  carefully   balanced     against   the  benefits    of  the  reduced    miss    rate, 

favoring adaptive compression strategies. 



As a result, going forward we are likely to see an ever-increasing proportion 

of   CMT   processors   designed   from   the   ground-up   in   order   to   deliver   ever- 

increasing      performance        while    satisfying     these    power     and    bandwidth 

constraints. 


----------------------- Page 26-----------------------

CHAPTER 2 



OpenSPARC Designs 



Sun      Microsystems       began      shipping     the    UltraSPARC        T1    chip 

multithreaded      (CMT)    processor    in December      2005.   Sun   surprised   the 

industry by announcing that it would not only ship the processor but also 

open-source      that  processor—a      first  in  the  industry.  By   March    2006, 

UltraSPARC        T1    had    been   open-sourced       in  a   distribution    called 

OpenSPARC T1, available on http://OpenSPARC.net. 



In   2007,   Sun   began   shipping   its   newer,   more   advanced   UltraSPARC   T2 

processor, and open-sourced the bulk of that design as OpenSPARC T2. 



The    “source    code”    for  both   designs    offered   on   OpenSPARC.net        is 

comprehensive,       including    not  just  millions   of  lines   of  the  hardware 

description language (Verilog, a form  of “register transfer logic”—RTL) 

for these microprocessors, but also scripts to compile (“synthesize”) that 

source code into hardware implementations, source code of processor and 

full-system simulators, prepackaged operating system images to boot on 

the simulators, source code to the Hypervisor software layer, a large suite 

of   verification   software,    and   thousands    of  pages   of  architecture    and 

implementation specification documents. 



This    book    is  intended     as  a   “getting   started”   companion      to   both 

OpenSPARC T1          and   OpenSPARC T2.        In  this  chapter,   we   begin   that 

association by addressing   this question: Now that Sun   has open-sourced 

OpenSPARC T1 and T2, what can they be used for? 



One   thing   is   certain:   the   real-world   uses   to   which   OpenSPARC   will   be 

put   will  be  infinitely   more   diverse   and   interesting  than   anything   that 

could be suggested in this book! Nonetheless, this short chapter offers a 

few    ideas,   in  the  hope   that  they   will  stimulate    even   more    creative 

thinking … 



                                           7 


----------------------- Page 27-----------------------

          8                                                     Chapter 2   OpenSPARC Designs 



2.1                Academic Uses for OpenSPARC 



          The    utility  of  OpenSPARC       in   academia    is  limited   only   by  students’ 

          imaginations. 



          The   most   common   academic   use   of   OpenSPARC   to   date   is   as   a   complete 

          example     processor   architecture  and/or   implementation.     It  can  be  used  in 

          coursework areas such as computer architecture, VLSI design, compiler code 

          generation/optimization, and general computer engineering. 



          In university lab courses, OpenSPARC provides a design that can be used as a 

          known-good starting point for assigned projects. 



          OpenSPARC   can   be   used   as   a   basis   for   compiler   research,   such   as   for   code 

          generation/optimization       for  highly    threaded    target   processors    or   for 

          experimenting with instruction set changes and additions. 



          OpenSPARC is already in use in multiple FPGA-based projects at universities. 

          For more information, visit: 

               http://www.opensparc.net/fpga/index.html 



          For more information on programs supporting academic use of OpenSPARC, 

          including availability of the Xilinx OpenSPARC FPGA Board, visit web page: 

               http://www.OpenSPARC.net/edu/university-program.html 



          Specific    questions    about   university   programs     can   be   posted    on   the 

          OpenSPARC general forum at: 

               http://forums.sun.com/forum.jspa?forumID=837 

          or emailed to OpenSPARC-UniversityProgram@sun.com. 



          Many     of  the  commercial    applications  of   OpenSPARC,       mentioned    in  the 

          following section, suggest corresponding academic uses. 



2.2                Commercial Uses for OpenSPARC 



          OpenSPARC provides a springboard for design of commercial processors. By 

          starting   from   a   complete,   known-good   design—including   a   full   verification 

          suite—the     time-to-market    for  a  new   custom   processor   can   be  drastically 

          slashed. 


----------------------- Page 28-----------------------

2.2   Commercial Uses for OpenSPARC                                                        9 



Derivative processors ranging from a simple single-core, single-thread design 

all the way up through an 8-core, 64-thread design can rapidly be synthesized 

from OpenSPARC T1 or T2. 



2.2.1             FPGA Implementation 



An     OpenSPARC        design    can   be   synthesized    and    loaded   into   a   field- 

programmable gate array (FPGA) device. This can be used in several ways: 



*  An   FPGA   version   of   the   processor   can   be   used   for   product   prototyping, 

    allowing rapid design iteration 



*  An   FPGA   can   be   used   to   provide   a   high-speed   simulation   engine   for   a 

   processor under development 



*  For    extreme    time-to-market     needs   where   production    cost   per  processor 

   isn’t critical, a processor could even be shipped in FPGA form. This could 

    also   be   useful   if   the   processor   itself   needs   to   be   field-upgradable   via   a 

    software download. 



2.2.2             Design Minimization 



Portions   of a standard OpenSPARC design that are not needed for the target 

application     can  be   stripped  out,   to  make   the  resulting   processor    smaller, 

cheaper,   faster,   and/or   with   higher   yield   rates.   For   example,   for   a   network 

routing     application,     perhaps     hardware      floating-point     operations     are 

superfluous—in which case, the FPU(s) can be removed, saving die area and 

reducing verification effort. 



2.2.3             Coprocessors 



Specialized     coprocessors     can   be  incorporated     into  a  processor    based   on 

OpenSPARC.         OpenSPARC T2,         for  example,     comes    with    a  coprocessor 

containing   two   10   Gbit/second   Ethernet   transceivers   (the   network   interface 

unit   or  “NIU”).    Coprocessors     can  be   added   for  any   conceivable    purpose, 

including (but hardly limited to) the following: 



*  Network routing 

*  Floating-point acceleration 

*   Cryptographic processing 

*  I/O compression/decompression engines 

*  Audio compression/decompression (codecs) 

*  Video codecs 

*  I/O interface units for embedded devices such as displays or input sensors 


----------------------- Page 29-----------------------

10                                                Chapter 2  OpenSPARC Designs 



   2.2.4        OpenSPARC as Test Input to CAD/ 

                EDA Tools 



The   OpenSPARC      source  code  (Verilog  RTL)   provides  a  large,  real-world 

input   dataset   for   CAD/EDA   tools.   It   can   be   used   to   test   the   robustness   of 

CAD   tools and simulators. Many major   commercial CAD/EDA tool vendors 

are already using OpenSPARC this way! 


----------------------- Page 30-----------------------

CHAPTER 3 



Architecture Overview 



OpenSPARC processors are based on a processor architecture named the 

UltraSPARC   Architecture.   The   OpenSPARC T1   design   is   based   on   the 

UltraSPARC        Architecture    2005,   and   OpenSPARC T2         is  based   on  the 

UltraSPARC Architecture 2007. This chapter is intended as an overview 

of    the  architecture;    more    details  can   be   found   in   the UltraSPARC 

Architecture   2005   Specification      and   the UltraSPARC   Architecture   2007 

Specification. 



The     UltraSPARC       Architecture     is  descended     from    the   SPARC      V9 

architecture     and   complies     fully  with   the   “Level    1”   (nonprivileged) 

SPARC V9 specification. 



The UltraSPARC Architecture supports 32-bit and 64-bit integer and 32- 

bit, 64-bit, and 128-bit floating-point  as its principal data types. The 32- 

bit   and   64-bit   floating-point   types   conform   to   IEEE   Std   754-1985.   The 

128-bit     floating-point    type   conforms    to  IEEE    Std   1596.5-1992.     The 

architecture   defines   general-purpose   integer,   floating-point,   and   special 

state/status    register  instructions,   all  encoded   in  32-bit-wide    instruction 

formats.     The   load/store   instructions   address    a  linear,  264-byte   virtual 



address space. 



As used here, the word  architecture refers to the processor features that 

are   visible   to   an   assembly   language   programmer   or   to   a   compiler   code 

generator.   It   does   not   include   details   of   the   implementation   that   are   not 

visible or easily observable by software, nor those that only affect timing 

(performance). 



The chapter contains these sections: 



*   The UltraSPARC Architecture on page 12 

*   Processor Architecture on page 15 

*  Instructions on page 17 

*   Traps on page 23 

*   Chip-Level Multithreading (CMT) on page 23 



                                           11 


----------------------- Page 31-----------------------

12                                                        Chapter 3    Architecture Overview 



3.1                The UltraSPARC Architecture 



This    section   briefly   describes   features,   attributes,   and  components      of  the 

UltraSPARC        Architecture   and,   further,   describes   correct   implementation   of 

the architecture specification and SPARC V9-compliance levels. 



3.1.1              Features 



The    UltraSPARC       Architecture,    like  its  ancestor   SPARC      V9,   includes   the 

following principal features: 



*  A linear 64-bit address space with 64-bit addressing. 

*   32-bit   wide    instructions   —    These    are  aligned   on   32-bit  boundaries     in 

   memory. Only load and store instructions access memory and perform I/O. 



*   Few addressing modes — A memory address is given as either “register + 

   register” or “register + immediate”. 



*   Triadic register addresses — Most computational instructions operate on 

   two register operands or one register and a constant and place the result in 

    a third register. 



*  A   large   windowed   register   file   —   At   any   one   instant,   a   program   sees   8 

    global integer registers plus a 24-register window of a larger register file. 

    The   windowed   registers   can   be   used  as   a   cache   of   procedure   arguments, 

    local values, and return addresses. 



*   Floating     point   —    The   architecture    provides    an  IEEE    754-compatible 

    floating-point     instruction   set,  operating   on   a  separate   register   file  that 

   provides 32 single-precision (32-bit), 32 double-precision (64-bit), and 16 

    quad-precision (128-bit) overlayed registers. 



*   Fast trap handlers — Traps are vectored through a table. 

*   Multiprocessor       synchronization       instructions    —    Multiple   variations   of 

    atomic load-store memory operations are supported. 



*   Predicted branches — The branch with prediction instructions allows the 

    compiler   or   assembly   language   programmer   to   give   the   hardware   a   hint 

    about whether a branch will be taken. 



*   Branch   elimination   instructions   —   Several   instructions   can   be   used   to 

    eliminate     branches     altogether     (for   example,     Move      on   Condition). 

    Eliminating       branches      increases     performance       in    superscalar      and 

    superpipelined implementations. 


----------------------- Page 32-----------------------

3.1   The UltraSPARC Architecture                                                             13 



*   Hardware trap stack — A hardware trap stack is provided to allow nested 

    traps. It contains all of the machine state necessary to return to the previous 

    trap level. The trap stack makes the handling of faults and error conditions 

    simpler, faster, and safer. 



In   addition,   UltraSPARC       Architecture     includes    the  following    features   that 

were not present in the SPARC V9 specification: 



*   Hyperprivileged         mode—       This   mode     simplifies    porting    of  operating 

    systems,   supports   far   greater   portability   of   operating   system   (privileged) 

    software, supports the ability to run multiple simultaneous guest operating 

    systems,      and    provides     more    robust     handling     of   error   conditions. 

    Hyperprivileged mode is described in detail in the Hyperprivileged version 

    of   the UltraSPARC        Architecture    2005    Specification   or   the  UltraSPARC 

   Architecture 2007 Specification . 



*   Multiple levels of global registers — Instead of the two 8-register sets of 

    global   registers   specified   in   the   SPARC   V9   architecture,   the   UltraSPARC 

    Architecture   provides   multiple   sets;   typically,   one   set   is   used   at   each   trap 

    level. 



*   Extended instruction set — The UltraSPARC Architecture provides many 

    instruction   set   extensions,   including   the   VIS   instruction   set   for   “vector” 

    (SIMD) data operations. 



*   More      detailed,     specific    instruction      descriptions      —     UltraSPARC 

    Architecture      specifications    provide   many      more    details  regarding     what 

    exceptions can be generated by each instruction, and the specific conditions 

    under    which    those    exceptions    can   occur,   than   did   SPARC      V9.   Also, 

    detailed    lists  of  valid  ASIs   are  provided    for  each   load/store    instruction 

    from/to alternate space. 



*   Detailed MMU architecture — Although some details of the UltraSPARC 

    MMU       architecture   are   necessarily    implementation-specific,       UltraSPARC 

    Architecture specifications provide  a blueprint for the UltraSPARC MMU, 

    including     software   view    (TTEs    and  TSBs)     and   MMU      hardware    control 

    registers. 



*   Chip-level      multithreading       (CMT)     —     The    UltraSPARC        Architecture 

    provides      a    control     architecture     for     highly     threaded      processor 

    implementations. 



3.1.2              Attributes 



The UltraSPARC Architecture is a processor instruction set architecture (ISA) 

derived from SPARC V8 and SPARC V9, which in turn come from a reduced 

instruction set computer (RISC) lineage. As an architecture, the UltraSPARC 


----------------------- Page 33-----------------------

14                                                       Chapter 3   Architecture Overview 



Architecture   allows   for a   spectrum of processor   and system  implementations 

at a variety of price/performance points for a range of applications, including 

scientific     or    engineering,     programming,        real-time,    and    commercial 

applications.     OpenSPARC       further   extends   the   possible   breadth   of  design 

possibilities   by   opening   up   key   implementations   to   be   studied,   enhanced,   or 

redesigned by anyone in the community. 



3.1.2.1            Design Goals 



The    UltraSPARC       Architecture    is  designed    to  be  a  target  for  optimizing 

compilers and high-performance hardware implementations. The  UltraSPARC 

Architecture      2005     and   UltraSPARC         Architecture     2007     Specification 

documents      provide    design   specs  against   which    an  implementation     can   be 

verified, using appropriate verification software. 



3.1.2.2            Register Windows 



The     UltraSPARC      Architecture     architecture   is  derived    from    the  SPARC 

architecture,    which    was   formulated    at   Sun  Microsystems     in  1984   through 

1987. The SPARC architecture is, in turn, based on the RISC I and II designs 

engineered     at  the  University   of  California   at  Berkeley    from   1980   through 

1982.     The   SPARC     “register   window”      architecture,  pioneered     in  the  UC 

Berkeley designs, allows for straightforward, high-performance compilers and 

a reduction in memory load/store instructions. 



Note     that  privileged   software,    not user    programs,     manages     the  register 

windows.      Privileged    software   can   save   a  minimum      number     of  registers 

(approximately 24) during a context switch, thereby optimizing context-switch 

latency. 



3.1.3              System Components 



The UltraSPARC Architecture allows for a spectrum of subarchitectures, such 

as cache system, I/O, and memory management unit (MMU). 



3.1.3.1            Binary Compatibility 



An    important    mandate     for  the  UltraSPARC       Architecture    is  compatibility 

across    implementations      of  the  architecture   for  application    (nonprivileged) 

software, down to the binary level. Binaries executed in nonprivileged mode 

should behave identically on all UltraSPARC Architecture systems when those 


----------------------- Page 34-----------------------

3.2   Processor Architecture                                                                15 



systems     are   running    an   operating    system    known     to  provide    a  standard 

execution   environment.   One   example   of   such   a   standard   environment   is   the 

SPARC V9 Application Binary Interface (ABI). 



Although        different    UltraSPARC         Architecture      systems     can     execute 

nonprivileged programs at different rates, they will generate the same results 

as   long   as  they   are  run   under   the same     memory     model.    See   Chapter    9, 

Memory, in an UltraSPARC Architecture specification for more information. 



Additionally,   UltraSPARC   Architecture   2005   and   UltraSPARC   Architecture 

2007 are are upward-compatible from SPARC V9 for applications running in 

nonprivileged      mode     that  conform     to  the   SPARC      V9   ABI    and   upward- 

compatible   from   SPARC   V8   for   applications   running   in   nonprivileged   mode 

that conform to the SPARC V8 ABI. 



An   OpenSPARC   implementation   may   or   may   not   maintain   the   same   binary 

compatibility,   depending   on   how   the   implementation   has   been   modified   and 

what software execution environment is run on it. 



3.1.3.2            UltraSPARC Architecture MMU 



UltraSPARC Architecture defines a common MMU architecture (see Chapter 

14, Memory   Management,   in   any   UltraSPARC   Architecture   specification   for 

details). Some specifics are left implementation-dependent. 



3.1.3.3            Privileged Software 



UltraSPARC        Architecture    does   not   assume    that  all  implementations      must 

execute   identical   privileged   software  (operating   systems)   or   hyperprivileged 

software     (hypervisors).     Thus,   certain   traits  that  are  visible   to  privileged 

software may be tailored to the requirements of the system. 



3.2                Processor Architecture 



An      UltraSPARC         Architecture      processor—therefore         an     OpenSPARC 

processor—logically consists of an integer unit (IU) and a floating-point unit 

(FPU),      each    with    its   own     registers.   This    organization      allows     for 

implementations        with    concurrent     integer    and   floating-point     instruction 

execution.   Integer   registers   are   64   bits  wide;   floating-point   registers   are   32, 

64, or 128 bits wide. Instruction   operands  are   single registers, register   pairs, 

register quadruples, or immediate constants. 


----------------------- Page 35-----------------------

16                                                          Chapter 3     Architecture Overview 



A virtual processor (synonym: strand) is the hardware containing the state for 

execution   of   a   software   thread.   A physical   core     is   the   hardware   required   to 

execute   instructions   from   one   or   more  software   threads,   including   resources 

shared among strands. A complete processor comprises one or more physical 

cores and is the physical module that plugs into a system. 



An OpenSPARC virtual processor can run in  nonprivileged mode, privileged 

mode,   or  hyperprivileged   mode.   In   hyperprivileged   mode,   the   processor   can 

execute any instruction, including privileged instructions. In privileged mode, 

the    processor     can   execute     nonprivileged      and   privileged     instructions.    In 

nonprivileged        mode,      the    processor     can     only    execute      nonprivileged 

instructions.   In   nonprivileged   or   privileged   mode,   an   attempt   to   execute   an 

instruction requiring greater privilege than the current mode causes a trap to 

hyperprivileged software. 



3.2.1               Integer Unit (IU) 



An   OpenSPARC   implementation’s   integer   unit   contains   the   general-purpose 

registers   and   controls   the   overall   operation   of   the   virtual   processor.   The   IU 

executes   the   integer   arithmetic   instructions   and   computes   memory   addresses 

for   loads   and   stores.   It  also  maintains     the  program     counters    and   controls 

instruction execution for the FPU. 



An   UltraSPARC   Architecture   implementation   may   contain   from   72   to   640 

general-purpose        64-bit  R   registers.   This   corresponds     to  a  grouping     of  the 

registers   into   a   number   of   sets   of   global  R  registers   plus   a   circular   stack   of 

N_REG_ WINDOWS sets of 16 registers each, known as register windows. The 

number   of   register   windows   present   (N_REG_ WINDOWS)   is   implementation 

dependent,       within    the  range    of   3   to  32   (inclusive).    In   an   unmodified 

OpenSPARC T1 or T2 implementation, N_REG_ WINDOWS = 8. 



3.2.2               Floating-Point Unit (FPU) 



An   OpenSPARC   FPU   has   thirty-two   32-bit   (single-precision)   floating-point 

registers,    thirty-two    64-bit    (double-precision)      floating-point     registers,   and 

sixteen     128-bit    (quad-precision)       floating-point     registers,   some     of  which 

overlap (as described in detail in UltraSPARC Architecture specifications). 



If no FPU is present, then it appears to software as if the FPU is permanently 

disabled. 


----------------------- Page 36-----------------------

3.3   Instructions                                                                            17 



If   the  FPU     is  not  enabled,    then   an   attempt    to  execute    a  floating-point 

instruction     generates    an  fp_disabled    trap   and   the  fp_disabled    trap   handler 

software must either 



*   Enable the FPU (if present) and reexecute the trapping instruction, or 

*   Emulate the trapping instruction in software. 



3.3                Instructions 



Instructions fall into the following basic categories: 



*   Memory access 

*   Integer arithmetic / logical / shift 

*   Control transfer 

*   State register access 

*   Floating-point operate 

*   Conditional move 

*   Register window management 

*   SIMD (single instruction, multiple data) instructions 



These classes are discussed in the following subsections. 



3.3.1              Memory Access 



Load, store, load-store, and PREFETCH instructions are the only instructions 

that   access   memory.   They   use   two  R  registers   or   an  R  register   and   a   signed 

13-bit   immediate   value   to   calculate   a  64-bit,   byte-aligned   memory   address. 

The integer unit appends an ASI to this address. 



The destination field of the load/store instruction specifies either one or two R 

registers or one, two, or four  F registers that supply the data for a store or that 

receive the data from a load. 



Integer load and store instructions support byte, halfword (16-bit), word (32- 

bit),   and   extended-word (64-bit)   accesses. There   are   versions   of   integer load 

instructions that perform either sign-extension or zero-extension on 8-bit, 16- 

bit,   and   32-bit   values   as   they   are   loaded   into   a   64-bit   destination   register. 

Floating-point      load   and   store   instructions   support    word,    doubleword,      and 

quadword1 memory accesses. 



1. OpenSPARC T1   and   T2   processors   do   not   implement   the  LDQF  instruction   in   hardware;   it 



  generates an exception and is emulated in hyperprivileged software. 


----------------------- Page 37-----------------------

18                                                        Chapter 3    Architecture Overview 



CASA, CASXA, and LDSTUB are special atomic memory access instructions 

that concurrent processes use for synchronization and memory updates. 



                 Note    The     SWAP      instruction    is  also   specified,    but   it  is 

                         deprecated   and   should   not   be   used   in   newly   developed 

                         software. 



The   (nonportable)   LDTXA   instruction   supplies   an   atomic   128-bit   (16-byte) 

load that is important in certain system software applications. 



3.3.1.1            Memory Alignment Restrictions 



A    memory     access   on   an  OpenSPARC        virtual  processor    must   typically   be 

aligned on an address boundary greater than or equal to the size of the datum 

being   accessed.   An   iproperly   aligned   address   in   a   load,   store,   or   load-store 

instruction may trigger an exception and cause a subsequent trap. For details, 

see     the  Memory       Alignment      Restrictions     section    in   an    UltraSPARC 

Architecture specification. 



3.3.1.2            Addressing Conventions 



An unmodified OpenSPARC processor uses big-endian byte order by default: 

the address of a quadword, doubleword, word, or halfword is the address of its 

most     significant     byte.   Increasing     the   address     means     decreasing     the 

significance of the unit being accessed. All instruction accesses are performed 

using big-endian byte order. 



An   unmodified   OpenSPARC   processor   also   supports   little-endian   byte   order 

for   data   accesses   only:   the  address   of  a  quadword,     doubleword,     word,   or 

halfword   is   the   address   of   its   least significant   byte.   Increasing   the   address 

means increasing the significance of the data unit being accessed. 



3.3.1.3            Addressing Range 



An OpenSPARC implementation supports a 64-bit virtual address space. The 

supported range of virtual addresses is restricted to two equal-sized ranges at 

the extreme upper and lower ends of 64-bit addresses; that is, for n-bit virtual 

addresses, the valid address ranges are 0 to 2n–1 - 1 and 264 - 2n–1 to 264 -  1. 



See    the  OpenSPARC         T1   Implementation      Supplement      or OpenSPARC         T2 

Implementation Supplement for details. 


----------------------- Page 38-----------------------

3.3   Instructions                                                                                19 



3.3.1.4             Load/Store Alternate 



Versions   of   load/store   instructions,   the  load/store   alternate   instructions,   can 

specify     an   arbitrary   8-bit   address    space    identifier    for  the   load/store    data 

access. 

Access      to   alternate     spaces     0016–2F16      is   restricted    to   privileged     and 

hyperprivileged software, access to alternate spaces 3016–7F16 is restricted to 

hyperprivileged        software,     and    access     to  alternate     spaces    8016–FF16       is 

unrestricted.   Some   of   the   ASIs   are    available   for   implementation-dependent 

uses.    Privileged    and   hyperprivileged   software        can   use   the  implementation- 

dependent   ASIs   to   access  special   protected   registers,   such   as   MMU   control 

registers,   cache   control   registers,   virtual   processor   state   registers,   and   other 

processor-dependent          or   system-dependent        values.    See    the Address      Space 

Identifiers     (ASIs)   chapter    in  an   UltraSPARC       Architecture      specification    for 

more information. 



Alternate   space   addressing   is   also   provided   for   the   atomic   memory   access 

instructions LDSTUBA, CASA, and CASXA. 



3.3.1.5             Separate Instruction and Data Memories 



The     interpretation    of   addresses     in an   unmodified       OpenSPARC         process    is 

“split”; instruction references use one caching and translation mechanism and 

data   references   use   another,   although  the   same   underlying   main   memory   is 

shared. 



In   such   split-memory   systems,   the   coherency   mechanism   may   be   split,   so   a 

write1  into data   memory is not immediately   reflected   in instruction memory. 



For    this  reason,    programs      that  modify     their  own    instruction    stream    (self- 

modifying       code2)     and   that   wish    to  be   portable     across    all  UltraSPARC 



Architecture (and SPARC V9) processors must issue FLUSH instructions, or a 

system call with a similar effect, to bring the instruction and data caches into 

a consistent state. 



An UltraSPARC Architecture virtual processor may or may not have coherent 

instruction   and   data   caches.   Even   if   an   implementation   does   have   coherent 

instruction      and   data    caches,    a  FLUSH      instruction      is  required     for  self- 

modifying   code—not   for   cache   coherency,   but   to   flush   pipeline   instruction 

buffers      that   contain      unmodified       instructions     which      may     have     been 

subsequently modified. 



1. This includes use of store instructions (executed on the same or another virtual processor) that 



  write to instruction memory, or any other means of writing into instruction memory (for example, 

  DMA). 

2.  This is practiced, for example, by software such as debuggers and dynamic linkers. 


----------------------- Page 39-----------------------

20                                                        Chapter 3    Architecture Overview 



3.3.1.6            Input/Output (I/O) 



The UltraSPARC Architecture assumes that input/output registers are accessed 

through     load/store   alternate   instructions,   normal    load/store   instructions,    or 

read/write Ancillary State register instructions (RDasr, WRasr). 



3.3.1.7            Memory Synchronization 



Two instructions are used for synchronization of memory operations: FLUSH 

and MEMBAR. Their operation is explained in Flush Instruction Memory and 

Memory        Barrier     sections,    respectively,     of   UltraSPARC        Architecture 

specifications. 



3.3.2              Integer Arithmetic / Logical / Shift 

                   Instructions 



The arithmetic/logical/shift instructions perform arithmetic, tagged arithmetic, 

logical, and shift operations. With one exception, these instructions compute a 

result that is a function of two source operands; the result is either written into 

a   destination   register   or   discarded.  The   exception,   SETHI,   can   be   used   in 

combination       with   other  arithmetic    and/or   logical   instructions    to  create   a 

constant in an R register. 



Shift   instructions   shift   the   contents   of   an  R  register   left   or   right   by   a   given 

number of bits (“shift count”). The shift distance is specified by a constant in 

the instruction or by the contents of an  R register. 



3.3.3              Control Transfer 



Control-transfer   instructions   (CTIs)   include     PC-relative   branches   and   calls, 

register-indirect     jumps,   and   conditional   traps.   Most   of  the  control-transfer 

instructions     are  delayed;    that  is,  the  instruction   immediately     following     a 

control-transfer instruction in logical sequence is dispatched before the control 

transfer   to   the   target   address   is   completed.   Note   that   the   next   instruction   in 

logical    sequence    may    not  be  the  instruction   following    the   control-transfer 

instruction in memory. 



The    instruction   following    a  delayed    control-transfer    instruction   is  called  a 

delay    instruction.   Setting   the  annul    bit  in  a  conditional    delayed   control- 

transfer instruction causes the delay instruction to be annulled (that is, to have 


----------------------- Page 40-----------------------

3.3   Instructions                                                                         21 



no   effect)   if   and   only   if   the   branch   is   not   taken.   Setting   the   annul   bit   in   an 

unconditional      delayed   control-transfer    instruction   (“branch   always”)    causes 

the delay instruction to be always annulled. 



Branch and CALL instructions use  PC-relative displacements. The jump and 

link   (JMPL)   and   return   (RETURN)   instructions   use   a   register-indirect   target 

address.    They    compute    their  target  addresses    either  as  the  sum   of  two  R 

registers or as the sum of an  R register and a 13-bit signed immediate value. 

The   “branch   on   condition   codes   without   prediction”   instruction   provides   a 

displacement of ±8 Mbytes; the “branch on condition codes with prediction” 

instruction    provides    a  displacement    of ±1    Mbyte;    the  “branch    on  register 

contents” instruction provides a displacement of ±128 Kbytes; and the CALL 

instruction’s 30-bit word displacement allows a control transfer to any address 

within ± 2 gigabytes (± 231 bytes). 



                 Note    The return from privileged trap instructions (DONE and 

                         RETRY)       get  their  target  address   from   the  appropriate 

                         TPC or TNPC register. 



3.3.4              State Register Access 



This section describes the following state registers: 



*  Ancillary state registers 

*  Read and write privileged state registers 

*  Read and writer hyperprivileged state registers 



3.3.4.1            Ancillary State Registers 



The    read   and  write   ancillary  state  register  instructions    read  and   write  the 

contents of ancillary state registers visible to nonprivileged software (Y, CCR, 

ASI, PC, TICK, and  FPRS) and some registers visible only to privileged and 

hyperprivileged         software       (PCR,      SOFTINT,         TICK_CMPR,             and 

STICK_CMPR). 



3.3.4.2            PR State Registers 



The   read   and   write   privileged   register   instructions   (RDPR   and   WRPR)   read 

and    write   the  contents    of  state  registers   visible   only   to  privileged    and 

hyperprivileged   software   (TPC, TNPC, TSTATE, TT, TICK, TBA,  PSTATE, 

TL,  PIL,  CWP,  CANSAVE,  CANRESTORE,  CLEANWIN,  OTHERWIN,   and 

WSTATE). 

----------------------- Page 41-----------------------

22                                                      Chapter 3   Architecture Overview 



3.3.4.3           HPR State Registers 



The     read   and   write   hyperprivileged    register    instructions   (RDHPR       and 

WRHPR)        read   and  write   the  contents    of  state  registers  visible   only   to 

hyperprivileged       software    (HPSTATE,       HTSTATE,       HINTP,     HVER,       and 

HSTICK_CMPR). 



3.3.5             Floating-Point Operate 



Floating-point      operate     (FPop)    instructions    perform     all   floating-point 

calculations;    they   are  register-to-register   instructions    that  operate   on  the 

floating-point registers. FPops compute a result that is a function of one, two, 

or three source operands. The groups of instructions that are considered FPops 

are   listed   in  the Floating-Point      Operate    (FPop)    Instructions    section  of 

UltraSPARC Architecture specifications. 



3.3.6             Conditional Move 



Conditional     move    instructions   conditionally    copy   a  value   from    a  source 

register   to   a   destination   register,   depending   on   an   integer   or   floating-point 

condition code or on the contents of an integer register. These instructions can 

be used to reduce the number of branches in software. 



3.3.7             Register Window Management 



Register     window     instructions   manage     the  register   windows.     SAVE     and 

RESTORE   are   nonprivileged   and   cause   a   register   window   to   be   pushed   or 

popped. FLUSHW is nonprivileged and causes all of the windows except the 

current   one   to   be   flushed   to   memory.   SAVED   and   RESTORED   are   used   by 

privileged software to end a window spill or fill trap handler. 



3.3.8             SIMD 



An    unmodified     OpenSPARC       processor    includes   SIMD    (single   instruction, 

multiple data) instructions, also known as “vector” instructions, which allow a 

single    instruction   to  perform   the  same    operation   on   multiple   data  items, 

totaling   64   bits,   such   as   eight   8-bit, four   16-bit,   or   two   32-bit   data   items. 

These operations are part of the “VIS” instruction set extensions. 


----------------------- Page 42-----------------------

3.4   Traps                                                                                   23 



3.4                Traps 



A   trap    is  a  vectored   transfer   of  control    to  privileged    or  hyperprivileged 

software through a trap table that may contain the first 8 instructions (32 for 

some frequently used traps) of each trap handler. The base address of the table 

is established by software in a state register (the Trap Base Address register, 

TBA,   or   the   Hyperprivileged   Trap   Base   register,  HTBA).   The   displacement 

within the table is encoded in the type number of each trap and the level of the 

trap.   Part   of   the   trap   table   is   reserved   for   hardware   traps,   and   part   of   it   is 

reserved for software traps generated by trap (Tcc) instructions. 



A   trap   causes   the   current  PC  and  NPC  to   be   saved   in   the  TPC   and  TNPC 

registers.   It   also   causes   the CCR,  ASI,  PSTATE,   and  CWP   registers   to   be 

saved   in TSTATE. TPC, TNPC,   and TSTATE are entries   in a   hardware trap 

stack, where the number of entries in the trap stack is equal to the number of 

supported   trap   levels.   A   trap   causes   hyperprivileged   state   to   be   saved   in   the 

HTSTATE trap stack. A trap also sets bits in the PSTATE (and, in some cases, 

HPSTATE)   register   and   typically   increments   the  GL  register.   Normally,   the 

CWP  is   not   changed   by   a   trap;   on   a   window   spill   or   fill   trap,   however,   the 

CWP is changed to point to the register window to be saved or restored. 



A   trap   can   be   caused   by   a   Tcc   instruction,   an   asynchronous   exception,   an 

instruction-induced exception, or an interrupt request not directly related to a 

particular   instruction.   Before   executing   each   instruction,   a   virtual   processor 

determines if there are any pending exceptions or interrupt requests. If any are 

pending,      the  virtual    processor    selects    the   highest-priority     exception    or 

interrupt request and causes a trap. 



See    the Traps   chapter     in  an   UltraSPARC       Architecture    specification    for   a 

complete description of traps. 



3.5                Chip-Level Multithreading 

                   (CMT) 



An OpenSPARC implementation may include multiple virtual processor cores 

within the processor (“chip”) to provide a dense, high-throughput system. This 

may     be  achieved    by   having    a  combination     of  multiple    physical   processor 


----------------------- Page 43-----------------------

24                                                 Chapter 3   Architecture Overview 



cores and/or multiple strands (threads) per physical processor core, referred to 

as chip-level multithreaded (CMT) processors. CMT-specific hyperprivileged 

registers are used for identification and configuration of CMT processors. 



The    CMT    programming     model    describes  a  common     interface  between 

hardware (CMT registers) and software 



The common CMT registers and the CMT programming model are described 

in the  Chip-Level Multithreading (CMT) chapter in UltraSPARC Architecture 

specifications. 


----------------------- Page 44-----------------------

CHAPTER 4 



OpenSPARC T1 and T2 Processor 

 Implementations 



This   chapter   introduces   the   OpenSPARC T1   and   OpenSPARC T2   chip- 

level multithreaded (CMT) processors in the following sections: 



*  General Background on page 25 

*  OpenSPARC T1 Overview on page 27 

*  OpenSPARC T1 Components on page 29 

*  OpenSPARC T2 Overview on page 33 

*  OpenSPARC T2 Components on page 34 

*  Summary of Differences Between OpenSPARC T1 and OpenSPARC T2 

   on page 36 



4.1              General Background 



OpenSPARC T1        is  the  first  chip  multiprocessor  that  fully  implements 

Sun’s Throughput Computing initiative. OpenSPARC T2 is the follow-on 

chip   multi-threaded   (CMT)   processor   to   the   OpenSPARC T1   processor. 

Throughput Computing is a technique that takes advantage of the thread- 

level   parallelism   that   is   present   in   most   commercial   workloads.   Unlike 

desktop     workloads,    which   often   have   a  small   number     of  threads 

concurrently      running,   most    commercial      workloads     achieve    their 

scalability by employing large pools of concurrent threads. 



Historically,   microprocessors     have  been    designed    to  target  desktop 

workloads,   and   as   a   result   have   focused   on   running   a   single   thread   as 

quickly    as  possible.  Single-thread    performance    is  achieved   in  these 

microprocessors   by   a   combination   of   extremely   deep   pipelines   (over   20 

stages in Pentium 4) and by execution of multiple instructions in parallel 

(referred   to  as  instruction-level  parallelism,   or  ILP).  The  basic   tenet 



                                        25 


----------------------- Page 45-----------------------

26                          Chapter 4     OpenSPARC T1 and T2 Processor Implementations 



behind Throughput Computing is that exploiting ILP and deep pipelining has 

reached       the   point    of    diminishing      returns     and    as   a    result,   current 

microprocessors do not utilize their underlying hardware very efficiently. 



For    many    commercial       workloads,     the  physical    processor     core  will   be  idle 

most   of   the   time   waiting   on   memory,   and   even   when   it   is   executing,   it   will 

often be able to utilize only a small fraction of its wide execution width. So 

rather than building a large and complex  ILP processor that sits idle most of 

the   time,   build   a   number   of   small,   single-issue   physical   processor   cores   that 

employ      multithreading       built  in   the  same     chip   area.   Combining       multiple 

physical   processors   cores   on   a   single   chip   with   multiple   hardware-supported 

threads (strands) per physical processor core allows very high performance for 

highly threaded commercial applications. This approach is called thread-level 

parallelism      (TLP).     The    difference   between       TLP     and   ILP    is  shown     in 

FIGURE 4-1. 



             Strand 1 



             Strand 2 

TLP 

             Strand 3 



             Strand 4 



ILP         Single strand 

           executing 

           two 



                             Executing               Stalled on Memory 



FIGURE 4-1     Differences Between TLP and ILP 



The memory stall time of one strand can often be overlapped with execution 

of   other   strands   on   the   same   physical  processor   core,   and   multiple   physical 

processor      cores   run   their  strands    in  parallel.  In   the  ideal   case,   shown    in 

FIGURE 4-1,   memory   latency   can   be   completely   overlapped   with   execution   of 

other    strands.   In  contrast,    instruction-level     parallelism    simply    shortens    the 

time to execute instructions, and does not help much in overlapping execution 

with memory latency.1 



1. Processors   that   employ   out-of-order   ILP   can   overlap   some   memory   latency   with   execution. 



  However, this overlap is typically limited to shorter memory latency events such as L1 cache 

  misses that hit in the L2 cache. Longer memory latency events such as main memory accesses are 

  rarely overlapped to a significant degree with execution by an out-of-order processor. 


----------------------- Page 46-----------------------

4.2   OpenSPARC T1 Overview                                                                     27 



Given this ability to overlap execution with memory latency, why don’t more 

processors   utilize   TLP?   The   answer   is  that   designing   processors   is   a   mostly 

evolutionary process, and the ubiquitous deeply pipelined, wide ILP physical 

processor cores of today are the evolutionary outgrowth from a time when the 

CPU was the bottleneck in delivering good performance. 



With     physical     processor     cores    capable     of   multiple-GHz       clocking,     the 

performance   bottleneck   has   shifted   to  the   memory   and   I/O   subsystems   and 

TLP     has   an  obvious     advantage     over   ILP   for  tolerating    the  large   I/O  and 

memory       latency    prevalent     in  commercial      applications.     Of   course,    every 

architectural      technique     has   its   advantages     and    disadvantages.      The    one 

disadvantage of employing TLP over ILP is that execution of a single strand 

may   be   slower   on   a   TLP   processor   than   on   an   ILP   processor.   With   physical 

processor cores   running at   frequencies  well over 1   GHz,   a   strand capable of 

executing   only   a   single   instruction   per   cycle   is   fully   capable   of   completing 

tasks    in  the  time   required    by   the  application,    making     this  disadvantage      a 

nonissue for nearly all commercial applications. 



4.2                 OpenSPARC T1 Overview 



OpenSPARC T1           is  a  single-chip     multiprocessor.      OpenSPARC T1          contains 

eight SPARC physical processor cores. Each SPARC physical processor core 

has full hardware support for four virtual processors (or “strands”). These four 

strands run simultaneously, with the instructions from each of the four strands 

executed round-robin by the single-issue pipeline. When a strand encounters a 

long-latency      event,    such   as   a  cache   miss,    it  is  marked    unavailable     and 

instructions   are   not   issued   from   that   strand   until   the   long-latency   event   is 

resolved. Round-robin execution of the remaining available strands continues 

while the long-latency event of the first strand  is resolved. 



Each     OpenSPARC T1           physical    core   has    a  16-Kbyte,      4-way    associative 

instruction cache (32-byte lines), 8-Kbyte, 4-way associative data cache (16- 

byte lines), 64-entry fully associative instruction Translation Lookaside Buffer 

(TLB),   and   64-entry   fully   associative   data   TLB   that   are   shared   by   the   four 

strands. The eight SPARC physical cores are connected through a crossbar to 

an on-chip unified 3-Mbyte, 12-way associative L2 cache (with 64-byte lines). 

The   L2   cache   is   banked   four   ways   to  provide   sufficient   bandwidth   for   the 

eight OpenSPARC T1 physical cores. The L2 cache connects to four on-chip 

DRAM   controllers,   which   directly   interface   to   DDR2-SDRAM.   In   addition, 


----------------------- Page 47-----------------------

28                            Chapter 4     OpenSPARC T1 and T2 Processor Implementations 



an on-chip J-Bus controller and several on-chip I/O-mapped control registers 

are accessible to the SPARC physical cores. Traffic from the J-Bus coherently 

interacts with the L2 cache. 



A block diagram of the OpenSPARC T1 chip is shown in FIGURE 4-2. 



         OpenSPARC T1 



                                 FPU 



                                     124,145 



            SPARC Core 

                                                                             DRAM Control           DDR-II 

                                            L2 Bank0                           Channel 0 

            SPARC Core                                    156,64 



            SPARC Core 

                                                                             DRAM Control           DDR-II 

                                            L2 Bank1                           Channel1 

                                                          156,64 

            SPARC Core         Cache 

                              Crossbar 

            SPARC Core          (CCX) 

                                                                             DRAM Control 

                                            L2 Bank2                           Channel 2            DDR-II 

                                                         156,64 

            SPARC Core 



            SPARC Core 

                                                                             DRAM Control 

                                                                                                    DDR-II 

                                            L2 Bank3                           Channel3 

                                                         156,64 



            SPARC Core 

                                                                          32,32 

                                                                          32,32 

                                                           32,32          32,32 



            eFuse 



                                             32, 16, Copy 

                                             8, or 4 for 

JTAG         CTU                  IOB               each             J-Bus              200 MHz 

 Port 

                                             4 or 8 block          System                J-Bus 



                                                   with            Interface 

                                                    CSRs 



                                                                                         50 MHz 

                                                                  SSI ROM 

                                                                                          SSI 

                                                                    Interface 



Notes: 

(1) Blocks not scaled to physical size. 

(2) Bus widths are labeled as in#,out#, where “in” is into CCX or L2. 



FIGURE 4-2      OpenSPARC T1 Chip Block Diagram 


----------------------- Page 48-----------------------

4.3   OpenSPARC T1 Components                                                                 29 



4.3                OpenSPARC T1 Components 



This     section    describes     each    component       in   OpenSPARC T1           in   these 

subsections. 



*   SPARC Physical Core on this page 

*  Floating-Point Unit (FPU) on page 30 

*  L2 Cache on page 31 

*  DRAM Controller on page 31 

*  I/O Bridge (IOB) Unit on page 31 

*  J-Bus Interface (JBI) on page 32 

*   SSI ROM Interface on page 32 

*   Clock and Test Unit (CTU) on page 32 

*  EFuse on page 33 



4.3.1              OpenSPARC T1 Physical Core 



Each   OpenSPARC T1   physical   core   has  hardware   support   for   four   strands. 

This support consists of a full register  file (with eight register windows) per 

strand,   with   most   of   the   ASI,   ASR,   and   privileged   registers   replicated   per 

strand. The four strands share the instruction and data caches and TLBs. An 

autodemap1 feature is included with the TLBs to allow the multiple strands to 



update the TLB without locking. 



The    core   pipeline   consists   of  six  stages:   Fetch,   Switch,    Decode,    Execute, 

Memory, and Writeback. As shown in FIGURE 4-3, the Switch stage contains a 

strand instruction register for each strand. One of the strands is picked by the 

strand   scheduler   and   the   current   instruction   for   that   strand   is   issued   to   the 

pipe.   While   this   is   done,   the   hardware   fetches   the   next   instruction   for   that 

strand and updates the strand instruction register. 



The   scheduled   instruction   proceeds   down   the   rest   of   the   stages   of   the   pipe, 

similar to instruction execution in a single-strand RISC machine. It is decoded 

in the Decode stage. The register file access also happens at this time. In the 

Execute   stage,   all   arithmetic   and   logical   operations   take   place.   The   memory 

address is calculated in this stage. The data cache is accessed in the Memory 

stage   and   the   instruction   is   committed   in   the   Writeback   stage.   All   traps   are 

signaled in this stage. 



1. Autodemap   causes   an   existing   TLB   entry   to  be   automatically   removed   when   a   new   entry   is 



  installed with the same virtual page number (VPN) and same page size. 


----------------------- Page 49-----------------------

30                           Chapter 4     OpenSPARC T1 and T2 Processor Implementations 



Instructions   are   classified   as   either  short   or   long   latency   instructions.   Upon 

encountering   a   long   latency   instruction  or   other   stall   condition   in   a   certain 

strand, the strand scheduler stops scheduling that strand for further execution. 

Scheduling commences again when the long latency instruction completes or 

the stall condition clears. 



FIGURE 4-3 illustrates the OpenSPARC T1 physical core. 



                                    Strand 

     I-Cache                      Instruction            Strand 

                                                                             Decode 

                                   Registers           Scheduler 



                             Store Buffers 



                                                         ALU 



                                                                                  Register Files 



                              D-Cache 



   External 



   Interface 



FIGURE 4-3     OpenSPARC T1 Core Block Diagram 



4.3.2                Floating-Point Unit (FPU) 



A   single   floating-point   unit   is   shared   by   all   eight   OpenSPARC T1   physical 

cores.     The    shared     floating-point      unit   is  sufficient    for   most     commercial 

applications,       in   which     fewer    than    1%    of   instructions      typically    involve 

floating-point operations. 


----------------------- Page 50-----------------------

4.3   OpenSPARC T1 Components                                                              31 



4.3.3              L2 Cache 



The L2 cache is banked four ways, with the bank selection based on physical 

address   bits   7:6.   The   cache   is   3-Mbyte,  12-way   set   associative   with   pseudo- 

LRU replacement (replacement is based on a used-bit scheme), and has a line 

size of 64 bytes. Unloaded access time is 23 cycles for an L1 data cache miss 

and 22 cycles for an L1 instruction cache miss. 



4.3.4              DRAM Controller 



                                                                    1 

OpenSPARC T1’s DRAM Controller is banked four ways  , with each L2 bank 

interacting with exactly one DRAM Controller bank. The DRAM Controller is 

interleaved     based   on  physical   address  bits   7:6,  so  each   DRAM      Controller 

bank must have the same amount of memory  installed and enabled. 



OpenSPARC T1   uses   DDR2   DIMMs   and   can   support   one   or   two   ranks   of 

stacked   or   unstacked   DIMMs.   Each  DRAM   bank/port   is   two   DIMMs   wide 

(128-bit + 16-bit ECC). All installed DIMMs on an individual bank/port must 

be identical, and the same total amount of memory (number of bytes) must be 

installed on each DRAM Controller port. The DRAM controller frequency is 

an   exact   ratio   of   the   CMP   core   frequency,   where   the   CMP   core   frequency 

must   be   at   least   4×  the   DRAM   controller   frequency.   The   DDR   (double   data 

rate) data buses, of course, transfer data at twice the frequency of the DRAM 

Controller frequency. 



The   DRAM   Controller   also   supports   a   small   memory   configuration   mode, 

using only two DRAM ports. In this mode, L2 banks 0 and 2 are serviced by 

DRAM   port   0,   and   L2   banks   1   and   3  are   serviced   by   DRAM   port   1.   The 

installed memory on each of these ports is still two DIMMs wide. 



4.3.5              I/O Bridge (IOB) Unit 



The    IOB    performs    an  address   decode   on   I/O-addressable     transactions    and 

directs   them   to   the   appropriate   internal   block   or   to   the   appropriate   external 

interface (J-Bus or SSI). In addition, the IOB maintains the register status for 

external interrupts. 



1. A two-bank option is available for cost-constrained minimal memory configurations. 


----------------------- Page 51-----------------------

32                          Chapter 4    OpenSPARC T1 and T2 Processor Implementations 



4.3.6              J-Bus Interface (JBI) 



J-Bus is the interconnect between OpenSPARC T1 and the I/O subsystem. It is 

a   200   MHz,   128-bit-wide,   multiplexed   address/data   bus,   used   predominantly 

for DMA traffic, plus the PIO traffic to control it. 



The   JBI   is   the   block   that   interfaces  to   J-Bus,   receiving   and   responding   to 

DMA requests, routing them to the appropriate L2 banks, and also issuing PIO 

transactions on behalf of the strands and forwarding responses back. 



4.3.7              SSI ROM Interface 



OpenSPARC T1   has   a   50   Mbit/s   serial   interface   (SSI)   which   connects   to   an 

external    FPGA      which    interfaces   to  the  BOOT      ROM.     In  addition,   the   SSI 

interface supports PIO accesses across the SSI, thus supporting optional CSRs 

or other interfaces within the FPGA. 



4.3.8              Clock and Test Unit (CTU) 



The CTU contains the clock generation, reset, and JTAG circuitry. 



OpenSPARC T1   has   a   single   PLL,   which  takes   the   J-Bus   clock   as   its   input 

reference,   where   the   PLL   output   is   divided   down   to   generate   the   CMP   core 

clocks   (for   OpenSPARC T1   and   caches),   the   DRAM   clock   (for   the   DRAM 

controller and external DIMMs), and internal J-Bus clock (for IOB and JBI). 

Thus,   all   OpenSPARC T1   clocks   are   ratioed.   Sync   pulses   are   generated   to 

control transmission of signals and data across clock domain boundaries. 



The CTU has the state machines for internal reset sequencing, which includes 

logic   to   reset   the   PLL   and   signal   when   the   PLL   is   locked,   updating   clock 

ratios   on   warm   resets   (if   so   programmed),   enabling   clocks   to   each   block   in 

turn,   and   distributing   reset   so   that   its  assertion   is   seen   simultaneously   in   all 

clock domains. 



The   CTU   also   contains   the   JTAG   block,   which   allows   access   to   the   shadow 

scan chains, plus has a CREG interface that allows the JTAG to issue reads of 

any   I/O-addressable   register,   some   ASI   locations,   and   any   memory   location 

while OpenSPARC T1 is in operation. 


----------------------- Page 52-----------------------

4.4   OpenSPARC T2 Overview                                                                    33 



4.3.9              EFuse 



The   eFuse   (electronic   fuse)   block   contains   configuration   information   that   is 

electronically burned in as part of manufacturing, including part serial number 

and strand-available information. 



4.4                OpenSPARC T2 Overview 



OpenSPARC T2            is    a   single     chip    multithreaded        (CMT)       processor. 

OpenSPARC T2 contains eight SPARC physical processor cores. Each SPARC 

physical   processor   core   has   full   hardware   support   for   eight   processors,   two 

integer    execution     pipelines,   one   floating-point    execution     pipeline,   and   one 

memory pipeline. The floating-point and memory pipelines are shared by all 

eight   strands. The   eight   strands   are   hard-partitioned into two   groups   of four, 

and the four strands within a group share a single integer pipeline. 



While   all   eight   strands   run   simultaneously,   at   any   given   time   at   most   two 

strands    will   be  active   in  the  physical  core,    and   those   two   strands   will   be 

issuing either a pair of integer pipeline operations, an integer operation and a 

floating-point   operation,   an   integer   operation   and   a   memory   operation,   or   a 

floating-point   operation   and   a   memory   operation.   Strands   are   switched   on   a 

cycle-by-cycle basis between the available strands within the hard-partitioned 

group of four, using a least recently issued priority scheme. 



When   a   strand   encounters   a   long-latency  event,   such   as   a   cache   miss,   it   is 

marked unavailable and instructions will not be issued from that strand until 

the    long-latency     event   is  resolved.    Execution     of  the   remaining     available 

strands     will  continue     while   the   long-latency     event   of   the  first  strand  is 

resolved. 



Each     OpenSPARC T2          physical     core   has   a  16-Kbyte,      8-way     associative 

instruction cache (32-byte lines), 8-Kbyte, 4-way associative data cache (16- 

byte    lines),   64-entry   fully-associative   instruction     TLB,   and   128-entry   fully 

associative      data   TLB     that  are   shared    by   the   eight    strands.  The     eight 

OpenSPARC T2 physical cores are connected through a crossbar to an on-chip 

unified 4-Mbyte, 16-way associative L2 cache (64-byte lines). 



The   L2   cache   is   banked   eight   ways   to   provide   sufficient   bandwidth   for   the 

eight OpenSPARC T2 physical cores. The L2 cache connects to four on-chip 

DRAM Controllers, which directly interface to a pair of fully buffered DIMM 


----------------------- Page 53-----------------------

34                                         Chapter 4            OpenSPARC T1 and T2 Processor Implementations 



(FBD) channels. In addition, two 1-Gbit/10-Gbit Ethernet MACs and several 

on-chip   I/O-mapped   control   registers   are   accessible   to   the   SPARC   physical 

cores. 



A block diagram of the OpenSPARC T2 chip is shown in FIGURE 4-4. 

                 . 



                                                                                                                              Fully B uffered 

                                                                                                                              D IM M s (FB D ) 



                                                                             1.4Ghz1.4Ghz                          4.8Ghz 

               OpenSPARC T2                                                                   800M hz 

                   N ia g a ra 2 



                                                                               64 

                                                                                                                     10 



       S P A R C  C o re                             L2 B ank0                 128           M C U  0                14 



                                                                                                                     1010 

                                                                               64 

       S P A R C  C o re                              L2 B ank1                                                      14 



                                                                               64 

                                                                                                                     1010 

       S P A R C  C o re           C ache            L2 B ank0 

                                 C ro ssbar                                    128           M C U  1                14 

                                                                                                                    1010 

       S P A R C  C o re           (C C X )           L2 B ank1                64 

                                                                                                                    14 



                                                                               64 

                                                                                                                     1010 

       S P A R C  C o re                               L2 B ank0 

                                                                               128           M C U  2                14 

                                                                                                                     1010 

        S P A R C  C o re                              L2 B ank1               64 

                                                                                                                     14 



       S P A R C  C o re                              L2 B ank0                64                                    1010 



                                                                                                                     14 

                                                                                128          M C U  3 

                                                                                                                     1010 

       S P A R C  C o re                              L2 B ank1                64 



                                                                                                                     14 



              TC U               C C U              eFuse 

                                                                                                                     DD IMIM MM s:s: 11 22 33         88 

                                                                                                                     Ranks:        1 or 2 per D IMM 

            10 Gb MA C               N IU                                            P C I-E X 



            10 Gb MA C                                              S IU 

                                                                                                                       Optional dual C hannel M ode 



                               FC R A M  Intf                                        P C I-E X 

                                                            S S I R O M  Intf 



FIGURE 4-4             OpenSPARC T2 Chip Block Diagram 



4.5                            OpenSPARC T2 Components 



This section describes the major components in OpenSPARC T2. 


----------------------- Page 54-----------------------

4.5   OpenSPARC T2 Components                                                             35 



4.5.1             OpenSPARC T2 Physical Core 



Each   OpenSPARC T2   physical   core   has  hardware   support   for   eight   strands. 

This support consists of a full register  file (with eight register windows) per 

strand,   with   most   of   the   ASI,   ASR,   and   privileged   registers   replicated   per 

strand. The eight strands share the instruction and data caches and TLBs. An 

autodemap feature is included with the TLBs to allow the multiple strands to 

update the TLB without locking. 



Each   OpenSPARC T2   physical   core   contains   a   floating-point   unit,   shared   by 

all eight strands. The floating-point unit performs single- and double-precision 

floating-point operations, graphics operations, and integer multiply and divide 

operations. 



4.5.2             L2 Cache 



The L2 cache is banked eight ways. To provide for better partial-die recovery, 

OpenSPARC T2 can also be configured in 4-bank and 2-bank modes (with 1/ 

2   and   1/4  the  total  cache   size  respectively).    Bank   selection   is  based   on 

physical address bits 8:6 for 8 banks, 7:6 for 4 banks, and 6 for 2 banks. The 

cache   is   4   Mbytes,   16-way   set   associative,   and   uses   index   hashing.   The   line 

size is 64 bytes. 



4.5.3             Memory Controller Unit (MCU) 



OpenSPARC T2 has four MCUs, one for each memory branch with a pair of 

L2    banks  interacting    with   exactly   one   DRAM      branch.   The   branches    are 

interleaved    based    on  physical    address   bits  7:6,  and   support   1–16    DDR2 

DIMMs. Each memory branch  is two FBD channels wide. A branch may use 

only one of the FBD channels in a reduced power configuration. 



Each DRAM branch operates independently and can have a different memory 

size and a different kind of DIMM (for example, a different number of ranks 

or different CAS latency). Software should not use address space larger than 

four times the lowest memory capacity in a branch because the cache lines are 

interleaved across branches. The DRAM  Controller frequency is the same as 

that   of  the  DDR    (double    data  rate) data   buses,   which    is  twice  the  DDR 

frequency. The FBDIMM links run at six times the frequency of the DDR data 

buses. 



The   OpenSPARC T2   MCU   implements   a   DDR2   FBD   design   model   that   is 

based on various JEDEC-approved  DDR2 SDRAM and FBDIMM standards. 

JEDEC   has   received   information   that   certain   patents   or   patent   applications 


----------------------- Page 55-----------------------

36                      Chapter 4   OpenSPARC T1 and T2 Processor Implementations 



may   be   relevant   to   FBDIMM   Advanced   Memory   Buffer   standard   (JESD82- 

20) as well as other standards related to FBDIMM technology (JESD206) (For 

more information, see 

http://www.jedec.org/download/search/FBDIMM/Patents.xls). 

Sun   Microsystems   does   not   provide   any  legal   opinions   as   to   the   validity   or 

relevancy     of  such   patents   or   patent   applications.   Sun   Microsystems 

encourages   prospective   users   of   the   OpenSPARC T2   MCU   design   to   review 

all  information   assembled    by  JEDEC   and    develop   their  own  independent 

conclusion. 



4.5.4            Noncacheable Unit (NCU) 



The   NCU   performs   an   address   decode   on   I/O-addressable   transactions   and 

directs them to the appropriate block (for example, DMU, CCU). In addition, 

the NCU maintains the register status for external interrupts. 



4.5.5            System Interface Unit (SIU) 



The SIU connects the DMU and L2 cache. SIU is the L2 cache access point 

for the Network subsystem. 



4.5.6            SSI ROM Interface (SSI) 



OpenSPARC T2   has   a   50   Mb/s   serial   interface   (SSI),   which   connects   to   an 

external boot ROM. In addition, the SSI supports PIO accesses across the SSI, 

thus   supporting    optional  Control   and   Status   registers  (CSRs)   or   other 

interfaces attached to the SSI. 



4.6              Summary of Differences 

                 Between OpenSPARC T1 and 

                 OpenSPARC T2 



OpenSPARC T2   follows   the   CMT   philosophy   of   OpenSPARC T1,   but   adds 

more execution capability to each physical core, as well as significant system- 

on-a-chip components and an enhanced L2 cache. 


----------------------- Page 56-----------------------

4.6  Summary of Differences Between OpenSPARC T1 and OpenSPARC T2                       37 



4.6.1             Microarchitectural Differences 



The following lists the microarchitectural differences. 



*  Physical    core   consists  of  two   integer  execution    pipelines   and  a  single 

   floating-point     pipeline.  OpenSPARC T1         has  a  single  integer   execution 

   pipeline and all cores shared a single floating-point pipeline. 



*  Each   physical   core   in   OpenSPARC T2   supports   eight   strands,   which   all 

   share the floating-point pipeline. The eight strands are partitioned into two 

   groups     of   four  strands,   each    of  which    shares   an   integer   pipeline. 

   OpenSPARC T1 shares the single integer pipeline among four strands. 



*  Pipeline     in  OpenSPARC T2        is  eight   stages,   two   stages   longer   than 

   OpenSPARC T1. 



*  Instruction      cache    is   8-way     associative,    compared      to   4-way     in 

   OpenSPARC T1. 



*  The   L2   cache   is   4-Mbyte,   8-banked   and   16-way   associative,   compared   to 

   3-Mbyte, 4-banked and 12-way associative in OpenSPARC T1. 



*  Data TLB is 128 entries, compared to 64 entries in OpenSPARC T1. 

*  The memory interface in OpenSPARC T2 supports fully buffered DIMMS 

   (FBDs), providing higher capacity and memory clock rates. 



*  The OpenSPARC T2 memory channels support a single-DIMM option for 

   low-cost configurations. 



*  OpenSPARC T2 includes a network interface unit (NIU), to which network 

   traffic management tasks can be off-loaded. 



4.6.2             Instruction Set Architecture (ISA) 

                  Differences 



There     are  a  number     of   ISA   differences    between    OpenSPARC T2         and 

OpenSPARC T1, as follows: 



*  OpenSPARC T2   fully   supports   all   VIS   2.0   instructions.   OpenSPARC T1 

   supports a subset of VIS 1.0 plus the SIAM (Set Interval Arithmetic Mode) 

   instruction     (on  OpenSPARC T1,         the  remainder     of  VIS    1.0  and   2.0 

   instructions trap to software for emulation). 



*  OpenSPARC T2          supports   the  full  CMP    specification,    as  described    in 

    UltraSPARC      Architecture    2007.  OpenSPARC T1        has  its  own   version   of 

   CMP      control/status  registers.  OpenSPARC T2       consists    of  eight  physical 

   cores, with eight virtual processors per physical core. 


----------------------- Page 57-----------------------

38                        Chapter 4    OpenSPARC T1 and T2 Processor Implementations 



*   OpenSPARC T2 does not support OpenSPARC T1’s  idle state or its idle, 

   halt,   or   resume   messages.   Instead,   OpenSPARC T2   supports   parking   and 

   unparking   as   specified   in   the   CMP   chapter   of UltraSPARC   Architecture 

   2007 Specification. Note that parking is similar to OpenSPARC T1’s idle 

    state. OpenSPARC T2 does support an equivalent to the halt state, which 

    on    OpenSPARC T1         is  entered   by   writing    to  HPR     1E16.    However, 

    OpenSPARC T2 does not support OpenSPARC T1’s  STRAND_STS_REG 

   ASR, which holds the   strand   state. Halted   state   is   not software-visible   on 

    OpenSPARC T2. 



*   OpenSPARC T2         does    not  support    the  INT_VEC_DIS         register   (which 

    allows   any   OpenSPARC T1   strand   to   generate   an   interrupt,   reset,   idle,   or 

   resume     message     to  any   strand).  Instead,   an  alias  to ASI_INTR_W  is 

   provided, which allows only the generation of an interrupt to any strand. 



*   OpenSPARC T2          supports     the   ALLCLEAN,         INVALW,       NORMALW, 

    OTHERW, POPC, and FSQRT instructions in hardware. 



*   OpenSPARC T2’s   floating-point   unit   generates       fp_unfinished_other   with 

    FSR.ftt unfinished_FPop for most denorm cases and supports a nonstandard 

   mode   that   flushes   denorms   to   zero.   OpenSPARC T1   handles   denorms   in 

   hardware,     never   generates    an  unfinished_FPop,      and  does   not  support   a 

   nonstandard mode. 



*   OpenSPARC T2 generates an illegal_instruction trap on any quad-precision 

   FP   instruction,   whereas   OpenSPARC T1   generates   an  fp_exception_other 

   trap    on  numeric    and   move-FP-quad      instructions.   See   Table   5-2  of  the 

    UltraSPARC       T2    Supplement     to   the  “UltraSPARC        Architecture     2007 

   Specification.” 



*   OpenSPARC T2   generates   a  privileged_action   exception   upon   attempted 

    access   to   hyperprivileged   ASIs   by   privileged   software,   whereas,   in   such 

    cases, OpenSPARC T1 takes a data_access_exception exception. 



*   OpenSPARC T2 supports PSTATE.tct; OpenSPARC T1 does not. 

*   OpenSPARC T2 implements the SAVE instruction similarly to all previous 

   UltraSPARC   processors.   OpenSPARC T1   implements   a   SAVE   instruction 

   that updates the locals in the new window to be the same as the locals in 

   the old window, and swaps the  ins (outs) of the old window with the outs 

    (ins) of the new window. 



*   PSTATE.am         masking     details    differ   between      OpenSPARC T1          and 

    OpenSPARC T2,         as  described   in  Section   11.1.8   of  the UltraSPARC       T2 

   Supplement to the “UltraSPARC Architecture 2007 Specification.” 



*   OpenSPARC T2         implements       PREFETCH        fcn = 1816      as   a   prefetch 

   invalidate cache entry, for efficient software cache flushing. 



*  The Synchronous Fault register (SFSR) is eliminated in OpenSPARC T2. 


----------------------- Page 58-----------------------

4.6  Summary of Differences Between OpenSPARC T1 and OpenSPARC T2                       39 



*  T1’s  data_access_exception   is   replaced   in   OpenSPARC T2   by   multiple 

   DAE_* exceptions. 



*  T1’s      instruction_access_exception            exception      is     replaced      in 

   OpenSPARC T2 by multiple IAE_* exceptions. 



4.6.3             MMU Differences 



The OpenSPARC T2 and OpenSPARC T1 MMUs differ as follows: 



*  OpenSPARC T2 has a 128-entry DTLB, whereas OpenSPARC T1 has a 64- 

   entry DTLB. 



*  OpenSPARC T2 supports a pair of primary context registers and a pair of 

   secondary      context  registers.  OpenSPARC T1         supports   a  single  primary 

   context and single secondary context register. 



*  OpenSPARC T2           does    not   support     a   locked    bit   in   the    TLBs. 

   OpenSPARC T1 supports a locked bit in the TLBs. 



*  OpenSPARC T2 supports only the sun4v (the architected interface between 

   privileged software and hyperprivileged software) TTE format for I/D-TLB 

   Data-In     and   Data-Access     registers.  OpenSPARC T1        supports    both  the 

   sun4v and the older sun4u TTE formats. 



*  OpenSPARC T2   is   compatible   with   UltraSPARC   Architecture   2007   with 

   regard to multiple flavors of data access exception (DAE_*) and instruction 

   access     exception    (IAE_*).     As   per   UltraSPARC       Architecture     2005, 

   OpenSPARC T1          uses  the  single   flavor  of data_access_exception          and 

   instruction_access_exception,           indicating   the   “flavors”   in   its SFSR 

   register. 



*  OpenSPARC T2 supports a hardware Table Walker to accelerate ITLB and 

   DTLB miss handling. 



*  The   number   and   format   of   translation   storage   buffer   (TSB)   configuration 

   and pointer registers differs between OpenSPARC T1 and OpenSPARC T2. 

   OpenSPARC T2 uses physical addresses for TSB pointers; OpenSPARC T1 

   uses virtual addresses for TSB pointers. 



*  OpenSPARC T1   and OpenSPARC T2   support the   same   four   page   sizes   (8 

   Kbyte,     64  Kbyte,   4  Mbyte,    256   Mbyte).    OpenSPARC T2        supports   an 

   unsupported_page_size trap when an illegal page size is programmed into 

   TSB   registers   or   attempted   to   be   loaded   into   the   TLB.   OpenSPARC T1 

   forces an illegal page size being programmed into TSB registers to be 256 

   Mbytes and generates a data_access_exception trap when a page with an 

   illegal size is loaded into the TLB. 



*  OpenSPARC T2 adds a demap real operation, which demaps all pages with 

    r = 1 from the TLB. 


----------------------- Page 59-----------------------

40                       Chapter 4   OpenSPARC T1 and T2 Processor Implementations 



*  OpenSPARC T2 supports an I-TLB probe ASI. 

*  Autodemapping of pages in the TLBs  only demaps pages of the same size 

   or   of  a  larger  size  in  OpenSPARC T2.       In  OpenSPARC T1,       autodemap 

   demaps pages of the same size, larger size, or smaller size. 



*  OpenSPARC T2 supports detection of multiple hits in the TLBs. 



4.6.4             Performance Instrumentation 

                  Differences 



Both    OpenSPARC T1         and   OpenSPARC T2        provide   access   to   hardware 

performance      counters   through   the PIC   and  PCR   registers.    However,    the 

events captured by the hardware differ significantly between OpenSPARC T1 

and   OpenSPARC T2,        with  OpenSPARC T2       capturing   a  much   larger  set  of 

events,   as   described   in   Chapter   10   of   the  UltraSPARC   T2   Supplement   to   the 

 “UltraSPARC Architecture 2007 Specification.” OpenSPARC T2 also supports 

count events in hyperprivileged mode; OpenSPARC T1 does not. 



In     addition,   the    implementation      of   pic_overflow      differs    between 

OpenSPARC T1   and   OpenSPARC T2.   OpenSPARC T1   provides   a   disrupting 

pic_overflow     trap  on   the  instruction   following   the  one   that  caused   the 

overflow   event.   OpenSPARC T2       provides   a   disrupting  pic_overflow   on   the 

instruction that generates the event, but that occurs within an epsilon number 

of event-generating instructions from the actual overflow. 



Both    OpenSPARC T2        and   OpenSPARC T1        support    DRAM      performance 

counters. 



4.6.5             Error Handling Differences 



Error     handling     differs   quite   a    bit  between      OpenSPARC T1         and 

OpenSPARC T2.   OpenSPARC T1   primarily   employs   hardware   correction   of 

errors,   whereas   OpenSPARC T2        primarily   employs    software   correction   of 

errors. 



*  OpenSPARC T2 uses the following traps for error handling: 

   *   data_access_error 

   *   data_access_MMU_error 

   *   hw_corrected_error 

   *   instruction_access_error 

   *   instruction_access_MMU_error 

   *   internal_processor_error 

   *   store_error 

   *   sw_recoverable_error