2008年12月31日星期三

voyage linux

Debian Lovers - Why I love Voyage Linux http://www.adamsinfo.com/debian-lovers-why-i-love-voyage-linux/

I'm Adam Palmer, and I'm an Embedded Linux/Embedded Hardware enthusiast. I spend most of my time managing server clusters, and doing some PHP/MySQL development, whilst what little spare time I have is dedicated to playing with gadgets. I'm always happy to consult on or manage any Linux related, Web Application or Hosting project. Please get in touch with me!

For those Debian lovers I have finally found a great embedded distro. I’ve always stayed away from the multitude of distros available, each with their own package manager or lack of, each with their own preinstalled software or again, lack of, and each with their own caveats.

I began my jorney into Linux with SuSE about 11 years ago at the time of writing, and have also given RedHat a fair chance in the past. In my first employment I was forced to battle against Slackware for two years, and about 7 years ago, discovered Debian.

From that point on, my love for Debian has been absolutely unshakeable. Why do I love Debian? It’s perfectly crafted, clean, unbloated, exquisitely simple yet powerful to the more Novice user, and elegantly understated. I toyed with Ubuntu for some time as a desktop OS, as it’s Debian based, and does most of the Xorg configuration hassle for you, although I usually find myself using a minimalist Windows XP Professional setup on my desktops.

Enter Voyage Linux. It must have been fate.. I was browsing the Alix web site, and noticed amongst many that ‘Voyage Linux’ was supported. Something other than it’s mediocre name must have drawn my attention to it, as a quick Google search lead me to http://linux.voyage.hk

The first sentence sold it to me, “Voyage Linux is Debian derived distribution that is best run on a x86-based embedded platforms such as WRAP, ALIX and Soekris 45xx/48xx boards.”

I downloaded and installed Voyage on my Alix Robot board in minutes, and without delay was up and running. If Debian is clean and unbloated, Voyage is anorexic. It’s so lightweight by default, that you don’t need to even consider removing packages from the default installation. It has a fully working apt package manager, and you can also happily install packages from the main Debian package repositories. It looks, works and behaves just like Debian, no annoyances or broken/badly behaved tools. It’s also shipped with a working set of base utilities and does not rely on busybox as some of the embedded distros do. The default kernel is also incredibly well crafted, and includes madwifi support by default which again is great as I use Atheros chipset. I quickly and cleanly upgrade from the included 2.6.24 to a regular 2.6.27.6 from kernel.org which was the newest available at the time. Rebooted into 2.6.27.6 with no complaints and began copying my C applications over to the new distro to operate the Robot. Absolutely perfect and no complaints.

For a hardcore Debian lover who would never imagine installing anything else… ever… Voyage Linux is now immediately available on my Key chain USB mass storage stick, and I would be hard pushed for a reason to go back to Debian on any embedded/minimalist hardware project.

About Voyage Linux

Published Fri, 2006-01-06 15:49 read more

Voyage Linux is Debian derived distribution that is best run on a x86-based embedded platforms such as WRAP, ALIX and Soekris 45xx/48xx boards.

It can also run on low-end x86 PC platforms. Typical installation requires 128MB disk space, although larger storage allows more packages to be installed. Voyage Linux is so small that it is best suitable for running a full-feature firewall, wireless access point, VoIP gateway and network storage device.

Voyage Linux 0.6.0 released

http://sumancolumbia.blogspot.com/2008/07/long-voyage-modpost-to-work-and-compile.html

Long "voyage": modpost to work and compile

Ah... after weeks of hacking through kernel code and Makefiles, I am finally able to compile and run modpost , for compiling my kernel modules on Voyage Linux 0.4.

This is the story: I need to install some additional kernel modules for the WORKIT project on Voyage. All of the kernel modules pass step 1 of the Makefile, and they compile, but then I get the error:

Building modules, stage 2.
MODPOST 1 modules
/bin/sh: scripts/mod/modpost: No such file or directory

A long search for modpost lead me to find that modpost is used for compiling modules, and is part of the kernel development tools (sorry, missing the link for that post, and couldn't find it again on Google - tells you how exotic this modpost is.)

After a lot more trial and error - which included downloading and trying to recompile the Voyage Linux kernels - I found two posts today that helped immensely in coming to the last step:

No modpost directory - Linux Forums : helped me realize how to build the modpost (but it was referencing the wrong directory; on my Voyage Linux, the code is at /lib/modules/2.6.20-486-voyage/build)

But I got the exact same error message as in the forum post: unrecognized command line option "-m".

I copied the makefile command, removed the -m directive, and tried compiling again and got a "missing elfconfig.h" error message and other errors dependent on that.

A search for "elfconfig.h" led me to a post of the Linux kernel mailing list about missing elfconfig.h which hints at running "make modules" and "make scripts".

Running those two commands on the /lib/modules/2.6.20-486-voyage/build directory solves the problem - it creates the elfconfig.h (!) file and compiles modpost. (Expect this to take a while - it makes all modules and scripts.)

Added information:

The Internet Real-Time Lab (IRT) in the Computer Science Department at Columbia University conducts research in the areas of Internet and multimedia services: Internet telephony, wireless and mobile networks, streaming, quality of service, resource reservation, dynamic pricing for the Internet, network measurement and reliability, service location, network security, media on demand, content distribution networks, multicast networks and ubiquitous and context-aware computing and communication.

2008年12月30日星期二

Linux: The 0.01 Release

July 26, 2007 - 3:56pm

Submitted by Jeremy on July 26, 2007 - 3:56pm.

"This is a free minix-like kernel for i386(+) based AT-machines," began the Linux version 0.01 release notes in September of 1991 for the first release of the Linux kernel. "As the version number (0.01) suggests this is not a mature product. Currently only a subset of AT-hardware is supported (hard-disk, screen, keyboard and serial lines), and some of the system calls are not yet fully implemented (notably mount/umount aren't even implemented)." Booting the original 0.01 Linux kernel required bootstrapping it with minix, and the keyboard driver was written in assembly and hard-wired for a Finnish keyboard. The listed features were mostly presented as a comparison to minix and included, efficiently using the 386 chip rather than the older 8088, use of system calls rather than message passing, a fully multithreaded FS, minimal task switching, and visible interrupts. Linus Torvalds noted, "the guiding line when implementing linux was: get it working fast. I wanted the kernel simple, yet powerful enough to run most unix software." In a section titled "Apologies :-)" he noted:

"This isn't yet the 'mother of all operating systems', and anyone who hoped for that will have to wait for the first real release (1.0), and even then you might not want to change from minix. This is a source release for those that are interested in seeing what linux looks like, and it's not really supported yet."

Submit to: Reddit, Digg, Slashdot, Del.icio.us, OSNews

Linus had originally intended to call the new kernel "Freax". According to Wikipedia, the name Linux was actually invented by Ari Lemmke who maintained the ftp.funet.fi FTP server from which the kernel was originally distributed.

The initial post that Linus made about Linux was to the comp.os.minix Usenet group titled, "What would you like to see most in minix". It began:

"I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones. This has been brewing since april, and is starting to get ready. I'd like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things)."

Later in the same thread, Linus went on to talk about how unportable the code was:

"Simply, I'd say that porting is impossible. It's mostly in C, but most people wouldn't call what I write C. It uses every conceivable feature of the 386 I could find, as it was also a project to teach me about the 386. As already mentioned, it uses a MMU, for both paging (not to disk yet) and segmentation. It's the segmentation that makes it REALLY 386 dependent (every task has a 64Mb segment for code & data - max 64 tasks in 4Gb. Anybody who needs more than 64Mb/task - tough cookies).

"It also uses every feature of gcc I could find, specifically the __asm__ directive, so that I wouldn't need so much assembly language objects. Some of my 'C'-files (specifically mm.c) are almost as much assembler as C. It would be 'interesting' even to port it to another compiler (though why anybody would want to use anything other than gcc is a mystery).

"Unlike minix, I also happen to LIKE interrupts, so interrupts are handled without trying to hide the reason behind them (I especially like my hard-disk-driver. Anybody else make interrupts drive a state-machine?). All in all it's a porters nightmare. "

Indeed, Linux 1.0 was released on March 13th, 1994 supporting only the 32-bit i386 architecture. However, by the release of Linux 1.2 on March 7th, 1995 it had already been ported to 32-bit MIPS, 32-bit SPARC, and the 64-bit Alpha. By the release of Linux 2.0 on June 9th, 1996 support had also been added for the 32-bit m68k and 32-bit PowerPC architectures. And jumping forward to the Linux 2.6 kernel, first released in 2004, it has been and continues to be ported to numerous additional architectures.

Linux 0.01 release notes:

Notes for linux release 0.01 0. Contents of this directory linux-0.01.tar.Z - sources to the kernel bash.Z - compressed bash binary if you want to test it update.Z - compressed update binary RELNOTES-0.01 - this file 1. Short intro This is a free minix-like kernel for i386(+) based AT-machines. Full source is included, and this source has been used to produce a running kernel on two different machines. Currently there are no kernel binaries for public viewing, as they have to be recompiled for different machines. You need to compile it with gcc (I use 1.40, don't know if 1.37.1 will handle all __asm__-directives), after having changed the relevant configuration file(s). As the version number (0.01) suggests this is not a mature product. Currently only a subset of AT-hardware is supported (hard-disk, screen, keyboard and serial lines), and some of the system calls are not yet fully implemented (notably mount/umount aren't even implemented). See comments or readme's in the code. This version is also meant mostly for reading - ie if you are interested in how the system looks like currently. It will compile and produce a working kernel, and though I will help in any way I can to get it working on your machine (mail me), it isn't really supported. Changes are frequent, and the first "production" version will probably differ wildly from this pre-alpha-release. Hardware needed for running linux: - 386 AT - VGA/EGA screen - AT-type harddisk controller (IDE is fine) - Finnish keyboard (oh, you can use a US keyboard, but not without some practise :-) The Finnish keyboard is hard-wired, and as I don't have a US one I cannot change it without major problems. See kernel/keyboard.s for details. If anybody is willing to make an even partial port, I'd be grateful. Shouldn't be too hard, as it's tabledriven (it's assembler though, so ...) Although linux is a complete kernel, and uses no code from minix or other sources, almost none of the support routines have yet been coded. Thus you currently need minix to bootstrap the system. It might be possible to use the free minix demo-disk to make a filesystem and run linux without having minix, but I don't know... 2. Copyrights etc This kernel is (C) 1991 Linus Torvalds, but all or part of it may be redistributed provided you do the following: - Full source must be available (and free), if not with the distribution then at least on asking for it. - Copyright notices must be intact. (In fact, if you distribute only parts of it you may have to add copyrights, as there aren't (C)'s in all files.) Small partial excerpts may be copied without bothering with copyrights. - You may not distibute this for a fee, not even "handling" costs. Mail me at [email blocked] if you have any questions. Sadly, a kernel by itself gets you nowhere. To get a working system you need a shell, compilers, a library etc. These are separate parts and may be under a stricter (or even looser) copyright. Most of the tools used with linux are GNU software and are under the GNU copyleft. These tools aren't in the distribution - ask me (or GNU) for more info. 3. Short technical overview of the kernel. The linux kernel has been made under minix, and it was my original idea to make it binary compatible with minix. That was dropped, as the differences got bigger, but the system still resembles minix a great deal. Some of the key points are: - Efficient use of the possibilities offered by the 386 chip. Minix was written on a 8088, and later ported to other machines - linux takes full advantage of the 386 (which is nice if you /have/ a 386, but makes porting very difficult) - No message passing, this is a more traditional approach to unix. System calls are just that - calls. This might or might not be faster, but it does mean we can dispense with some of the problems with messages (message queues etc). Of course, we also miss the nice features :-p. - Multithreaded FS - a direct consequence of not using messages. This makes the filesystem a bit (a lot) more complicated, but much nicer. Coupled with a better scheduler, this means that you can actually run several processes concurrently without the performance hit induced by minix. - Minimal task switching. This too is a consequence of not using messages. We task switch only when we really want to switch tasks - unlike minix which task-switches whatever you do. This means we can more easily implement 387 support (indeed this is already mostly implemented) - Interrupts aren't hidden. Some people (among them Tanenbaum) think interrupts are ugly and should be hidden. Not so IMHO. Due to practical reasons interrupts must be mainly handled by machine code, which is a pity, but they are a part of the code like everything else. Especially device drivers are mostly interrupt routines - see kernel/hd.c etc. - There is no distinction between kernel/fs/mm, and they are all linked into the same heap of code. This has it's good sides as well as bad. The code isn't as modular as the minix code, but on the other hand some things are simpler. The different parts of the kernel are under different sub-directories in the source tree, but when running everything happens in the same data/code space. The guiding line when implementing linux was: get it working fast. I wanted the kernel simple, yet powerful enough to run most unix software. The file system I couldn't do much about - it needed to be minix compatible for practical reasons, and the minix filesystem was simple enough as it was. The kernel and mm could be simplified, though: - Just one data structure for tasks. "Real" unices have task information in several places, I wanted everything in one place. - A very simple memory management algorithm, using both the paging and segmentation capabilities of the i386. Currently MM is just two files - memory.c and page.s, just a couple of hundreds of lines of code. These decisions seem to have worked out well - bugs were easy to spot, and things work. 4. The "kernel proper" All the routines handling tasks are in the subdirectory "kernel". These include things like 'fork' and 'exit' as well as scheduling and minor system calls like 'getpid' etc. Here are also the handlers for most exceptions and traps (not page faults, they are in mm), and all low-level device drivers (get_hd_block, tty_write etc). Currently all faults lead to a exit with error code 11 (Segmentation fault), and the system seems to be relatively stable ("crashme" hasn't - yet). 5. Memory management This is the simplest of all parts, and should need only little changes. It contains entry-points for some things that the rest of the kernel needs, but mostly copes on it's own, handling page faults as they happen. Indeed, the rest of the kernel usually doesn't actively allocate pages, and just writes into user space, letting mm handle any possible 'page-not-present' errors. Memory is dealt with in two completely different ways - by paging and segmentation. First the 386 VM-space (4GB) is divided into a number of segments (currently 64 segments of 64Mb each), the first of which is the kernel memory segment, with the complete physical memory identity-mapped into it. All kernel functions live within this area. Tasks are then given one segment each, to use as they wish. The paging mechanism sees to filling the segment with the appropriate pages, keeping track of any duplicate copies (created at a 'fork'), and making copies on any write. The rest of the system doesn't need to know about all this. 6. The file system As already mentioned, the linux FS is the same as in minix. This makes crosscompiling from minix easy, and means you can mount a linux partition from minix (or the other way around as soon as I implement mount :-). This is only on the logical level though - the actual routines are very different. NOTE! Minix-1.6.16 seems to have a new FS, with minor modifications to the 1.5.10 I've been using. Linux won't understand the new system. The main difference is in the fact that minix has a single-threaded file-system and linux hasn't. Implementing a single-threaded FS is much easier as you don't need to worry about other processes allocating buffer blocks etc while you do something else. It also means that you lose some of the multiprocessing so important to unix. There are a number of problems (deadlocks/raceconditions) that the linux kernel needed to address due to multi-threading. One way to inhibit race-conditions is to lock everything you need, but as this can lead to unnecessary blocking I decided never to lock any data structures (unless actually reading or writing to a physical device). This has the nice property that dead-locks cannot happen. Sadly it has the not so nice property that race-conditions can happen almost everywhere. These are handled by double-checking allocations etc (see fs/buffer.c and fs/inode.c). Not letting the kernel schedule a task while it is in supervisor mode (standard unix practise), means that all kernel/fs/mm actions are atomic (not counting interrupts, and we are careful when writing those) if you don't call 'sleep', so that is one of the things we can count on. 7. Apologies :-) This isn't yet the "mother of all operating systems", and anyone who hoped for that will have to wait for the first real release (1.0), and even then you might not want to change from minix. This is a source release for those that are interested in seeing what linux looks like, and it's not really supported yet. Anyone with questions or suggestions (even bug-reports if you decide to get it working on your system) is encouraged to mail me. 8. Getting it working Most hardware dependancies will have to be compiled into the system, and there a number of defines in the file "include/linux/config.h" that you have to change to get a personalized kernel. Also you must uncomment the right "equ" in the file boot/boot.s, telling the bootup-routine what kind of device your A-floppy is. After that a simple "make" should make the file "Image", which you can copy to a floppy (cp Image /dev/PS0 is what I use with a 1.44Mb floppy). That's it. Without any programs to run, though, the kernel cannot do anything. You should find binaries for 'update' and 'bash' at the same place you found this, which will have to be put into the '/bin' directory on the specified root-device (specified in config.h). Bash must be found under the name '/bin/sh', as that's what the kernel currently executes. Happy hacking. Linus Torvalds [email blocked] Petersgatan 2 A 2 00140 Helsingfors 14 FINLAND

First posting about Linux:

From: Linus Benedict Torvalds Newsgroups: comp.os.minix Subject: Gcc-1.40 and a posix-question Date: 3 Jul 91 10:00:50 GMT  Hello netlanders,  Due to a project I'm working on (in minix), I'm interested in the posix standard definition. Could somebody please point me to a (preferably) machine-readable format of the latest posix rules? Ftp-sites would be nice.  As an aside for all using gcc on minix - the new version (1.40) has been out for some weeks, and I decided to test what needed to be done to get it working on minix (1.37.1, which is the version you can get from plains is nice, but 1.40 is better :-).  To my surpice, the answer turned out to be - NOTHING! Gcc-1.40 compiles as-is on minix386 (with old gcc-1.37.1), with no need to change source files (I changed the Makefile and some paths, but that's it!).  As default this results in a compiler that uses floating point insns, but if you'd rather not, changing 'toplev.c' to define DEFAULT_TARGET from 1 to 0 (this is from memory - I'm not at my minix-box) will handle that too.  Don't make the libs, use the old gnulib&libc.a.  I have successfully compiled 1.40 with itself, and everything works fine (I got the newest versions of gas and binutils at the same time, as I've heard of bugs with older versions of ld.c).  Makefile needs some chmem's (and gcc2minix if you're still using it).                  Linus Torvalds          [email blocked]  PS. Could someone please try to finger me from overseas, as I've installed a "changing .plan" (made by your's truly), and I'm not certain it works from outside? It should report a new .plan every time.

First Linux announcement:

From: Linus Benedict Torvalds [email blocked] Newsgroups: comp.os.minix Subject: What would you like to see most in minix? Date: 25 Aug 91 20:57:08 GMT  Hello everybody out there using minix -  I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones.  This has been brewing since april, and is starting to get ready.  I'd like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things).  I've currently ported bash(1.08) and gcc(1.40), and things seem to work. This implies that I'll get something practical within a few months, and I'd like to know what features most people would want.  Any suggestions are welcome, but I won't promise I'll implement them :-)                  Linus (torva... at kruuna.helsinki.fi)  PS.  Yes - it's free of any minix code, and it has a multi-threaded fs. It is NOT protable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that's all I have :-(. 
 From: Jyrki Kuoppala [email blocked] Newsgroups: comp.os.minix Subject: What would you like to see most in minix? Date: 25 Aug 91 23:44:50 GMT  In article Linus Benedict Torvalds writes:  >I've currently ported bash(1.08) and gcc(1.40), and things seem to work. >This implies that I'll get something practical within a few months, and >I'd like to know what features most people would want.  Any suggestions >are welcome, but I won't promise I'll implement them :-)  Tell us more!  Does it need a MMU?  >PS.  Yes - it's free of any minix code, and it has a multi-threaded fs. >It is NOT protable (uses 386 task switching etc)  How much of it is in C?  What difficulties will there be in porting? Nobody will believe you about non-portability ;-), and I for one would like to port it to my Amiga (Mach needs a MMU and Minix is not free).  As for the features; well, pseudo ttys, BSD sockets, user-mode filesystems (so I can say cat /dev/tcp/kruuna.helsinki.fi/finger), window size in the tty structure, system calls capable of supporting POSIX.1.  Oh, and bsd-style long file names.  //Jyrki  
From: Linus Benedict Torvalds [email blocked] Newsgroups: comp.os.minix Subject: Re: What would you like to see most in minix? Date: 26 Aug 91 11:06:02 GMT  In article Jyrki Kuoppala writes: >> [re: my post about my new OS]  >Tell us more!  Does it need a MMU?  Yes, it needs a MMU (sorry everybody), and it specifically needs a 386/486 MMU (see later).  >>PS.  Yes - it's free of any minix code, and it has a multi-threaded fs. >>It is NOT protable (uses 386 task switching etc)  >How much of it is in C?  What difficulties will there be in porting? >Nobody will believe you about non-portability ;-), and I for one would >like to port it to my Amiga (Mach needs a MMU and Minix is not free).  Simply, I'd say that porting is impossible.  It's mostly in C, but most people wouldn't call what I write C.  It uses every conceivable feature of the 386 I could find, as it was also a project to teach me about the 386.  As already mentioned, it uses a MMU, for both paging (not to disk yet) and segmentation. It's the segmentation that makes it REALLY 386 dependent (every task has a 64Mb segment for code & data - max 64 tasks in 4Gb. Anybody who needs more than 64Mb/task - tough cookies).  It also uses every feature of gcc I could find, specifically the __asm__ directive, so that I wouldn't need so much assembly language objects. Some of my "C"-files (specifically mm.c) are almost as much assembler as C. It would be "interesting" even to port it to another compiler (though why anybody would want to use anything other than gcc is a mystery).  Unlike minix, I also happen to LIKE interrupts, so interrupts are handled without trying to hide the reason behind them (I especially like my hard-disk-driver.  Anybody else make interrupts drive a state- machine?).  All in all it's a porters nightmare.  >As for the features; well, pseudo ttys, BSD sockets, user-mode >filesystems (so I can say cat /dev/tcp/kruuna.helsinki.fi/finger), >window size in the tty structure, system calls capable of supporting >POSIX.1.  Oh, and bsd-style long file names.  Most of these seem possible (the tty structure already has stubs for window size), except maybe for the user-mode filesystems. As to POSIX, I'd be delighted to have it, but posix wants money for their papers, so that's not currently an option. In any case these are things that won't be supported for some time yet (first I'll make it a simple minix- lookalike, keyword SIMPLE).                  Linus [email blocked]  PS. To make things really clear - yes I can run gcc on it, and bash, and most of the gnu [bin/file]utilities, but it's not very debugged, and the library is really minimal. It doesn't even support floppy-disks yet. It won't be ready for distribution for a couple of months. Even then it probably won't be able to do much more than minix, and much less in some respects. It will be free though (probably under gnu-license or similar).  
 From: Alan Barclay [email blocked] Newsgroups: comp.os.minix Subject: Re: What would you like to see most in minix? Date: 27 Aug 91 14:34:32 GMT  In article Linus Benedict Torvalds writes:  >yet) and segmentation. It's the segmentation that makes it REALLY 386 >dependent (every task has a 64Mb segment for code & data - max 64 tasks >in 4Gb. Anybody who needs more than 64Mb/task - tough cookies).  Is that max 64 64Mb tasks or max 64 tasks no matter what their size? --    Alan Barclay   iT                                |        E-mail : [email blocked]   Barker Lane                       |        BANG-STYLE : [email blocked]   CHESTERFIELD S40 1DY              |        VOICE : +44 246 214241  
 From: Linus Benedict Torvalds [email blocked] Newsgroups: comp.os.minix Subject: Re: What would you like to see most in minix? Date: 28 Aug 91 10:56:19 GMT  In article Alan Barclay writes: >In article Linus Benedict Torvalds writes: >>yet) and segmentation. It's the segmentation that makes it REALLY 386 >>dependent (every task has a 64Mb segment for code & data - max 64 tasks >>in 4Gb. Anybody who needs more than 64Mb/task - tough cookies).  >Is that max 64 64Mb tasks or max 64 tasks no matter what their size?  I'm afraid that is 64 tasks max (and one is used as swapper), no matter how small they should be. Fragmentation is evil - this is how it was handled. As the current opinion seems to be that 64 Mb is more than enough, but 64 tasks might be a little crowded, I'll probably change the limits be easily changed (to 32Mb/128 tasks for example) with just a recompilation of the kernel. I don't want to be on the machine when someone is spawning >64 processes, though :-)                  Linus

Early Linux installation guide:

Installing Linux on your system Ok, this is a short guide for those people who actually want to get a running system, not just look at the pretty source code :-). You'll certainly need minix for most of the steps. 0. Back up any important software. This kernel has been working beautifully on my machine for some time, and has never destroyed anything on my hard-disk, but you never can be too careful when it comes to using the disk directly. I'd hate to get flames like "you destroyed my entire collection of Sam Fox nude gifs (all 103 of them), I'll hate you forever", just because I may have done something wrong. Double-check your hardware. If you are using other than EGA/VGA, you'll have to make the appropriate changes to 'linux/kernel/console.c', which may not be easy. If you are able to use the at_wini.c under minix, linux will probably also like your drive. If you feel comfortable with scan-codes, you might want to hack 'linux/kernel/keyboard.s' making it more practical for your [US|German|...] keyboard. 1. Decide on what root device you'll be using. You can use any (standard) partition on any of your harddisks, the numbering is the same as for minix (ie 0x306, which I'm using, means partition 1 on hd2). It is certainly possible to use the same device as for minix, but I wouldn't recommend it. You'd have to change pathnames (or make a chroot in init) to get minix and linux to live together peacefully. I'd recommend making a new filesystem, and filling it with the necessary files: You need at least the following: - /dev/tty0 (same as under minix, ie mknod ...) - /dev/tty (same as under minix) - /bin/sh (link to bash) - /bin/update (I guess this should be /etc/update ...) Note that linux and minix binaries aren't compatible, although they use the same (gcc-)header (for ease of cross-compiling), so running one under the other will result in errors. 2. Compile the source, making necessary changes into the makefiles and linux/include/linux/config.h and linux/boot/boot.s. I'm using a slightly hacked gcc-1.40, to which I have added a -mstring-insns flag, which uses the i386 string instructions for structure copy etc. Removing the flag from all makefiles should do the trick for you. NOTE! I'm using -Wall, and I'm not seeing many warnings (2 I think, one about _exit returning although it's volatile - it's ok.) If you get more warnings when compiling, something's wrong. 3. Copy the resultant code to a diskette of the right type. Use 'cp Image /dev/PS0' or equivalent. 4. Boot with the new diskette. If you've done everything right (and if *I've* done everything right), you should now be running bash as root. You can't do much (alias ls='echo *' is a good idea :-), but if you do run, most other things should work. I'd be happy to hear from anybody that has come this far - and I'll send any ported binaries you might want (and I have). I'll also put them out for ftp if there is enough interest. With gcc, make and uemacs, I've been able to stop crosscompiling and actually compile natively under linux. (I also have a term-emu, sz/rz, sed, etc ...) The boot-sequence should start with "Loading system...", and then a "Partition table ok" followed by some root-dev info. If you forget to make the /dev/tty0-character device, you'll never see anything but the "loading" message. Hopefully errors will be told to the console, but if there are problems at boot-up there is a distinct possibility that the machine just hangs. 5. Check the new filesystem regularly with (minix) fsck. I haven't got any errors for some time now, but I cannot guarantee that this means it will never happen. Due to slight differences in 'unlink', fsck will report "mode inode XXX not cleared", but that isn't an error, and you can safely ignore it (if you don't like it, do a fsck -a every once in a while). Minix "restore" will not work on a file deleted with linux - so be extra careful if you have a tendency to delete files you don't really want to. Logging out from the "login-shell" will automatically do a sync, and will leave you hanging without any processes (except update, which isn't much fun), so do the "three-finger-salute" to restart dos/minix/linux or whatever. 6. Mail me and ask about problems/updates etc. Even more welcome are success-reports (yeah, sure), and bugreports or even patches (or pointers to corrections). NOTE!!! I haven't included diffs with the binaries I've posted for the simple reason that there aren't any - I've had this silly idea that I'd rather change the OS than do a lot of porting. All source to the binaries can be found on nic.funet.fi under /pub/gnu or /pub/unix. Changes have been to makefiles or configuration files, and anybody interested in them might want to contact me. Mostly it's been a matter of adding a -DUSG to makefiles. The one exception if gcc - I've made some hacks on it (string-insns), and have got it (with the gracious help of Bruce Evans) to correctly emit software floating point. I haven't got diffs to that one either, as my hard-disk is overflowing and I cannot accomodate both originals and changes, but as per the GNU copyleft I'll make them available if someone wants them. I hope nobody want's them :-) Linus [email blocked]

README about early pictures of Linus Torvalds:

I finally got these made, and even managed to persuade Linus into allowing me to publish three pictures instead of only the first one. (He still vetoes the one with the toy moose... :-)    linus1.gif, linus2.gif, linus3.gif          Three pictures of Linus Torvalds, showing what a despicable         figure he is in real life.  The beer is from the pre-Linux         era, so it's not virtual.  In nic.funet.fi: pub/OS/Linux/doc/PEOPLE.  -- Lars.Wirzenius [email blocked]  (finger wirzeniu at klaava.helsinki.fi)    MS-DOS, you can't live with it, you can live without it.

Attachment	Size
linux-0.01.tar.bz2	61.88 KB
linux-0.01.tar.bz2.sign	248 bytes
linux-0.01.tar.gz	71.38 KB
linux-0.01.tar.gz.sign	248 bytes
linus1.gif	104.93 KB
linus2.gif	72.24 KB
linus3.gif	123.49 KB

Tags: 0.01, historical, Linus Torvalds, Linux, Minix, photos, release

http://kerneltrap.org/node/14002

10 Linux Predictions for 2009

Dec 29th, 2008, 2:20 pm

veryone wants to know what's going to happen in the new year as if anyone can accurately predict these things. However, one can deduce, with reasonable accuracy, that there will be innovations that are designed to get our attention. This is my list of Linux-oriented predictions for 2009.
The keyword for 2009 is Innovation.

1. Buyouts/Mergers - 2009 will see its share of company buyouts and mergers--all innovation-related. Larger companies will buy up smaller ones with innovative products and services. Many new open source millionaires will be created through these transitions.

2. Gadgets, Gadgets, and more Gadgets - This will be the Year of the Gadget and they'll be Linux-powered. You'll see dozens of new gadgets from phones to home appliances to weather stations come out in 2009 all designed to attract your attention and your money. Watch for rapidly falling prices on these little gems along the way too.

3. Virtualization - Linux-powered virtualization in the form of virtual appliances, virtual services, and hosted solutions are going to overwhelm even the most enthusiastic virtualization afficianados among you. I will have plenty of fodder for my Virtualization column at linux-mag.com as well as posts here on DaniWeb. I expect to see weekly announcements for new products, new services, and new companies popping-up to solve our problems.

4. Desktop Innovations - Ahh, the pet peeve of every IT jock in the business: Desktop Linux. Well hold on to your shorts naysayers, this is going to be one helluva ride through the dark recesses of the Desktop nether regions. Expect big things in the Linux Desktop in 2009 as a true Microsoft killer will emerge from an unlikely source. Windows 7 will be laughable by comparison.

5. Portable Servers - Say what? Oh yeah, get ready for this: Portable Servers. I've written articles on this and now it will happen from commercial sources. Truly portable services on portable servers. You'll be able to provide services to any group, anywhere, any time with these. Want to have a LAN party at a Community Center? Take your WiFi-enabled portable server and get to it.

6. Embedded Systems - This is one area that will enjoy quantum leaps of innovations in the coming year. Embedded Linux systems will power microwave ovens, regular ovens, sprinkler systems, robotic maids, and much more. Get ready for the embedded revolution.

7. Game Console - A major game console manufacturer will switch to Linux for their operating system to power the most innovative game console yet. You think the Wii is cool? Just wait till you see what's brewing elsewhere.

8. Home Automation - New homes will not only be built with green technology, green materials, and more efficient fixtures; those fixtures and technologies will be powered by Linux. Your new home will resemble something from The Jetsons rather than something from contemporary life. For existing homes, there will be packages available to upgrade your home to a "smart" home but it will still pale in comparison to a freshly built home with the innovations built-in.

9. Automobiles - Auto companies need to trim those budgets and executive salaries aside, they need to use some innovative technologies to remedy some of their money angst. Linux-powered car brains with pluggable and programmable modules will arise as one solution. I wouldn't be a bit surprised to see new car companies emerge using green technologies and Linux as part of the mix.

10. Cloud Computing - 2009 is going to be a big year for The Cloud and Cloud Computing. Linux-powered Cloud vendors will win out over those who employ other operating systems. Why? Not just costs but the major Virtualization vendors use Linux for their virtual platforms (VMware and Xen). Cloud vendors are going to use technology that's cheap, easy to maintain, commercially supported, and mature--in other words, Linux.

There you have them. My Linux predictions for 2009. The proof of my prophecy won't be available until this time next year. Stay tuned and keep referring to this post over the coming year to check my status. I will refer to this posting as innovations emerge over the coming year.

Do you have any Linux predictions for 2009?

How Small Can Computers Get? Computing in a Molecule

Posted by ScuttleMonkey on Tuesday December 30, @05:36AM
from the nano-pc dept.

ScienceDaily at what the future might bring for atomic-scale computing. "Joachim, the head of the CEMES Nanoscience and Picotechnology Group (GNS), is currently coordinating a team of researchers from 15 academic and industrial research institutes in Europe whose groundbreaking work on developing a molecular replacement for transistors has brought the vision of atomic-scale computing a step closer to reality. Their efforts, a continuation of work that began in the 1990s, are today being funded by the European Union in the Pico-Inside project. [...] The team has managed to design a simple logic gate with 30 atoms that perform the same task as 14 transistors, while also exploring the architecture, technology and chemistry needed to achieve computing inside a single molecule and to interconnect molecules."

http://www.sciencedaily.com/releases/2008/12/081222113532.htm

How Small Can Computers Get? Computing In A Molecule

ScienceDaily (Dec. 30, 2008) — Over the last 60 years, ever-smaller generations of transistors have driven exponential growth in computing power. Could molecules, each turned into miniscule computer components, trigger even greater growth in computing over the next 60?

Atomic-scale computing, in which computer processes are carried out in a single molecule or using a surface atomic-scale circuit, holds vast promise for the microelectronics industry. It allows computers to continue to increase in processing power through the development of components in the nano- and pico scale. In theory, atomic-scale computing could put computers more powerful than today’s supercomputers in everyone’s pocket.

“Atomic-scale computing researchers today are in much the same position as transistor inventors were before 1947. No one knows where this will lead,” says Christian Joachim of the French National Scientific Research Centre’s (CNRS) Centre for Material Elaboration & Structural Studies (CEMES) in Toulouse, France.

Joachim, the head of the CEMES Nanoscience and Picotechnology Group (GNS), is currently coordinating a team of researchers from 15 academic and industrial research institutes in Europe whose groundbreaking work on developing a molecular replacement for transistors has brought the vision of atomic-scale computing a step closer to reality. Their efforts, a continuation of work that began in the 1990s, are today being funded by the European Union in the Pico-Inside project.

In a conventional microprocessor – the “motor” of a modern computer – transistors are the essential building blocks of digital circuits, creating logic gates that process true or false signals. A few transistors are needed to create a single logic gate and modern microprocessors contain billions of them, each measuring around 100 nanometres.

Transistors have continued to shrink in size since Intel co-founder Gordon E. Moore famously predicted in 1965 that the number that can be placed on a processor would double roughly every two years. But there will inevitably come a time when the laws of quantum physics prevent any further shrinkage using conventional methods. That is where atomic-scale computing comes into play with a fundamentally different approach to the problem.

“Nanotechnology is about taking something and shrinking it to its smallest possible scale. It’s a top-down approach,” Joachim says. He and the Pico-Inside team are turning that upside down, starting from the atom, the molecule, and exploring if such a tiny bit of matter can be a logic gate, memory source, or more. “It is a bottom-up or, as we call it, 'bottom-bottom' approach because we do not want to reach the material scale,” he explains.

Joachim’s team has focused on taking one individual molecule and building up computer components, with the ultimate goal of hosting a logic gate in a single molecule.

How many atoms to build a computer?

“The question we have asked ourselves is how many atoms does it take to build a computer?” Joachim says. “That is something we cannot answer at present, but we are getting a better idea about it.”

The team has managed to design a simple logic gate with 30 atoms that perform the same task as 14 transistors, while also exploring the architecture, technology and chemistry needed to achieve computing inside a single molecule and to interconnect molecules.

They are focusing on two architectures: one that mimics the classical design of a logic gate but in atomic form, including nodes, loops, meshes etc., and another, more complex, process that relies on changes to the molecule’s conformation to carry out the logic gate inputs and quantum mechanics to perform the computation.

The logic gates are interconnected using scanning-tunnelling microscopes and atomic-force microscopes – devices that can measure and move individual atoms with resolutions down to 1/100 of a nanometre (that is one hundred millionth of a millimetre!). As a side project, partly for fun but partly to stimulate new lines of research, Joachim and his team have used the technique to build tiny nano-machines, such as wheels, gears, motors and nano-vehicles each consisting of a single molecule.

“Put logic gates on it and it could decide where to go,” Joachim notes, pointing to what would be one of the world’s first implementations of atomic-scale robotics.

The importance of the Pico-Inside team’s work has been widely recognised in the scientific community, though Joachim cautions that it is still very much fundamental research. It will be some time before commercial applications emerge from it. However, emerge they all but certainly will.

“Microelectronics needs us if logic gates – and as a consequence microprocessors – are to continue to get smaller,” Joachim says.

The Pico-Inside researchers, who received funding under the ICT strand of the EU’s Sixth Framework Programme, are currently drafting a roadmap to ensure computing power continues to increase in the future.

OpenSPARC(TM) Internals abstract

Much of the material in this chapter was leveraged from L. Spracklen and

S. G. Abraham, “Chip Multithreading: Opportunities and Challenges,” in

11th International Symposium on High-Performance Computer

Architecture, 2005.

Over the last few decades microprocessor performance has increased

exponentially, with processor architects successfully achieving significant

gains in single-thread performance from one processor generation to the

next. Semiconductor technology has been the main driver for this

increase, with faster transistors allowing rapid increases in clock speed to

today’s multi-GHz frequencies. In addition to these frequency increases,

each new technology generation has essentially doubled the number of

available transistors. As a result, architects have been able to aggressively

chase increased single-threaded performance by using a range of

expensive microarchitectural techniques, such as, superscalar issue, out-

of-order issue, on-chip caching, and deep pipelines supported by

sophisticated branch predictors.

However, process technology challenges, including power constraints, the

memory wall, and ever-increasing difficulties in extracting further

instruction-level parallelism (ILP), are all conspiring to limit the

performance of individual processors in the future. While recent attempts

at improving single-thread performance through even deeper pipelines

have led to impressive clock frequencies, these clock frequencies have

not translated into significantly better performance in comparison with

less aggressive designs. As a result, microprocessor frequency, which

used to increase exponentially, has now leveled off, with most processors

operating in the 2–4 GHz range.

----------------------- Page 21-----------------------

2 Chapter 1 Introducing Chip Multithreaded (CMT) Processors

This combination of the limited realizable ILP, practical limits to pipelining,

and a “power ceiling” imposed by cost-effective cooling considerations have

conspired to limit future performance increases within conventional processor

cores. Accordingly, processor designers are searching for new ways to

effectively utilize their ever-increasing transistor budgets.

The techniques being embraced across the microprocessor industry are chip

multiprocessors (CMPs) and chip multithreaded (CMT) processors. CMP, as

the name implies, is simply a group of processors integrated onto the same

chip. The individual processors typically have comparable performance to

their single-core brethren, but for workloads with sufficient thread-level

parallelism (TLP), the aggregate performance delivered by the processor can

be many times that delivered by a single-core processor. Most current

processors adopt this approach and simply involve the replication of existing

single-processor processor cores on a single die.

Moving beyond these simple CMP processors, chip multithreaded (CMT)

processors go one step further and support many simultaneous hardware

strands (or threads) of execution per core by simultaneous multithreading

(SMT) techniques. SMT effectively combats increasing latencies by enabling

multiple strands to share many of the resources within the core, including the

execution resources. With each strand spending a significant portion of time

stalled waiting for off-chip misses to complete, each strand’s utilization of the

core’s execution resources is extremely low. SMT improves the utilization of

key resources and reduces the sensitivity of an application to off-chip misses.

Similarly, as with CMP, multiple cores can share chip resources such as the

memory controller, off-chip bandwidth, and the level-2/level-3 cache,

improving the utilization of these resources.

The benefits of CMT processors are apparent in a wide variety for application

spaces. For instance, in the commercial space, server workloads are broadly

characterized by high levels of TLP, low ILP, and large working sets. The

potential for further improvements in overall single-thread performance is

limited; on-chip cycles per instruction (CPI) cannot be improved significantly

because of low ILP, and off-chip CPI is large and growing because of relative

increases in memory latency. However, typical server applications

concurrently serve a large number of users or clients; for instance, a database

server may have hundreds of active processes, each associated with a different

client. Furthermore, these processes are currently multithreaded to hide disk

access latencies. This structure leads to high levels of TLP. Thus, it is

extremely attractive to couple the high TLP in the application domain with

support for multiple threads of execution on a processor chip.

----------------------- Page 22-----------------------

Evolution of CMTs 3

Though the arguments for CMT processors are often made in the context of

overlapping memory latencies, memory bandwidth considerations also play a

significant role. New memory technologies, such as fully buffered DIMMs

(FBDs), have higher bandwidths (for example, 60 GB/s/chip), as well as

higher latencies (for example, 130 ns), pushing up their bandwidth-delay

product to 60 GB/s × 130 ns = 7800 bytes. The processor chip’s pins represent

an expensive resource, and to keep these pins fully utilized (assuming a cache

line size of 64 bytes), the processor chip must sustain 7800/64 or over 100

parallel requests. To put this in perspective, a single strand on an aggressive

out-of-order processor core generates less than two parallel requests on typical

server workloads: therefore, a large number of strands are required to sustain

a high utilization of the memory ports.

Finally, power considerations also favor CMT processors. Given the almost

cubic dependence between core frequency and power consumption, the latter

drops dramatically with reductions in frequency. As a result, for workloads

with adequate TLP, doubling the number of cores and halving the frequency

delivers roughly equivalent performance while reducing power consumption

by a factor of four.

Evolution of CMTs

Given the exponential growth in transistors per chip over time, a rule of

thumb is that a board design becomes a chip design in ten years or less. Thus,

most industry observers expected that chip-level multiprocessing would

eventually become a dominant design trend. The case for a single-chip

multiprocessor was presented as early as 1996 by Kunle Olukotun’s team at

Stanford University. Their Stanford Hydra CMP processor design called for

the integration of four MIPS-based processors on a single chip. A DEC/

Compaq research team proposed the incorporation of eight simple Alpha cores

and a two-level cache hierarchy on a single chip (code-named Piranha) and

estimated a simulated performance of three times that of a single-core, next-

generation Alpha processor for on-line transaction processing workloads.

As early as the mid-1990s, Sun recognized the problems that would soon face

processor designers as a result of the rapidly increasing clock frequencies

required to improve single-thread performance. In response, Sun defined the

MAJC architecture to target thread-level parallelism. Providing well-defined

support for both CMP and SMT processors, MAJC architecture was industry’s

first step toward general-purpose CMT processors. Shortly after publishing

the MAJC architecture, Sun announced its first MAJC-compliant processor

(MAJC-5200), a dual-core CMT processor with cores sharing an L1 data

cache.

----------------------- Page 23-----------------------

4 Chapter 1 Introducing Chip Multithreaded (CMT) Processors

Subsequently, Sun moved its SPARC processor family toward the CMP design

point. In 2003, Sun announced two CMP SPARC processors: Gemini, a dual-

core UltraSPARC II derivative; and UltraSPARC IV. These first-generation

CMP processors were derived from earlier uniprocessor designs, and the two

cores did not share any resources other than off-chip datapaths. In most CMP

designs, it is preferable to share the outermost caches, because doing so

localizes coherency traffic between the strands and optimizes inter-strand

communication in the chip—allowing very fine-grained thread interaction

(microparallelism). In 2003, Sun also announced its second-generation CMP

processor, UltraSPARC IV+, a follow-on to the UltraSPARC IV processor, in

which the on-chip L2 and off-chip L3 caches are shared between the two

cores.

In 2006, Sun introduced a 32-way CMT SPARC processor, called

UltraSPARC T1, for which the entire design, including the cores, is optimized

for a CMT design point. UltraSPARC T1 has eight cores; each core is a four-

way SMT with its own private L1 caches. All eight cores share a 3-Mbyte, 12-

way level-2 cache,. Since UltraSPARC T1 is targeted at commercial server

workloads with high TLP, low ILP, and large working sets, the ability to

support many strands and therefore many concurrent off-chip misses is key to

overall performance. Thus, to accommodate eight cores, each core supports

single issue and has a fairly short pipeline.

Sun’s most recent CMT processor is the UltraSPARC T2 processor. The

UltraSPARC T2 processor provides double the threads of the UltraSPARC T1

processor (eight threads per core), as well as improved single-thread

performance, additional level-2 cache resources (increased size and

associativity), and improved support for floating-point operations.

Sun’s move toward the CMT design has been mirrored throughout industry. In

2001, IBM introduced the dual-core POWER-4 processor and recently

released second-generation CMT processors, the POWER-5 and POWER-6

processors, in which each core supports 2-way SMT. While this fundamental

shift in processor design was initially confined to the high-end server

processors, where the target workloads are the most thread-rich, this change

has recently begun to spread to desktop processors. AMD and Intel have also

subsequently released multicore CMP processors, starting with dual-core

CMPs and more recently quad-core CMP processors. Further, Intel has

announced that its next-generation quad-core processors will support 2-way

SMT, providing a total of eight threads per chip.

CMT is emerging as the dominant trend in general-purpose processor design,

with manufacturers discussing their multicore plans beyond their initial quad-

core offerings. Similar to the CISC-to-RISC shift that enabled an entire

processor to fit on a single chip and internalized all communication between

----------------------- Page 24-----------------------

Future CMT Designs 5

pipeline stages to within a chip, the move to CMT represents a fundamental

shift in processor design that internalizes much of the communication between

processors to within a chip.

Future CMT Designs

An attractive proposition for future CMT design is to just double the number

of cores per chip every generation since a new process technology essentially

doubles the transistor budget. Little design effort is expended on the cores,

and performance is almost doubled every process generation on workloads

with sufficient TLP. Though reusing existing core designs is an attractive

option, this approach may not scale well beyond a couple of process

generations. Processor designs are already pushing the limits of power

dissipation. For the total power consumption to be restrained, the power

dissipation of each core must be halved in each generation. In the past, supply

voltage scaling delivered most of the required power reduction, but

indications are that voltage scaling will not be sufficient by itself. Though

well-known techniques, such as clock gating and frequency scaling, may be

quite effective in the short term, more research is needed to develop low-

power, high-performance cores for future CMT designs.

Further, given the significant area cost associated with high-performance

cores, for a fixed area and power budget, the CMP design choice is between a

small number of high-performance (high frequency, aggressive out-of-order,

large issue width) cores or multiple simple (low frequency, in-order, limited

issue width) cores. For workloads with sufficient TLP, the simpler core

solution may deliver superior chipwide performance at a fraction of the power.

However, for applications with limited TLP, unless speculative parallelism can

be exploited, CMT performance will be poor. One possible solution is to

support heterogeneous cores, potentially providing multiple simple cores for

thread-rich workloads and a single more complex core to provide robust

performance for single-threaded applications.

Another interesting opportunity for CMT processors is support for on-chip

hardware accelerators. Hardware accelerators improve performance on certain

specialized tasks and off-load work from the general-purpose processor.

Additionally, on-chip hardware accelerators may be an order of magnitude

more power efficient than the general-purpose processor and may be

significantly more efficient than off-chip accelerators (for example,

eliminating the off-chip traffic required to communicate to an off-chip

accelerator). Although high cost and low utilization typically make on-chip

hardware accelerators unattractive for traditional processors, the cost of an

accelerator can be amortized over many strands, thanks to the high degree of

resource sharing associated with CMTs. While a wide variety of hardware

----------------------- Page 25-----------------------

accelerators can be envisaged, emerging trends make an extremely compelling

case for supporting on-chip network off-load engines and cryptographic

accelerators. The future processors will afford opportunities for accelerating

other functionality. For instance, with the increasing usage of XML-formatted

data, it may become attractive to have hardware support XML parsing and

processing.

Finally, for the same amount of off-chip bandwidth to be maintained per core,

the total off-chip bandwidth for the processor chip must also double every

process generation. Processor designers can meet the bandwidth need by

adding more pins or increasing the bandwidth per pin. However, the

maximum number of pins per package is only increasing at a rate of 10

percent per generation. Further packaging costs per pin are barely going down

with each new generation and increase significantly with pin count. As a

result, efforts have recently focused on increasing the per-pin bandwidth by

innovations in the processor chip to DRAM memory interconnect through

technologies such as double data rate and fully buffered DIMMs. Additional

benefits can be obtained by doing more with the available bandwidth; for

instance, by compressing off-chip traffic or exploiting silentness to minimize

the bandwidth required to perform write-back operations. Compression of the

on-chip caches themselves can also improve performance, but the (significant)

additional latency that is introduced as a result of the decompression overhead

must be carefully balanced against the benefits of the reduced miss rate,

favoring adaptive compression strategies.

As a result, going forward we are likely to see an ever-increasing proportion

of CMT processors designed from the ground-up in order to deliver ever-

increasing performance while satisfying these power and bandwidth

constraints.

----------------------- Page 26-----------------------

CHAPTER 2

OpenSPARC Designs

Sun Microsystems began shipping the UltraSPARC T1 chip

multithreaded (CMT) processor in December 2005. Sun surprised the

industry by announcing that it would not only ship the processor but also

open-source that processor—a first in the industry. By March 2006,

UltraSPARC T1 had been open-sourced in a distribution called

OpenSPARC T1, available on http://OpenSPARC.net.

In 2007, Sun began shipping its newer, more advanced UltraSPARC T2

processor, and open-sourced the bulk of that design as OpenSPARC T2.

The “source code” for both designs offered on OpenSPARC.net is

comprehensive, including not just millions of lines of the hardware

description language (Verilog, a form of “register transfer logic”—RTL)

for these microprocessors, but also scripts to compile (“synthesize”) that

source code into hardware implementations, source code of processor and

full-system simulators, prepackaged operating system images to boot on

the simulators, source code to the Hypervisor software layer, a large suite

of verification software, and thousands of pages of architecture and

implementation specification documents.

This book is intended as a “getting started” companion to both

OpenSPARC T1 and OpenSPARC T2. In this chapter, we begin that

association by addressing this question: Now that Sun has open-sourced

OpenSPARC T1 and T2, what can they be used for?

One thing is certain: the real-world uses to which OpenSPARC will be

put will be infinitely more diverse and interesting than anything that

could be suggested in this book! Nonetheless, this short chapter offers a

few ideas, in the hope that they will stimulate even more creative

thinking …

----------------------- Page 27-----------------------

8 Chapter 2 OpenSPARC Designs

2.1 Academic Uses for OpenSPARC

The utility of OpenSPARC in academia is limited only by students’

imaginations.

The most common academic use of OpenSPARC to date is as a complete

example processor architecture and/or implementation. It can be used in

coursework areas such as computer architecture, VLSI design, compiler code

generation/optimization, and general computer engineering.

In university lab courses, OpenSPARC provides a design that can be used as a

known-good starting point for assigned projects.

OpenSPARC can be used as a basis for compiler research, such as for code

generation/optimization for highly threaded target processors or for

experimenting with instruction set changes and additions.

OpenSPARC is already in use in multiple FPGA-based projects at universities.

For more information, visit:

http://www.opensparc.net/fpga/index.html

For more information on programs supporting academic use of OpenSPARC,

including availability of the Xilinx OpenSPARC FPGA Board, visit web page:

http://www.OpenSPARC.net/edu/university-program.html

Specific questions about university programs can be posted on the

OpenSPARC general forum at:

http://forums.sun.com/forum.jspa?forumID=837

or emailed to OpenSPARC-UniversityProgram@sun.com.

Many of the commercial applications of OpenSPARC, mentioned in the

following section, suggest corresponding academic uses.

2.2 Commercial Uses for OpenSPARC

OpenSPARC provides a springboard for design of commercial processors. By

starting from a complete, known-good design—including a full verification

suite—the time-to-market for a new custom processor can be drastically

slashed.

----------------------- Page 28-----------------------

2.2 Commercial Uses for OpenSPARC 9

Derivative processors ranging from a simple single-core, single-thread design

all the way up through an 8-core, 64-thread design can rapidly be synthesized

from OpenSPARC T1 or T2.

2.2.1 FPGA Implementation

An OpenSPARC design can be synthesized and loaded into a field-

programmable gate array (FPGA) device. This can be used in several ways:

* An FPGA version of the processor can be used for product prototyping,

allowing rapid design iteration

* An FPGA can be used to provide a high-speed simulation engine for a

processor under development

* For extreme time-to-market needs where production cost per processor

isn’t critical, a processor could even be shipped in FPGA form. This could

also be useful if the processor itself needs to be field-upgradable via a

software download.

2.2.2 Design Minimization

Portions of a standard OpenSPARC design that are not needed for the target

application can be stripped out, to make the resulting processor smaller,

cheaper, faster, and/or with higher yield rates. For example, for a network

routing application, perhaps hardware floating-point operations are

superfluous—in which case, the FPU(s) can be removed, saving die area and

reducing verification effort.

2.2.3 Coprocessors

Specialized coprocessors can be incorporated into a processor based on

OpenSPARC. OpenSPARC T2, for example, comes with a coprocessor

containing two 10 Gbit/second Ethernet transceivers (the network interface

unit or “NIU”). Coprocessors can be added for any conceivable purpose,

including (but hardly limited to) the following:

* Network routing

* Floating-point acceleration

* Cryptographic processing

* I/O compression/decompression engines

* Audio compression/decompression (codecs)

* Video codecs

* I/O interface units for embedded devices such as displays or input sensors

----------------------- Page 29-----------------------

10 Chapter 2 OpenSPARC Designs

2.2.4 OpenSPARC as Test Input to CAD/

EDA Tools

The OpenSPARC source code (Verilog RTL) provides a large, real-world

input dataset for CAD/EDA tools. It can be used to test the robustness of

CAD tools and simulators. Many major commercial CAD/EDA tool vendors

are already using OpenSPARC this way!

----------------------- Page 30-----------------------

CHAPTER 3

Architecture Overview

OpenSPARC processors are based on a processor architecture named the

UltraSPARC Architecture. The OpenSPARC T1 design is based on the

UltraSPARC Architecture 2005, and OpenSPARC T2 is based on the

UltraSPARC Architecture 2007. This chapter is intended as an overview

of the architecture; more details can be found in the UltraSPARC

Architecture 2005 Specification and the UltraSPARC Architecture 2007

Specification.

The UltraSPARC Architecture is descended from the SPARC V9

architecture and complies fully with the “Level 1” (nonprivileged)

SPARC V9 specification.

The UltraSPARC Architecture supports 32-bit and 64-bit integer and 32-

bit, 64-bit, and 128-bit floating-point as its principal data types. The 32-

bit and 64-bit floating-point types conform to IEEE Std 754-1985. The

128-bit floating-point type conforms to IEEE Std 1596.5-1992. The

architecture defines general-purpose integer, floating-point, and special

state/status register instructions, all encoded in 32-bit-wide instruction

formats. The load/store instructions address a linear, 264-byte virtual

address space.

As used here, the word architecture refers to the processor features that

are visible to an assembly language programmer or to a compiler code

generator. It does not include details of the implementation that are not

visible or easily observable by software, nor those that only affect timing

(performance).

The chapter contains these sections:

* The UltraSPARC Architecture on page 12

* Processor Architecture on page 15

* Instructions on page 17

* Traps on page 23

* Chip-Level Multithreading (CMT) on page 23

----------------------- Page 31-----------------------

12 Chapter 3 Architecture Overview

3.1 The UltraSPARC Architecture

This section briefly describes features, attributes, and components of the

UltraSPARC Architecture and, further, describes correct implementation of

the architecture specification and SPARC V9-compliance levels.

3.1.1 Features

The UltraSPARC Architecture, like its ancestor SPARC V9, includes the

following principal features:

* A linear 64-bit address space with 64-bit addressing.

* 32-bit wide instructions — These are aligned on 32-bit boundaries in

memory. Only load and store instructions access memory and perform I/O.

* Few addressing modes — A memory address is given as either “register +

* Triadic register addresses — Most computational instructions operate on

two register operands or one register and a constant and place the result in

a third register.

* A large windowed register file — At any one instant, a program sees 8

global integer registers plus a 24-register window of a larger register file.

The windowed registers can be used as a cache of procedure arguments,

local values, and return addresses.

* Floating point — The architecture provides an IEEE 754-compatible

floating-point instruction set, operating on a separate register file that

provides 32 single-precision (32-bit), 32 double-precision (64-bit), and 16

quad-precision (128-bit) overlayed registers.

* Fast trap handlers — Traps are vectored through a table.

* Multiprocessor synchronization instructions — Multiple variations of

atomic load-store memory operations are supported.

* Predicted branches — The branch with prediction instructions allows the

compiler or assembly language programmer to give the hardware a hint

about whether a branch will be taken.

* Branch elimination instructions — Several instructions can be used to

eliminate branches altogether (for example, Move on Condition).

Eliminating branches increases performance in superscalar and

superpipelined implementations.

----------------------- Page 32-----------------------

3.1 The UltraSPARC Architecture 13

* Hardware trap stack — A hardware trap stack is provided to allow nested

traps. It contains all of the machine state necessary to return to the previous

trap level. The trap stack makes the handling of faults and error conditions

simpler, faster, and safer.

In addition, UltraSPARC Architecture includes the following features that

were not present in the SPARC V9 specification:

* Hyperprivileged mode— This mode simplifies porting of operating

systems, supports far greater portability of operating system (privileged)

software, supports the ability to run multiple simultaneous guest operating

systems, and provides more robust handling of error conditions.

Hyperprivileged mode is described in detail in the Hyperprivileged version

of the UltraSPARC Architecture 2005 Specification or the UltraSPARC

Architecture 2007 Specification .

* Multiple levels of global registers — Instead of the two 8-register sets of

global registers specified in the SPARC V9 architecture, the UltraSPARC

Architecture provides multiple sets; typically, one set is used at each trap

level.

* Extended instruction set — The UltraSPARC Architecture provides many

instruction set extensions, including the VIS instruction set for “vector”

(SIMD) data operations.

* More detailed, specific instruction descriptions — UltraSPARC

Architecture specifications provide many more details regarding what

exceptions can be generated by each instruction, and the specific conditions

under which those exceptions can occur, than did SPARC V9. Also,

detailed lists of valid ASIs are provided for each load/store instruction

from/to alternate space.

* Detailed MMU architecture — Although some details of the UltraSPARC

MMU architecture are necessarily implementation-specific, UltraSPARC

Architecture specifications provide a blueprint for the UltraSPARC MMU,

including software view (TTEs and TSBs) and MMU hardware control

registers.

* Chip-level multithreading (CMT) — The UltraSPARC Architecture

provides a control architecture for highly threaded processor

implementations.

3.1.2 Attributes

The UltraSPARC Architecture is a processor instruction set architecture (ISA)

derived from SPARC V8 and SPARC V9, which in turn come from a reduced

instruction set computer (RISC) lineage. As an architecture, the UltraSPARC

----------------------- Page 33-----------------------

14 Chapter 3 Architecture Overview

Architecture allows for a spectrum of processor and system implementations

at a variety of price/performance points for a range of applications, including

scientific or engineering, programming, real-time, and commercial

applications. OpenSPARC further extends the possible breadth of design

possibilities by opening up key implementations to be studied, enhanced, or

redesigned by anyone in the community.

3.1.2.1 Design Goals

The UltraSPARC Architecture is designed to be a target for optimizing

compilers and high-performance hardware implementations. The UltraSPARC

Architecture 2005 and UltraSPARC Architecture 2007 Specification

documents provide design specs against which an implementation can be

verified, using appropriate verification software.

3.1.2.2 Register Windows

The UltraSPARC Architecture architecture is derived from the SPARC

architecture, which was formulated at Sun Microsystems in 1984 through

1987. The SPARC architecture is, in turn, based on the RISC I and II designs

engineered at the University of California at Berkeley from 1980 through

1982. The SPARC “register window” architecture, pioneered in the UC

Berkeley designs, allows for straightforward, high-performance compilers and

a reduction in memory load/store instructions.

Note that privileged software, not user programs, manages the register

windows. Privileged software can save a minimum number of registers

(approximately 24) during a context switch, thereby optimizing context-switch

latency.

3.1.3 System Components

The UltraSPARC Architecture allows for a spectrum of subarchitectures, such

as cache system, I/O, and memory management unit (MMU).

3.1.3.1 Binary Compatibility

An important mandate for the UltraSPARC Architecture is compatibility

across implementations of the architecture for application (nonprivileged)

software, down to the binary level. Binaries executed in nonprivileged mode

should behave identically on all UltraSPARC Architecture systems when those

----------------------- Page 34-----------------------

3.2 Processor Architecture 15

systems are running an operating system known to provide a standard

execution environment. One example of such a standard environment is the

SPARC V9 Application Binary Interface (ABI).

Although different UltraSPARC Architecture systems can execute

nonprivileged programs at different rates, they will generate the same results

as long as they are run under the same memory model. See Chapter 9,

Memory, in an UltraSPARC Architecture specification for more information.

Additionally, UltraSPARC Architecture 2005 and UltraSPARC Architecture

2007 are are upward-compatible from SPARC V9 for applications running in

nonprivileged mode that conform to the SPARC V9 ABI and upward-

compatible from SPARC V8 for applications running in nonprivileged mode

that conform to the SPARC V8 ABI.

An OpenSPARC implementation may or may not maintain the same binary

compatibility, depending on how the implementation has been modified and

what software execution environment is run on it.

3.1.3.2 UltraSPARC Architecture MMU

UltraSPARC Architecture defines a common MMU architecture (see Chapter

14, Memory Management, in any UltraSPARC Architecture specification for

details). Some specifics are left implementation-dependent.

3.1.3.3 Privileged Software

UltraSPARC Architecture does not assume that all implementations must

execute identical privileged software (operating systems) or hyperprivileged

software (hypervisors). Thus, certain traits that are visible to privileged

software may be tailored to the requirements of the system.

3.2 Processor Architecture

An UltraSPARC Architecture processor—therefore an OpenSPARC

processor—logically consists of an integer unit (IU) and a floating-point unit

(FPU), each with its own registers. This organization allows for

implementations with concurrent integer and floating-point instruction

execution. Integer registers are 64 bits wide; floating-point registers are 32,

64, or 128 bits wide. Instruction operands are single registers, register pairs,

----------------------- Page 35-----------------------

16 Chapter 3 Architecture Overview

A virtual processor (synonym: strand) is the hardware containing the state for

execution of a software thread. A physical core is the hardware required to

execute instructions from one or more software threads, including resources

shared among strands. A complete processor comprises one or more physical

cores and is the physical module that plugs into a system.

An OpenSPARC virtual processor can run in nonprivileged mode, privileged

mode, or hyperprivileged mode. In hyperprivileged mode, the processor can

execute any instruction, including privileged instructions. In privileged mode,

the processor can execute nonprivileged and privileged instructions. In

nonprivileged mode, the processor can only execute nonprivileged

instructions. In nonprivileged or privileged mode, an attempt to execute an

instruction requiring greater privilege than the current mode causes a trap to

hyperprivileged software.

3.2.1 Integer Unit (IU)

An OpenSPARC implementation’s integer unit contains the general-purpose

registers and controls the overall operation of the virtual processor. The IU

executes the integer arithmetic instructions and computes memory addresses

for loads and stores. It also maintains the program counters and controls

instruction execution for the FPU.

An UltraSPARC Architecture implementation may contain from 72 to 640

general-purpose 64-bit R registers. This corresponds to a grouping of the

registers into a number of sets of global R registers plus a circular stack of

N_REG_ WINDOWS sets of 16 registers each, known as register windows. The

number of register windows present (N_REG_ WINDOWS) is implementation

dependent, within the range of 3 to 32 (inclusive). In an unmodified

OpenSPARC T1 or T2 implementation, N_REG_ WINDOWS = 8.

3.2.2 Floating-Point Unit (FPU)

An OpenSPARC FPU has thirty-two 32-bit (single-precision) floating-point

registers, thirty-two 64-bit (double-precision) floating-point registers, and

sixteen 128-bit (quad-precision) floating-point registers, some of which

overlap (as described in detail in UltraSPARC Architecture specifications).

If no FPU is present, then it appears to software as if the FPU is permanently

disabled.

----------------------- Page 36-----------------------

3.3 Instructions 17

If the FPU is not enabled, then an attempt to execute a floating-point

instruction generates an fp_disabled trap and the fp_disabled trap handler

software must either

* Enable the FPU (if present) and reexecute the trapping instruction, or

* Emulate the trapping instruction in software.

3.3 Instructions

Instructions fall into the following basic categories:

* Memory access

* Integer arithmetic / logical / shift

* Control transfer

* State register access

* Floating-point operate

* Conditional move

* Register window management

* SIMD (single instruction, multiple data) instructions

These classes are discussed in the following subsections.

3.3.1 Memory Access

Load, store, load-store, and PREFETCH instructions are the only instructions

that access memory. They use two R registers or an R register and a signed

13-bit immediate value to calculate a 64-bit, byte-aligned memory address.

The integer unit appends an ASI to this address.

The destination field of the load/store instruction specifies either one or two R

registers or one, two, or four F registers that supply the data for a store or that

receive the data from a load.

Integer load and store instructions support byte, halfword (16-bit), word (32-

bit), and extended-word (64-bit) accesses. There are versions of integer load

instructions that perform either sign-extension or zero-extension on 8-bit, 16-

bit, and 32-bit values as they are loaded into a 64-bit destination register.

Floating-point load and store instructions support word, doubleword, and

quadword1 memory accesses.

1. OpenSPARC T1 and T2 processors do not implement the LDQF instruction in hardware; it

generates an exception and is emulated in hyperprivileged software.

----------------------- Page 37-----------------------

18 Chapter 3 Architecture Overview

CASA, CASXA, and LDSTUB are special atomic memory access instructions

that concurrent processes use for synchronization and memory updates.

Note The SWAP instruction is also specified, but it is

deprecated and should not be used in newly developed

software.

The (nonportable) LDTXA instruction supplies an atomic 128-bit (16-byte)

load that is important in certain system software applications.

3.3.1.1 Memory Alignment Restrictions

A memory access on an OpenSPARC virtual processor must typically be

aligned on an address boundary greater than or equal to the size of the datum

being accessed. An iproperly aligned address in a load, store, or load-store

instruction may trigger an exception and cause a subsequent trap. For details,

see the Memory Alignment Restrictions section in an UltraSPARC

Architecture specification.

3.3.1.2 Addressing Conventions

An unmodified OpenSPARC processor uses big-endian byte order by default:

the address of a quadword, doubleword, word, or halfword is the address of its

most significant byte. Increasing the address means decreasing the

significance of the unit being accessed. All instruction accesses are performed

using big-endian byte order.

An unmodified OpenSPARC processor also supports little-endian byte order

for data accesses only: the address of a quadword, doubleword, word, or

halfword is the address of its least significant byte. Increasing the address

means increasing the significance of the data unit being accessed.

3.3.1.3 Addressing Range

An OpenSPARC implementation supports a 64-bit virtual address space. The

supported range of virtual addresses is restricted to two equal-sized ranges at

the extreme upper and lower ends of 64-bit addresses; that is, for n-bit virtual

addresses, the valid address ranges are 0 to 2n–1 - 1 and 264 - 2n–1 to 264 - 1.

See the OpenSPARC T1 Implementation Supplement or OpenSPARC T2

Implementation Supplement for details.

----------------------- Page 38-----------------------

3.3 Instructions 19

3.3.1.4 Load/Store Alternate

Versions of load/store instructions, the load/store alternate instructions, can

specify an arbitrary 8-bit address space identifier for the load/store data

access.

Access to alternate spaces 0016–2F16 is restricted to privileged and

hyperprivileged software, access to alternate spaces 3016–7F16 is restricted to

hyperprivileged software, and access to alternate spaces 8016–FF16 is

unrestricted. Some of the ASIs are available for implementation-dependent

uses. Privileged and hyperprivileged software can use the implementation-

dependent ASIs to access special protected registers, such as MMU control

registers, cache control registers, virtual processor state registers, and other

processor-dependent or system-dependent values. See the Address Space

Identifiers (ASIs) chapter in an UltraSPARC Architecture specification for

more information.

Alternate space addressing is also provided for the atomic memory access

instructions LDSTUBA, CASA, and CASXA.

3.3.1.5 Separate Instruction and Data Memories

The interpretation of addresses in an unmodified OpenSPARC process is

“split”; instruction references use one caching and translation mechanism and

data references use another, although the same underlying main memory is

shared.

In such split-memory systems, the coherency mechanism may be split, so a

write1 into data memory is not immediately reflected in instruction memory.

For this reason, programs that modify their own instruction stream (self-

modifying code2) and that wish to be portable across all UltraSPARC

Architecture (and SPARC V9) processors must issue FLUSH instructions, or a

system call with a similar effect, to bring the instruction and data caches into

a consistent state.

An UltraSPARC Architecture virtual processor may or may not have coherent

instruction and data caches. Even if an implementation does have coherent

instruction and data caches, a FLUSH instruction is required for self-

modifying code—not for cache coherency, but to flush pipeline instruction

buffers that contain unmodified instructions which may have been

subsequently modified.

1. This includes use of store instructions (executed on the same or another virtual processor) that

write to instruction memory, or any other means of writing into instruction memory (for example,

DMA).

2. This is practiced, for example, by software such as debuggers and dynamic linkers.

----------------------- Page 39-----------------------

20 Chapter 3 Architecture Overview

3.3.1.6 Input/Output (I/O)

The UltraSPARC Architecture assumes that input/output registers are accessed

through load/store alternate instructions, normal load/store instructions, or

read/write Ancillary State register instructions (RDasr, WRasr).

3.3.1.7 Memory Synchronization

Two instructions are used for synchronization of memory operations: FLUSH

and MEMBAR. Their operation is explained in Flush Instruction Memory and

Memory Barrier sections, respectively, of UltraSPARC Architecture

specifications.

3.3.2 Integer Arithmetic / Logical / Shift

Instructions

The arithmetic/logical/shift instructions perform arithmetic, tagged arithmetic,

logical, and shift operations. With one exception, these instructions compute a

result that is a function of two source operands; the result is either written into

a destination register or discarded. The exception, SETHI, can be used in

combination with other arithmetic and/or logical instructions to create a

constant in an R register.

Shift instructions shift the contents of an R register left or right by a given

number of bits (“shift count”). The shift distance is specified by a constant in

the instruction or by the contents of an R register.

3.3.3 Control Transfer

Control-transfer instructions (CTIs) include PC-relative branches and calls,

instructions are delayed; that is, the instruction immediately following a

control-transfer instruction in logical sequence is dispatched before the control

transfer to the target address is completed. Note that the next instruction in

logical sequence may not be the instruction following the control-transfer

instruction in memory.

The instruction following a delayed control-transfer instruction is called a

delay instruction. Setting the annul bit in a conditional delayed control-

transfer instruction causes the delay instruction to be annulled (that is, to have

----------------------- Page 40-----------------------

3.3 Instructions 21

no effect) if and only if the branch is not taken. Setting the annul bit in an

unconditional delayed control-transfer instruction (“branch always”) causes

the delay instruction to be always annulled.

Branch and CALL instructions use PC-relative displacements. The jump and

link (JMPL) and return (RETURN) instructions use a register-indirect target

address. They compute their target addresses either as the sum of two R

registers or as the sum of an R register and a 13-bit signed immediate value.

The “branch on condition codes without prediction” instruction provides a

displacement of ±8 Mbytes; the “branch on condition codes with prediction”

instruction provides a displacement of ±1 Mbyte; the “branch on register

contents” instruction provides a displacement of ±128 Kbytes; and the CALL

instruction’s 30-bit word displacement allows a control transfer to any address

within ± 2 gigabytes (± 231 bytes).

Note The return from privileged trap instructions (DONE and

RETRY) get their target address from the appropriate

TPC or TNPC register.

3.3.4 State Register Access

This section describes the following state registers:

* Ancillary state registers

* Read and write privileged state registers

* Read and writer hyperprivileged state registers

3.3.4.1 Ancillary State Registers

The read and write ancillary state register instructions read and write the

contents of ancillary state registers visible to nonprivileged software (Y, CCR,

ASI, PC, TICK, and FPRS) and some registers visible only to privileged and

hyperprivileged software (PCR, SOFTINT, TICK_CMPR, and

STICK_CMPR).

3.3.4.2 PR State Registers

The read and write privileged register instructions (RDPR and WRPR) read

and write the contents of state registers visible only to privileged and

hyperprivileged software (TPC, TNPC, TSTATE, TT, TICK, TBA, PSTATE,

TL, PIL, CWP, CANSAVE, CANRESTORE, CLEANWIN, OTHERWIN, and

WSTATE).

----------------------- Page 41-----------------------

22 Chapter 3 Architecture Overview

3.3.4.3 HPR State Registers

The read and write hyperprivileged register instructions (RDHPR and

WRHPR) read and write the contents of state registers visible only to

hyperprivileged software (HPSTATE, HTSTATE, HINTP, HVER, and

HSTICK_CMPR).

3.3.5 Floating-Point Operate

Floating-point operate (FPop) instructions perform all floating-point

calculations; they are register-to-register instructions that operate on the

floating-point registers. FPops compute a result that is a function of one, two,

or three source operands. The groups of instructions that are considered FPops

are listed in the Floating-Point Operate (FPop) Instructions section of

UltraSPARC Architecture specifications.

3.3.6 Conditional Move

Conditional move instructions conditionally copy a value from a source

condition code or on the contents of an integer register. These instructions can

be used to reduce the number of branches in software.

3.3.7 Register Window Management

RESTORE are nonprivileged and cause a register window to be pushed or

popped. FLUSHW is nonprivileged and causes all of the windows except the

current one to be flushed to memory. SAVED and RESTORED are used by

privileged software to end a window spill or fill trap handler.

3.3.8 SIMD

An unmodified OpenSPARC processor includes SIMD (single instruction,

multiple data) instructions, also known as “vector” instructions, which allow a

single instruction to perform the same operation on multiple data items,

totaling 64 bits, such as eight 8-bit, four 16-bit, or two 32-bit data items.

These operations are part of the “VIS” instruction set extensions.

----------------------- Page 42-----------------------

3.4 Traps 23

3.4 Traps

A trap is a vectored transfer of control to privileged or hyperprivileged

software through a trap table that may contain the first 8 instructions (32 for

some frequently used traps) of each trap handler. The base address of the table

is established by software in a state register (the Trap Base Address register,

TBA, or the Hyperprivileged Trap Base register, HTBA). The displacement

within the table is encoded in the type number of each trap and the level of the

trap. Part of the trap table is reserved for hardware traps, and part of it is

reserved for software traps generated by trap (Tcc) instructions.

A trap causes the current PC and NPC to be saved in the TPC and TNPC

registers. It also causes the CCR, ASI, PSTATE, and CWP registers to be

saved in TSTATE. TPC, TNPC, and TSTATE are entries in a hardware trap

stack, where the number of entries in the trap stack is equal to the number of

supported trap levels. A trap causes hyperprivileged state to be saved in the

HTSTATE trap stack. A trap also sets bits in the PSTATE (and, in some cases,

HPSTATE) register and typically increments the GL register. Normally, the

CWP is not changed by a trap; on a window spill or fill trap, however, the

CWP is changed to point to the register window to be saved or restored.

A trap can be caused by a Tcc instruction, an asynchronous exception, an

instruction-induced exception, or an interrupt request not directly related to a

particular instruction. Before executing each instruction, a virtual processor

determines if there are any pending exceptions or interrupt requests. If any are

pending, the virtual processor selects the highest-priority exception or

interrupt request and causes a trap.

See the Traps chapter in an UltraSPARC Architecture specification for a

complete description of traps.

3.5 Chip-Level Multithreading

(CMT)

An OpenSPARC implementation may include multiple virtual processor cores

within the processor (“chip”) to provide a dense, high-throughput system. This

may be achieved by having a combination of multiple physical processor

----------------------- Page 43-----------------------

24 Chapter 3 Architecture Overview

cores and/or multiple strands (threads) per physical processor core, referred to

as chip-level multithreaded (CMT) processors. CMT-specific hyperprivileged

registers are used for identification and configuration of CMT processors.

The CMT programming model describes a common interface between

hardware (CMT registers) and software

The common CMT registers and the CMT programming model are described

in the Chip-Level Multithreading (CMT) chapter in UltraSPARC Architecture

specifications.

----------------------- Page 44-----------------------

CHAPTER 4

OpenSPARC T1 and T2 Processor

Implementations

This chapter introduces the OpenSPARC T1 and OpenSPARC T2 chip-

level multithreaded (CMT) processors in the following sections:

* General Background on page 25

* OpenSPARC T1 Overview on page 27

* OpenSPARC T1 Components on page 29

* OpenSPARC T2 Overview on page 33

* OpenSPARC T2 Components on page 34

* Summary of Differences Between OpenSPARC T1 and OpenSPARC T2

on page 36

4.1 General Background

OpenSPARC T1 is the first chip multiprocessor that fully implements

Sun’s Throughput Computing initiative. OpenSPARC T2 is the follow-on

chip multi-threaded (CMT) processor to the OpenSPARC T1 processor.

Throughput Computing is a technique that takes advantage of the thread-

level parallelism that is present in most commercial workloads. Unlike

desktop workloads, which often have a small number of threads

concurrently running, most commercial workloads achieve their

scalability by employing large pools of concurrent threads.

Historically, microprocessors have been designed to target desktop

workloads, and as a result have focused on running a single thread as

quickly as possible. Single-thread performance is achieved in these

microprocessors by a combination of extremely deep pipelines (over 20

stages in Pentium 4) and by execution of multiple instructions in parallel

(referred to as instruction-level parallelism, or ILP). The basic tenet

----------------------- Page 45-----------------------

26 Chapter 4 OpenSPARC T1 and T2 Processor Implementations

behind Throughput Computing is that exploiting ILP and deep pipelining has

reached the point of diminishing returns and as a result, current

microprocessors do not utilize their underlying hardware very efficiently.

For many commercial workloads, the physical processor core will be idle

most of the time waiting on memory, and even when it is executing, it will

often be able to utilize only a small fraction of its wide execution width. So

rather than building a large and complex ILP processor that sits idle most of

the time, build a number of small, single-issue physical processor cores that

employ multithreading built in the same chip area. Combining multiple

physical processors cores on a single chip with multiple hardware-supported

threads (strands) per physical processor core allows very high performance for

highly threaded commercial applications. This approach is called thread-level

parallelism (TLP). The difference between TLP and ILP is shown in

FIGURE 4-1.

Strand 1

Strand 2

TLP

Strand 3

Strand 4

ILP Single strand

executing

two

Executing Stalled on Memory

FIGURE 4-1 Differences Between TLP and ILP

The memory stall time of one strand can often be overlapped with execution

of other strands on the same physical processor core, and multiple physical

processor cores run their strands in parallel. In the ideal case, shown in

FIGURE 4-1, memory latency can be completely overlapped with execution of

other strands. In contrast, instruction-level parallelism simply shortens the

time to execute instructions, and does not help much in overlapping execution

with memory latency.1

1. Processors that employ out-of-order ILP can overlap some memory latency with execution.

However, this overlap is typically limited to shorter memory latency events such as L1 cache

misses that hit in the L2 cache. Longer memory latency events such as main memory accesses are

rarely overlapped to a significant degree with execution by an out-of-order processor.

----------------------- Page 46-----------------------

4.2 OpenSPARC T1 Overview 27

Given this ability to overlap execution with memory latency, why don’t more

processors utilize TLP? The answer is that designing processors is a mostly

evolutionary process, and the ubiquitous deeply pipelined, wide ILP physical

processor cores of today are the evolutionary outgrowth from a time when the

CPU was the bottleneck in delivering good performance.

With physical processor cores capable of multiple-GHz clocking, the

performance bottleneck has shifted to the memory and I/O subsystems and

TLP has an obvious advantage over ILP for tolerating the large I/O and

memory latency prevalent in commercial applications. Of course, every

architectural technique has its advantages and disadvantages. The one

disadvantage of employing TLP over ILP is that execution of a single strand

may be slower on a TLP processor than on an ILP processor. With physical

processor cores running at frequencies well over 1 GHz, a strand capable of

executing only a single instruction per cycle is fully capable of completing

tasks in the time required by the application, making this disadvantage a

nonissue for nearly all commercial applications.

4.2 OpenSPARC T1 Overview

OpenSPARC T1 is a single-chip multiprocessor. OpenSPARC T1 contains

eight SPARC physical processor cores. Each SPARC physical processor core

has full hardware support for four virtual processors (or “strands”). These four

strands run simultaneously, with the instructions from each of the four strands

executed round-robin by the single-issue pipeline. When a strand encounters a

long-latency event, such as a cache miss, it is marked unavailable and

instructions are not issued from that strand until the long-latency event is

resolved. Round-robin execution of the remaining available strands continues

while the long-latency event of the first strand is resolved.

Each OpenSPARC T1 physical core has a 16-Kbyte, 4-way associative

instruction cache (32-byte lines), 8-Kbyte, 4-way associative data cache (16-

byte lines), 64-entry fully associative instruction Translation Lookaside Buffer

(TLB), and 64-entry fully associative data TLB that are shared by the four

strands. The eight SPARC physical cores are connected through a crossbar to

an on-chip unified 3-Mbyte, 12-way associative L2 cache (with 64-byte lines).

The L2 cache is banked four ways to provide sufficient bandwidth for the

eight OpenSPARC T1 physical cores. The L2 cache connects to four on-chip

DRAM controllers, which directly interface to DDR2-SDRAM. In addition,

----------------------- Page 47-----------------------

28 Chapter 4 OpenSPARC T1 and T2 Processor Implementations

an on-chip J-Bus controller and several on-chip I/O-mapped control registers

are accessible to the SPARC physical cores. Traffic from the J-Bus coherently

interacts with the L2 cache.

A block diagram of the OpenSPARC T1 chip is shown in FIGURE 4-2.

OpenSPARC T1

FPU

124,145

SPARC Core

DRAM Control DDR-II

L2 Bank0 Channel 0

SPARC Core 156,64

SPARC Core

DRAM Control DDR-II

L2 Bank1 Channel1

156,64

SPARC Core Cache

Crossbar

SPARC Core (CCX)

DRAM Control

L2 Bank2 Channel 2 DDR-II

156,64

SPARC Core

DRAM Control

DDR-II

L2 Bank3 Channel3

156,64

SPARC Core

32,32

32,32 32,32

eFuse

32, 16, Copy

8, or 4 for

JTAG CTU IOB each J-Bus 200 MHz

Port

4 or 8 block System J-Bus

with Interface

CSRs

50 MHz

SSI ROM

SSI

Interface

Notes:

(1) Blocks not scaled to physical size.

(2) Bus widths are labeled as in#,out#, where “in” is into CCX or L2.

FIGURE 4-2 OpenSPARC T1 Chip Block Diagram

----------------------- Page 48-----------------------

4.3 OpenSPARC T1 Components 29

4.3 OpenSPARC T1 Components

This section describes each component in OpenSPARC T1 in these

subsections.

* SPARC Physical Core on this page

* Floating-Point Unit (FPU) on page 30

* L2 Cache on page 31

* DRAM Controller on page 31

* I/O Bridge (IOB) Unit on page 31

* J-Bus Interface (JBI) on page 32

* SSI ROM Interface on page 32

* Clock and Test Unit (CTU) on page 32

* EFuse on page 33

4.3.1 OpenSPARC T1 Physical Core

Each OpenSPARC T1 physical core has hardware support for four strands.

This support consists of a full register file (with eight register windows) per

strand, with most of the ASI, ASR, and privileged registers replicated per

strand. The four strands share the instruction and data caches and TLBs. An

autodemap1 feature is included with the TLBs to allow the multiple strands to

update the TLB without locking.

The core pipeline consists of six stages: Fetch, Switch, Decode, Execute,

Memory, and Writeback. As shown in FIGURE 4-3, the Switch stage contains a

strand instruction register for each strand. One of the strands is picked by the

strand scheduler and the current instruction for that strand is issued to the

pipe. While this is done, the hardware fetches the next instruction for that

strand and updates the strand instruction register.

The scheduled instruction proceeds down the rest of the stages of the pipe,

similar to instruction execution in a single-strand RISC machine. It is decoded

in the Decode stage. The register file access also happens at this time. In the

Execute stage, all arithmetic and logical operations take place. The memory

address is calculated in this stage. The data cache is accessed in the Memory

stage and the instruction is committed in the Writeback stage. All traps are

signaled in this stage.

1. Autodemap causes an existing TLB entry to be automatically removed when a new entry is

installed with the same virtual page number (VPN) and same page size.

----------------------- Page 49-----------------------

30 Chapter 4 OpenSPARC T1 and T2 Processor Implementations

Instructions are classified as either short or long latency instructions. Upon

encountering a long latency instruction or other stall condition in a certain

strand, the strand scheduler stops scheduling that strand for further execution.

Scheduling commences again when the long latency instruction completes or

the stall condition clears.

FIGURE 4-3 illustrates the OpenSPARC T1 physical core.

Strand

I-Cache Instruction Strand

Decode

Registers Scheduler

Store Buffers

ALU

D-Cache

External

Interface

FIGURE 4-3 OpenSPARC T1 Core Block Diagram

4.3.2 Floating-Point Unit (FPU)

A single floating-point unit is shared by all eight OpenSPARC T1 physical

cores. The shared floating-point unit is sufficient for most commercial

applications, in which fewer than 1% of instructions typically involve

floating-point operations.

----------------------- Page 50-----------------------

4.3 OpenSPARC T1 Components 31

4.3.3 L2 Cache

The L2 cache is banked four ways, with the bank selection based on physical

address bits 7:6. The cache is 3-Mbyte, 12-way set associative with pseudo-

LRU replacement (replacement is based on a used-bit scheme), and has a line

size of 64 bytes. Unloaded access time is 23 cycles for an L1 data cache miss

and 22 cycles for an L1 instruction cache miss.

4.3.4 DRAM Controller

OpenSPARC T1’s DRAM Controller is banked four ways , with each L2 bank

interacting with exactly one DRAM Controller bank. The DRAM Controller is

interleaved based on physical address bits 7:6, so each DRAM Controller

bank must have the same amount of memory installed and enabled.

OpenSPARC T1 uses DDR2 DIMMs and can support one or two ranks of

stacked or unstacked DIMMs. Each DRAM bank/port is two DIMMs wide

(128-bit + 16-bit ECC). All installed DIMMs on an individual bank/port must

be identical, and the same total amount of memory (number of bytes) must be

installed on each DRAM Controller port. The DRAM controller frequency is

an exact ratio of the CMP core frequency, where the CMP core frequency

must be at least 4× the DRAM controller frequency. The DDR (double data

rate) data buses, of course, transfer data at twice the frequency of the DRAM

Controller frequency.

The DRAM Controller also supports a small memory configuration mode,

using only two DRAM ports. In this mode, L2 banks 0 and 2 are serviced by

DRAM port 0, and L2 banks 1 and 3 are serviced by DRAM port 1. The

installed memory on each of these ports is still two DIMMs wide.

4.3.5 I/O Bridge (IOB) Unit

The IOB performs an address decode on I/O-addressable transactions and

directs them to the appropriate internal block or to the appropriate external

interface (J-Bus or SSI). In addition, the IOB maintains the register status for

external interrupts.

1. A two-bank option is available for cost-constrained minimal memory configurations.

----------------------- Page 51-----------------------

32 Chapter 4 OpenSPARC T1 and T2 Processor Implementations

4.3.6 J-Bus Interface (JBI)

J-Bus is the interconnect between OpenSPARC T1 and the I/O subsystem. It is

a 200 MHz, 128-bit-wide, multiplexed address/data bus, used predominantly

for DMA traffic, plus the PIO traffic to control it.

The JBI is the block that interfaces to J-Bus, receiving and responding to

DMA requests, routing them to the appropriate L2 banks, and also issuing PIO

transactions on behalf of the strands and forwarding responses back.

4.3.7 SSI ROM Interface

OpenSPARC T1 has a 50 Mbit/s serial interface (SSI) which connects to an

external FPGA which interfaces to the BOOT ROM. In addition, the SSI

interface supports PIO accesses across the SSI, thus supporting optional CSRs

or other interfaces within the FPGA.

4.3.8 Clock and Test Unit (CTU)

The CTU contains the clock generation, reset, and JTAG circuitry.

OpenSPARC T1 has a single PLL, which takes the J-Bus clock as its input

reference, where the PLL output is divided down to generate the CMP core

clocks (for OpenSPARC T1 and caches), the DRAM clock (for the DRAM

controller and external DIMMs), and internal J-Bus clock (for IOB and JBI).

Thus, all OpenSPARC T1 clocks are ratioed. Sync pulses are generated to

control transmission of signals and data across clock domain boundaries.

The CTU has the state machines for internal reset sequencing, which includes

logic to reset the PLL and signal when the PLL is locked, updating clock

ratios on warm resets (if so programmed), enabling clocks to each block in

turn, and distributing reset so that its assertion is seen simultaneously in all

clock domains.

The CTU also contains the JTAG block, which allows access to the shadow

scan chains, plus has a CREG interface that allows the JTAG to issue reads of

any I/O-addressable register, some ASI locations, and any memory location

while OpenSPARC T1 is in operation.

----------------------- Page 52-----------------------

4.4 OpenSPARC T2 Overview 33

4.3.9 EFuse

The eFuse (electronic fuse) block contains configuration information that is

electronically burned in as part of manufacturing, including part serial number

and strand-available information.

4.4 OpenSPARC T2 Overview

OpenSPARC T2 is a single chip multithreaded (CMT) processor.

OpenSPARC T2 contains eight SPARC physical processor cores. Each SPARC

physical processor core has full hardware support for eight processors, two

integer execution pipelines, one floating-point execution pipeline, and one

memory pipeline. The floating-point and memory pipelines are shared by all

eight strands. The eight strands are hard-partitioned into two groups of four,

and the four strands within a group share a single integer pipeline.

While all eight strands run simultaneously, at any given time at most two

strands will be active in the physical core, and those two strands will be

issuing either a pair of integer pipeline operations, an integer operation and a

floating-point operation, an integer operation and a memory operation, or a

floating-point operation and a memory operation. Strands are switched on a

cycle-by-cycle basis between the available strands within the hard-partitioned

group of four, using a least recently issued priority scheme.

When a strand encounters a long-latency event, such as a cache miss, it is

marked unavailable and instructions will not be issued from that strand until

the long-latency event is resolved. Execution of the remaining available

strands will continue while the long-latency event of the first strand is

resolved.

Each OpenSPARC T2 physical core has a 16-Kbyte, 8-way associative

instruction cache (32-byte lines), 8-Kbyte, 4-way associative data cache (16-

byte lines), 64-entry fully-associative instruction TLB, and 128-entry fully

associative data TLB that are shared by the eight strands. The eight

OpenSPARC T2 physical cores are connected through a crossbar to an on-chip

unified 4-Mbyte, 16-way associative L2 cache (64-byte lines).

The L2 cache is banked eight ways to provide sufficient bandwidth for the

eight OpenSPARC T2 physical cores. The L2 cache connects to four on-chip

DRAM Controllers, which directly interface to a pair of fully buffered DIMM

----------------------- Page 53-----------------------

34 Chapter 4 OpenSPARC T1 and T2 Processor Implementations

(FBD) channels. In addition, two 1-Gbit/10-Gbit Ethernet MACs and several

on-chip I/O-mapped control registers are accessible to the SPARC physical

cores.

A block diagram of the OpenSPARC T2 chip is shown in FIGURE 4-4.

Fully B uffered

D IM M s (FB D )

1.4Ghz1.4Ghz 4.8Ghz

OpenSPARC T2 800M hz

N ia g a ra 2

S P A R C C o re L2 B ank0 128 M C U 0 14

1010

S P A R C C o re L2 B ank1 14

1010

S P A R C C o re C ache L2 B ank0

C ro ssbar 128 M C U 1 14

1010

S P A R C C o re (C C X ) L2 B ank1 64

1010

S P A R C C o re L2 B ank0

128 M C U 2 14

1010

S P A R C C o re L2 B ank1 64

S P A R C C o re L2 B ank0 64 1010

128 M C U 3

1010

S P A R C C o re L2 B ank1 64

TC U C C U eFuse

DD IMIM MM s:s: 11 22 33 88

Ranks: 1 or 2 per D IMM

10 Gb MA C N IU P C I-E X

10 Gb MA C S IU

Optional dual C hannel M ode

FC R A M Intf P C I-E X

S S I R O M Intf

FIGURE 4-4 OpenSPARC T2 Chip Block Diagram

4.5 OpenSPARC T2 Components

This section describes the major components in OpenSPARC T2.

----------------------- Page 54-----------------------

4.5 OpenSPARC T2 Components 35

4.5.1 OpenSPARC T2 Physical Core

Each OpenSPARC T2 physical core has hardware support for eight strands.

This support consists of a full register file (with eight register windows) per

strand, with most of the ASI, ASR, and privileged registers replicated per

strand. The eight strands share the instruction and data caches and TLBs. An

autodemap feature is included with the TLBs to allow the multiple strands to

update the TLB without locking.

Each OpenSPARC T2 physical core contains a floating-point unit, shared by

all eight strands. The floating-point unit performs single- and double-precision

floating-point operations, graphics operations, and integer multiply and divide

operations.

4.5.2 L2 Cache

The L2 cache is banked eight ways. To provide for better partial-die recovery,

OpenSPARC T2 can also be configured in 4-bank and 2-bank modes (with 1/

2 and 1/4 the total cache size respectively). Bank selection is based on

physical address bits 8:6 for 8 banks, 7:6 for 4 banks, and 6 for 2 banks. The

cache is 4 Mbytes, 16-way set associative, and uses index hashing. The line

size is 64 bytes.

4.5.3 Memory Controller Unit (MCU)

OpenSPARC T2 has four MCUs, one for each memory branch with a pair of

L2 banks interacting with exactly one DRAM branch. The branches are

interleaved based on physical address bits 7:6, and support 1–16 DDR2

DIMMs. Each memory branch is two FBD channels wide. A branch may use

only one of the FBD channels in a reduced power configuration.

Each DRAM branch operates independently and can have a different memory

size and a different kind of DIMM (for example, a different number of ranks

or different CAS latency). Software should not use address space larger than

four times the lowest memory capacity in a branch because the cache lines are

interleaved across branches. The DRAM Controller frequency is the same as

that of the DDR (double data rate) data buses, which is twice the DDR

frequency. The FBDIMM links run at six times the frequency of the DDR data

buses.

The OpenSPARC T2 MCU implements a DDR2 FBD design model that is

based on various JEDEC-approved DDR2 SDRAM and FBDIMM standards.

JEDEC has received information that certain patents or patent applications

----------------------- Page 55-----------------------

36 Chapter 4 OpenSPARC T1 and T2 Processor Implementations

may be relevant to FBDIMM Advanced Memory Buffer standard (JESD82-

20) as well as other standards related to FBDIMM technology (JESD206) (For

more information, see

http://www.jedec.org/download/search/FBDIMM/Patents.xls).

Sun Microsystems does not provide any legal opinions as to the validity or

relevancy of such patents or patent applications. Sun Microsystems

encourages prospective users of the OpenSPARC T2 MCU design to review

all information assembled by JEDEC and develop their own independent

conclusion.

4.5.4 Noncacheable Unit (NCU)

The NCU performs an address decode on I/O-addressable transactions and

directs them to the appropriate block (for example, DMU, CCU). In addition,

the NCU maintains the register status for external interrupts.

4.5.5 System Interface Unit (SIU)

The SIU connects the DMU and L2 cache. SIU is the L2 cache access point

for the Network subsystem.

4.5.6 SSI ROM Interface (SSI)

OpenSPARC T2 has a 50 Mb/s serial interface (SSI), which connects to an

external boot ROM. In addition, the SSI supports PIO accesses across the SSI,

thus supporting optional Control and Status registers (CSRs) or other

interfaces attached to the SSI.

4.6 Summary of Differences

Between OpenSPARC T1 and

OpenSPARC T2

OpenSPARC T2 follows the CMT philosophy of OpenSPARC T1, but adds

more execution capability to each physical core, as well as significant system-

on-a-chip components and an enhanced L2 cache.

----------------------- Page 56-----------------------

4.6 Summary of Differences Between OpenSPARC T1 and OpenSPARC T2 37

4.6.1 Microarchitectural Differences

The following lists the microarchitectural differences.

* Physical core consists of two integer execution pipelines and a single

floating-point pipeline. OpenSPARC T1 has a single integer execution

pipeline and all cores shared a single floating-point pipeline.

* Each physical core in OpenSPARC T2 supports eight strands, which all

share the floating-point pipeline. The eight strands are partitioned into two

groups of four strands, each of which shares an integer pipeline.

OpenSPARC T1 shares the single integer pipeline among four strands.

* Pipeline in OpenSPARC T2 is eight stages, two stages longer than

OpenSPARC T1.

* Instruction cache is 8-way associative, compared to 4-way in

OpenSPARC T1.

* The L2 cache is 4-Mbyte, 8-banked and 16-way associative, compared to

3-Mbyte, 4-banked and 12-way associative in OpenSPARC T1.

* Data TLB is 128 entries, compared to 64 entries in OpenSPARC T1.

* The memory interface in OpenSPARC T2 supports fully buffered DIMMS

(FBDs), providing higher capacity and memory clock rates.

* The OpenSPARC T2 memory channels support a single-DIMM option for

low-cost configurations.

* OpenSPARC T2 includes a network interface unit (NIU), to which network

traffic management tasks can be off-loaded.

4.6.2 Instruction Set Architecture (ISA)

Differences

There are a number of ISA differences between OpenSPARC T2 and

OpenSPARC T1, as follows:

* OpenSPARC T2 fully supports all VIS 2.0 instructions. OpenSPARC T1

supports a subset of VIS 1.0 plus the SIAM (Set Interval Arithmetic Mode)

instruction (on OpenSPARC T1, the remainder of VIS 1.0 and 2.0

instructions trap to software for emulation).

* OpenSPARC T2 supports the full CMP specification, as described in

UltraSPARC Architecture 2007. OpenSPARC T1 has its own version of

CMP control/status registers. OpenSPARC T2 consists of eight physical

cores, with eight virtual processors per physical core.

----------------------- Page 57-----------------------

38 Chapter 4 OpenSPARC T1 and T2 Processor Implementations

* OpenSPARC T2 does not support OpenSPARC T1’s idle state or its idle,

halt, or resume messages. Instead, OpenSPARC T2 supports parking and

unparking as specified in the CMP chapter of UltraSPARC Architecture

2007 Specification. Note that parking is similar to OpenSPARC T1’s idle

state. OpenSPARC T2 does support an equivalent to the halt state, which

on OpenSPARC T1 is entered by writing to HPR 1E16. However,

OpenSPARC T2 does not support OpenSPARC T1’s STRAND_STS_REG

ASR, which holds the strand state. Halted state is not software-visible on

OpenSPARC T2.

* OpenSPARC T2 does not support the INT_VEC_DIS register (which

allows any OpenSPARC T1 strand to generate an interrupt, reset, idle, or

resume message to any strand). Instead, an alias to ASI_INTR_W is

provided, which allows only the generation of an interrupt to any strand.

* OpenSPARC T2 supports the ALLCLEAN, INVALW, NORMALW,

OTHERW, POPC, and FSQRT instructions in hardware.

* OpenSPARC T2’s floating-point unit generates fp_unfinished_other with

FSR.ftt unfinished_FPop for most denorm cases and supports a nonstandard

mode that flushes denorms to zero. OpenSPARC T1 handles denorms in

hardware, never generates an unfinished_FPop, and does not support a

nonstandard mode.

* OpenSPARC T2 generates an illegal_instruction trap on any quad-precision

FP instruction, whereas OpenSPARC T1 generates an fp_exception_other

trap on numeric and move-FP-quad instructions. See Table 5-2 of the

UltraSPARC T2 Supplement to the “UltraSPARC Architecture 2007

Specification.”

* OpenSPARC T2 generates a privileged_action exception upon attempted

access to hyperprivileged ASIs by privileged software, whereas, in such

cases, OpenSPARC T1 takes a data_access_exception exception.

* OpenSPARC T2 supports PSTATE.tct; OpenSPARC T1 does not.

* OpenSPARC T2 implements the SAVE instruction similarly to all previous

UltraSPARC processors. OpenSPARC T1 implements a SAVE instruction

that updates the locals in the new window to be the same as the locals in

the old window, and swaps the ins (outs) of the old window with the outs

(ins) of the new window.

* PSTATE.am masking details differ between OpenSPARC T1 and

OpenSPARC T2, as described in Section 11.1.8 of the UltraSPARC T2

Supplement to the “UltraSPARC Architecture 2007 Specification.”

* OpenSPARC T2 implements PREFETCH fcn = 1816 as a prefetch

invalidate cache entry, for efficient software cache flushing.

* The Synchronous Fault register (SFSR) is eliminated in OpenSPARC T2.

----------------------- Page 58-----------------------

4.6 Summary of Differences Between OpenSPARC T1 and OpenSPARC T2 39

* T1’s data_access_exception is replaced in OpenSPARC T2 by multiple

DAE_* exceptions.

* T1’s instruction_access_exception exception is replaced in

OpenSPARC T2 by multiple IAE_* exceptions.

4.6.3 MMU Differences

The OpenSPARC T2 and OpenSPARC T1 MMUs differ as follows:

* OpenSPARC T2 has a 128-entry DTLB, whereas OpenSPARC T1 has a 64-

entry DTLB.

* OpenSPARC T2 supports a pair of primary context registers and a pair of

secondary context registers. OpenSPARC T1 supports a single primary

context and single secondary context register.

* OpenSPARC T2 does not support a locked bit in the TLBs.

OpenSPARC T1 supports a locked bit in the TLBs.

* OpenSPARC T2 supports only the sun4v (the architected interface between

privileged software and hyperprivileged software) TTE format for I/D-TLB

Data-In and Data-Access registers. OpenSPARC T1 supports both the

sun4v and the older sun4u TTE formats.

* OpenSPARC T2 is compatible with UltraSPARC Architecture 2007 with

regard to multiple flavors of data access exception (DAE_*) and instruction

access exception (IAE_*). As per UltraSPARC Architecture 2005,

OpenSPARC T1 uses the single flavor of data_access_exception and

instruction_access_exception, indicating the “flavors” in its SFSR

* OpenSPARC T2 supports a hardware Table Walker to accelerate ITLB and

DTLB miss handling.

* The number and format of translation storage buffer (TSB) configuration

and pointer registers differs between OpenSPARC T1 and OpenSPARC T2.

OpenSPARC T2 uses physical addresses for TSB pointers; OpenSPARC T1

uses virtual addresses for TSB pointers.

* OpenSPARC T1 and OpenSPARC T2 support the same four page sizes (8

Kbyte, 64 Kbyte, 4 Mbyte, 256 Mbyte). OpenSPARC T2 supports an

unsupported_page_size trap when an illegal page size is programmed into

TSB registers or attempted to be loaded into the TLB. OpenSPARC T1

forces an illegal page size being programmed into TSB registers to be 256

Mbytes and generates a data_access_exception trap when a page with an

illegal size is loaded into the TLB.

* OpenSPARC T2 adds a demap real operation, which demaps all pages with

r = 1 from the TLB.

----------------------- Page 59-----------------------

40 Chapter 4 OpenSPARC T1 and T2 Processor Implementations

* OpenSPARC T2 supports an I-TLB probe ASI.

* Autodemapping of pages in the TLBs only demaps pages of the same size

or of a larger size in OpenSPARC T2. In OpenSPARC T1, autodemap

demaps pages of the same size, larger size, or smaller size.

* OpenSPARC T2 supports detection of multiple hits in the TLBs.

4.6.4 Performance Instrumentation

Differences

Both OpenSPARC T1 and OpenSPARC T2 provide access to hardware

performance counters through the PIC and PCR registers. However, the

events captured by the hardware differ significantly between OpenSPARC T1

and OpenSPARC T2, with OpenSPARC T2 capturing a much larger set of

events, as described in Chapter 10 of the UltraSPARC T2 Supplement to the

“UltraSPARC Architecture 2007 Specification.” OpenSPARC T2 also supports

count events in hyperprivileged mode; OpenSPARC T1 does not.

In addition, the implementation of pic_overflow differs between

OpenSPARC T1 and OpenSPARC T2. OpenSPARC T1 provides a disrupting

pic_overflow trap on the instruction following the one that caused the

overflow event. OpenSPARC T2 provides a disrupting pic_overflow on the

instruction that generates the event, but that occurs within an epsilon number

of event-generating instructions from the actual overflow.

Both OpenSPARC T2 and OpenSPARC T1 support DRAM performance

counters.

4.6.5 Error Handling Differences

Error handling differs quite a bit between OpenSPARC T1 and

OpenSPARC T2. OpenSPARC T1 primarily employs hardware correction of

errors, whereas OpenSPARC T2 primarily employs software correction of

errors.

* OpenSPARC T2 uses the following traps for error handling:

* data_access_error

* data_access_MMU_error

* hw_corrected_error

* instruction_access_error

* instruction_access_MMU_error

* internal_processor_error

* store_error

* sw_recoverable_error

订阅：博文 (Atom)

linux lover

2008年12月31日星期三

voyage linux

About Voyage Linux

Voyage Linux 0.6.0 released

Long "voyage": modpost to work and compile

2008年12月30日星期二

Linux: The 0.01 Release

Linux: The 0.01 Release

10 Linux Predictions for 2009

How Small Can Computers Get? Computing in a Molecule

How Small Can Computers Get? Computing in a Molecule

How Small Can Computers Get? Computing In A Molecule

OpenSPARC(TM) Internals abstract

google analytics

visitor maps

counter

FSF Widgets

Profile

Blog Archive

whos

free counter

Google Friend Connect

FEEDJIT Live Traffic Feed

Calendar

StumbleUpon Buttons

Site Meter

Followers List

2008年12月31日星期三

2008年12月30日星期二

Linux: The 0.01 Release

How Small Can Computers Get? Computing in a Molecule

How Small Can Computers Get? Computing In A Molecule

google analytics

visitor maps

counter

Subscription

FSF Widgets

Profile

Blog Archive

whos

free counter

Google Friend Connect

FEEDJIT Live Traffic Feed

Calendar

StumbleUpon Buttons

Site Meter

Followers List