The Way of the great learning involves manifesting virtue, renovating the people, and abiding by the highest good.

2009年3月31日星期二

关于“谷鸽鸟看”计划

2009年4月1日,总部位于美国加州山寨城(Mountain Village)的谷歌公司正式推出“谷鸽鸟看”计划。该计划旨在利用装备了 CADIE 芯片和软体,并被赋予了超智能信息处理能力的“谷鸽”,动态采集、整理和分享山寨信息,打造全球最大的山寨信息网。简言之,“谷鸽鸟看”计划的使命是:

鸟看全球信息,使人人皆可山寨并从中受益!

什么是“谷鸽鸟看”技术?

谷鸽

谷歌为专门训练的 31415926 只谷鸽装备了以 CADIE 计算技术为支撑的四大高科技系统:

  • 智能导航帽:在鸽子头脑清醒的时候提供 0.01 米精度全球定位导航信息,并在鸽子头昏脑胀的时候使用绿色纳米微波为鸽子进行泰式头部按摩。
  • 信息处理肚兜:纯棉工艺,不仅仅为了保暖,同时也是 CADIE 谷鸽版芯片和软体的运行平台。
  • 无线充电脚环:当谷鸽飞临专用太阳能充电站时,脚环在五分钟内以无线方式为谷鸽充满 16 小时工作电力。
  • 呼叫应答器:支持 2G/3G 网络协议,响应用户召唤,真正实现“思想有多远,山寨就有多远”的山寨主义理想。

类似谷歌街景(Street View) 采集技术,谷歌倾心打造的超智能谷鸽被赋予外出采集山寨信息的重要使命。这一方面可以大幅提高谷歌地球(Google Earth)谷歌地图(Google Maps)的图像分辨率,另一方面也可以弥补网页搜索中山寨信息含量明显偏低的缺憾,实现搜索山寨化,山寨信息化,信息无废话。

天涯何处不山寨,就看谁的动作快!利用飞得高、看得远、耳朵灵、眼睛贼等特点,谷鸽将重点采集以下山寨信息:

  • 最具有震撼力的山寨新闻:例如,湖南某烟花厂最新研制成功无污染、无燃烧、无烟尘,适于在所有完工或未完工高层建筑安全燃放的绿色版山寨烟花的新闻。
  • 最有潜质的山寨明星:包括,上不了春晚一级的舞台,但有潜力成为网络人气偶像的型男、靓女;不懂得炒作,但却充满娱乐气质的宅男、宅女;没有出众外表,但有满腹心事的痴男、怨女……
  • 最适合山寨恋人约会的时间地点:例如,2月14日晚,多情谷下、断肠崖边的爱情烧饼屋。
  • 最有创意的山寨发明、创造:例如,能够从谷鸽音乐搜索中迅速找到可调解家庭矛盾、平息地区争端的“和平音乐编织机”。
  • 最有魅力的山寨流行语:类似2008年出现的“叉腰肌”、“囧”、 “谷鸽”等充满山寨活力的流行网络新词

由“谷鸽鸟看”技术采集的所有山寨信息将被 CADIE 集中处理并发布在谷鸽山寨搜索引擎上。新一代山寨搜索引擎将能够覆盖全球每个角落、每一时刻、每一种类型的山寨信息。网友可以使用谷歌地图提供的谷鸽飞行路线图功能查看谷鸽的飞行路线。

谷鸽飞行路线图

返回页首

如何召唤谷鸽?

除谷鸽自动外出寻找山寨信息外,用户也可以主动召唤谷鸽采集身边的山寨信息。召唤方法如下:

  • 走到户外或楼顶超过20平方米的空地
  • 用支持上网功能的手机打开谷歌移动版http://g.cn/
  • 对着手机屏幕上出现的麦克风图标,使用鸽子的方法,“咕——咕——咕——”大叫三声
  • 耐心等待……

不出意外的话,谷鸽会在三十分钟内出现在您的身边。根据不完全统计,排除软件 Bug 和芯片抽风等影响因素,谷鸽响应召唤者的平均时间间隔是 21.04 分钟,响应成功率为 99.5865%。

返回页首

如何保护隐私?

谷歌承诺,谷鸽只飞越公共区域,并只能像普通鸽子一样感知鸟类可感知的事物(想像一下你自己被装上翅膀,在天上飞翔时所能看到和听到的一切)。如果你不想谷鸽打扰你,可以从以下网址下载“谷鸽别烦我”的折纸挂件图样,自行制作后,站在窗前,一旦看到谷鸽,就高高挥舞“谷鸽别烦我”的折纸。祝你好运!

“谷鸽别烦我”折纸挂件图样

返回页首

什么是“绿色谷鸽”计划?

谷歌计划在全国范围修建 271828 座由 CADIE 技术提供支持的太阳能充电站。 “谷鸽太阳能充电站”同时支持无线和有线两种充电模式,不但可以为佩戴“无线充电脚环”的谷鸽提供可持续能源,还可以为其他可充电电器设备提供服务。未来,经过谷鸽计划的充分检验后,这批充电站将投入民用,为手机、电动汽车、油电混合动力汽车、电动自行车等民用设备提供充电服务。

目前,谷歌正在与英国科学家积极合作,研发第二代能源供应设备——“鸽能发电”。这将是一种完全依靠谷鸽翅膀振动获得电力,并完全自给自足的供电模式。

未来,绿色、清洁、无污染的“鸽能发电”技术,及其推广版本如“狗能发电”、“马能发电”、 “麻雀能发电”等等,有望部分取代高污染、高能耗的火力发电技术,成为全球绿色能源体系的重要组成部分。

返回页首

爱因斯坦也会错吗?

爱因斯坦也会错吗?

    前苏联的天才物理学家朗道曾经给20世纪最有名的一些物理学家打分,他的最高分的标准是:接近上帝的工作。朗道把这个荣誉只给了一个人,就是爱因斯坦。

(图1)爱因斯坦1919年在柏林的家里

    不过风云流转是常律,当爱因斯坦进入20世纪的二三十年代后,就逐渐离开了物理学界的主流,领风骚者换成了一群毛头小年轻,尽管这些年轻人仍然对爱因斯坦充满敬畏,但是在他们自己倾注热情的领域里面,开始把爱因斯坦视为僵化老人了。这是怎么回事呢?

    故事还是得从爱因斯坦开始。大家都知道爱因斯坦获得过诺贝尔物理学奖,然而诺贝尔委员会并不是因为他建立了相对论而授奖,而是因为他的另外一个也非常重要的工作,即圆满地解释了光电效应。

    所谓光电效应,就是当用光线照射一种金属的表面时,一定条件下会导致金属发射出阴极射线,即电子流。爱因斯坦在1905年的一篇论文里面,圆满地给出了理论解释。这个解释具有革命性的地方是,爱因斯坦认为光是以量子的形式跟金属发生相互作用的,形象地说,就是光象子弹一样,是一颗一颗地打在金属表面的,而经典的电磁理论里面,光是连续传播的电磁辐射场。

因此爱因斯坦可以被认为是量子理论的先驱,然而当量子力学最终建立起来之后,他却成了质疑量子力学最不遗余力的人,他的主要论战对手就是量子力学之父-玻尔。

(图21925年的玻尔

玻尔在20世纪的10年代建立了完整的量子论,从他的理论出发能够完美地解释氢原子的光谱现象。随后在他的感召之下,以波恩,薛定厄,海森堡,泡立等为代表的一群年轻人建立了具有颠覆性魅力的量子力学理论,他们(不包括薛定厄)把自己对于量子力学的物理认识称为“哥本哈根诠释”,奉玻尔为量子力学的精神领袖。而量子力学的这种对于物理世界的理解方式是爱因斯坦一直不能接受的。

所谓量子力学的“哥本哈根诠释”的基本出发点,是认为我们人类对于微观世界的物理过程,只能进行概率描述,如果说对于一块小石头,我们可以说出来它在某时某刻处于某个位置上,那么对于一个象电子这样的微观物体,我们就只能说它在某时某刻处于某个位置上的概率是多少。在这个意义上,我们永远无法对于一个量子对象,给出具有确切时间空间描述的轨迹。对于这样的物理理论,爱因斯坦至死都不同意,或者说,都不满意。

爱因斯坦对于量子力学的不满,其实在他的相对论里面就可以找到端倪。因为在爱因斯坦的对于物理世界的一般图像里面,这个世界是严格满足因果律的。在狭义相对论的世界里面,宇宙的一切都循着自己的世界线而前进,在世界线上面,前后的事件之间具有严格的因果关系,不存在任何的需要使用概率描述的东西;在广义相对论的世界里面,时空因为引力质量而弯曲,宇宙的一切同样遵循自己的既有轨道,同样看不到概率的影子。在爱因斯坦的世界观里面,因果律是一条不可动摇的律令,犹如康德的星空,犹如斯宾诺莎的上帝。

因此,对于无法确定电子运动轨迹的量子力学,爱因斯坦是极其不满的,认为一个电子的精确实在的运动轨迹并不是不存在,我们之所以不能精确地描述出来,那是因为量子力学还不完善,只是给出运动概率的量子力学是一种最终的完善的量子力学的暂时形式。所以他也并非是忘记了自己曾经成功地运用量子概念解释了光电效应,而是把量子这种离散图像看做是一个权宜之计。

对于玻尔来说,他所关注的不是形式上的因果律,而是人类在认识论上的地位。他认为人本身作为一个宏观物体,在认识论的意义上,是不可能获得微观世界的细节知识的,因为我们关于微观世界的物理知识,都是来自物理测量,而物理测量要能够给出为我们人类所接受的数值,一切测量结果最终都必须是一种宏观现象,这就意味着我们对于微观世界的认识,只有止步于我们所能够测量的物理量上面,而这些可测量的物理量都在本性上属于一些不可测量的物理量的概率平均值。正是沿着这个思路,海森堡迅速地建立了量子力学的理论形式,并立刻在应用当中显示了这个理论的惊人的刻画自然的能力。

因此,爱因斯坦和玻尔对于自然的理解有着尖锐的内在对立。他们只要有机会,就要为自己而辩护,对对方进行驳难,一个最有名的情节,就是连续多日,爱因斯坦施展他的善于设计理想实验的才能,不断在黑板上画出他要求玻尔“进行”的理想实验,试图导出悖论,以揭示量子力学的错误;然后玻尔苦思一番,总是能够最终让悖论无法成立,如此反复多日,爱因斯坦也没有达到目的,两人各自的信念依旧!

(图3)玻尔和爱因斯坦的经典论战1

(图4)玻尔和爱因斯坦的经典论战2

(图5)画着著名的“爱因斯坦盒子”理想实验的黑板

爱因斯坦一直到了他的晚年,还没有罢休,跟人合作提出了一个EPR实验,在当时也可以说是理想实验,可喜的是在现代条件下,EPR这种类型的实验可以实现了,然而不幸的是,迄今所有的类似实验都表明,玻尔是对的。

爱因斯坦有一句有感而发的名言,“上帝是不扔色子的。”确实,他的宏观相对论世界是没有色子的,而微观世界呢,人类迄今还没有排除掉那个色子,这种反差现在仍然反映在引力理论和量子力学的不协调上面。爱因斯坦的固执现在看来是错了,但将来呢?

2009年3月29日星期日

UltraEdit for Linux: UEX Development Update

The world's favored windows editor will soon be available for Linux/Mac
Co-Written by Richard Knott and Ian D. Mead

Few programs within the scope of text/programmers editors offer the functionality UltraEdit does, so the goal of UEX is to offer the sum of UltraEdit's legendary functionality to the Linux/Mac user community - because a total solution should not be bound by OS.

UEX: UltraEdit for Linux

If you're a Windows user who has made the jump to Linux/Mac, UEX will seem very familiar. UEX uses all the standard (UltraEdit) hot keys that you've come to know (Ctrl-C to copy text, Ctrl-P to print, etc.)...

Perhaps two of the most popular hallmarks of UltraEdit are its extensive feature set and its famed configurability. Of course, both of these attributes are preserved in UEX so users can enjoy the things they liked most about UltraEdit, while at the same time, leveraging its strength against computing tasks on Linux/Mac platforms.

If you're exchanging text files between a Windows or Mac OS, you'll find UEX's built-in file-type converter very handy. Of course, as you know, In different operating systems line terminators are encoded differently. If you open a text file created in Linux in a Windows editor, you may see that the editor runs all the lines together. UEX's file-type converter internally handles the line endings so you can view the file as you expect to see it - you can convert the file permanently or leave it in its native format. Because UEX handles these formats automatically, you and your co-workers can switch effortlessly between Unix, Mac, and Windows text formats.

UltraEdit is heralded by the Windows community as the ideal text, HTML, HEX editor, advanced PHP, Perl, Java and JavaScript editor for programmers. We think delivering anything short of this to our Linux/Mac users would be a disappointment. As such, we didn't simply port UltraEdit (Windows) to Linux, rather we re-engineered the critical components to take advantage of the native Linux/Mac operating system - all this to give you the superior performance that UltraEdit is known for - on your operating system of choice.

Of course, for those of you who have become accustomed to UltraEdit on Windows, UltraEdit's legendary edit control has been re-engineered to run on Linux/OSX. The file handling and superior performance unique to UltraEdit (Windows) is the backbone of UEX thus giving you the superior performance you expect. Besides the core of the editor, we have ported the essential features needed for users of all types: Cut, Copy, Paste, Find, Replace, Undo, Redo, Print, Projects, Macros, Scripting, code folding, workspace/file manager, output window, project support, syntax highlighting, column/HEX editing, and so on...

Like its Windows counterpart, you can use UEX for anything you want, text editing, web development, application/software development, technical writing, system administration, and so on...

VI, VIM, Emacs
Replacement
Web
Developer
System
Administrator
Technical
Writer
Power
User
Programmer/
Developer

Configure UEX to perform an array of specialized tasks, or simply use it a general text editor. Load a 4GB+ file and data mine with all the might of our Find/Find-in-Files (FIF) and of course Replace/Replace-in-Files. Need a quick spell check before publishing your files, then run the integrated spell-checker... UEX is built with the single user in mind as well as the needs of a complete enterprise deployment.

As you know, many Linux editors won't run unless you install specific libraries, files, and other dependencies. Having specialized in text editing perfection for better than 15 years, we understand that installing and using specific libraries can be cumbersome and time-consuming. As such, we have taken the steps to make the installation of UEX as seamless as possible without the need for hunting down dependencies. To these ends, we plan on providing distribution specific packages for a seamless installation experience.

While we have done so much to date, and many are anticipating the nearing release, we are not done yet!

OSs

The initial distribution of UEX will be based on the Ubuntu platform. For Red Hat/Fedora, OpenSuse, Debian, and other Linux distribution users, a tarball will be provided so all can enjoy the editor.

Future steps will include the addition of a collection of IT tools such as integrated Telnet/SSH client, FTP/SFTP browser, and SQL/MySQL admin tools. Like its Windows counterpart, UEX will not be just a text editor, it will offer a total text editing solution...

And of course, our model for pricing has not changed. Affordability has always been a design requirement of any IDM product. The same will be true for UEX. There are few commercially available editors for Linux/Mac and the ones that offer any appreciable advanced functionality are very expensive. UEX will not follow this model.

Like it's Windows counterpart, UEX will be highly affordable, come with 1-year free upgrades, lifetime tech support, and will be maintained by our in-house dedicated engineering division so users can expect regular updates and new versions to keep pace with the ever changing technological landscape.

Look for updates here in IDM Highlights each month and expect UEX's commercial release VERY soon.

On behalf of our dedicated UEX team, it's our hope that you are as excited about UEX as we are.

LaTeX bibliography

JabRef is an open source bibliography reference manager. The native file format used by JabRef is BibTeX, the standard LaTeX bibliography format. JabRef runs on the Java VM (version 1.5 or newer), and should work equally well on Windows, Linux and Mac OS X.

BibTeX is an application and a bibliography file format written by Oren Patashnik and Leslie Lamport for the LaTeX document preparation system. General information about BibTeX.

Bibliographies generated by LaTeX and BibTeX from a BibTeX file can be formatted to suit any reference list specifications through the use of different BibTeX style files. We support this initiative to build a searchable database of BibTeX style files, organized by journal names: LaTeX bibliography style database.

You can run JabRef instantly with Java Web Start: Run JabRef.

2009年3月28日星期六

Dracut -- Cross distribution initramfs infrastructure

As davej started talking about a few months ago at Kernel Summit and LPC, there's a lot of duplication between distros on the tools used to generate the initramfs as well as the contents and how the initramfs works. Ultimately, there's little reason for this not to be something that is shared and worked on by everyone. Added to this is the fact that everyone's infrastructures for this have grown up over a long-ish period of time without significant amounts of reworking for the way that the kernel and early boot works these days. Therefore I've started on a new project, dracut, to try to be a new initramfs tool that can be used across various distributions. From the README... Unlike existing initramfs's, this is an attempt at having as little as possible hard-coded into the initramfs as possible. The initramfs has (basically) one purpose in life -- getting the rootfs mounted so that we can transition to the real rootfs. This is all driven off of device availability. Therefore, instead of scripts hard-coded to do various things, we depend on udev to create device nodes for us and then when we have the rootfs's device node, we mount and carry on. This helps to keep the time required in the initramfs as little as possible so that things like a 5 second boot aren't made impossible as a result of the very existence of an initramfs. It's likely that we'll grow some hooks for running arbitrary commands in the flow of the script, but it's worth trying to resist the urge as much as we can as hooks are guaranteed to be the path to slow-down. Also, there is an attempt to keep things as distribution-agnostic as possible. Every distribution has their own tool here and it's not something which is really interesting to have separate across them. So contributions to help decrease the distro-dependencies are welcome. The git tree can be found at git://fedorapeople.org/~katzj/dracut.git for now. See the TODO file for things which still need to be done and HACKING for some instructions on how to get started using the tool. There is also a mailing list that is being used for the discussion -- initramfs@vger.kernel.org. Currently, there are a few Fedora-isms which have crept in just as a result of it being the shortest path to solving some problems, but I'm actively trying to get those out sooner rather than later as well as getting to where I'm using it to boot my laptop. Comments and discussion welcome Jeremy

Dracut looks to replace the initramfs patchwork

Creating initramfs images, for use by the kernel at "early boot" time, is a rather messy business. It is made more so by the fact that each individual distribution has its own tools to build the image, as well as its own set of tools inside it. At the 2008 Kernel Summit, Dave Jones spent some time discussing the problem along with his idea to start over by creating a cross-distribution initramfs. That has led to the Dracut project, which was announced by Jeremy Katz in December, and a new mailing list, aptly named "initramfs", in which to discuss it.

An initramfs is a cpio archive of the initial filesystem that gets loaded into memory when the kernel is loaded. That filesystem needs to contain all of the drivers and tools needed to mount the real root filesystem. It isn't strictly necessary to have an initramfs, a minimal /dev along with the required drivers built into the kernel is another alternative. Distributions, though, all use an initramfs and, over time, each has come up with their own way to handle this process. Jones, Katz, and others would like to see something more standardized, that gets pushed upstream into the mainline kernel so that distributions can stop fussing with the problem.

There are a number of advantages to that approach. Building an initramfs from the kernel sources would eliminate problems that users who build their own kernels sometimes run into. If a distribution's initramfs scheme falls behind the pace of kernel development in some fashion, users can find themselves unable to build a kernel+initramfs combination that will work. There is also hope that dracut will help speed up the boot process by using udev, as Katz puts it:

By instead moving to where we're basing everything off of uevents we can hopefully move away from the massive shell scripts of doom, speed up boot and also maybe get to where a more general initramfs can be built _with the kernel_ instead of per-system.

Because initramfs is so integral to the early boot process—and so difficult to debug if problems arise—there is a concern about starting over. It is not surprising, then, that there is some resistance to throwing out years of hard-earned knowledge that is embodied in the various distributions' initramfs handling, leading Maximilian Attems to ask:

btw why do we need dracut at all? your blog has vague allusion against initramfs-tools, which is much better tested and has seen the field.

beside having more features and flexibility it does not hardcode udev usage, nor bash, why should it not be considered at first!?

It is a question that is frequently asked, but one that Jones has a ready answer for:

"why not use the ubuntu one?"
"why not use the suse one?"

they all have some good and bad tradeoffs. Distro X has feature Y which no-one else does. etc.

When the project began we spent some time looking at what everyone else already does, and "lets start over and hope others participate" seemed more attractive than taking an existing one and bending it to fit.

So, the Red Hat folks, at least, are proceeding with dracut. Jones recently posted a status report on his blog that outlined what is working and what still needs to be done. Though it currently is "Fedora-centric, with a few hardcoded assumptions in there, so it'll likely fall over on other distros", fixing that is clearly high on the to-do list. The status report is an effort to get people up-to-speed so that other distributions can start trying it out. In addition, he plans to start trying it on various distributions himself.

In its current form, dracut is rather minimal. It has a script named dracut that will generate a gzipped cpio file for the initramfs image, as well as an init shell script that ends up in that image. Jones says that init "achieves quite a lot in its 119 lines": setting up device nodes, starting udev, waiting for the root device to show up and mounting it, mounting /proc and /sys, and more. If anything goes wrong during that process, init will drop to a shell that will allow diagnosis of the problem. So far, it only supports the simpler cases for the location of the root filesystem:

Currently, dracut supports root on raw disks (/dev/sda), lvm (/dev/mapper...), and mounting root by label or uuid. If you have a more esoteric rootfs setup, such as root-on-nfs, right now it'll fail horribly.

There is only one remaining barrier to getting rid of the unlamented nash, and that is a utility to do a switch_root (i.e. switch to a new root directory and start an init from there). The plan is to write a standalone utility that would be added to the util-linux package. The environment provided by the initramfs would include util-linux, bash, and use glibc, which doesn't sit well with some embedded folks. They generally prefer a statically linked busybox environment. Kay Sievers outlines the reasons for a standard environment:

Busybox is nice as an option to be able to rescue/hack. It should definitely be provided as an optional "plugin" for people who need it. But there is no chance to depend on it by default, for the very same reason klibc, or any other libc is not an option.

Full-featured distros who make their money with support, can just not afford to support tools compiled differently from the tools in the real rootfs. SUSE used klibc for one release, and stopped doing that immediately, because you go crazy if you run into problems with bootup problems on [customer] setups you can not reproduce with the tools from the real rootfs.

There is plenty to do to make dracut into a real tool for creating initramfs images—at least ones that work on more than just Fedora—more root filesystem types need to be handled, hibernation signatures need to be recognized and handled, the udev rules need to be cleaned up, kdump images need to be supported, etc. But the overriding question is: will other distributions start working on dracut as well? If and when Jones (or others) get things at least limping along on Debian/Ubuntu and/or SUSE, will those distributions start getting on board? So far, there is not a lot of evidence of anyone other than Red Hat working on dracut.

But, the plan is to eventually submit dracut upstream to the mainline kernel, so that make initramfs works in a standard kernel tree. It would seem that many kernel hackers see the need for standardizing initramfs and eventually moving it into the kernel, as Ted Ts'o notes:

[...] So the idea that was explored was adding a common mkinitramfs with basic functionality into kernel sources, with the ability for distributions to add various "value add" enhancements if they like. This way if the kernel wants to move more functionality (for example, in the area of resuming from hibernation) out of the kernel into initramfs, it can do so without breaking the ability of older distributions from being able to use kernel.org kernels.

So IMHO, it's important not only that the distributions standardize on a single initramfs framework, but that framework get integrated into the kernel sources.

No one is very happy about losing their particular version of the tools to build an initramfs—if only because of familiarity—but a standardized solution is something whose time has come. Probably any of the existing tools could have been used as a starting point, but for political reasons, it makes sense to start anew. There is a fair amount of cruft that has built up in the existing tools as well, which folks are unlikely to miss, so there are also technical reasons to start over. It should come as no surprise that a project started by Red Hat might be somewhat Fedora-centric in its early form, but the clear intent is to make it distribution-agnostic. It would seem the right time for other distributions and constituencies (embedded for example) to get involved to help shape dracut into something useful for all. 

Interix

Interix is the name of an optional, full-featured POSIX and Unix environment subsystem for Microsoft's Windows NT-based operating systems. It is a component of the Services for Unix (SFU) release 3.0 and 3.5 (this last one is distributed free). The most recent releases of Interix, 5.2 and 6.0, are components of the Windows Server 2003 R2 and Windows Vista Enterprise and Ultimate editions under the name SUA [1] (Subsystem for Unix-based Applications).[2]Like the Microsoft POSIX subsystem in Windows NT, Interix is an implementation of an environment subsystem running atop the Windows kernel. Interix provides numerous open source utilities, much like the cygwin project.

The complete installation of Interix includes:

  • Over 350 Unix utilities such as vi, ksh, csh, ls, cat, awk, grep, kill, etc.
  • A complete set of manual pages for utilities and API's
  • GCC 3.3 compiler, includes and libraries
  • A cc/c89-like wrapper for Microsoft Visual Studio command-line C/C++ compiler
  • GNU Debugger
  • X11 client applications and libraries (no X server included though)
  • Has Unix "root" capabilities (i.e. setuid files)
  • Supports pthreads, shared libraries, DSO's, job control, signals, sockets, shared memory

The development environment includes support for C, C++ and Fortran. Threading is supported using the Pthreads model. Additional languages can be obtained (Python, Ruby, Tcl, etc.). The pkgsrc software packaging/build system was ported to work with Interix 3.5, and may work with newer versions (not yet tested).

Starting with release 5.2 the following capabilities were added:[3]

  • "Mixed mode" for linking Unix programs with Windows DLLs
  • 64-bit CPU support (in addition to 32-bit)
  • Large file system support on 64-bit systems
  • System V utilities can be optionally installed instead of the default BSD-based utilities

With release 6.0 the following new features can be expected:

  • IPv6 support
  • Updates to utilities are planned
  • MSVC debugging plug-in

Also Interix is slated to be included only with Vista Ultimate and Enterprise (not other Vista editions) from the next version onwards.

2009年3月27日星期五

STL之父访谈录


1995年3月,Dr.Dobb's Journal特约记者, 著名技术书籍作家Al Stevens采访了STL创始人Alexander
Stepanov. 这份访谈纪录是迄今为止对于STL发展历史的最完备介绍, 侯捷先生在他的STL有关文章里
推荐大家阅读这篇文章. 因此我将该文全文翻译如下:

Q: 您对于generic programming进行了长时间的研究, 请就此谈谈.
A: 我开始考虑有关GP的问题是在7O年代末期, 当时我注意到有些算法并不依赖于数据结构的
   特定实现,而只是依赖于该结构的几个基本的语义属性. 于是我开始研究大量不同的算法,
   结果发现大部分算法可以用这种方法从特定实现中抽象出来, 而且效率无损. 对我来说,
   效率是至关重要的, 要是一种算法抽象在实例化会导致性能的下降, 那可不够棒.
  
   当时我认为这项研究的正确方向是创造一种编程语言. 我和我的两个朋友一起开始干起来.
   一个是现在的纽约州立大学教授Deepak Kapur, 另一个是伦塞里尔技术学院教授David Musser.
   当时我们三个在通用电器公司研究中心工作. 我们开始设计一种叫Tecton的语言. 该语言
   有一种我们称为"通用结构"的东西, 其实不过是一些形式类型和属性的集合体, 人们可以
   用它来描述算法. 例如一些数学方面的结构充许人们在其上定义一个代数操作, 精化之,
   扩充之, 做各种各样的事.
 
   虽然有很多有趣的创意, 最终该项研究没有取得任何实用成果, 因为Tecton语言是函数型
   语言. 我们信奉Backus的理念,相信自己能把编程从von Neumann风格中解放出来. 我们
   不想使用副效应, 这一点限制了我们的能力, 因为存在大量需要使用诸如"状态", "副效
   应"等观念的算法.  

   我在70年代末期在Tecton上面所认识到了一个有趣的问题: 被广泛接受的ADT观念有着根本
   性的缺陷. 人们通常认为ADT的特点是只暴露对象行为特征, 而将实现隐藏起来. 一项操作
   的复杂度被认为是与实现相关的属性, 所以抽象的时候应予忽略. 我则认识到, 在考虑一
   个(抽象)操作时, 复杂度(或者至少是一般观念上的复杂度)必须被同时考虑在内. 这一点
   现在已经成了GP的核心理念之一.

   例如一个抽象的栈stack类型,  仅仅保证你push进去的东西可以随后被pop出来是不够的,
   同样极端重要的是, 不管stack有多大, 你的push操作必须能在常数时间内完成. 如果我
   写了一个stack, 每push一次就慢一点, 那谁都不会用这个烂玩艺.

   我们是要把实现和界面分开, 但不能完全忽略复杂度. 复杂度必须是, 而且也确实是横陈
   于模块的使用者与实现者之间的不成文契约. ADT观念的引入是为了允许软件模块相互可
   替换. 但除非另一个模块的操作复杂度与这个模块类似, 否则你肯定不愿意实现这种互换.
   如果我用另外一个模块替换原来的模块, 并提供完全相同的接口和行为, 但就是复杂度不
   同, 那么用户肯定不高兴. 就算我费尽口舌介绍那些抽象实现的优点, 他肯定还是不乐意
   用. 复杂度必须被认为是接口的一部分.

   1983年左右, 我转往纽约布鲁克林技术大学任教. 开始研究的是图的算法, 主要的合作伙
   伴是现在IBM的Aaron Kershenbaum. 他在图和网络算法方面是个专家, 我使他相信高序(high
   order)的思想和GP能够应用在图的算法中. 他支持我与他合作开始把这些想法用于实际的
   网络算法. 某些图的算法太复杂了, 只进行过理论分析, 从来没有实现过. 他企图建立一个
   包含有高序的通用组件的工具箱, 这样某些算法就可以实现了. 我决定使用Lisp语言的一个
   变种Scheme语言来建立这样一个工具箱. 我们俩建立了一个巨大的库, 展示了各种编程技术.
   网络算法是首要目标. 不久当时还在通用电器的David Musser加了进来, 开发了更多的组件,
   一个非常大的库. 这个库供大学里的本科生使用, 但从未商业化. 在这项工作中, 我了解到
   副效应是很重要的, 不利用副效应, 你根本没法进行图操作. 你不能每次修改一个端点(vertex)
   时都在图上兜圈子. 所以, 当时得到的经验是在实现通用算法时可以把高序技术和副效应结
   合起来. 副效应不总是坏的, 只有在被错误使用时才是.

   1985年夏, 我回到通用电器讲授有关高序程序设计的课程. 我展示了在构件复杂算法时这项
   技术的应用. 有一个听课的人叫陈迩, 当时是信息系统实验室的主任. 他问我是否能用Ada语
   言实现这些技术, 形成一个工业强度的库, 并表示可以提供支持. 我是个穷助教, 所以尽管我
   当时对于Ada一无所知, 我还是回答"好的". 我跟Dave Musser一起建立这个Ada库. 这是很重
   要的一个时期, 从象Scheme那样的动态类型语言(dynamically typed language)转向Ada这
   样的强类型语言, 使我认识到了强类型的重要性. 谁都知道强类型有助于纠错. 我则发现在
   Ada的通用编程中, 强类型是获取设计思想的有力工具. 它不仅是查错工具, 而且是思想工具.
   这项工作给了我对于组件空间进行正交分解的观念. 我认识到, 软件组件各自属于不同的类别.
   OOP的狂热支持者认为一切都是对象. 但我在Ada通用库的工作中认识到, 这是不对的. 二分查找
   就不是个对象, 它是个算法. 此外, 我还认识到, 通过将组件空间分解到几个不同的方向上, 我
   们可以减少组件的数量, 更重要的是, 我们可以提供一个设计产品的概念框架.

   随后, 我在贝尔实验室C++组中得到一份工作, 专事库研究. 他们问我能不能用C++做类似的事.
   我那时还不懂C++, 但当然, 我说我行. 可结果我不行, 因为1987年时, C++中还没有模板, 这玩
   艺在通用编程中是个必需品. 结果只好用继承来获取通用性, 那显然不理想.

   直到现在C++继承机制也不大用在通用编程中, 我们来说说为什么. 很多人想用继承实现数据结构
   和容器类, 结果几乎全部一败涂地. C++的继承机制及与之相关的编程风格有着戏剧性的局限. 用
   这种方式进行通用编程, 连等于判断这类的小问题都解决不了. 如果你以X类作为基类, 设计了
   一个虚函数operater==, 接受一个X类对象, 并由X派生类Y, 那么Y的operator==是在拿Y类对象与
   X类对象做比较. 以动物为例, 定义animal类, 派生giraffe(长颈鹿)类. 定义一个成员函数
   mate(), 实现与另一个哺乳动物的交配操作, 返回一个animal对象. 现在看看你的派生类giraffe,
   它当然也有一个mate()方法, 结果一个长颈鹿同一个动物交配, 返回一个动物对象. 这成何体统?
   当然, 对于C++程序员来说, 交配函数没那么重要, 可是operator==就很重要了.

   对付这种问题, 你得使用模板. 用模板机制, 一切如愿.

   尽管没有模板, 我还是搞出来一个巨大的算法库, 后来成了Unix System Laboratory Standard
   Component Library的一部分. 在Bell Lab, 我从象Andy Koenig, Bjarne Stroustrup(Andrew
   Koenig, 前ISO C++标准化委员会主席; Bjarne Stroustrup, C++之父 -- 译者)这类专家
   身上学到很多东西. 我认识到C/C++的重要, 它们的一些成功之处是不能被忽略的. 特别是我发
   现指针是个好东东. 我不是说空悬的指针, 或是指向栈的指针. 我是说指针这个一般观念. 地
   址的观念被广泛使用着. 没有指针我们就没法描述并行算法.

   我们现在来探讨一下为什么说C是一种伟大的语言. 通常人们认为C是编程利器并且获得如此成功,
   是因为UNIX是用C写的. 我不同意. 计算机的体系结构是长时间发展演变的结果, 不是哪一个聪明
   的人创造的. 事实上是广大程序员在解决实际问题的过程中提出的要求推动了那些天才提出这些
   体系. 计算机经过多次进化, 现在只需要处理字节地址索引的内存, 线性地址空间和指针. 这个
   进化结果是对于人们要求解决问题的自然反映. Dennis Ritchie天才的作品C, 正反映了演化了
   30年的计算机的最小模型. C当时并不是什么利器. 但是当计算机被用来处理各种问题时, 作为
   最小模型的C成了一种非常强大的语言, 在各个领域解决各种问题时都非常高效. 这就是C可移植
   性的奥秘, C是所有计算机的最佳抽象模型, 而且这种抽象确确实实是建立在实际的计算机, 而
   不是假想的计算机上的. 人们可以比较容易的理解C背后的机器模型, 比理解Ada和Scheme语言背
   后的机器模型要容易的多. C的成功是因为C做了正确的事, 不是因为AT&T的极力鼓吹和UNIX.

   C++的成功是因为Bjarne Stroustrup以C为出发点来改进C, 引入更多的编程技术, 但始终保持在
   C所定义的机器模型框架之内, 而不是闭门造车地自己搞出一个新的机器模型来. C的机器模型非
   常简单. 你拥有内存, 对象保存在那里面, 你又有指向连续内存空间的指针, 很好理解. C++保留
   了这个模型, 不过大大扩展了内存中对象的范畴, 毕竟C的数据类型太有限了, 它允许你建立新的
   类型结构, 但不允许你定义类型方法. 这限制了类型系统的能力. C++把C的机器模型扩展为真正
   类型系统.

   1988年我到惠普实验室从事通用库开发工作. 但实际上好几年我都是在作磁盘驱动器. 很有趣但跟
   GP毫不相关. 92年我终于回到了GP领域, 实验室主任Bill Worley建立了一个算法研究项目, 由我
   负责. 那时候C++已经有模板了. 我发现Bjarne的模板设计方案是非常天才的. 在Bell Lab时, 我参
   加过有关模班设计的几个早期的讨论, 跟Bjarne吵得很凶, 我认为C++的模板设计应该尽可能向Ada的
   通用方案看齐. 我想可能我吵得太凶了, 结果Bjarne决定坚决拒绝我的建议. 我当时就认识到在C++
   中设置模板函数的必要性了, 那时候好多人都觉得最好只有模板类. 不过我觉得一个模板函数在使用
   之前必须先显式实例化, 跟Ada似的. Bjarne死活不听我的, 他把模板函数设计成可以用重载机制来
   隐式实例化. 后来这个特别的技术在我的工作中变得至关重要, 我发现它容许我做很多在Ada中不可能
   的任务. 非常高兴Bjarne当初没听我的.

Q: 您是什么时候第一次构思STL的, 最初的目的是什么?
A: 92年那个项目建立时由8个人, 渐渐地人越来越少, 最后剩下俩, 我和李梦, 而且李小姐是这个领域的新
   手. 在她的专业研究中编译器是主要工作, 不过她接受了GP研究的想法, 并且坚信此项研究将带给软件开
   发一个大变化, 要知道那时候有这个信念的认可是寥寥无几. 没有她, 我可不敢想象我能搞定STL, 毕竟
   STL标着两个人的名字:Stepanov和Lee. 我们写了一个庞大的库, 庞大的代码量, 庞大的数据结构组件,
   函数对象, 适配器类, 等等. 可是虽然有很多代码, 却没有文档. 我们的工作被认为是一个验证性项目,
   其目的是搞清楚到底能不能在使算法尽可能通用化的前提下仍然具有很高的效率. 我们化了很多时间来
   比较, 结果发现, 我们算法不仅最通用, 而且要率与手写代码一样高效, 这种程序设计风格在性能上是
   不打折扣的! 这个库在不断成长, 但是很难说它是什么时候成为一个"项目"的. STL的诞生是好几件事情
   的机缘巧合才促成的.

Q: 什么时候, 什么原因促使您决定建议使STL成为ANSI/ISO标准C++一部分的?
A: 1993年夏, Andy Koenig跑到斯坦福来讲C++课, 我把一些有关的材料给他看, 我想他当时确实是很兴奋.
   他安排我9月到圣何塞给C++标准委员会做一个演讲. 我演讲的题目是"C++程序设计的科学", 讲得很理
   论化, 要点是存在一些C++的基本元素所必须遵循的, 有关基本操作的原则. 我举了一些例子, 比如构
   造函数, 赋值操作, 相等操作. 作为一种语言,  C++没有什么限制. 你可以用operator==()来做乘法.
   但是相等操作就应该是相等操作. 它要有自反性,  A == A; 它要有对称性, A == B 则 B == A; 它还
   要有传递性. 作为一个数学公理, 相等操作对于其他操作是基本的要素. 构造函数和相等操作之间的联
   系就有公理性的东西在里边. 你用拷贝构造函数生成了一个新对象, 那么这个对象和原来那个就应该是
   相等的. C++是没有做强行要求, 但是这是我们都必须遵守这个规则. 同样的, 赋值操作也必须产生相等
   的对象. 我展示了一些基本操作的"公理", 还讲了一点迭代子(iterator), 以及一些通用算法怎样利用迭
   代子来工作. 我觉得那是一个两小时的枯燥演讲, 但却非常受欢迎. 不过我那时并没有想把这个东西塞在
   标准里, 它毕竟是太过先进的编程技术, 大概还不适于出现在现实世界里, 恐怕那些做实际工作的人对它
   没什么兴趣.

   我是在9月做这个演讲的, 直到次年(1994)月, 我都没往ANSI标准上动过什么脑筋. 1月6日, 我收到
   Andy Koenig的一封信(他那时是标准文档项目编辑), 信中说如果我希望STL成为标准库的一部分, 可以
   在1月25日之前提交一份建议到委员会. 我的答复是:"Andy, 你发疯了吗?", 他答复道:"不错, 是的我
   发疯了, 为什么咱们不疯一次试试看?"

   当时我们有很多代码, 但是没有文档, 更没有正式的建议书. 李小姐和我每星期工作80小时, 终于在
   期限之前写出一份正式的建议书. 当是时也, 只有Andy一个人知道可能会发生些什么. 他是唯一的支
   持者, 在那段日子里他确实提供了很多帮助. 我们把建议寄出去了, 然后就是等待. 在写建议的过程
   中我们做了很多事. 当你把一个东西写下来, 特别是想到你写的可能会成为标准, 你就会发现设计中
   的所有纰漏. 寄出标准后,我们不得不一段一段重写了库中间的代码, 以及几百个组件, 一直到3月份
   圣迭戈会议之前. 然后我们又重新修订了建议书, 因为在重新写代码的过程中, 我们又发现建议书中
   间的很多瑕疵.

Q: 您能描述一下当时委员会里的争论吗? 建议一开始是被支持呢, 还是反对?
A: 我当时无法预料会发生些什么. 我做了一个报告, 反响很好. 但当时有许多反对意见. 主要的意见是:
   这是一份庞大的建议, 而且来得太晚, 前一次会议上已经做出决议, 不在接受任何大的建议. 而这个
   东西是有史以来最大的建议, 包括了一大堆新玩艺. 投票的结果很有趣, 压倒多数的意见认为应对
   建议进行再考虑, 并把投票推迟到下次会议, 就是后来众所周知的滑铁卢会议.

   Bjarne Stroustrup成了STL的强有力支持者. 很多人都通过建议、更改和修订的方式给予了帮助。
   Bjarne干脆跑到这来跟我们一起工作了一个礼拜。Andy更是无时无刻的帮助我们。C++是一种复杂
   的语言,不是总能搞得清楚确切的含义的。差不多每天我都要问Andy和Bjarne C++能不能干这干那。
   我得把特殊的荣誉归于Andy, 是他提出把STL作为C++标准库的一部分;而Bjarne也成了委员会中
   STL的主要鼓吹者。其他要感谢的人还有:Mike Vilot,标准库小组的负责人; Rogue Wave公司的
   Nathan Myers(Rogue Wave是Boland C++Builder中STL方案的提供商 —— 译者),Andersen咨询公
   司的Larry Podmolik。确实有好多人要致谢。

   在圣迭戈提出的STL实际与当时的C++,我们被要求用新的ANSI/ISO C++语言特性重写STL,这些特性
   中有一些是尚未实现的。为了正确使用这些新的、未实现的C++特性,Bjarne和Andy花了无以计数的
   时间   来帮助我们。

   人们希望容器独立于内存模式,这有点过分,因为语言本身并没有包括内存模式。所以我们得要想出
   一些机制来抽象内存模式。在STL的早期版本里,假定容器的容积可以用size_t类型来表示,迭代子
   之间的距离可以用ptrdiff_t来表示。现在我们被告知,你为什么不抽象的定义这些类型?这个要求
   比较高,连语言本身都没有抽象定义这些类型,而且C/C++数组还不能被这些类型定义所限定。我们
   发明了一个机制称作"allocator",封装了内存模式的信息。这各机制深刻地影响了库中间的每一个
   组件。你可能疑惑:内存模式和算法或者容器类接口有什么关系?如果你使用size_t这样的东西,你
   就无法使用 T* 对象,因为存在不同的指针类型(T*, T huge *, 等等)。这样你就不能使用引用,因
   为内存模式不同的话,会产成不同的引用类型。这样就会导致标准库产生庞大的分支。

   另外一件重要的事情是我们原先的关联类型数据结构被扩展了。这比较容易一些,但是最为标准的东
   西总是很困难的,因为我们做的东西人们要使用很多年。从容器的观点看,STL做了十分清楚的二分
   法设计。所有的容器类被分成两种:顺序的和关联的,就好像常规的内存和按内容寻址的内存一般。
   这些容器的语义十分清楚。

   当我到滑铁卢以后,Bjarne用了不少时间来安慰我不要太在意成败与否,因为虽然看上去似乎不会成功,
   但是我们毕竟做到了最好。我们试过了,所以应该坦然面对。成功的期望很低。我们估计大部分的意见
   将是反对。但是事实上,确实有一些反对意见,但不占上风。滑铁卢投票的结果让人大跌眼镜,80%赞
   成,20%反对。所有人都预期会有一场恶战,一场大论战。结果是确实有争论,但投票是压倒性的。

Q: STL对于1994年2月发行的ANSI/ISO C++工作文件中的类库有何影响?
A: STL被放进了滑铁卢会议的工作文件里。STL文档被分解成若干部分,放在了文件的不同部分中。Mike
   Vilot负责此事。我并没有过多地参与编辑工作,甚至也不是C++委员会的成员。不过每次有关STL的
   建议都由我来考虑。委员会考虑还是满周到的。

Q: 委员会后来又做了一些有关模板机制的改动,哪些影响到了STL?
A: 在STL被接受之前,有两个变化影响到了我们修订STL。其一是模板类增加了包含模板函数的能力。STL
   广泛地使用了这个特性来允许你建立各种容纳容器的容器。一个单独的构造函数就能让你建立一个能容
   纳list或其他容器的的vector。还有一个模板构造函数,从迭代子构造容器对象,你可以用一对迭代子
   当作参数传给它,这对迭代子之间的元素都会被用来构造新的容器类对象。另一个STL用到的新特性是
   把模板自身当作模板参数传给模板类。这项技术被用在刚刚提到的allocator中。

Q: 那么STL影响了模板机制吗?
A: 在弗基山谷的会议中,Bjarne建议给模板增加一个“局部特殊化”(partial specialization)的特性。
   这个特性可以让很多算法和类效率更高,但也会带来代码体积上的问题。我跟Bjarne在这个建议上共同
   研究了一段时间,这个建议就是为了使STL更高效而提出的。我们来解释一下什么是“局部特殊化”。
   你现在有一个模板函数 swap( T&, T& ),用来交换两个参数。但是当T是某些特殊的类型参数时,你想
   做一些特殊的事情。例如对于swap( int&, int& ),你想用一种特别的操作来交换数据。这一点在没有
   局部特殊化机制的情况下是不可能的。有了局部特殊化机制,你可以声明一个模板函数如下:
  
       template void swap( vector&, vector& );

   这种形式给vector容器类的swap操作提供了一种特别的办法。从性能的角度讲,这是非常重要的。如果
   你用通用的形式去交换vector,会使用三个赋值操作,vector被复制三次,时间复杂度是线性的。然而,
   如果我们有一个局部特殊化的swap版本专门用来交换两个vector,你可以得到一个时间复杂度为常数的,
   非常快的操作,只要移动vector头部的两个指针就OK。这能让vector上的sort算法运行得更快。没有局
   部特殊化,让某一种特殊的vector,例如vector运行得更快的唯一办法是让程序员自己定一个特殊
   的swap函数,这行得通,但是加重了程序员的负担。在大部分情况下,局部特殊化机制能够让算法在某
   些通用类上表现得更高效。你有最通用的swap,不那么通用的swap,更不通用的swap,完全特殊的swap
   这么一系列重载的swap,然后你使用局部特殊化,编译器会自动找到最接近的那个swap。另一个例子是
   copy。现在我们的copy就是通过迭代子一个一个地拷贝。使用模板特殊化可以定义一个模板函数:

 template T** copy( T**, T**, T** );

   这可以用memcpy高效地拷贝一系列指针来实现,因为是指针拷贝,我们可以不必担心构造对象和析构
   对象的开销。这个模板函数可以定义一次,然后供整个库使用,而且用户不必操心。我们使用局部特殊
   化处理了一些算法。这是个重要的改进,据我所知在弗基山谷会议上得到了好评,将来会成为标准的一
   部分。(后来的确成了标准的一部分 —— 译者)

Q: 除了标准类库外,STL对那一类的应用程序来说最有用处?
A: 我希望STL能够引导大家学习一种新的编程风格:通用编程。我相信这种风格适用于任何种类的应用程
   序。这种风格就是:用最通用的方式来写算法和数据结构。这些结构所要求的语义特性应该能够被清楚
   地归类和分类,而这些归类分类的原则应该是任何对象都能满足的。理解和发展这种技术还要很长时间,
   STL不过是这个过程的起点。
 
   我们最终会对通用的组件有一个标准的分类,这些组件具有精心定义的接口和复杂度。程序员们将不必
   在微观层次上编程。你再也不用去写一个二分查找算法。就是在现在,STL也已经提供了好几个通用的
   二分查找算法,凡是能用二分查找算法的场合,都可以使用这些算法。算法所要求的前提条件很少:你
   只要在代码里使用它。我希望所有的组件都能有这么一天。我们会有一个标准的分类,人们不用再重复
   这些工作。

   这就是Douglas McIlroy的梦想,他在1969年关于“构件工厂”的那篇著名文章中所提出来的东西。STL
   就是这种“构件工厂”的一个范例。当然,还需要有主流的力量介入这种技术的发展之中,光靠研究机
   构不行,工业界应该想程序员提供组件和工具,帮助他们找到所需的组件,把组件粘合到一起,然后
   确定复杂度是否达到预期。

Q: STL没有实现一个持久化(persistent)对象容器模型。map和multimap似乎是比较好的候选者,它们可以
   把对象按索引存入持久对象数据库。您在此方向上做了什么工作吗,或者对这类实现有何评论?
A:很多人都注意到这个问题。STL没实现持久化是有理由的。STL在当时已经是能被接受的最巨大的库了。
   再大一点的话,我认为委员会肯定不会接受。当然持久化是确实是一些人提出的问题。在设计STL,特别
   是设计allocator时,Bjarne认为这个封装了内存模式的组件可以用来封装持久性内存模式。Bjarne的
   洞察秋毫非常的重要和有趣,好几个对象数据库公司正在盯着这项技术。1994年10月我参加了Object
   Database Management Group的一个会议,我做了一个关于演说。他们非常感兴趣,想让他们正在形成
   中的组件库的接口与STL一致,但不包括allocator在内。不过该集团的某些成员仔细分析了allocator
   是否能够被用来实现持久化。我希望与STL接口一致的组件对象持久化方案能在接下来的一年里出现。

Q:set,multiset,map和multimap是用红黑树实现的,您试过用其他的结构,比如B*树来实现吗?
A:我不认为B*适用于内存中的数据结构,不过当然这件事还是应该去做的。应该对许多其他的数据结构,
   比如跳表(skip list)、伸展树(splay tree)、半平衡树(half-balanced tree)等,也实现STL容器的标
   准接口。应该做这样的研究工作,因为STL提供了一个很好的框架,可以用来比较这些结构的性能。结口
   是固定的,基本的复杂度是固定的,现在我们就可一个对各种数据结构进行很有意义的比较了。在数据
   结构领域里有很多人用各种各样的接口来实现不同的数据结构,我希望他们能用STL框架来把这些数据
   结构变成通用的。
   (译者注:上面所提到的各种数据结构我以为大多并非急需,而一个STL没有提供而又是真正重要的数据
     结构是哈希结构。后来在Stepanov和Matt Austern等人的SGI*STL中增补了hashset,hashmap和
     hashtable三种容器,使得这个STL实现才比较完满。众所周知,红黑树的时间复杂度为O(logN), 而理
     想hash结构为O(1)。当然,如果实现了持久化,B+树也是必须的。)

Q:有没有编译器厂商跟您一起工作来把STL集成到他们的产品中去?
A:是的,我接到了很多厂家的电话。Borland公司的Peter Becker出的力特别大。他帮助我实现了对应
   Borland编译器的所有内存模式的allocator组件。Symantec打算为他们的Macintosh编译器提供一个STL
   实现。Edison设计集团也很有帮助。我们从大多数编译器厂商都得到了帮助。
   (译者注:以目前的STL版本来看,最出色的无疑是SGI*STL和IBM STL for AS/390,所有Windows下的
     的STL实现都不令人满意。根据测试数据,Windows下最好的STL运行在PIII 500MHz上的速度远远
     落后与在250MHz SGI工作站(IRIX操作系统)上运行的SGI*STL。以我个人经验,Linux也是运行STL
     的极佳平台。而在Windows的STL实现中,又以Borland C++Builder的Rogue Wave STL为最差,其效率
     甚至低于JIT执行方式下的Java2。Visual C++中的STL是著名大师P. J. Plauger的个人作品,性能较
     好,但其queue组件效率很差,慎用)

Q:STL包括了对MS-DOS的16位内存模式编译器的支持,不过当前的重点显然是在32位上线性内存模式
   (flat model)的操作系统和编译器上。您觉得这种面向内存模式的方案以后还会有效吗?
A:抛开Intel的体系结构不谈,内存模式是一个对象,封装了有关指针的信息:这个指针的整型尺寸和
   距离类型是什么,相关的引用类型是什么,等等。如果我们想利用各种内存,比如持久性内存,共享
   内存等等,抽象化的工作就非常重要了。STL的一个很漂亮的特性是整个库中唯一与机器类型相关的
   部分——代表真实指针,真实引用的组件——被封装到大约16行代码里,其他的一切,容器、算法等
   等,都与机器无关(真是牛啊!)。从移植的观点看,所有及其相关的东西,象是地址记法,指针等
   等,都被封装到一个微小的,很好理解的机制里面。这样一来,allocator对于STL而言就不是那么
   重要了,至少不像对于基本数据结构和算法的分解那么重要。


Q:ANSI/ISO C标准委员会认为像内存模式这类问题是平台相关的,没有对此做出什么具体规定。C++委员
   会会不会采取不同的态度?为什么?
A:我认为STL在内存模式这一点上跟C++标准相比是超前的。但是在C和C++之间有着显著的不同。C++有构造
   函数和new操作符来对付内存模式问题,而且它们是语言的一部分。现在看来似乎让new操作符像STL容器
   使用allocater那样来工作是很有意义的。不过现在对问题的重要性不像STL出现之前那么显著了,因为
   在大多数场合,STL数据结构将让new失业。大部分人不再需要分配一个数组,因为STL在做这类事情上
   更为高效。要知道我对效率的迷信是无以复加的,可我在我的代码里从不使用new,汇编代码表明其效率
   比使用new时更高。随着STL的广泛使用,new会逐渐淡出江湖。而且STL永远都会记住回收内存,因为当
   一个容器,比如vector退出作用域时,它的析构函数被调用,会把容器里的所有东西都析构。你也不必
   再担心内存泄漏了。STL可以戏剧性地降低对于垃圾收集机制的需求。使用STL容器,你可以为所欲为,
   不用关心内存的管理,自有STL构造函数和析构函数来对付。


Q:C++标准库子委员会正在制订标准名空间(namespace)和异常处理机制。STL类会有名空间吗,会抛出异
   常吗?
A:是的。该委员会的几个成员正在考虑这件事,他们的工作非常卓越。

Q:现在的STL跟最终作为标准的STL会有多大不同?委员会会不会干预某些变化,新的设计会不会被严格地控
   制起来?
A:多数人的意见看起来是不希望对STL做任何重要的改变。

Q:在成为标准之前,程序员们怎样的一些STL经验?
A:他们可以从butler.hpl.hp.com/stl当下STL头文件,在Borland和IBM或其他足够强劲的的编译器中使用它。
   学习这种编程技术的唯一途径是编程,看看范例,试着用这种技术来编程。

Q:您正在和P. J. Plauger合作一本STL的书。那本书的重点是什么?什么时候面世?
A:计划95年夏天面世,重点是对STL实现技术的详解,跟他那本标准C库实现和标准C++库实现的书类似。他是
   这本书的第一作者。该书可以作为STL的参考手册。我希望跟Bjarne合作另写一本书,在C++/STL背景下介绍
   语言与库的交互作用。

   好多工作都等着要做。为了STL的成功,人们需要对这种编程技术进行更多的试验性研究,更多的文章和书籍
   应该对此提供帮助。要准备开设此类课程,写一些入门指南,开发一些工具帮助人们漫游STL库。STL是一个
   框架,应该有好的工具来帮助使用这个框架。
   (译者注:他说这番话时,并没有预计到在接下来的几年里会发生什么。由于Internet的大爆炸和Java、
     VB、Delphi等语言的巨大成功,工业界的重心一下子从经典的软件工程领域转移到Internet上。再加上
     标准C++直到98年才制订,完全符合要求的编译器直到现在都还没有出现,STL并没有立刻成为人们心中的
     关注焦点。他提到的那本书也迟迟不能问世,直到前几天(2001年元旦之后),这本众人久已期盼的书
     终于问世,由P. J. Plauger, Alexander Stepanov, Meng Lee, David Musser四大高手联手奉献,
     Prentice Hall出版。不过该书主要关注的是STL的实现技术,不适用于普通程序员。

     另外就P. J. Plauger做一个简介:其人是标准C中stdio库的早期实现者之一,91年的一本关于标准
     C库的书使他名满天下。他现在是C/C++ Use's Journal的主编,与Microsoft保持着良好的,甚至是
     过分亲密的关系,Visual C++中的STL和其他的一些内容就是出自他的那只生花妙笔。不过由于跟MS
     的关系已经影响到了他的中立形象,现在有不少人对他有意见。

     至于Stepanov想象中的那本与Stroustrup的书,起码目前是没听说。其实这两位都是典型的编程圣手,
     跟Ken Thompson和Dennis Ritchie是一路的,懒得亲自写书,往往做个第二作者。如果作为第一作者,
     写出来的书肯定是学院味十足,跟标准文件似的,不适合一般程序员阅读。在计算机科学领域,编程
     圣手同时又是写作高手的人是凤毛麟角,最著名的可能是外星人D. E. Knuth, C++领域里则首推前面
     提到的Andrew Koenig。可惜我们中国程序员无缘看到他的书。)

Q:通用编程跟OOP之间有什么关系?
A:一句话,通用编程是OOP基本思想的自然延续。什么是OOP的基本思想呢?把组件的实现和接口分开,并
   且让组件具有多态性。不过,两者还是有根本的不同。OOP强调在程序构造中语言要素的语法。你必须
   得继承,使用类,使用对象,对象传递消息。GP不关心你继承或是不继承,它的开端是分析产品的分类,
   有些什么种类,他们的行为如何。就是说,两件东西相等意味着什么?怎样正确地定义相等操作?不单
   单是相等操作那么简单,你往深处分析就会发现“相等”这个一般观念意味着两个对象部分,或者至少
   基本部分是相等的,据此我们就可以有一个通用的相等操作。再说对象的种类。假设存在一个顺序序列
   和一组对于顺序序列的操作。那么这些操作的语义是什么?从复杂度权衡的角度看,我们应该向用户提
   供什么样的顺序序列?该种序列上存在那些操作?那种排序是我们需要的?只有对这些组件的概念型分
   类搞清楚了,我们才能提到实现的问题:使用模板、继承还是宏?使用什么语言和技术?GP的基本观点
   是把抽象的软件组件和它们的行为用标准的分类学分类,出发点就是要建造真实的、高效的和不取决于
   语言的算法和数据结构。当然最终的载体还是语言,没有语言没法编程。STL使用C++,你也可以用Ada
   来实现,用其他的语言来实现也行,结果会有所不同,但基本的东西是一样的。到处都要用到二分查找
   和排序,而这就是人们正在做的。对于容器的语义,不同的语言会带来轻微的不同。但是基本的区别很
   清楚是GP所依存的语义,以及语义分解。例如,我们决定需要一个组件swap,然后指出这个组件在不同的
   语言中如果工作。显然重点是语义以及语义分类。而OOP所强调的(我认为是过分强调的)是清楚的定义
   类之间的层次关系。OOP告诉了你如何建立层次关系,却没有告诉你这些关系的实质。
   (这段不太好理解,有一些术语可能要过一段时间才会有合适的中文翻译——译者)

Q:您对STL和GP的未来怎么看?
A:我刚才提到过,程序员们的梦想是拥有一个标准的组件仓库,其中的组件都具有良好的、易于理解的和标
   准的接口。为了达成这一点,GP需要有一门专门的科学来作为基础和支柱。STL在某种程度上开始了这项
   工作,它对于某些基本的组件进行了语义上的分类。我们要在这上面下更多的功夫,目标是要将软件工程
   从一种手工艺技术转化为工程学科。这需要一门对于基本概念的分类学,以及一些关于这些基本概念的定
   理,这些定理必须是容易理解和掌握的,每一个程序员即使不能很清楚的知道这些定理,也能正确地使用
   它。很多人根本不知道交换律,但只要上过学的人都知道2+5等于5+2。我希望所有的程序员都能学习一些
   基本的语义属性和基本操作:赋值意味着什么?相等意味着什么?怎样建立数据结构,等等。

   当前,C++是GP的最佳载体。我试过其他的语言,最后还是C++最理想地达成了抽象和高效的统一。但是
   我觉得可能设计出一种语言,基于C和很多C++的卓越思想,而又更适合于GP。它没有C++的一些缺陷,特别
   是不会像C++一样庞大。STL处理的东西是概念,什么是迭代子,不是类,不是类型,是概念。说得更正式
   一些,这是Bourbaki所说的结构类型(structure type),是逻辑学家所说的理念(theory),或是类型
   理论学派的人所说的种类(sort),这种东西在C++里没有语言层面上的对应物(原文是incarnation,直译
   为肉身——译者),但是可以有。你可以拥有一种语言,使用它你可以探讨概念,精化概念,最终用一种
   非常“程序化”(programmatic,直译为节目的,在这里是指符合程序员习惯的——译者)的手段把它们
   转化为类。当然确实有一些语言能处理种类(sorts),但是当你想排序(sort)时它们没什么用处。我们
   能够有一种语言,用它我们能定义叫做foward iterator(前向迭代子)的东西,在STL里这是个概念,没有
   C++对应物。然后我们可以从forword iterator中发展出bidirectional iterator(双向迭代子),再发展
   出random iterator。可能设计一种语言大为简化GP,我完全相信该语言足够高效,其机器模型与C/C++充分
   接近。我完全相信能够设计出一种语言,一方面尽可能地靠近机器层面以达成绝对的高效,另一方面能够处
   理非常抽象化的实体。我认为该语言的抽象性能够超过C++,同时又与底层的机器之间契合得天衣无缝。我认
   为GP会影响到语言的研究方向,我们会有适于GP的实用语言。从这些话中你应该能猜出我下一步的计划。

Test Center: Slacker databases break all the old rules

Amazon SimpleDB, Apache CouchDB, Google App Engine, and Persevere, offering far greater simplicity than SQL, may have a better way of storing data for your Web app



By Peter Wayner


March 24, 2009

So you've got some data to store. In the past, the answer was simple: Hook up an official database, pour the data into it, and let the machine sort everything out for you while you spend your time writing big checks to the database manufacturer. Now things aren't so cut and dry. A fresh round of exciting new tools is tacking the two letters "db" onto a pile of code that breaks with the traditional relational model. Old database administrators call them "toys" and hint at terrible dangers to come from the follies of these young whippersnappers. The whippersnappers just tune out the warnings because the new tools are good enough and fast enough for what they need.


The non-relational upstarts are grabbing attention because they're willfully ignoring many of the rules that codify the hard lessons learned by the old database masters. The problem is that these belts-and-suspenders strictures often make it hard to create really, really big databases that suck up all of the cycles of a room full of machines. Because all Web application designers dream of building a startup that needs a really big room filled with machines to hold all of the data of all of the users, the rules need to be bent or even broken.

[ Fora brief look at more alternative databases, see Open source and SaaS offerings rethink the database. Catch InfoWorld's cloud computing reviews and analysis: Cloud versus cloud: Amazon, Google, AppNexus, and GoGrid | Deep dive into Amazon Web Services | App builders in the sky | Windows Azure Services preview | What cloud computing really means. ]

The first thing to go is the venerable old JOIN. College students used to dutifully work through exercises that taught them how to normalize the data by breaking the tables up into as many parts as practical. Disk space was expensive then, and a good normalization expert could really pack in the data. The problem is that JOINs are really, really slow when the data is spread out over several machines. Now that disk space is so cheap and many of the data models don't benefit as much from normalization, JOINs are easy to leave behind.

The next trick is to start using phrases like "eventual consistency." Amazon's documentation for SimpleDB includes this inexact promise: "Consistency is usually reached within seconds, but a high system load or network partition might increase this time." The new twerps really get those codgers steamed when they talk about how all of the computers in the cluster will get around to replicating the data and giving consistent answers when the machines are good and ready. For the kids, consistency is akin to cod liver oil or making your bed in the morning.

This distinction between immediate and eventual consistency is deeply philosophical and depends on how important the data happens to be. The old guard who start reaching for their heart medication at the news of these new databases are usually bank programmers who want to make sure that the accounts balance at the end of the day. After all, the bank's brilliant leaders can't turn around and "invest" the cash in subprime mortgages if there's one penny missing after a failed database transaction. At least they're not hauling the DBAs before Congress to explain where the cash went.

But many modern Web sites will sail on without a hiccup if some transaction fails. I see glitches on Facebook regularly. The world won't end if some snarky, anonymous comment on Slashdot disappears. None of these sites cares if the accounting is as good as a bank's, and they don't really need all of the power of a traditional database. (Some wags suggest that banks put the money from an Oracle license into a fund to compensate the people who actually lose money on a failed transaction from one of these newfangled data stores.)

To get an understanding of this expanding tier of non-traditional databases, I took a few out for a ride and built up some test applications with them. The field was surprisingly diverse despite the fact that the offerings are so stripped down that they really don't have more than three major commands: Insert, Update, and Delete. Some offer clustering. Some are available only as a service. Some have grand pretensions to take over the entire server stack. Some play better with AJAX tools than others. None of them is right for everyone, and all of them are completely wrong for the bankers out there. (See the sidebar, "Open source and SaaS offerings rethink the database.")

I also excluded a few interesting tools because of space or just because they were slightly different. Sun, for instance, is now bundling a version of a relational database called Derby with its Java VM. Oracle has its own embedded tool once known as Sleepycat's Berkeley DB but now called the Oracle Embedded Database. Some programmers are even creating very low-rent libraries that write the objects directly to the disk. One project, Prevayler, brags that all of the code from one version could fit legibly on a T-shirt. These products are also stretching the meaning of the two letters "db," but they didn't fit in this comparison.

Amazon SimpleDB
SimpleDB is one of the most advanced and most cloud-like components of Amazon's great push into cloud computing services. Once you sign up and get your secret password, you can ship off some Web service XML filled with pairs of keys and values to SimpleDB and it will store the data for you -- well, as long as you keep paying the bills shown on the meter. You don't need to think about installing anything or backing it up. Amazon hides all of that work for you behind its Web service wall.

SimpleDBcomes with two levels of hierarchy on top of the piles of data pairs. The top level is the "domain" and the second level is the "item." After you choose the domain and item names, you pour in the pairs. SimpleDB's comparatively feature-rich API includes the ability to sort the data and even count the number of items that match the query. You can even write queries that exclude values that don't start with a certain string. This may not sound like much to someone who uses SQL Server or Oracle, but some of these low-rent databases can't even sort the data in the result set.

SimpleDBis meant to be used with Amazon's Simple Storage Service (S3), because each of the values in the pairs is limited to 1,024 bytes. That's enough for many strings, but it's not enough for many content engines. So you store a pointer to the data in S3. There are a few libraries like an extension of the Java Persistence Architecture that straddle the two clouds and handle this pointer juggling for you.

There are other limitations that can lead you to start doing JOIN-like things with multiple calls. Each query can only run 5 seconds. The answer can only hold 250 items. Each item can have only 250 pairs. Some people half joke about concatenating multiple values with keys like "description1," "description2," and "description3." There are many simple work-arounds for the limitations, but they start to make you wonder whether SimpleDB is supposed to make your life easier or harder.

Amazon is beginning to rewrite the APIs to push for more and better authentication. Come September 2009, calls to the SimpleDB (and a few other services) will run through SSL, providing both security and authentication. Amazon is also enhancing the signature mechanism to use more sophisticated hashing algorithms that pack together more of the request. This is just one of the ways that Amazon is slowly rolling out small improvements.

The company is also creating more libraries that make it simpler to use the service. There are dozens of packages that work with all of the major languages and some of the minor ones. The documentation is extensive. It's usually possible to start up and begin storing your data in little time.

The price is now easier to handle. There's a "free tier" of service that lets you burn up to 25 hours of computation time per month -- enough, Amazon estimates, to run a basic logging tool that processes less than 2 million requests a month. Plus, Amazon recently slashed the price for storage from $1.50 to 25 cents per gigabyte. The company appears committed to keeping the charges transparent so users will have the right incentives to structure their consumption.

Amazon has one of the more advanced terms of service. There are plenty of clauses that work through some of the problems you might encounter, and several caught my untrained eye. For instance, Amazon claims, "We may delete, without liability of any kind, any of your Amazon SimpleDB Content that has not been accessed in the previous 6 months." This may be perfectly acceptable for the people who are taking the system out for a spin with test data and not paying for it, but the phrasing suggests a bit of the omnipotence that Amazon probably feels it needs to keep its datacenter running.

There are other squishy issues. For instance, the terms of service include a long list of forbidden data, such as promoting illegal activities and discriminating on the basis of "race, sex, religion, nationality, disability, sexual orientation, or age." Imagine you're running a Web site for some church campaigning against gay marriage. That sounds like it might be dinged for discriminating against sexual orientation. But let's say you're campaigning for gay marriage by protesting these churches. Are you discriminating on the basis of religion?

I feel sorry for the lawyers who are going to parse the complaints, but at least they can rest easy knowing they can pretty much delete your data "for any reason or for no reason." Whew. If you're just using the free service, Amazon doesn't have to give you any notice, but it promises a 60-day notice if you're a paying customer. You can get your data back -- if you pay the storage charges that keep accruing.

Google App Engine
Google App Engine isn't a database per se. It's a cloud for distributing Python applications, and it comes with its own database hidden away inside. It's not really possible to access the database without going through the application layer first. But it's not hard to wrap up a database call and format the data for the request, so it might be proper to think of App Engine as a database with a layer of embedded procedures that are written in Python.

This extra layer of customizability is often quite useful. Many of the complaints about the other toy databases revolve around how a missing feature makes it impossible to find the right data. If you want to add a bit more functionality to the database here, you can whip up many of the features locally in Python. If you want a JOIN, you can synthesize one in Python and probably customize the memory cache at the same time. This is especially useful for Web applications that let users store their data in the service. If you need to add security to restrict each user to the right data, you can code that in Python too.

The App Engine data store is much more structured than Amazon's SimpleDB, and it gets much of this structure from Python's object model. You don't store key-value pairs, but Python objects, and those are defined with something that's pretty similar to an SQL schema. You can set the type of each column, make some of them required, and then ask for indexing across the columns that you'll need. The transaction mechanism is also deeply entwined with Python because each transaction is really just a Python function. This is a bit of a simplistic statement because there is a list of restrictions on what can happen inside this function (including rules such as each item can be updated only once). The good news is that the Google team is building special transaction methods that abstract away some of the common behavior (such as "Create or Update" a row).

Searching is deliberately set up to be SQL-like; in fact, Google offers its own SQL-like language, GQL, that's parsed into queries. There's also a Python-based set of methods that can be chained together to handle the data selection and querying. You don't need to waste the cycles parsing the query.

It's worth pointing out that the Python stack includes a number of features that aren't found in the best of databases. There's a library for manipulating image files by cropping and even a Google-esque "I feel lucky" function that will fix up the picture with some magic formula. If you want to e-mail someone, you can. You can also store data as Google documents, spreadsheets, and calendar items. It may seem like just a database at first, but it's easy to get sucked into the Google stack.

Until a few weeks ago, App Engine was beta and using it was free. It's still free as long as you stay within some basic quotas. After that, Google is charging with a mechanism that's pretty similar to Amazon's. The price for storage is cheaper (12 cents per gigabyte per month), but the charge for bandwidth is about the same (10 cents per gigabyte coming in.)

Google's terms of service carry a different set of responsibilities than Amazon's. You're required to formulate a privacy policy and guard the data of your users. If your users violate copyright rules, you must respond to DMCA (Digital Millennium Copyright Act) takedown notices or Google will do it for you. Google retains the right to delete any content at any time for any reason: "You agree that Google has no responsibility or liability for the deletion or failure to store any Content and other communications maintained or transmitted through use of the Service."

These terms have become more focused over the years. Google now promises to give you 90 days to get your data off of the servers if it decides to cancel your account -- something it can do "for any reason." Many of the changes I've noticed over that time seem to be focused on DMCA issues, which tie everyone in knots up and down the chain.

It's an interesting question what would happen if you decide to leave Google or Google asks you to leave. Google distributes a nice development tool that makes it easy for you to test your applications on your local machine. There's no technical reason why you couldn't host your service on your own server with these tools, except you would lose some of the cloud-like features. The data store included for testing wouldn't replicate itself automatically, but it seems to do everything else on my local machine. As always, there are some legal questions because "license is for the sole purpose of enabling you to use and enjoy the benefit of the Service."

Apache CouchDB
There's no reason why you need to work with a cloud to enjoy these new services. CouchDB is one of the many open source projects that build a simple database for storing key-value pairs. The project, written in Erlang, is supported under the aegis of the Apache Software Foundation. You can install it on any server by downloading the source files and compiling them. Then there are no more charges except paying for the server.

The CouchDB is similar to Amazon's tool, but it has some crucial differences. You still store key-value pairs as rows, but these pairs can be any of the standard JSON (JavaScript Object Notation) data types like Booleans and numbers. These values aren't limited to being 1,024-byte strings, something that makes it possible to store long values and even things like images. All of the requests and responses are formatted as JavaScript. There are no XML-based Web services, just JSON.

The biggest differences come when you're writing queries. CouchDB lets you write separate map functions and reduce functions using JavaScript. A simple query might just be a map function with a single "if" clause that tests to see whether the data is greater or less than some number. The reduce functions are only required if you're trying to compute some function across all of the data found by the map function. Counting the number of rows that are found is easy to do, and it's possible to carry off arbitrarily cool things as well, because the map function is limited only by what you can specify in JavaScript. This can be very powerful, although try as I might, I couldn't figure out any non-academic uses beyond counting the number of matches. The documentation includes one impressive reduction function that computes statistics, but I don't know if CouchDB is really the right tool for that kind of thing. If you need complex statistics, it may be better to stick with a traditional database with a traditional package for building reports with statistics.

There are still some limitations to this project. While the front page of the project calls it "a distributed, fault-tolerant, and schema-free document-oriented database," you won't get the distribution and fault-tolerance without some manual intervention. The nice AJAX interface to CouchDB includes a form that you can fill out to replicate the database. It's not automatic yet.

There are plans for an access control and security model, but these are not well-documented or even apparent in the APIs. They are designed to use pure JavaScript instead of SQL or some other language, which is a nice idea. You don't give or take away permissions to read documents; you just write a JavaScript function that returns true or false.

This approach isn't as limiting as it might seem. As I was working with these databases, I soon began to see how anyone could layer on a security model at the client with the judicious use of some encryption. Empowering the client reduces the need for much security work at the server, something I wrote about in Translucent Databases.

Observations like this are driving some of the more extreme users to push toward using CouchDB as the entire server stack. J. Chris Anderson, one of the committers on the project, wrote a fascinating piece arguing that CouchDB is all you need for an application server. The business logic for displaying and interacting with the data is written in JavaScript and downloaded from CouchDB as just another packet of JSON data.

In Anderson's eyes, there's no big reason to use Ruby, Python, Java, or PHP on the server when it can all be packaged in JavaScript. This may be a bit extreme because there will always be some business cases when the client machine can't be trusted to do the right thing, but they may be fewer than we know. Lightweight tools like CouchDB are encouraging people to rethink how much code we really need to get the job done.

Persevere
At first glance, the Persevere database looks like most of the others. You push pairs of keys and values into it, and it stores them away. But that's just the beginning. Persevere provides a well-established hierarchy for objects that makes it possible to add much more structure to your database, giving it much of the form that we traditionally associate with the last generation of databases. Persevere is more of a back-end storage facility for JavaScript objects created by AJAX toolkits like Dojo, a detail that makes sense given that some of the principal developers work for SitePen, a consulting group with a core group of Dojo devotees.

Persevere is not like some of the other databases in this space that seem proud of the fact that they're "schema-free." It lets you add as much schema as you want to bring structure to your pairs. Instead of calling the top level of the hierarchy a domain (SimpleDB) or a document (CouchDB), Persevere calls them objects and even lets you create subclasses of the objects. If you want to enforce rules, you can insist that certain fields be filled with certain types, but there's no recommendation. The schema rules are optional.

The roots in the Dojo team are apparent because Dojo comes with a class, JsonRestStore, that can connect with Persevere and a number of other databases. including CouchDB. (Dojo 1.2 will also connect with Amazon's S3 but not SimpleDB and Google's Feed and Search APIs but not App Engine, at least out of the box.) The "Store" is sophisticated and has some surprising facilities. When I was playing with it initially, I hadn't given the clients the permissions to store data directly. The tool stored the data locally as if I were offline and had no connection with the database. When I granted the correct permissions later, the changes streamed through as if I had reconnected.

Persevere provides a great deal of connectivity through this tight connection with Dojo. You can create grid and tree widgets, then link them directly to the JsonRestStore; the widgets will let you edit the data. Voila! You've got remote access to a database in about 20 lines of JavaScript.

I encountered a number of small glitches that were probably due more to my lack of experience than to underlying bugs. Some things just started working correctly when I figured out exactly what to do. It's not so much Persevere itself you need to master, but the AJAX frameworks you're using in front of it. The documentation from Dojo is better than most AJAX frameworks, but it will take some time for Dojo to catch up with the underlying complexity that's hidden by the smooth surface of Persevere. 

Cloud or cluster
After playing with these databases, I can understand why some people will keep using the word "toys" to describe them. They do very little, and their newness limits your options. There were a number of times when I realized that a fairly standard feature from the SQL world would make life simpler. Many of the standard SQL-based tools, like the reporting engines, can't connect with these oddities. There are a great many things that can be done with MySQL or Oracle out of the box.

But that doesn't mean that I'm not thinking of using them for one of my upcoming projects. They are solid data stores and so tightly integrated with AJAX that they make development very easy. Most Web sites don't need all of the functions of a MySQL or Oracle, and JOIN-free schemas are still pretty useful for many common data structures, including one-to-many and one-to-one relationships. Even many-to-one relationships are feasible until something needs to be changed. Given that database administrators are often denormalizing the tables to speed them up, you might say that these non-relational tools just save them a step.

One of the trickier questions is whether to use a cloud or build your own cluster of machines. Both Google and Amazon offer multimachine promises that CouchDB and Persevere can't match. You've got to push the buttons yourself with CouchDB. The Persevere team talks about scaling in the future. But it can be hard to guess how good the promises of Amazon and Google might be. What happens if Amazon or Google loses a disk? What if they lose a rack? They still don't make explicit promises and their terms of service explicitly disclaim any real responsibility.

Amazon's terms, for instance, repeat this sentiment a number of times: "We are not responsible for any unauthorized access to, alteration of, or the deletion, destruction, damage, loss or failure to store any of, Your Content (as defined in Section 10.2), your Applications, or other data which you submit or use in connection with your account or the Services."

I can't say I blame Amazon or Google because who knows who is ultimately responsible for a lost transaction? It could be any programmer in the stack, and it would be practically impossible to decide who trashed something. But it would be nice to have more information. Is the data in a SimpleDB stored in a RAID disk? Is a copy kept in another geographic area unlikely to be hit by the same earthquake, hurricane, or wildfire? The online backup community is starting to offer these kinds of details, but the clouds have not been so forthcoming.

All of these considerations make it clear to me that these are still toy databases that are best suited for applications that can survive a total loss of data. They're noble experiments that do a good job of making the limitations of scale apparent to programmers by forcing them to work with a data model that does a better job of matching the hardware. They are fun, fast, and so reasonable in price that you can forget about writing big checks and concentrate on figuring out how to work around the lack of JOINs.

2009年3月25日星期三

Neural Information Processing Systems

Neural Information Processing Systems (NIPS) is a machine learning and computational neuroscience conference held every December in Vancouver, Canada. It began in 1987 as a computational cognitive science conference, and was held in Denver, Colorado until 2000.

Papers in early NIPS proceedings tended to use neural networks as a tool for understanding how the human brain works, which attracted researchers with interests in biological learning systems as well as those interested in artificial learning systems. Since then, the biological and artificial systems research streams have diverged, and recent NIPS proceedings are dominated by papers on machine learning, artificial intelligence and statistics, although computational neuroscience remains an aspect of the conference.

Besides machine learning and neuroscience, a number of other fields are represented at NIPS, including cognitive science, psychology, computer vision, statistical linguistics, and information theory. The NIPS conference has a tendency towards a very rapid turnover of "hot topics", so neural networks are now rarely seen at NIPS, having declined in popularity compared to tools such as support vector machines and Bayes nets. As a result, the 'Neural' in the NIPS acronym is now something of a historical relic, and the conference spans a lot less wide range of topics.

The proceedings from the conferences have been published in bookform by MIT Press and Morgan Kaufmann under the name Advances in Neural Information Processing Systems, and these proceedings are available freely on the Internet from the URL http://www.nips.cc/.

The Open Company - Running your business as if it were an Open Source Project.

The promise of Open Source

The Open Source movement has shown that loose groups of people, each working of their own accord on whatever they feel is important or interesting, can create great software. Not only has this worked for small hobby projects, but also for huge well known projects such as Linux, Firefox and OpenOffice.

“beneath all this there is a titillating promise of an even more fundamental freedom”

It used to be hard to imagine that anything serious could be build without the creation of large hierarchical organizations. But if one thing has really been shown in these recent years, it is that self-organizing groups in many cases can outperform traditional organizations.

There is a lot of talk in the community about the various freedoms that open source confers. But beneath all this there is a titillating promise of an even more fundamental freedom. This is “the real freedom zero”:

The freedom to decide for yourself what you want to work on.

If you do not have this basic freedom, all the others are really irrelevant.

The central dilemma of Open Source is, and has always been, how to make a living doing it. And so far all the proposed solutions seems to have been a surrender of the right of the individual to choose his own work.

Whether the idea is to create a company that offers support, or maybe go to work for a big company that has an interest in improving the product, you will always end up with a boss who has the final say in what you should work on. Of course you might be lucky that it (at least for a time) overlaps with what you are passionate about, but the decision is out of your hands.

Very very few people are in a position where someone is willing to pay them for just following their passions and doing whatever they find most rewarding. For most people (if they even have had the opportunity to find their passion), this has to be delegated to a hobby they can do in their free time, while they make their living in a day job.

Is this really what we wish for? Working all day in more or less boring jobs to bring bread on the table, and then only hacking on what you are passionate about in your precious free time, where you should really be with friends and family.

You could say that this must be how it is meant to be. How could it be otherwise, when the commonly inferred meaning of the word “work”, is to be doing something you don’t really want to do, to make a living? And isn’t this how it is, and always have been, for everybody?

But this is ignoring the long history of the human race. If you look to anthropology you will see that we spend the overwhelming part of our history as tribal bands of hunter-gatherers, where nobody really had the means to force others work for them. Indeed many tribal societies have been found where the whole concept of “work” is non-existent. They simply don’t have a word for it.

“The mass of men lead lives of quiet desperation”

It was not until the agricultural revolution, that it really became possible for individuals to amass a surplus of resources, which made it possible to pay (and force) others to work for them.

There is a very good case to be made for the fact that we are not very well evolutionarily adapted to work for others (with others yes, but not for others), and we only have to look around us to see that it causes a lot of misery. This was what Thoreau alluded to when he stated that “The mass of men lead lives of quiet desperation”.

So this brings us back to the freedom to decide our own work. How do we make this titillating promise become reality? How do we make it possible for individuals to freely work together, just working on what they personally find important, while still making a safe living?

The Way it Ought to Be

One, not very optimal, solution could be starting your own one person company, producing and selling proprietary software (as I and many others have done). This ensures that you are the only one deciding what to do, but it also has several problems.

First of all you are only a single person. This means that you have to do all the work, also the work that you don’t find interesting (but you might find it important enough to want to do it anyways). Also, you are yourself a liability to the company. If anything happens to you, everything stops (as it happened for me when I had some family issues that meant development stopped for several months).

Second, there is still a fundamental disrespect for your customers, who in a very real sense are also taking part in the company. They get a locked down product which they cannot study, or modify beyond what you have explicitly provided for. And while they may be doing a lot of activities that are hugely beneficial for the company (offering support on the forum, word of mouth, sending bug reports, etc..), they get no real reward for their efforts.

Fixing the product issue, is fortunately quite easy. Just give the users the source of the application. Then they can study and modify it to their needs, and if they want to, share their modifications with each other. A simple release form can make them share the ownership of the changes with the company so that they can be included in future versions (without making them loose any rights).

The Open Company

“Totally open. No concept of bosses or employees. Anyone could join in at any time, doing whatever task they found interesting, for whatever time they found appropriate.”

The real question is how to make the users real participants in the company. There is a lot more to be done than just coding. Everything from support to design and marketing could in principle be opened up to free participation. Obviously there are some things where mistakes could have seriously adverse effects on the company, but this is where it would be appropriate with levels of certification (maybe shown kind of like stackoverflow’s badges).

Imagine you had a company like this. Totally open. No concept of bosses or employees. Anyone could join in at any time, doing whatever task they found interesting, for whatever time they found appropriate. How could you possibly find a way to compensate them fairly?

The key is in a technology called Trust Metrics. In essence this is a technique for rating each other, but with the key distinction that the way ratings are calculated makes cheating ineffective. This is a new technology, which has not been applied for this purpose before, but it has already proven itself as the underlying principle behind such well known technologies as Googles pagerank and the certifications on Advogato.

By basing the compensation on continuous rating by your peers, it becomes possible to start out by just participating a bit in your free time, and then gradually, as your ratings increase, spend more and more time on the project. It may eventually come to fully supplanting your day job, becoming your primary source of income, or you may choose to just keep it as something you do on the side. And not only can nobody stop you from participating, there is nobody who can fire you either. This makes it a far more secure way to make a living, where your status is solely dependent on your own ability and effort, rather than on arbitrary decisions from some superior.

“not only can nobody stop you from participating, there is nobody who can fire you either.

You could question the fairness of being rated by your peers like this, but keep in mind that the way it is done in companies now, is pretty much completely opaque, with some boss judging you in a pretty much arbitrary manner. At least here you will have full disclosure of why and how you are being rated. Also, it is not completely unprecedented. There are companies like W.L.Gore, which for decades has used peer ratings as the sole basis for compensation. But they have obviously not been open for free participation.

Making It Real

Throughout time, many people have brought up more or less utopian plans for ways to make a living. But if they are never realized, it really amounts to nothing more than hot air. So to make this real, I am putting my company (from which i currently make my living) on the line. Over the next few months I will gradually be transforming the company of the e text editor into an Open Company.

Since this is an established company, which already has an accomplished product and a large userbase, it has a good base to build on. Therefore the transformation will have to be done step-by-step:

“Over the next few months I will gradually be transforming the company of the e text editor into an Open Company.”

1st step: Releasing the source

The source will be made a available, so that users can study and modify the application for their own needs. If they want to contribute their changes back, they can submit them for review. To discourage piracy, a tiny but essential core (also containing the licensing code), will be kept private (at least until users reach a certain rating). This will gradually be followed by a similar opening of the rest of the company (web site, documentation, bug tracking, etc..)

2nd step: Building the Trust Metric

The basic infrastructure will be set up so that participants can start rating each other. The algorithms and code will be released as open source, so that they can be studied and discussed (and used by others). It will probably need quite some time and tweaking before we reach a fair balance.

3rd step: Compensating Participants

All income in the company (minus operating expenses), will be passed through the trust metric and distributed to participants.

The Future

“a future where everybody has the opportunity to find (or start) one or more open companies in alignment with their passions, and make a living doing what they love.”

Throughout the entire process, I will be blogging about the experience and the individual parts of the transformation. This is kind of a grand experiment, but my hope would be that it can inspire others to either join in and participate, or form their own open companies, so even more opportunities are created.

The end goal is to make “the real freedom zero” a reality. Creating a future where everybody has the opportunity to find (or start) one or more open companies in alignment with their passions, and make a living doing what they love.

If you want to participate in this, join us on the forum, and help us shape the future.