I'd like to direct your attention to the concept of user interface. It took us a while to appreciate the importance of the concept; for many years "un-interfaces" such as DOS ruled the computer world. But at long last we seem to have accepted the utility of graphical user interfaces. Now it's time to take the next step. Yes, windows and scrollbars and menus and buttons are all very well and good, but after designing with these constructs for more than a decade, I now recognize their many limitations.
Perhaps the clearest way to articulate the nature of the problem is to think about a user interface system as a language of communication between user and computer. We then ask ourselves how rich and expressive any such language is. A simple-minded way to quantify the concept is to talk about vocabulary size, setting aside less quantifiable issues of grammatical richness.
By this way of thinking, joystick videogames afford vocabularies of only five words: up, down, left, right, and fire. Yes, with cleverness you could theoretically extend this to as many as 19 distinct words -- but that was really stretching it.
The text-oriented parser interfaces such as DOS are next in vocabulary size. Theoretically, they can support infinitely large vocabularies, but in practice the vocabularies of such systems are confined by the capacity of the user's memory -- after all, the user has to remember all those commands. Thus, most such systems have vocabulary sizes of under a hundred words.
The real value of the GUIs is that they make larger vocabularies practicable. Menu structures make it unnecessary to remember all the commands, and cleverly designed icons, scrollbars, radiobuttons, checkboxes, and pushbuttons all suggest their functions in ways that extend the expressive vocabulary of the user. For example, my current word processor (MacWrite Pro, a mid-range product) supports about 150 fundamental commands, plus additional commands buried inside dialog boxes, plus font and color choices, which could range up to several hundred options themselves. This is a great deal of expressive power for a word processor whose primary selling point is a clean, easy-to-use interface.
We are already pushing the upper limits of what GUIs can do. There is still room for improvement, but in the end you can only nest submenus so deep; you can only have so many dialog boxes; you can only have so many icons.
So where do we go from here? How will we obtain larger vocabularies so that our users can have richer interactions with the computer?
Consider the nature of interactive entertainment in light of my definition of interactivity as listening, thinking, and talking. We've got the talking part down pat: why, we can do all sorts of wondrous video, sound, and animation. We're starting to get pretty good with the thinking part, too; we've seen smarter internal processing in some of the better products of the last few years.
But it's the listening part that's killing us. Our users can't say anything interesting to our programs because we just don't have a decent vocabulary to offer them. We talk about writing great works of art on the computer, about tugging at user's heartstrings and making them feel glorious feelings, but how can we emotionally engage users with vocabularies consisting of "up, down, left, right, and fire"? We need to give them vocabularies that let them involve themselves at a human level, and a few score or even a few hundred words in a vocabulary won't be adequate for this task.
One proposed solution is to use natural language; after all, we've been using natural language interfaces (in the form of text parsers) ever since Adventure.The catch, of course, is that none of our natural language interfaces actually work. Yes, there are many impressive parsers out there that can understand an amazing amount of language. But, as with so many other subfields of AI research, we're starting to realize that getting actual working results is a lot tougher than we originally thought.
In the case of natural language, a strong case can be made that we will never crack the problem. Why not? Because natural language mirrors reality, and you can only understand natural language when you understand the reality that spawned it. A sentence of natural language is not an absolute declaration of truth independent of the context in which it takes place. All utterances in natural languages include a wealth of subtle assumptions about the universe in which they take place.
In other words, any successful natural language processor must incorporate the same knowledge of the universe that the speaker possesses. Without that knowledge, natural language cannot be understood. I think we can all agree that giving a computer the same knowledge of the universe that a typical human possesses is a task beyond even our hubris at present. Until we can solve this problem, we have to dismiss the possibility of natural language user interfaces.
But can't we somehow build a natural language parser that understands most of a language, and then have the computer just say, "Huh?" when the user gets a little too eloquent? Yes, this is certainly possible -- and in fact it is exactly the strategy that has been used in text adventures. The problem with this approach is that there are so many valid ways to express an idea in English, and so few of them can be handled by any reasonable parser, that the human ends up feeling betrayed by the parser. By communicating with the user in his natural language, we suggest that he can communicate with us in the same manner, but when we respond to half of his utterances with "Huh?", the user feels that our interface has misrepresented itself.
The user must above all feel safe, confident in his interactions with the computer. If the user does not have confidence that input X will yield result Y, then the user will be reluctant to involve himself with the interface. The arbitrariness of partial natural language parsers robs the user of the confidence necessary to interact richly.
I believe that a better answer lies in an earlier body of work which has now fallen on hard times: the artificial language movement. In particular, I think that we can dust off a 180-year-old design and use it for completely new purposes. That design is an artificial language called Solresol.The language was designed in 1817 by Francois Sudre. It is based on the seven notes of the musical scale: do re mi fa sol la ti. The use of only seven symbols instead of the two dozen or so in conventional alphabets has profound implications. First, unlike conventional alphabets, any combination of letters is usable. Thus, even with so few letters, a large vocabulary can be built up without overly long words; a vocabulary of more than 100,000 six-letter words is possible.
Moreover, by confining the alphabet to only seven letters, we gain multimodality of expression, because other human sensory organs, not so finely attuned as the ear, can now be used with the language. Consider the many modalities of expression available to such a language: we could replace each symbol with a single letter (i.e., do re mi fa sol la ti becomes d r m f s l t) for greater conciseness. For computer terms, the whole thing maps cleanly to octal, with the zero reserved for a space. Moreover, you can use the musical notes in place of the pronounced syllables -- you can sing solresol rather that speak it. Or you can assign a color to each of the seven symbols (red, orange, yellow, blue purple, and brown), thereby making it possible to denote words as sequences of colors.
Such a language could readily be used by the handicapped. For the deaf, a truly simple finger-spelling language is possible, with three fingers reserved for punctuation and the others assigned to the syllables. The blind, of course, could sing it or hear it, but a form of braille would be much easier to design and support. Consider also the universality of this system; once the language has been defined, all users, whatever their handicap, have equal access to the language. A blind person can sing and hear in solresol almost as quickly and efficiently as a deaf person can type and read; and both have complete access to the exact same language that regular people use.
Consider the ways in which input could be handled. We could use a standard keyboard to type out the syllables or just the letters. Even better, we could have a simple hand-held device, more like a mouse than a keyboard, that contains all the buttons needed. Four finger buttons and a thumb-shift key would allow complete expression in solresol.
But that's not the only form of input. We could just as easily create a simpler form of voice recognition than we already have operational, and the problems of voice discrimination would be dramatically reduced. It should not be difficult to achieve speaker-independence with so simple a system. And any speaker having difficulty getting through to the computer could always sing out the scale (do re mi fa sol la ti do) as a means of synchronizing the computer to the peculiarities of his voice.
Now let's talk about output. The computer could speak, sing, draw, or color its output, or all of the above. A user could tell the computer to confine output to any combination of the sensory modalities.
But now we come to the really ugly problem with any such language: the vocabulary. How are we going to get people to learn a vocabulary with thousands of words such as milasolsol or tilatido? The trick, I think, is to grow the language from scratch. Remember, every combination of syllables makes sense. So why don't we start off with the seven single syllables representing some of the simplest and most fundamental interface words, like so:
These seven should not be difficult to learn, and all by themselves, they would greatly facilitate user interface. In other words, if we implemented a voice-driven command language that used just these seven words, it would be immediately useful.
But we don't have to stop there. Next, we define the 49 two-syllable words at about the level of DOS commands: format, print, quit application, switch window, and so forth. Right there you have a functioning operating system with 56 commands that addresses all the basic needs of a computer system. Next, you add the 343 three-syllable words and you've got yourself a system with more expressive power than most GUIs offer. A decade or two down the road, after you've developed a solid basis of user experience, you start using the 2,401 four-syllable words.
Granted, nobody is going to learn all those words from scratch. But right now computing doesn't need several thousand words of expression; it's getting along just fine with a few hundred. People can start off learning a few basic commands and expand their vocabulary only as their needs expand. Occasional users need only learn to speak two-syllable words. More frequent users might learn many of the three-syllable words. And the power users can boast of their mastery of the four-syllable words. There's no reason why this language can't continue to grow for decades, into five- and even six-syllable word vocabularies. The learning curve for this language is smooth and shallow -- and there's no ceiling.
Realize too that a solresol language of computer interface need not replace any existing interfaces; it can be tacked onto the existing GUIs. I imagine a graphic artist sitting in front of keyboard, mouse, and microphone, painting an intricate picture with the mouse, but singing out "la" when she wants to save the file, "soldo" when she wants to print it, and "mifare" when she wants to change palettes. And wouldn't it be wonderful to be able to dismiss all those idiotic "are you really sure you want to do this?" messages by singing "do" for no and "ti" for yes? Twenty years from now, that same graphic artist might be saying "domisolfa" when she wants a Gaussian smoothing applied to her image -- but she never bothers to learn that "domisolre" tells an email application to encrypt a file before transmitting it. She doesn't need to learn that stuff.
Oh yes, I didn't mention the intrinsic internationality of solresol. With the Internet taking on a truly international character, isn't it time to stop burdening it with English parochialism? A solresol language could be expanded into an international form of communication, finally satsifying the dreams of those early language-designers.
Which brings us to the "Academie Francaise" objection. The French have this crazy notion that their language should be defined and controlled by a committee of "top people" who pass judgement on the "Frenchiness" of any word or expression. Anglo-Saxon barbarisms such as "hamburger" and "rock 'n roll" are replaced with suitably Frenchified terms such as "deux tout-boeuf patois en la baguette a la sesame seed" or "Elvis Presley". We Anglo-Saxons, happy in our anarchistic squalor, just keep making up new terms and linguistic fads, because this provides our lexicographers will full employment. If we're going to invent a language for human-computer interface, who will define and control the language -- some "Academie Solresol" of "top people" who impose their linguistic tastes on the computing public?
Truth be told, we've had such academies from the very beginning; the biggest is known as "Academie du Microsoft" and not only do they impose their notions of user interface on the computing community, but we pay them billions of dollars for the privilege!
This does raise the possibility that Microsoft might not do the job right. Anybody who has used Windows can imagine what havoc this company could wreak on the musical scale. Would they define an eight-note scale, just to make it cleanly octal? Would they name that eighth note "bill"? What if they decided to replace "mi" with a C# because Bill can't hit a B-note? As we have seen so many times before, the opportunities for screwing up are manifold. While the theoretical possibilities of a little language like solresol are exhilirating, the likely realities of implementation are disheartening.
Despite my cynicism, I'm very excited about the possibilities of a solresol-type language. This could be the next big jump in user interface design. Yes, we're likely to screw it up, and some greedy bastard will probably figure out a way to use our concepts to enrich himself, but that's what happened to Gutenberg, James Watt, Alexander Bell, and Alan Kay -- why should we be any different? The important thing at this point is to figure out if this idea is worth exploring. Hey, maybe I'm crazy; maybe this idea just won't work. For now, my intention is to place this idea before the community and ask if others like its possibilities.
I hereby disclaim any proprietary interest in the ideas presented in this essay. I am placing these proposals in the public domain, and waive any rights to control or derive benefit from them. Some ideas can and should be personal property, but this is not one of them. Its success requires the joint efforts of many people, and there's no obvious way to provide financial inducements for all those people. The only beneficiaries of solresol will be the users -- and, thus, the industry as a whole.
So I put the ball into your court. Is this crazy or brilliant? I'll monitor feedback and, if interest warrants it, I'll put together whatever organization, formal or informal, that the community seems to desire.