"Sound Engine Roundup" (Part 2: 3D APIs)

by Alexander Brandon

In our last issue we took a look at some of the engines in use today for music in interactive media. With this issue we begin to examine 3D sound APIs.

Although 3D sound and multi channel audio can be considered to be relatively new developments in PC based audio, they have certainly generated a lot of interest from the end user. Many game players are scrambling to purchase multi speaker systems and are enjoying playing their favorite new titles with headphones, awash in sounds that appear to come all corners of the room. This experience is helped to broaden knowledge of 3D audio and multi channel sound delivery formats, but it is also urging users to ask some questions.

For example, in Chris Grigg's 1999 GDC Interactive Composition Roundtable, 3D audio was seen by many content creators as an interesting prospect, but as something that was more in the hands of programmers than the sound designers. If you are a programmer, when you look at an audio API for your game, ask yourself as well as your API provider: "how easy will it be for my sound designers to use these features in our project?" If you are a sound designer or composer, ask the same question.

In this section, we offer comments from two providers of today's most advanced 3D audio technology: Scott Willing from QSound Labs, and Brian Schmidt of Microsoft.


In looking over QSound's Web Site and their product literature, we found the following two quotes which should serve as a decent introduction to their technology.

"QSound signal processing algorithms provide positional 3D audio effects (Q3D tm) and soundfield enhancement (QXpander tm, Q2Xpander tm) through conventional stereo (i.e. two channel) sound systems. QSound also has algorithms for more advanced applications as well."

"QSound Labs has been involved in interactive entertainment for nearly a decade. The tools and technology developed for interactive entertainment are also directly applicable or easily adapted to a wide range of applications including virtual reality, training simulators and communications."

From here, we contacted Scott Willing (scott.willing@qsound.com) and following is a question-answer session, in which he discusses the terminologies and misconceptions of 3D and multi channel sound, and how QSound delivers both technologies.

Q: How does 3D sound differ from "surround sound"?

A: There is often much debate over audio terminology, typically tainted by marketing angles. I'll try to give you my view and place it in the context of the other main view I've heard expressed.

Let's agree on a general goal: in the context of artificial sound (re)production using electronics and electric-to-acoustic transducers, we want to be able to synthesize the apparent spatial location of sound sources in arbitrary positions relative to the listener - in the ideal case, without restriction - in order to mimic the real world. [AB note: In my opinion, nobody has achieved the ideal case in a practical consumer technology-or even the impractical technologies I've heard. ;-) ]

There are two general methods that can be used to approach this goal:

- use more transducers, driven by more discrete audio channels in more real physical locations, or,
- apply signal-processing techniques that extend the range of effective placement provided by a given number of transducers.

As compared to the two-channel stereo case, by my convention the former approach is termed "surround" and the latter is "3D audio". To put it in painfully simple terms, surround means more channels of output; 3D means adding signal processing to the channels you have.

Note, by my definition:
- the two technologies are not mutually exclusive by any means. That is, 3D techniques can be combined with surround techniques, as indeed they are in real products.
- soundfield enhancement (stereo expansion or surround enhancement) and surround "virtualization" are included along with positional 3D technologies under the umbrella of "3D audio".

At this point it's useful to examine another view of what constitutes "3D audio". There is a school of thought that suggests the term "3D audio" should be reserved exclusively for positional audio synthesis, and specifically, for positional audio based on the binaural synthesis technique: by nature, a two-output process. There have been impassioned arguments over this contention, which in my view is really a waste of time.

First let's deal with the positional aspect. I think it's widely accepted that the term "positional" audio has become synonymous with processing individual audio streams in an effort to place them independently in arbitrary apparent spatial positions defined in terms of azimuth, elevation and range (x,y,z) with respect to the listener. I can see where some would say, "if it ain't (x,y,z) it can't be 3D". The term "positional audio" would therefore apparently exclude soundfield enhancement, which typically operates on a mix rather than individual sounds, and involves no elevation or range-related processing. By extension, soundfield enhancement is also excluded from the definition of "3D" under this convention, which is at odds with my definition.

As long as everyone understands what the terms of reference are, IMHO it doesn't matter exactly where you draw the line. However, when people (such as vendors of 3D technology who lack soundfield enhancement in their product line-up) get on a high horse about "true 3D" I like to raise the following points:

The meaning of the term "stereophonic" has evolved in the collective consciousness to mean "two-channel". However, formed from the Greek word "stereos" (solid), it was originally intended to mean "solid sound" i.e. "sound having three-dimensional qualities". (Tongue stuck out) Bleah! So there! Even plain stereo is "3D". ;-)

The other bone I have to pick with the "true 3D" camp is that a given process that accepts a mono input and a (x,y,z) control inputs doesn't necessarily succeed in convincingly placing the sound at (x,y,z). Should it earn the term "3D audio" if it isn't fully effective, for a broad range of listeners, under representative real-world conditions? (Note: If it doesn't work on my desktop, but theoretically might work, or work better, in an anechoic chamber with my head nailed to a post, I don't really care.) This is where the issue of binaural synthesis and its effectiveness should be examined, but I'll table that until later.

A further complication in defining 3D, especially as distinct from surround, is created by the substantial differences between interactive and passive audio. Generally I prefer to restrict the use of "surround" to refer to multi-channel (i.e. >2-channel) passive delivery formats and their associated gear. Dolby Surround, Dolby Digital, multi-channel MPEG etc. are intended for the delivery of a small number of streams that are separated out by a decoder and fed to multiple physical transducers. At production time, the content of these streams is pre-ordained as they are aimed at passive listening. They almost universally represent a mix of different sources, and they are intended to be directed to the listener from specific fixed physical locations.

But what then do we call the use of multiple speakers in an interactive environment? Multi-speaker 3D? (By my definition that would only apply if it involved filtering.) Interactive surround? Real-time surround? Multi-channel interactive sound, perhaps? Another debate.

In an interactive scenario, as I'm sure you're aware, it is common to abstract the actual method of sound placement in the interface used by the programmer. Microsoft's DirectSound3D API is the perfect example. Anyone can write an application that says "place this sound here" using DS3D commands.

It is up to the hardware how it attempts to achieve this. It may employ binaural synthesis and headphones, or binaural synthesis corrected for speaker playback and two speakers, or QSound's 2-speaker algorithm, or a QSound "3D+surround" multi-speaker algorithm, or it might just use "panning" (i.e. output channel volume ratios) and 600 speakers! The game is using "3D sound" in the sense that it is passing requests for sound positioning that, directly or indirectly, specify the position of various objects in (x,y,z) coordinates. The sound card that interprets this request using 600 speakers would technically not be employing 3D technology, yet it is compatible with a 3D API standard and might well be the most effective (if not the most practical) of the lot.

Q: If Qsound uses two speakers only, how does it claim to make sounds appear behind the user?

A: First, we don't use "two speakers only" but let's consider the two-speaker case. Actually QSound is the only 3D company that doesn't make this claim. I personally consider it The Big Lie of 3D audio.

We have been mercilessly slammed by competitors because our audio preprocessing tools (the QSystem, various plug-in's, QCreator etc.) do NOT provide a full 3D user interface. At this time, they're all designed for two-speaker output, and by their nature are intended for the production of canned content. (The reason for this statement will shortly become clear.)

It has been our opinion from the get-go that 2-speaker 3D (by whatever method) is primarily effective at placing sounds in a wide arc in front of the listener which extends well past the speakers. Though this is a dramatic, valuable, and exciting improvement over plain stereo production, rear placement effects and elevation effects are weak at best. Thus we have excluded rear placement and elevation from our user interfaces in such products. If we ever produce 3D preprocessing tools that support headphone 3D or multi-channel output, this will have to change.

I have never experienced any two-speaker 3D technology that I would present to someone, for example, like Pink Floyd mixer James Guthrie, and try to convince this professional audio engineer that I can put a guitar on the wall directly behind his head or stick a violin under the chair. Yet many 3D vendors are only too happy to make this sort of claim to sound-card buyers. How do they intend to get away with this? In my opinion there are a couple of issues.

The first is that these claims rely on the comparative inexperience and lack of professional-level discriminating audio savvy of typical sound card buyers. And for a while, it worked. But though these folks may not be record mixers, they aren't all deaf or stupid. Pretty soon we started to see threads pop up in news groups with subject lines like: "3D Audio, Is it a Joke?" This really pissed us off, because the unnecessary and insupportable marketing claims made everyone in the business look bad.

The other issue is that the perceived effectiveness of any 3D audio process is dramatically extended by creative use, and in typical interactive applications, all the methods are available: Interactivity itself, supporting visual cues, preconditioning and motion (often lots of it) can all add up to a very convincing net illusion. There's nothing wrong with this per se - that's where the art comes in - but credit where credit's due, please. So, a good demo can lead to the impression of capabilities that aren't really there.

So: Turn off the video monitor, turn out the lights, take the interface away from the user and place arbitrary, unfamiliar sounds in arbitrary static locations. Ask the listener identify their positions. An interesting story comes to light if it hasn't already.

Repeat the test with headphones. You'll find that binaural synthesis, the darling of 3D "purists", fails miserably at creating convincing placement in the front hemisphere. (Only one well-known 3D purist I know of has admitted this in print - though the same fellow once told us 3D over speakers was "impossible" but now heads the 3D audio department at a Big Company that sells A Lot of Speakers. <g>)

Add crosstalk cancellation to the same binaural filtering algorithm, thus theoretically adapting it for speaker playback, and the resulting output displays the characteristic weaknesses of speaker algorithms: front placement is fine, but elevation (which can work quite well on headphones) now sounds like tonal variations - i.e. "flavor" rather than effect, and rear placement... just isn't there.

Near the beginning of my post I told you I don't personally think anyone has come close to the ideal of completely successful arbitrary sound placement synthesis. I'll stand by that contention until proven otherwise.

[Sidebar: In an interactive scenario, one MUST provide a full 3D API to the application programmer that assumes no rendering engine limitations. You can't tell the programmer that it's impossible for the enemy battle cruiser to be behind and above the listener. In turn, the engine must do the best it can (logical behavior at least) with whatever algo's and output configuration it supports to attempt to produce an appropriate result. Another reason the API should assume no limitations is because the user has a selection of run-time rendering engines, and some are better at certain aspects of sound placement than others.]

Q: Does QSound have more than 2 speaker technology available or in development?

A: Yes. First of all, there's the case of surround formats. Our QSurround process has combined 3D with multi-channel formats for some time, both for "virtualization" (surround rendering over two speakers or headphones) and for surround enhancement (multiple physical output channels + 3D processing).

For interactive apps, there are lots of four-speaker sound cards out there now; next-gen sound cards will very typically be sporting "5.1" output features. Our first positional (Q3D) 4-channel implementations used simple panning (not 3D) but that's all changed now.

[Sidebar: I personally think that a better arrangement for interactive multi-channel audio via six outputs would be "2/2/2" rather than "3/2" + sub. (That is: 2 front / 2 side / 2 back vs L,R,C, Ls, Rs + sub) However, since people will apparently be using their PC's to watch canned surround content authored to the 5.1 format (?) I guess we're kind of stuck with that arrangement in the interest of compatibility.]

Q: Would you, as an unbiased opinion, consider QSound or A3D to be a more popular choice for game developers?

A: In a very real sense, it's impossible to make an easy comparison - these are apples and oranges. First, bear in mind that you have to separate the idea of the API from that of the rendering engine. (This is where my FAQ has a lot of relevant info.) Our philosophy with respect to 3D sound hardware on the PC has been that we support industry-standard, open API's. That means all sound cards employing Q3D may be controlled entirely via Microsoft's DS3D API for positional 3D, and the EAX 1.0 DS3D property set extension for reverberation. We are currently adding support for the I3DL2 guidelines, which pick up where EAX 1.x left off and, as you may know, are loosely based on EAX 2.0.

Bottom line: to support our hardware, the developer does NOT have to use a QSound API. They can use anything that supports DS3D / EAX protocols, including third-party SDK from folks like RAD (Miles Sound System) etc. The idea of a proprietary API for hardware is ostensibly to provide support for proprietary features that an open API like DS3D doesn't address. In fact, DS3D has a mechanism called property sets for this very purpose. (Like MIDI Sysex.) In fact, the real idea of a proprietary API is to claim titles that support the API as your own. This is purely a marketing exercise.

QSound also has software development kits, and thus our own API. Why? Bear in mind that QSound was providing real-time 3D capabilities back in DOS days. (Terra Nova: Strike Force Centauri, for example.) Our QMixer API was born under Windows 3.1, believe it or not. There WAS no hardware to support, no DS3D, no Win 95... only our Q3D algorithms running in software. (First QMixer title: Zork Nemesis by Activision.)

Some developers consider 3D audio to be an important-enough component of their titles to want to ship functional software 3D that will work with any stereo sound card. We provide this in the form of the QMixer SDK. Since QMixer contains our patented 3D software engine, we charge a significant license fee for it.

I think that willingness to belly up and pay good money for a technology qualifies as a true endorsement, don't you?

You may have noticed that we have a free SDK as well, however. QMDX has the same high-level, full-featured API that QMixer does, but its internal engine only mixes to stereo in the absence of a hardware 3D accelerator. Both QMDX and QMixer use DS3D and (in the next version) EAX commands to use any manufacturer's compliant 3D hardware to render in 3D, with reverb, if the capabilities reside in the end-users' hardware. QMDX is offered for free, to underline our commitment to open API's and to assisting developers with supporting the broadest possible range of hardware.

We don't even ask developers to register with us to get QMDX. It's freely available for download. There's no logo requirement. We have no idea of how many dev's are using it.

Q: Some other 3D audio API's, like Creative's "Environmental Audio" use multi-speaker systems to achieve surround sound. This would seem to be a more effective way of getting genuine 3D / surround sound in games, and it seems to be working with the less expensive satellite speaker systems they are distributing. Would then the next step overall for 3D sound be to increase the number of speakers, or develop new technology other than the standard 2 speaker system to create 3d sound?

A: OK, let's first step back a bit. Some people think that headphones are the only way to go. I personally hate 'em. In my passive entertainment environment (living room) I would implement multi-speaker surround with QSurround enhancement, but on my PC set-up at home, multiple speakers really aren't convenient. On the PC I would stick with two speakers, and I put on headphones only when I'm starting to piss my girlfriend off - or I need to test a headphone algo!

My point here is that there is room for many opinions, and for reasons of preference or practicality, there is no perfect solution for every user. The real beauty, IMHO, of interactive audio on the PC is the fact that rendering takes place in real-time, and there is a separation between API and rendering engine. The app doesn't have to care whether the output goes to headphones or 47 speakers or a freakin' bone implant. ;-)

If you want to compete in the PC sound card market, you have to give the user all the choices you can. I'll say it again: next-gen products from virtually (no pun intended) all manufacturers will have at least four and up to six outputs, but will or should support two-speaker, four-speaker and headphone 3D as well. In any set-up of up to "5.1" channels, the addition of 3D processing to the multiple output channels definitely adds value. This is probably an area that deserves further research by all companies, as the lion's share of attention in the 3D camp has obviously been focused on the assumption of two-channel output
(BTW, I think it's a bit funny to watch the purist camp adapt to the reality of public preference for multiple speakers. The pure binaural + crosstalk cancellation model is very much at odds with this output scenario, for one thing. Providing multi-speaker support also forces a tacit admission that two-speaker 3D isn't, at least in the public's opinion, up to the marketing claims that typically come with it. We've never played the "3D is perfect" tune, so we're totally comfortable with this situation - in fact we think it's a really healthy thing.) If manufacturers get into a "number of speakers" race, you quickly get to the point where the additional value that might theoretically be added by 3D processing to simple "volume ratio" signal distribution amongst output channels would be minimal. However, there's no technical reason why someone couldn't produce a sound card with a horrific number of outputs. It's pretty easy; the question is how many channels of D/A, and speakers to go with them, is the public prepared to buy?

<Ed. Note: Thanks to Scott for his effort in answering these questions as fully as anyone could expect!>

For more information contact:

Scott Willing
QSound (www.qsound.com)
Suite 400, 3115-12 Street North East
Calgary, AB.
Canada T2E 7J2
403.291.2492 (voice)
403.250.1521 (fax)

Microsoft DirectSound3D

Next, we took a look at the DirectSound3D. The following passage found on Microsoft's DirectX web site (http://www.microsoft.com/directx/overview/dsound3d/default.asp) will serve as our introduction.

"The Microsoft® DirectX® audio playback services are designed to support the demanding requirements of games and interactive media applications for Microsoft Windows&REG;. Microsoft DirectSound3D allows you to transcend the limits of ordinary left-right balancing by making it possible to position sounds anywhere in three-dimensional space. And it can be accelerated in hardware or implemented in software when 3-D hardware is not available."

We got in touch via email with Brian Schmidt, the Program Manager for both DirectSound & DirectMusic: He supplied us with the following comments:

"DirectSound3D is a neutral API. It merely specifies x,y,z locations of sounds along with being able to specify a directional sound (sound cones...for example, if you're facing away from me and talking, you'll be softer than if you were facing me and talking). It also lets you control the amount of Doppler effect and set the maximum and minimum distances for a sound. These aspects are almost 100% non-controversial.

The "DirectSound3D renderer" takes the position information from the API and creates the sound. If there is hardware available, the DS3D renderer is not used; DirectSound uses the hardware instead of doing its own processing.

DirectSound can also be extended to include processing that the hardware can do but the DirectSound3D renderer can not. A good example of this is EAX reverb. Suppose a game uses DirectSound3D; On cards that support 3D and EAX, the hardware gives the user 3D and EAX. If the card only supports 'Frank's 3D', the user gets 'Frank's 3D'. If the card doesn't support any 3D, the user gets DirectSound's 3D.The goal is that if a user has any 3D sound card, and a game wants 3D audio, the user should hear whatever their card can do (after all, they choose that card). So the API should be neutral, and not tied to any specific piece of hardware. This is in fact the essence of all the DirectX API's.

(As a side note: new in DirectX 7 is the ability of a programmer to choose from 3 different DirectSound 3D renderer algorithms; plain stereo-pan+doppler+volume, and 2 different HRTF algorithms licensed from Harman Industries, called 'VMAx'."

Next month, we hope to add additional information on 3D Audio Engines from other leading companies.

<Ed. Note: Thanks to Brian for his more in depth explanation of DirectSound3d, it being one of the most widely used 3D audio APIs in games today.>

For more information on DirectSound3D, contact:

Brian Schmidt
DirectSound/DirectMusic Program Manager
Corporate headquarters:
One Microsoft Way
Redmond, WA 98052-6399
Telephone: (425) 882-8080