Beyond Music: Audio for VR
Have you ever thought about working in audio for games or virtual reality? Did you know that movie trailers are not made by the same people who make the movies? Sound Recordist and Designer Stephan Schütze provides some background to how things are done for these fascinating aspects of the audio industry.
In over 20 years of working as a location recordist and sound designer, I have recorded and produced audio for interactive media, sound libraries, movie trailers, film, virtual reality and augmented reality. Yes, that is as fun as it sounds! In doing so, I’ve discovered that each of these formats has different needs and requires a different approach to creating the content. In this article, I’m going to share what I’ve learnt.
The most important thing when recording is to capture a great sound; we have to hunt down those rare and interesting sound opportunities like a photographer chasing the perfect sunset. Location recording is much more than just pointing a microphone at something noisy – it is about discovering a world so rich with incredible sound sources that we will never run out of things to record.
The source sound is everything, and whatever can capture that sound becomes the best tool at the time. I’ve had people laugh when they learn that I sometimes use a Zoom H1, and I chuckle back knowing that my H1 recordings are in the official Captain Marvel movie trailer. Remember, Academy Award-winning Hollywood sound designer Ben Burtt (Star Wars, Indiana Jones, etc.) did some of his best work with analogue tape and razor blades…
The core consideration is to start with clean usable sounds, and I’ve always strived to achieve this while recording. Our recording skills are critical to what we do, but the software that supports us frees our creativity. Advances in technology, such as iZotope’s RX, mean that I can record kookaburras on my porch and not worry too much about a bit of wind noise in the trees because I can remove it later.
Once we have a selection of amazing raw material for a particular project, we need to prepare it for the platform we’re working on. Each platform has its own requirements, as I’ll describe below.
Games are interactive, making them quite different to other forms of media. Interactive formats are constructed to behave much like the real world does, so there can never be a single ‘final’ mix for a game; the mix is created automatically by the game’s ‘sound engine’, in real time, in response to the player’s actions. That means you can’t just drop in a convenient stereo ambience of a forest, for example. The player is likely to be walking through that forest and can choose their own path, and that will determine the sounds that are heard and how they are mixed together. In the forest example, we need to have individual bird sounds placed in the trees that the player moves between. The stereo or surround impression the player hears is from multiple mono sounds positioned in a 3D world. So when the player walks between two trees the bird sounds will be heard on either side, and these sounds will grow louder as the player approaches them and softer as the player moves away from them. All of that real-time manipulation is done by the game’s sound engine. The audio team can establish guidelines for how loud each individual sound or group of sounds is, how they blend together and so on, but it is the actions of the player and their specific position at any one time that defines the mix they will hear. This makes balancing game audio hugely challenging because you cannot blend individual sound elements to cover gaps or weak transitions. Most of the individual sound elements are exposed to the player, and the player’s journey through the game world determines how they are blended together.
Obviously, all sounds used within a game need to be clean and very isolated. Going back to my forest example, I’ll need separate bird sounds, a separate wind sound, a separate stream sound, and separate versions of any other sound elements; they all need to be highly isolated, and placed individually. When I prepare bird sounds to use in a game, for example, I need to cut them up into individual calls and place them into the desired place within the game’s 3D world. A common tool for this purpose is Wwise, produced by Canadian company Audiokinetic. Wwise allows me to take a selection of individual bird sounds and define their behaviour so that I can create a volumetric area in which birds will twitter randomly in real time. I essentially simulate real-world bird behaviour inside the game world.
VIRTUAL & AUGMENTED REALITY
The New Reality formats are hugely different to every other platform; even games, which they share some similarities with. The key reason for this is the desire to create a realistic spherical sound field for the audience. Remember, we are placing our audience inside the virtual world; unlike a game, they are no longer just looking into that world through a screen, they are immersed within it.
The various technologies we use to achieve this attempt to simulate how we localise sound in the real world. They simulate the interaural amplitude differences, the interaural time and phase differences, and the filtering of sounds reaching us from different directions, all to create a realistic immersive sound experience. The technology is still evolving, but that’s what it aims to do. There are many companies trying to create the perfect technology solution for this functionality. Companies such as Two Big Ears were doing an excellent job of this, and that is why they were bought by Facebook – their technology now forms the basis of the Oculus’ audio system. At this point in time we do not fully understand the science of how humans triangulate sound sources, so it is no wonder that the different technology solutions vary in effectiveness.
The important thing about all of the above is that the amplitude and frequency makeup of a sound are critical aspects of how our brains calculate where a sound is coming from. This means we cannot just boost the midrange to give a sound more presence or add a sub channel for added impact, because any enhancement like this could interfere with the directional nature of spatial positioning. Essentially, if we want a sound to be located 45 degrees to the audience’s right and 15 degrees above the horizon, any alterations we make to that sound may influence the perception of where it originates from. Elements that might add excitement to a project, such as enhanced low frequency content, need to be carefully designed to highlight the audio without working against the illusion of realism.
As with game audio, I start with clean and well-isolated mono files and place them into the simulated ‘world’, allowing the technology to handle distance attenuation and occlusion filtering. Then I layer in combinations of stereo, 3D, ambisonic and even binaural sound material to achieve the end result. For example, VR Regatta, the VR sailing game for Oculus Rift, uses an array of over a dozen layers that play simultaneously in real time – just to create the spherical wind sounds! The layers are all interacting in real time within the spherical sound field; in many ways, it is more like a live performance than a pre-produced product.
Much of what makes this content sound great is achieved during implementation. Virtual Reality (VR) and Augmented Reality (AR) are both going to be incredible formats in the future, and I honestly think no one has gotten close to realising their potential yet. Audio production for these formats is difficult and challenging, but it is also incredibly satisfying as we are literally creating a whole new form of media and discovering what can be effective. Good quality isolated audio is critical to the immersive effectiveness of these formats. If you plan on working in the VR, AR or 360 video space, I highly recommend you check out some of the discussion groups on Facebook that talk about the processes. The technology is changing so quickly that it can be a huge benefit to keep in touch with others working in this field, and seek their advice when necessary. We are all fairly new to these formats, and sharing the knowledge helps everyone.
The process of laying out tracks and building up sonic worlds allows for lots of creativity, but the underlying consideration is communicating with the audience about things seen and unseen.
The approach to creating great audio for movie trailers is very different to games, VR, AR, and, surprisingly, film. It is odd and a little counter-intuitive to discover that the trailer for a film will have a very different sound production process than the film it is promoting. For a start, it’s a completely different team that works on the trailers. Movie trailers, especially in Hollywood, are big business – there are studios dedicated entirely to producing high-impact trailers! And that is the core of it: it’s a specialised process because film trailers have 30 to 60 seconds to grab the audience’s attention and get them excited and engaged. While the visuals and dialogue for a trailer are cuts from the film, the sound and music are created specifically for the trailer.
Sound Effects (SFX) for trailers are super hyped. If the dial goes up to 10, then trailer SFX sit somewhere around 12 or 13! There is certainly an element of the loudness wars here, but it is more nuanced than just turning up the volume and compressing the hell out of everything. The sounds themselves need to simultaneously achieve two things: they must be punchy and really cut through, but they also need to stay well out of the way of the trailer’s voiceover and music. It takes many, many layers of sounds to achieve the end result, and the editors do an incredible job of blending all this together for the audience. As someone who has been recording and using my own raw material for 20 years, I know my content really well. Despite this, there are a handful of my sounds in the trailer for the latest Fast & Furious movie (‘Hobbs & Shaw’) that I cannot even recognise because of how densely the sound has been layered and mixed to get that high impact end result.
When creating trailer SFX there is a significant emphasis on mid-range content, which is often boosted to what would normally be considered silly levels. The trailer music usually occupies much of the frequency range we want to use; it dominates the very low and very high end of the spectrum, forcing us to tailor the SFX to have maximum impact inside of those two extremes. When preparing the sounds, I have a template where I layer my own sounds and do a lot of work to boost that mid-range frequency content. I find this tricky because the easiest way to make a sound cut through a mix is to add or boost high-frequency content so the crispness of the sound carries it through. Without being able to rely on the lovely high frequencies, I need to paint with a broad brush across the middle of the frequency spectrum. Compression is essential, but I do a lot of that work manually. If I am layering eight to 10 sound files I will build a custom volume curve for each sound, and tune each one so that I hear the exact elements I want to hear at the exact time I want to hear them.
I mentioned earlier that game sound consists of many different layers that are all automatically mixed together by the game engine to provide a real-time experience. Film trailers are the polar opposite: the layers are not just hammered together, they are surgically grafted with each other to allow each element to do its best work and then move out of the way for the next element. The challenges are very different for trailers, but the intensity of the audio content you are working with really creates excitement. Working on trailers is fun but very challenging, and, of course, it’s always cool to hear your content on the big screen! Just be careful of your ears because that high intensity can be really fatiguing.
I am not going to include television as I have not worked for TV beyond commercials, and my film experience is not as significant as my game experience. Nonetheless, there are some fundamental differences between working with sound for linear media such as film, and for non-linear media such as games, and these differences influence my approach to each platform.
The main thing for me with film is to achieve and maintain sonic consistency over time. One of the key aspects of film production I have encountered is the room tone/atmos and ambience setup. It’s a little like the audio equivalent of colour grading, where you have to take all of these individual shots, from different angles and with different lighting, and try to create a consistent lighting look across all the edits. It’s the same with the room tones and atmos. You need to remove all the background content from the dialogue tracks, as it is often really different and distracting to the audience, and replace it with a nice consistent tone across each scene. It’s like applying a smooth undercoat before you start painting.
With film, I tend to build up sounds from the rear to the front. I start with long, smooth, clean edits and transitions, and use spot sound effects to provide support for actions and events. Again, all the content I am editing and preparing needs to be super clean because, unlike non-linear media that is mixed in real time, film has no limits to the number of sounds you can combine simultaneously. Layers and layers can be combined, but every single layer has the potential to contribute to the overall level of unwanted noise. Combining many layers risks becoming an utter mess of unwanted noise if the sounds are not super clean to begin with.
Unlike non-linear media, you have much more control in film because you can set a scene exactly how you want it to sound and it will play back the same every time. After years of working in games, this is very different; you can push things to their limits and know they will never exceed those limits. This is why films can be dynamically more intense than games. The interactive and immersive nature of games is where they get their impact, not from a super tight mix.
SEEN & UNSEEN
The process of laying out tracks and building up sonic worlds allows for lots of creativity, but the underlying consideration is communicating with the audience about things seen and unseen. Audio is our principal emotional sense, and the soundtrack for any media needs to enhance and support the narrative of the dialogue and music while also transporting the audience into the worlds we create.
Each delivery format has its own challenges, strengths and weaknesses. Understanding the best way to approach each format allows us to enjoy the process of creating an audio world, rather than fighting with the content to make it fit. Never be afraid to try new things or crazy ideas. So much of the best creative work comes from folk who think and act outside the box.
We all work in sound because we could not imagine ourselves doing anything else! As someone who records raw material for sound libraries and also uses that material in sound production, I’m experienced with the entire workflow – from capture of raw sounds to delivery of finished product. This provides me with a useful perspective, and I honestly think it makes me better at both jobs. When recording raw sounds on location, I consider all the ways I might want to manipulate those sounds as part of production. I will often record sounds that are very ordinary in their raw format, but I know are going to be a great basis for sound design. Likewise, when doing sound design I am aware of the limitations of recording on location and just how hard it can be to capture really clean content. I am going to explore these concepts in more detail in future issues of AudioTechnology. Stay tuned!