AI Audio Explained (+ Bonus Resources)

October 9, 2023
October 9, 2023 Paul Virostek

AI Audio Explained (+ Bonus Resources)

As we all know now, AI has leapt into the public consciousness. While AI (or artificial intelligence) has long been discussed, it was only last year that AI became a part of common culture. Stable Diffusion showcased incredible text-to-image creation in August. In November OpenAI released ChatGPT with astounding results.

The Pole team wondered, can these advances be used for audio? How could artificial intelligence be applied to sound design, sonic creativity, and recording audio?

So, in May, Pole Position Production held an AI audio workshop. We spoke with Elias Forsblom, data scientist consultant at Sopra Steria. We walked away from the workshop knowing more about AI, how it could work with audio, and excited for the future.

Let’s learn more.

What is AI?

So what is AI audio?

To start, let’s learn about AI generally. We’ll take a quick look at the types of AI and the different things it can do. Afterwards, we’ll see how AI applies to audio, what cool and exciting AI audio tools have appeared, and what’s in store for the future.

Types of AI

During the Pole workshop, Elias Forsblom explained that AI comes in a number of different forms.

Narrow AI, or “weak AI” is limited: it can answer your questions, or provide you with facts. If you’ve used Alexa or Siri on your phone to fetch the weather, you’ve used narrow AI.

Generative AI (genAI) is much more capable. It uses inputs (images, text, audio, etc.) together with pattern recognition to create new content. Some genAI tools use many smaller narrow AI models to work together to solve a task. Stable Diffusion and DALL-E take text inputs to create new pictures from existing images. (The image at the top of this post was created with AI using a text prompt.) A similar tool, Midjourney, has used text prompts and existing images to create “what every country would look like if it was a villain”, “most stereotypical person in a country”, and other photorealistic text prompt-based mashups such as Mortal Combat and celebrities:

Artificial General Intelligence (AGI) is far beyond either of these. AGI is Strong AI that would be capable of reason, planning, learning, and much more. It’s a hypothetical concept for now.

What Can AI Do?

To help us understand how AI works, Forsblom broke down some broad categories of what AI can do.

It includes recommendation systems based on customer data, such as with YouTube, Spotify, and Netflix. It powers image recognition, such as Apple’s FaceID or object identification in self-driving cars. It is also capable of natural language processing to analyze text to create spam filters or study text to create new dialogue in a desired tone (friendly, corporate, colloquial, etc.).

What’s Changed with AI?

So what’s with the recent buzz about AI?

Forsblom explained to us that the rapid progress in recent years has come about because there is more data to draw from and greater infrastructure to handle all the data. Major tech companies such as Google, Facebook, Baidu, Microsoft, and others are in a race to develop AI tools.

ChatGPT is one of the most popular of these. It is narrow AI with general AI tendencies, and even sparks of AGI.

AI Audio

But what about using AI in audio tools?

Forsblom explained why generative AI has immense potential for use in sound: sound is easily quantifiable. It comprises of decibels, frequency, duration, phase, and more. These are all measurable, meaning they can be sampled, learned from, and applied to new tasks.

One you’ve likely used before is audio processing, such as with noise removal. For example, iZotope’s RX is able to take a sample via the “Learn” button to feed the software a dataset, analyze the patterns of the noise in general, then separate the noise from the signal with spectral subtraction.

It could potentially also be used for sound generation itself. Imagine a large library with carefully labeled sounds. Software can learn how the audio relates to the labels, learn relationships and patterns between the sounds, and generate something new. If a sound library is well-tagged with descriptive words like “swipe”, “airy”, “riser”, “impact”, “drum”, “cymbal”, and others, a prompt such as:

“create a short whoosh that grows from low to high and ends with a deep drum hit”

… could potentially find sounds based on those words and synthesize them into a new sound. Not what you want? Click the button again for a new version.

In pro audio right now, plugins and apps such as Soundminer and BaseHead can conjure workflows that simulate AI tools via audio measurement, processing, and sound generation. Sure, at the moment they are multi-step workflows or scripts that involve a bit of manual input and training. These serve as an approximation of AI audio and something for the future to aim for – if only until more sophisticated tools evolve.

AI Audio Challenges

Using AI audio to flawlessly scrub away even the smallest sonic imperfections would make life easier for every sound pro. And sound design generation would be invaluable for those days where inspiration is hard to come by.

While it’s an exciting notion, Forsblom explained that AI audio isn’t simple. Why?

Well, the first issue is data quality – to produce anything of value requires good source material. Even the most powerful tools can’t repair garbage audio. And it’s impossible to create amazing sound from poor quality sound effects. No matter how powerful the Ai is, it will always be subject to its training ability and the inputs it receives.

Perhaps most important to game audio and film and TV editors, AI audio isn’t creative. Math (with apologies to the data scientists) isn’t very artistic. It can’t create new things; instead, it is the result of the input it has been given. For instance, if we have a sound library without any weapon sounds, then ask an AI to create a gun sound, we won’t have much luck. Why? Well, without a reference point of what a gun sounds like, no AI tool can create any sonic approximation of gunfire. It just won’t know what it is.

There’s also the issue of interpretability. What one person may call an “airy” whoosh, someone else may label “breathy”. This subjectiveness makes it difficult to simply tell AI audio to “create a sound like Y in Z style”.

Finally, there’s a problematic side when creating with genAI. In essence, generative AI creates by sampling existing text, audio, images, and video. This runs a risk of basing new creations on copyrighted material – perhaps someone’s voice, their writing, and so on. This is why one of the key reasons behind the 2023 Hollywood labour disputes is to safeguard jobs against advancements in AI technology. After all, it won’t be long before an actor’s legacy footage, random interview, and even paparazzi shots can be reused for new content. Or, that a writer’s collected scripts will be used as a template to create a new screenplay without their involvement – for a fraction of the cost.

Examples of AI Audio

Despite these challenges, examples of AI audio have begun appearing.

Some of the first have been deepfakes in music. There are a growing number of examples that have sampled artists to create new songs in the style of Kanye or Drake and the Weeknd.

You can see for yourself how easy it is by using the Dadabots AI tool.

A bit closer to home, game engine provider Unity recently revealed a number of AI tools to help develop games. While not specifically aimed at game AI audio development, many other AI audio tools have already appeared.

Some common examples are AI audio enhancers, cleaners, and noise reduction tools. Adobe’s Enhanced Speech can make problematic dialogue sound like it was recorded in a studio. That’s ideal for podcast creators or budget filmmakers.

Izotope’s RX can reconstruct missing parts of audio, such as (small) gaps in sound or restore clipped portions of audio. The Goya app uses AI to strip reverb, noise, and voices from audio. Waves’ Clarity Vx uses AI to remove noise, check out this video example:

Another popular example is AI audio generators. Speechify uses text-to-speech AI to read aloud user prompts or existing text. Listnr is another example, there are many more, such as offerings from ElevenLabs, NOVA A.I., PlayHT, Synthesia, and Amazon Polly. There are even voice changers, like the ones from Voice AI and Metavoice.

How well does it work? You be the judge. We tried inserting this article’s text into the ElevenLabs AI audio generator. Here’s the result – with no other work than trimming and spreading out a few clips:

Facebook’s AudioCraft takes text prompts and creates music or dialogue and its Voicebox AI can perform speech generation tasks like editing, sampling, and styling, as well as noise removal (think horns and dog barks), and more. Mubert can create a song from a text prompt. And this is just the beginning.

While AI audio noise restoration tools for game audio, feature film, and episodic television exist and continue to improve, truly transformative pro audio-specific AI audio tools have yet to evolve.

Nevertheless, that hasn’t stopped some tools from entering the marketplace of procedural sound effects generation.

Adobe Firefly can analyze the video content of a scene to discover which sound effects it needs, then simply drop them in the proper place:

Krotos Studio is another similar tool that “creates Hollywood sound in seconds”:

For now, these tools lack the insight and craftsmanship that skilled audio professionals provide. But who knows? At the rate these tools are developing tasks such as mundane editing to inspiring and creative sound design may not be long from being a part of every sound pro’s workflow.

Read More & Resources


Got a question?
Get in touch.

We welcome you to contact us for more information
about any of our products or services.

Find answers to popular questions here.

    CALL: +(46) 707-94 04 76

    Postal address: Galtabäcksvägen 11, SE-168 55 Bromma, Sweden

    © 2020 All rights reserved. Pole Position Production. Please read our privacy policy and site terms. Sitemap.

    Share via
    Copy link