Ear Training and Audio Programming Course: Compression I (v0.3) - Programming WAV files
Last updated: July 31st 2023
Introduction #
Welcome to my Ear Training and Audio Programming course.
(I'm publishing it in parts, in no particular order. The full course will eventually get a whole page and a suggested learning order.)
Features in v0.3 #
Ability to export as WAV file #
Maybe you're tinkering with the compressor and find a sound that you like, that you actually would like to save, either to literally use in a production of yours, or to send as an example to someone, perhaps your producer.
Your Mission #
...should you choose to accept it, is to:
- Click any course file to load it.
- Hit play.
- Click the "+" to add a compressor.
- Click on the parameter names for guided tinkering.
- (Optionally: Upload any file of yours.)
- (But the hints and mini tutorials are obviously based on the course files. So for the guided tinkering part, you wanna use those.)
Main course files:
- Zander Noriega drum loops:
- Alternatively, load your own audio:
What is a WAVE file? #
Long story short: WAVE appeared as the native file format for audio on Windows. Developed by IBM and Microsoft in August 1991, it eventually became widely used for all things audio.
And even though there's nothing stopping anyone from putting compressed audio data into a WAV file, the common use and assumption is that it's a file storing audio data in the "uncompressed LCPM" format (which I will explain below.)
But let's leave the history for later, and get technical ASAP.
Spoiler alert: This is the code #
If you already know everything you need to know about the WAVE format and just want the code, here's the important part:
class IndexedBufferView {
constructor(length) {
this.buffer = new ArrayBuffer(length);
this.view = new DataView(this.buffer);
this.pos = 0;
}
setUint16(data) {
this.view.setUint16(this.pos, data, true);
this.pos += 2;
}
setUint32(data) {
this.view.setUint32(this.pos, data, true);
this.pos += 4;
}
}
const bufferToWAVE = (audioBuffer) => {
const { sampleRate, duration, numberOfChannels } = audioBuffer;
const samplesPerChannel = sampleRate * duration;
const bytesPerSample = 2;
const headerLengthInBytes = 44;
const fileLength = samplesPerChannel * numberOfChannels * bytesPerSample + headerLengthInBytes;
const byteRate = sampleRate * bytesPerSample * numberOfChannels;
const channels = [];
const content = new IndexedBufferView(fileLength);
content.setUint32(0x46464952); // "RIFF" chunk
content.setUint32(fileLength - 8); // file length - 8 (size of this and previous fields)
content.setUint32(0x45564157); // "WAVE"
content.setUint32(0x20746d66); // "fmt " chunk
content.setUint32(16); // length = 16
content.setUint16(1); // PCM (uncompressed)
content.setUint16(numberOfChannels); // usually 2 (for stereo left and right)
content.setUint32(sampleRate); // usually 44,100 or 48,000
content.setUint32(byteRate); // avg. bytes/sec
content.setUint16(numberOfChannels * 2); // block-align aka. frame size
content.setUint16(16); // 16-bit (hardcoded in this demo)
content.setUint32(0x61746164); // "data" - chunk
content.setUint32(fileLength - content.pos - 4); // length of *just* the data we're about to put in
// gather all channels at once into an array of channel data
// so we don't call getChannelData() millions of times below
for(let i = 0; i < numberOfChannels; i++)
channels.push(audioBuffer.getChannelData(i));
// actually write the data field of the "data" chunk
let sampleIndex = 0;
while(content.pos < fileLength) {
for(let i = 0; i < numberOfChannels; i++) {
const float32Sample = channels[i][sampleIndex];
const scaledSample = float32Sample * 32767;
const roundedSample = Math.round(scaledSample);
const clampedInt16Sample = Math.max(-32768, Math.min(roundedSample, 32767));
content.setUint16(clampedInt16Sample);
}
sampleIndex++;
}
return new Blob([content.buffer], {type: "audio/wav"});
}
It assumes an audioBuffer
, ie. buffer created by the Web Audio API once you manage to make it successfully decode audio data. (Or, alternatively, once you manually synthesize some sound with it.)
It also uses a little IndexedBufferView
class I made up to keep track of the byte position in the file as we write data to it with setUint32
and setUint16
. Feel free to approach that differently.
With that, you can then have a data blob that you can use to create a download link with something like:
const fileDataURL = URL.createObjectURL(bufferToWAVE(audioBuffer));
Now, if you want to understand the WAVE standard from an audio engineer and an audio software programmer perspective, keep on reading.
Anatomy of a WAVE file #
As you know, a file is just a bunch of bytes. (And a byte is 8 bits.)
In WAVE terminology, the parts of a file are called "chunks" And the parts of a chunk are called "fields."
Some chunks come after a chunk that conceptually/functionally is considered to their parent. We call those "subchunks." But note that at the file level it's still just a flat list of bytes.
The basic, "canonical" WAVE format has 14 chunks total (I'm counting both "chunk" and "subchunk"):
- The "RIFF" chunk, containing the first 3 things you need to write to the file:
- The "fmt" chunk, containing the 8 next things you need to write to the file:
- The "data" chunk, containing the final 3 things you need to write to the file:
You can see that the actual audio data is the very last (but definitely not least!) thing we write to the file.
All these "chunks" and "subchunks" are just groups of bytes in the file arranged together one after the other.
Note: We will be talking about the most basic, canonical WAVE structure. Extended WAVE formats are a discussion / programming session for another day.
Clarification on these variable-looking field names #
I'm using field names such as Subchunk1ID
and BitsPerSample
because I've seen them elsewhere, but we will not be writing those labels to any file.
A field name such as Subchunk1Size
is conceptual. Ie. the string "Subchunk1ID"
is never something we write to the file.
Why people use these programming-code-looking names when talking about these parts of a WAV file, I don't know. But I'm continuing the tradition.
Reason for the chunk structure #
The pattern is this: Chunk ID, chunk size (in bytes), which together we conceptualize as "the header." And then the chunk data, if any.
It's designed this way so programs can:
- Read the chunk ID and its size, ie. the header, and decide:
- If I recognize this chunk: Process the data.
- If I don't recognize this chunk: Skip the bytes according to its size.
That way, a program that was programmed to handle 1995 WAV files won't explode when given a WAV file generated in 2023 with perhaps some new chunk that has data for some audio feature that might not have even existed back in 1995.
Of course, the most obvious use case of the "Header with Chunk ID and Chunk Size, followed by the Chunk data" pattern is for programs that just need to eg. display a file list. They only care about reading the header, to show the file type and size to the user. Skpping the rest of the file.
The "RIFF" descriptor #
Ie. the "toplevel header."
WAVE files are a type of "structured binary file" that implements the RIFF file structure.
"Strictly speaking, RIFF is not a file format, but a file structure that defines a class of more specific file formats, some of which are listed here as subtypes. The basic building block of a RIFF file is called a chunk."
RIFF file structure, Library of Congress.
The short story is: Each group of bytes is called a "chunk." Each chunk begins with an ID and a size. Everything else is optional.
If a program / "application" doesn't know what a chunk is for, it can skip it (read its size, skip through that exact amount of bytes) and continue reading the following chunks, using only the ones it can make sense of.
(Fun fact: The RIFF structure was developed by Microsoft and IBM for Windows 3.1.)
ChunkID
#
The letters "RIFF" encoded as 4 bytes (of their respective ASCII), in big-endian form.
In hex: 0x52494646
.
ChunkSize
#
This is the size of the entire file in bytes, minus 8 bytes for the two fields not included in this count – ChunkID
(see above) and ChunkSize
(this field itself) – in little-endian form.
Format
#
The letters "WAVE" encoded as 4 bytes (of their respective ASCII), in big-endian form.
In hex: 0x57415645
.
The "fmt" subchunk #
This subchunk, which must always come before the data subchunk, specifies the format of the audio data.
Subchunk1ID
#
The letters "fmt " (yes, that's a space at the end) encoded as 4 bytes (of their respective ASCII), in big-endian form.
In hex: 0x666d7420
.
Subchunk1Size
#
The size of the rest of the Subchunk which follows this field.
It will be 16 for PCM. Because 16 is the total size, in bytes, of AudioFormat
(2 bytes), NumChannels
(2 bytes), SampleRate
(4 bytes), ByteRate
(4 bytes), BlockAlign
(2 bytes), BitsPerSample
(2 bytes) fields in this subchunk.
AudioFormat
#
The number 1, indicating uncompressed LPCM.
Also called "Microsoft PCM (uncompressed)" in some references out there.
Other values would indicate various other formats, eg. 0xF1AC
for "Free Lossless Audio Codec FLAC."
Linear Pulse Code Modulated Audio (LPCM) #
The audio data we're always working with is a digital representation of sound, following the "Linear Pulse Code Modulation" standard.
And the idea is that the magnitude of the analog signal is sampled regularly at uniform intervals by some hardware.
The hardware then gives us those numbers, which we will call "samples" from here on, in a digital (usually binary) form, the streams of which we call "bitstreams."
To be clear, we don't care about individual bits. We're going to be working with, and speaking in terms of, bytes. A byte is a group of 8 bits.
NumChannels
#
1 for mono, 2 for stereo, etc.
SampleRate
#
Eg. 44100.
The meaning of SampleRate
#
The highest part of a sound wave (or any wave) is called the crest (more commonly in audio: "peak"), and the lowest part the trough.
So in order to represent a wave digitally, you need to capture, at the very least, two "samples" for those two things: Peak and trough.
Imagine a digital audio recorder, capturing "samples" of the sound waves (more accurately: samples of the state of voltage of its microphone.) If you want to capture a 20,000 Hz wave (ie. a medium vibration that causes the voltage to go from peak to trough and back 20,000 times per second), then the recorder better be able to sample the voltage at 40,000 times per second at the very least.
I keep saying "at the very least" because, if you can capture the peak and trough voltage, that's cool, but still leaves out lot of information of what happens "in between." So the more samples (in addition to peak and trough) the better.
(Of course, at some point you start to get diminishing returns: Too high a sampling rate for what the listener can actually perceive.)
ByteRate
#
Basically SampleRate * NumChannels * BitsPerSample/8
.
Let's delve into that calculation, in case it's not clear.
The meaning of ByteRate
#
The ByteRate
is the average number of bytes per second of audio data.
SampleRate
, brings time – ie. the "per second" part of the result of this multiplication - into the picture.
The other two components of the operation bring channels and bytes.
So basically you have a samples per second * channels * size of each sample
giving you, as the common naming suggests, the rate (per second) of bytes.
For example, with a sample rate of 44,100 Hz, you know you're going to have 44,100 samples per second. Each sample has a size, which is defined by the resolution, eg. 16 bits. (Or, in bytes, 2.) (1 byte = 8 bits.)
But you have the channels too, right? In stereo, you have left and right. Ie. 2 channels.
So the total amount of data per second is going to be, on average: 44,100 * 16 * 2.
Or, generically, in terms of the WAVE file: SampleRate * NumChannels * BitsPerSample/8
.
(BitsPerSample/8
to get the number of bytes.)
That's the ByteRate
.
BlockAlign
#
Number of bytes per sample for all channels. Ie. The size of one "frame."
This is useful for playback software, so that it knows how many bytes it needs to grab in order to send "the next moment of audio" or "the next slice of the waveform," so to speak, ie. the next frame, to all speakers.
Frames vs. Samples #
As far as the AudioBuffer
is concerned, and in the context of the normie (stereo, 16-bit, 44.1kHz) WAV we want to write:
- An audio is two arrays (for the left and right channels) of signed 32-bit numbers which we call samples. (Which describe where a sound speaker needs to be at a particular moment in time.)
- The "sample rate" is how many of those samples (or "frames," or "sample-frames") to play per second.
So, how many samples need to be processed/played/copied/sent/etc. for 1 second of audio at 16-bit resolution and 44,100 Hz sample rate? 88,200. (44,100 per channel.)
Additionally, you might come across the word "frame," sometimes used interchangeably with "sample."
To be accurate, "frame" refers to a collection of samples. A frame is all the samples, from all channels, describing a point in time, Ie. the state of all speakers (your left and right earbuds, for example) at one point in time.
So, in the case of mono, 1 frame = 1 sample. For stereo, 1 frame = 2 samples. For 5.1 surround, 1 = frame = 6 samples.
This is more than terminology: Even though samples are the basic unit of audio data, sample frames are the most handy for talking about a specific moment of time, for audio processing. And when you look an an audio on your DAW, typically as some horizontal block, each moment is a graphical representation of a frame, ie. what all the samples at that point in time. (Usually it's split in two, with the left channel samples at the top, and the right channel samples at the bottom.)
BitsPerSample
#
8
for 8 bits, 16
for 16 bits, etc.
As the name says, it's how many bits will express the value of a sample. The more bits means more values, thus more detailed audio. 16 bits is the standard for audio CDs, but as of this writing, 24-bit audio has been common in a variety of other forms of playback.
Beyond the "high detail" part, there's the fact that higher resolution (say, 32-bit and beyond) is sometimes useful, because part of the available bits will be allocated for noisy artifacts of audio production processes, that can then remain at far lower levels than the signal the listener cares about.
The "data" subchunk #
Subchunk2ID
#
The letters "data" encoded as 4 bytes (of their respective ASCII), in big-endian form.
In hex: 0x64617461
.
Subchunk2Size
#
The size of the actual audio data, encoded as 4 bytes.
Ie. The size of the rest of the Subchunk which follows this field.
Data
#
The actual audio data, ie. the uncompressed LCPM data, and nothing more.
From Web Audio data to WAV file #
Now that you're familiar with the anatomy of a WAV file, let's write one.
Spoiler: It's not straighforward #
I hoped it'd be something like const audioBuffer = audioContext.destination.buffer
to get the audio as currently being heard by the user, with the result of all the processing in the node graph.
But it's not that easy.
Starting from the end #
I write code the same way I write music: I first pretend that something awesome exists, and start writing stuff around it, on the assumption that I will figure out the awesome thing later.
const onlineGraph = { audioNodes };
const offlineRenderedBuffer = await offlineBuffer(source.buffer, onlineGraph);
const fileDataURL = bufferToWAVDataURL(offlineRenderedBuffer);
// build DOM anchor with fileDataURL as href for a download link, etc. UI crap.
In this case, we're pretending there is:
- An
offlineBuffer()
function that magically:- Takes an online
AudioBufferSourceNode
. - Takes an
onlineGraph
(that reflects the state of our processing setup) - Magically "plays the processed audio into a buffer"
- Takes an online
- An
bufferToWAVDataURL
function that magically:- Takes the offline buffer.
- Magically packs it into WAVE standard audio data for a browser "data url."
Neither of those magics exist, of course, so let's start with the one nearer the end: bufferToWAVDataURL
.
Offline render to WAVE data #
If we continue with my self-bullshitting approach, and pretend that there exists a bufferToWAVE()
function that magically converts an AudioBuffer
into a "binary"
const bufferToWAVDataURL = (audioBuffer) => {
const fileDataURL = URL.createObjectURL(bufferToWAVE(audioBuffer));
return fileDataURL;
}
Now onto bufferToWAVE()
, which is where we finally have to write the proper magic (although here we again also rely on a bit of assumed magic, if you can catch it.)
Audio buffer to WAVE #
const bufferToWAVE = (audioBuffer) => {
const { sampleRate, duration, numberOfChannels } = audioBuffer;
const samplesPerChannel = sampleRate * duration;
const bytesPerSample = 2;
const headerLengthInBytes = 44;
const fileLength = samplesPerChannel * numberOfChannels * bytesPerSample + headerLengthInBytes;
const byteRate = sampleRate * bytesPerSample * numberOfChannels;
const channels = [];
const content = new IndexedBufferView(fileLength);
content.setUint32(0x46464952); // "RIFF" chunk
content.setUint32(fileLength - 8); // file length - 8 (size of this and previous fields)
content.setUint32(0x45564157); // "WAVE"
content.setUint32(0x20746d66); // "fmt " chunk
content.setUint32(16); // length = 16
content.setUint16(1); // PCM (uncompressed)
content.setUint16(numberOfChannels); // usually 2 (for stereo left and right)
content.setUint32(sampleRate); // usually 44,100 or 48,000
content.setUint32(byteRate); // avg. bytes/sec
content.setUint16(numberOfChannels * 2); // block-align aka. frame size
content.setUint16(16); // 16-bit (hardcoded in this demo)
content.setUint32(0x61746164); // "data" - chunk
content.setUint32(fileLength - content.pos - 4); // length of *just* the data we're about to put in
// gather all channels at once into an array of channel data
// so we don't call getChannelData() millions of times below
for(let i = 0; i < numberOfChannels; i++)
channels.push(audioBuffer.getChannelData(i));
// actually write the data field of the "data" chunk
let sampleIndex = 0;
while(content.pos < fileLength) {
for(let i = 0; i < numberOfChannels; i++) {
const float32Sample = channels[i][sampleIndex];
const scaledSample = float32Sample * 32767;
const roundedSample = Math.round(scaledSample);
const clampedInt16Sample = Math.max(-32768, Math.min(roundedSample, 32767));
content.setUint16(clampedInt16Sample);
}
sampleIndex++;
}
return new Blob([content.buffer], {type: "audio/wav"});
}
Notes:
- See the Appendix section on scaling the sample to better understand what's going on in the
for
loop. - Audio programmers will note that I'm skipping dithering. Beyond scope.
- See the Appendix section on planar vs. interleaved sample organization, to understand why we're looping through the channels and basically laying the samples this way "LRLRLR..." (as opposed to "LLLRRR...)
Now, this big-ass function still assumes the existence of something that doesn't exist: That IndexedBufferView
class.
So let's write that one.
The IndexedBufferView
#
This class is just there to keep track of a position "in bytes" that increases depending to how many bytes we're writing, that way every subsequent write occurs at the right place in memory.
class IndexedBufferView {
constructor(length) {
this.buffer = new ArrayBuffer(length);
this.view = new DataView(this.buffer);
this.pos = 0;
}
setUint16(data) {
this.view.setUint16(this.pos, data, true);
this.pos += 2;
}
setUint32(data) {
this.view.setUint32(this.pos, data, true);
this.pos += 4;
}
}
Once again, the IndexedBufferView
assumes that some magic exists: The classes ArrayBuffer
and DataView
. Briefly:
ArrayBuffer
: A way to ask the host environment (browser, Node.js, whatever application is being scripted with JavaScript) for a piece of memory to use for our "binary" needs.DataView
: A wrapper aroundArrayBuffer
that has convenient methods for (note how we pass it ourArrayBuffer
to construct it) treating the buffer in a more semantically structured way.
Fortunately, both AudioBuffer
and DataView
are real magic that already exists in all serious JavaScript implementations.
Finally, pos
(which should probably be called index
) is the position (in bytes) where the next write should occur. (Remember that 16 bits = 2 bytes, and 32 bits = 4 bytes. Thus the index increments you see in the code.)
Checkpoint 1: We can make fileDataURL
#
Alright, so now we have all the implementation of bufferToWAVDataURL()
.
Let's look at the final code again.
const onlineGraph = { audioNodes };
const offlineRenderedBuffer = await offlineBuffer(source.buffer, onlineGraph);
const fileDataURL = bufferToWAVDataURL(offlineRenderedBuffer);
// build DOM anchor with fileDataURL as href for a download link, etc. UI crap.
However bufferToWAVDataURL()
needs to be called with offlineRenderedBuffer
, which is created by another piece of magic we haven't implemented: offlineBuffer()
.
The "offline" audio context #
The short story is this. We are going to:
- Create an "offline" audio context. (With the appropriately named
OfflineAudioContext
class.) - Recreate the audio processing graph in it. (Creating new nodes in the offline context, and copying the settings from their online cousins.)
- Use
OfflineAudioCtx#startRendering()
to get the data for the WAV file.
Online buffer to (equally processed) offline buffer #
const offlineBuffer = async (audioBuffer, onlineGraph) => {
const offlineAudioCtx = offlineAudioContextFromBuffer(audioBuffer);
const offlineSource = offlineAudioCtx.createBufferSource();
offlineSource.buffer = audioBuffer;
recreateOnlineGraph(offlineSource, offlineAudioCtx, onlineGraph);
offlineSource.start(0);
const renderedBuffer = await offlineAudioCtx.startRendering();
return renderedBuffer;
}
This one:
- Takes the original audio buffer (
audioBuffer
). - Takes a description of the node processing connections (
onlineGraph
) (think diagram of studio equipment connections.) - Creates a new audio processing environment / context (
offlineAudioContextFromBuffer()
). - Creates a "source node," ie. the thing we can tell to "play" itself, ie. start streaming its audio data to the next node in the change. (
offlineAudioCtx.createBufferSource()
). Importantly, it creates it in the offline context. - Tries to recreate the configuration of the online graph in the offline one (
recreateOnlineGraph()
). - Tells the offline source to stream its audio data (
offlineSource.start(0)
). - Finally, it tells the offline context to render its output silently and return the data as a "rendered buffer."
Note what it takes:
- The original
audioBuffer
.- This is the plain loaded file, before any audio node graph processing.
- Ie. not the result of the online graph's processing.
- (If we had that, we wouldn't have to write any of this!)
- An
onlineGraph
object.- It's just some object containing a list of audioNodes as you're supposed to track.
The two pieces of magic this one assumes (and your own code will likely need to adapt from my dumb implementation for this first tutorial, because I don't know what type of graph management you will need) are recreateOnlineGraph()
and offlineAudioContextFromBuffer()
, so let's implement those.
Recreating the online graph offline #
This could get really complex (considering for example that your online graph might have auxiliary nodes for purely UI purposes that you might not want to have involved at all when rendering offline) so here I'm making two big assumptions:
- I assume you have a array
audioNodes
, in which you've been keeping track of the nodes that are processing your online graph.- Remember from the previous post that the Web Audio API doesn't keep track of anything for you (a node's inputs and outputs, the state of your node graph, etc.)
- I'm assuming one chain of serially connected nodes.
- Ie. not a proper graph per se.
- (I'll likely extend this to handle proper graphs later, as my own application evolves.)
Your real needs will obviousy require something more complex that this. Anyway, here's this basic implementation:
const recreateOnlineGraph = (offlineSource, offlineAudioCtx, onlineGraph) => {
let lastConnectedOffline = offlineSource;
onlineGraph.audioNodes.forEach(audioNode => {
const recreatedNode = recreateOnlineNode(offlineAudioCtx, audioNode);
lastConnectedOffline.connect(recreatedNode);
lastConnectedOffline = recreatedNode;
});
lastConnectedOffline.connect(offlineAudioCtx.destination);
}
So, basically just assume the "graph" is really, at most, a chain of nodes serially connected, and conveniently stored as an array of nodes, with the order implied by their position in the array.
For each node, we "recreate it" and connect it to the offline context. (Remember we can't just connect the online context's nodes into the offline context. If we could, this recreation wouldn't be necessary.)
The dumb "magic" this magic relies on is recreateOnlineNode()
, to which we now turn.
Recreating the online nodes offline #
const recreateOnlineNode = (offlineAudioCtx, audioNode) => {
if (audioNode instanceof DynamicsCompressorNode) {
const offlineCompressor = offlineAudioCtx.createDynamicsCompressor();
forEachAudioParam(audioNode, (param, k) => {
offlineCompressor[k].value = param.value;
});
return offlineCompressor;
} else {
throw new Error("Don't know how to clone this type of audioNode.");
}
}
This one should be self-explanatory (and it should be clear that it's clearly incomplete and should be extended!):
- If it's a compressor node (the only type of node my early stages application cares about at the time of this writing), it creates a compressor node on the offline context, and then copies the parameter values.
- Else, literally explode. (Which works for me right now. You do you. Zero interest in "graceful" error handling or any of that crap. I want things to expode hard and loud and fast when I'm tinkering.)
The bit of magic in here is forEachAudioParam()
.
Iterating an AudioNode
's params #
This is just a convenience function that does what the name says.
const noOp = () => {};
const forEachAudioParam = (node, f, g = noOp) => {
for (p in node) {
if (node[p] instanceof AudioParam) {
f(node[p], p)
} else {
g(node[p], p)
}
}
}
If the object property is specifically an AudioParam
, we call f
, else I call g
whose default value is a function that does nothing.
Offline audio context from online buffer #
This is the last piece of magic, which is thankfully trivial.
const offlineAudioContextFromBuffer = audioBuffer => {
const { sampleRate, numberOfChannels, duration } = audioBuffer;
const offlineAudioCtx = new OfflineAudioContext({
numberOfChannels: numberOfChannels,
length: sampleRate * duration,
sampleRate,
});
return offlineAudioCtx;
}
It simply creates an offline context and configures it based on properties of an audioBuffer
which, remember, for us it's the "initial," untouched audio buffer that the online context has been using, before any processing.
Because, again, there's no such thing as a "processed buffer" being automatically held for us in the online context. It's just not a thing.
Web Audio API does the right thing #
Think of it this way: Web Audio API loads an audio. Let's say the audio is large. Should Web Audio API automatically assume that you will want to export the audio as a file, and thus create an additional "processed audio" buffer with the whole thing? Ie. Occupy, by default, essentially 2x the memory it takes to load the audio for playback?
No. So Web Audio API takes the audio source, passes it through the processor nodes, and then out to the computer audio ports. It's on us to program the offline rendering the stuff into additional memory space, for our file creation purposes.
The whole thing #
Here's everything together for convenience:
const noOp = () => {};
const forEachAudioParam = (node, f, g = noOp) => {
for (p in node) {
if (node[p] instanceof AudioParam) {
f(node[p], p)
} else {
g(node[p], p)
}
}
}
const recreateOnlineNode = (offlineAudioCtx, audioNode) => {
if (audioNode instanceof DynamicsCompressorNode) {
const offlineCompressor = offlineAudioCtx.createDynamicsCompressor();
forEachAudioParam(audioNode, (param, k) => {
offlineCompressor[k].value = param.value;
});
return offlineCompressor;
} else {
throw new Error("Don't know how to clone this type of audioNode.");
}
}
const recreateOnlineGraph = (offlineSource, offlineAudioCtx, onlineGraph) => {
let lastConnectedOffline = offlineSource;
onlineGraph.audioNodes.forEach(audioNode => {
const recreatedNode = recreateOnlineNode(offlineAudioCtx, audioNode);
lastConnectedOffline.connect(recreatedNode);
lastConnectedOffline = recreatedNode;
});
lastConnectedOffline.connect(offlineAudioCtx.destination);
}
const offlineAudioContextFromBuffer = audioBuffer => {
const { sampleRate, numberOfChannels, duration } = audioBuffer;
const offlineAudioCtx = new OfflineAudioContext({
numberOfChannels: numberOfChannels,
length: sampleRate * duration,
sampleRate,
});
return offlineAudioCtx;
}
const offlineBuffer = async (audioBuffer, onlineGraph) => {
const offlineAudioCtx = offlineAudioContextFromBuffer(audioBuffer);
const offlineSource = offlineAudioCtx.createBufferSource();
offlineSource.buffer = audioBuffer;
recreateOnlineGraph(offlineSource, offlineAudioCtx, onlineGraph);
offlineSource.start(0);
const renderedBuffer = await offlineAudioCtx.startRendering();
return renderedBuffer;
}
class IndexedBufferView {
constructor(length) {
this.buffer = new ArrayBuffer(length);
this.view = new DataView(this.buffer);
this.pos = 0;
}
setUint16(data) {
this.view.setUint16(this.pos, data, true);
this.pos += 2;
}
setUint32(data) {
this.view.setUint32(this.pos, data, true);
this.pos += 4;
}
}
const bufferToWAVE = (audioBuffer) => {
const { sampleRate, duration, numberOfChannels } = audioBuffer;
const samplesPerChannel = sampleRate * duration;
const bytesPerSample = 2;
const headerLengthInBytes = 44;
const fileLength = samplesPerChannel * numberOfChannels * bytesPerSample + headerLengthInBytes;
const byteRate = sampleRate * bytesPerSample * numberOfChannels;
const channels = [];
const content = new IndexedBufferView(fileLength);
content.setUint32(0x46464952); // "RIFF" chunk
content.setUint32(fileLength - 8); // file length - 8 (size of this and previous fields)
content.setUint32(0x45564157); // "WAVE"
content.setUint32(0x20746d66); // "fmt " chunk
content.setUint32(16); // length = 16
content.setUint16(1); // PCM (uncompressed)
content.setUint16(numberOfChannels); // usually 2 (for stereo left and right)
content.setUint32(sampleRate); // usually 44,100 or 48,000
content.setUint32(byteRate); // avg. bytes/sec
content.setUint16(numberOfChannels * 2); // block-align aka. frame size
content.setUint16(16); // 16-bit (hardcoded in this demo)
content.setUint32(0x61746164); // "data" - chunk
content.setUint32(fileLength - content.pos - 4); // length of *just* the data we're about to put in
// gather all channels at once into an array of channel data
// so we don't call getChannelData() millions of times below
for(let i = 0; i < numberOfChannels; i++)
channels.push(audioBuffer.getChannelData(i));
// actually write the data field of the "data" chunk
let sampleIndex = 0;
while(content.pos < fileLength) {
for(let i = 0; i < numberOfChannels; i++) {
const float32Sample = channels[i][sampleIndex];
const scaledSample = float32Sample * 32767;
const roundedSample = Math.round(scaledSample);
const clampedInt16Sample = Math.max(-32768, Math.min(roundedSample, 32767));
content.setUint16(clampedInt16Sample);
}
sampleIndex++;
}
return new Blob([content.buffer], {type: "audio/wav"});
}
const bufferToWAVDataURL = (audioBuffer) => {
const fileDataURL = URL.createObjectURL(bufferToWAVE(audioBuffer));
return fileDataURL;
}
With these functions, your task is to modify whatever you're doing with Web Audio API, for which you should have:
- A successfully decoded an audio, used to construct an
AudioBufferSourceNode
(variablesource
below). - An
audioNodes
array in anonlineGraph
object, representing the state of your online graph.- (Again, your online graph can be simple or complex. Complex node graph management is beyond scope here.)
const onlineGraph = { audioNodes };
const offlineRenderedBuffer = await offlineBuffer(source.buffer, onlineGraph);
const fileDataURL = bufferToWAVDataURL(offlineRenderedBuffer);
const exportFileName = "EXPORT.wav";
const downloadLinkEl = document.createElement("a");
downloadLinkEl.textContent = exportFileName;
downloadLinkEl.href = fileDataURL;
downloadLinkEl.download = exportFileName;
Appendix #
Brushing up on AudioBuffer
#
We use AudioBuffer
for everything, so let's brush up on its interface.
To quote from the Web Audio API specification:
"This interface represents a memory-resident audio asset."
–Web Audio API specification.
Meaning, our audio data loaded in RAM memory.
"It can contain one or more channels with each channel appearing to be 32-bit floating-point linear PCM values with a nominal range of [−1,1] but the values are not limited to this range."
–Web Audio API specification.
"Channels" as in the left and right channels in stereo.
What is actually contained in the "data" chunk? #
As I hope is clear from the explanation and code above, the data chunk, specifically in our WAV file, contains 16-bit samples, ie. each sample occupies 2 bytes.
Scaling the sample #
There's a type issue to keep in mind.
Using audioBuffer.getChannelData(0)[0]
as shown above will yield a sample with a value such as -0.008437649346888065
(real example from my tests.)
Let's quote the spec for getChannelData()
and the (notes on the) spec for WAVE format:
"According to the rules described in acquire the content either get a reference to or get a copy of the bytes stored in [[internal data]] in a new Float32Array."
–AudioBuffer#getChannelData()
specification.
"8-bit samples are stored as unsigned bytes, ranging from 0 to 255. 16-bit samples are stored as 2's-complement signed integers, ranging from -32768 to 32767."
–WAVE PCM soundfile format notes
So I'm doing what seems to be the common way to convert a Float32 to an Int16 for our 16-bit audio file.
Assuming a sample which we will grab with channels[i][sampleIndex]
in a loop I will write later, this is what we're going to do:
const float32Sample = channels[i][sampleIndex];
const scaledSample = float32Sample * 32767;
const roundedSample = Math.round(scaledSample);
const clampedInt16Sample = Math.max(-32768, Math.min(roundedSample, 32767));
(The last line says "make sure the value is within the [-32768, 32767] range," ie. clamp it.)
In case it's not clear: We're mapping the lowest Float32, ie. -1.0
, to the lowest Int16, ie. -32768
, the highest float, ie. 1
to the highest Int16, ie. 32767
, and the 0.0
float to the 0
int. Everything in between is some float "scaled" by the highest int (and then rounded.)
Skipping dithering #
The conversion from audio float to integer inherently produces "quantization errors" which may or may not be perceived by your ears. They can be (partially) handled by adding white noise, in a process called "dithering."
We're not gonna be doing any dithering here.
(And in fact, in most professional audio software dithering is an optional setting. So I'm gonna leave that topic for another day / future version of my own application.)
Planar vs. interleaved #
We can't (well, we can, but we don't want to) just write all the samples in channel 0 (stereo left), and then all the samples in channel 1 (stereo right), on the data chunk.
The WAV file, being for playback, is expected to have an "interleaved" layout of samples.
Basically, if we have 4 digitized moments in time (ie. 4 frames of audio, ie. 8 frames in stereo):
- Web Audio API organizes the samples this way, ie. "LLLLRRRR" (Planar.)
- WAV file organizes the samples this way, ie. "LRLRLRLR" (Interleaved.)
The WAVE format is for playback. The samples are there to tell speakers what to do.
So it makes sense to tell all speakers what to do at each point in time, ie. send "L" and "R" before proceeding to the next pair. That way all speakers react as simultaneously as possible, and the listener gets the intended stereo perception/experience.
(As opposed to listening to a song N times, once per channel!)
So let's write the code that "interleaves" the samples on the data chunk of the WAVE file.
References #
- (Text/HTML) WAV @ Wikipedia.org
- (Text/HTML) WAVE Specifications @ McGill.ca
- (Text/PDF) Multimedia Programming Interface and Data Specifications 1.0 @ McGill.ca
- (Text/HTML) WAVE PCM soundfile format @ soundfile.sapp.org
- (Text/HTML) Wave File Format @ Sonicspot.com (Archived)
- (Text/HTML) RIFF (Resource Interchange File Format) @ LOC.gov
- (Text/HTML) Format chunk (of a Wave file) @ RecordingBlogs.com
- (Text/HTML) Linear Pulse Code Modulated Audio (LPCM) @ LOC.gov
- (Text/HTML) 1.4. The AudioBuffer Interface @ Web Audio API, 29 March 2023
- (Text/HTML) 1.3. The OfflineAudioContext Interface @ Web Audio API, 29 March 2023 @ MDN
- (Text/HTML) 1.4.3. AudioBuffer Methods @ Web Audio API, 29 March 2023
- (Text/HTML) BJORG @ Blog.BjornRoche.Com
- (Text/HTML) OfflineAudioContext @ MDN
- (Text/HTML) How to Convert an AudioBuffer to an Audio File with JavaScript @ RussellGood
- (Text/HTML) How to manipulate the contents of an audio tag and create derivative audio tags from it? @ StackOverflow
- (Text/HTML) Basic concepts behind Web Audio API @ MDN