SoundMusic

28/06/2020

Over the past few months we've put together a lot of the previously explored concepts and some new ones into a system capable of generating novel electroacoustic music (or at least sound art) from previously recorded audios. The system we have right now is versatile and was designed with human-machine collaboration in mind, allowing the human to intervene in many of the phases of the creation process or even taking over at some points. SoundMusic splits the creative process into 5 distinct phases, each with a tangible result that can be used on its own or incorporated into the final output of the system.

Phase 1: Extraction

The first phase of the process involves extracting salient sounds from the source audio. This is done by programaticaly tweaking the parameters of a silence detectio algorithm in order to maximize the number of non-silent sections detected within a certain duration interval. Bellow are some examples of the kind of sounds that are extracted.

Splash

Fragment 1

Fragment 2

Fragment 3

Birds

Fragment 1

Fragment 2

Fragment 3

Swans

Fragment 1

Fragment 2

Fragment 3

Phase 2: Synthesis

The purpose of the synthesis phase is to create novel sounds to be used in the final piece, creating some detatchment from the source audio. The phase uses a parametrized synth that is capable of generating a sound from a source audio through the combination of multiple synthesis techniques. This synth is composed of many sub-synths, each dedicated to a different kind of synthesis. This design makes it easy to introduce new sub-synths to the complex synth, leading to the creation of different sounds. The following synthesis techniques have been implemented:

Additive Synthesis

Additive synthesis is a technique that generates new sounds by adding together many waves, tipicaly sine waves, to recreate the sound of an instrument. In our system however, we open the possibility of combining other waves that aren't sine waves. The amplitude, frequency and phase of the waves that are combined are controlled by the input sound. In particular, the N bins with highest average amplitude over time are considered. This synth has the following parameters:

Wave Shape: [sine, square, triangle, sawtooth]
Number of Waves: [1, 25]

FM Synthesis

Frequency Modulation synthesis is a technique for generating sounds by modulating the frequency of a wave with another wave. In our specific case, a regular wave's frequecy is modulated by the values resulting from running a pitch tracking algorithm on the original sound. We also modulate the amplitude of the wave with the amplitude of the original wave. This synth has the following parameters:

Wave Shape: [sine, square, triangle, sawtooth]
Wave Phase (rad): [0.0, 2π]

PM Synthesis

Phase Modulation synthesis is similar to the previously described Frequency Modulation synthesis and achieves similar results. The main difference is that instead of modulating the frequecy of a wave, we modulate its phase. The reason why we separate the two is because of the way our implementations interact with the source sound. In this case, we use a regular wave to modulate the phase of the playback of the source wave, leading to very different results from the previous technique. This synth has the following parameters:

Wave Shape: [sine, square, triangle, sawtooth]
Wave Frequency (Hz): [0.1, 100.0]
Wave Amplitude (samples): [0.1, 50.0]
Wave Phase (rad): [0.0, 2π]

Granular Synthesis

Granular synthesis is a technique for creating new sounds from a source sound by making manipulations on a very small temporal scale. Our specific implementation works by coppying small segments from the source audio into random locations of the target audio to create a new texture from the source audio. The following parameters control this synth:

Source Grain Density (s^-1): [10.0, 500.0]
Source Grain Density Variance: [5.0, 100.0]
Target Grain Density (s^-1): [10.0, 500.0]
Target Density Variance: [5.0, 100.0]

Spectral Granular Synthesis

This technique is similar to the previously described technique, with the difference that instead of working with parts of the source sound wave, it works on the spectral representations of the source and target. This means that the grains can be shifted not only in the time domain, but also in the frequency domain. This can also create interesting features from the source audio. The following parameters control this synth:

Source Grain Density (s^-1): [10.0, 500.0]
Source Grain Density Variance: [5.0, 100.0]
Target Grain Density (s^-1): [10.0, 500.0]
Target Density Variance: [5.0, 100.0]
Grain Frequency Width (ratio): [0.01, 1.0]
Grain Frequency Width Variance: [0.0, 0.3]

Combined Synthesis

All of the above techniques are combined into a single synth that contains an instance of each of the previous synths. This synth inherites all of the parameters of the previous synths, and also enough parameters to generate a wave that modulates the amplitude of the output of each of the sub-synths. For each of the previously described synths, this synth also has the following parameters:

Base Amplitude: [-1.0, 1.0]
Modulator Shape: [sine, square, triangle, sawtooth]
Modulator Amplitude: [0.0, 1.0]
Modulator Frequency (Hz): [0.0, 10.0]
Modulator Phase (rad): [0.0, 2π]

The final synth's parameters can either be completely random, or generated through a Genetic Algorithm. The Genetic Algorithm considers as the genotype for each individual synth the list of parameters that controls it, and as the fenotype an example of sound produced by the synth. Since each synth can generate a different sound for each source audio, we evaluate n examples selected randomly and the final score of the synth is the average of those scores. Evaluating the quality of the sounds is a subjective task and as such, it is one where human input is needed. The first option is running a standard Interactive Genetic Algorithm, in which the user evaluates each samples' fitness. However, while this gives a lot of control to the user to shape the result of this phase, it results in a rather time consuming process, so we offer the possibility of delegating this process to a machine learning component. SVMs have proven to work well in audio classification tasks, and work well with a relatively small number of examples. The user can train an SVM based regression model with examples of sounds evaluated by them, and then use this model as the fitness function in the genetic algorithm. We have used this option to quickly generate examples from our large dataset and it has proven to produce desirable results. Furthermore, multiple passes of the synths can be applied on the audio, using the output of the first pass as input for the second to achieve different sounds from what could be considered composition of multiple instances of the base synth. While the program can go on to produce an entire composition, it can also be interesting to stop here and use the generated sounds as the basis for human made compositions.

Phase 3: Composition

The sounds generated in the synthesis phase are then loaded into samplers that will serve as the "instruments" throughout the generated composition. These samplers organize sounds in a cartesian space with 3 dimensions: pitch (in Hz), duration (in seconds) and volume (in dB). Each sampler is then controlled by a stream of commands in the form of three dimensional vectors indicating the pitch, duration and volume of the desired sound. For each command, the sampler produces a sound by interpolating the n closest sounds to the desired point on the cartesian space and pitch shifting to the desired pitch and applying a volume envelope with the desired duration.

There are two methods to achieve the interpolation of the sounds. The preferred method, as it is quicker and provides better results in the context of this thesis, is directly interpolating the samples that represent each of the sounds. An alternative method uses the NSynth WaveNet auto-encoder from the Magenta project to encode the samples from both sounds and the encoded representations are interpolated and decoded into the final sound. While we believe that this method should result in interesting sounds, right now the observed results are sub-par when compared to the other, more straightforward method. As our sounds are rather different from the sounds that the Magenta Team used to train the auto-encoder, we believe that retraining the auto-encoder with more similar examples should lead to better results, however the resources and time required to do so are beyond the scope of this work.

After the samplers are generated, we need a sequence of commands to feed the samplers in order to complete this phase. This sequence of commands can be derived from a MIDI file provided by the user, or can be generated from the input audio. The process to generate meaningful sequences of commands from the input audio is as follows:

A pitch tracking algorithm is run on the input audio, resulting in a sequence of commands in the form of (duration, pitch, volume) vectors.
A fuzzy Markov Model is trained based on the sequence of commands. This model allows for a note to be considered equal to other notes within a proximity threshold, effectively resulting in a more meaningful model.
Random paths of a random length within user-defined values are generated from the model. The instant of each command is randomly disturbed in order for the result to be more like a wave of sounds instead of a single note at a time.
Each of these sequences of commands are considered a theme. And random themes are then spread throughout the duration of the piece, for a result that is unpredictable, while still maintaining a degree of repetition. The desired duration of the piece is given by the user, and the number of themes to be played are calculated from the spectral flatness of the original sound (a measure of how noise-like or tone-like the sound is), where noisier sounds result in denser pieces, while more tone-like sounds result in more sparser pieces.

The results of this phase can also be used on their own, as part of a human-composed piece, or can be used by the following phases of the process to be integrated into the final output of the system.

Khunan

Khunan Fragments

Surf

Surf Fragments

Passing Train

Passing Train Fragments

Phase 4: Drone

This phase of the process is entirely optional and works on a completely different plane in order to solve an aesthetic problem of the product from the previous phase. Due to the way the commands are generated, undesireable chunks of silence can end up in the final result. In order to fix this, the system generates an evolving drone that serves as a backdrop to the piece. The drone is created by taking short segments of the source audio, applying a band pass and loopind them. The amplitude of the loops are then modulated by sine waves with different amplitudes, frequencies and durations. A drone is therefore characterized by:

A list of loop starts.
A list of loop ends.
A list of low frequencies for the band pass filters.
A list of high frequencies for the band pass filters.
A list of amplitudes for the amplitude modulators.
A list of phases for the amplitude modulators.
A list of frequencies for the amplitude modulators.

A population of drones is generated through a genetic algorithm, using as a fitness function the average distance between the spectogram of the generated drone and the spectogram of random segments of the source audio. The final population of drones is combined by modulating the frequencies of each drone by a lower frequency sine wave with different phases and adding up the resulting waves. The final result is a slowly evolving texture that resembles the texture of the original sound without focusing on any particular moment.

Phaser

Phaser Drone

Effects

Effects Drone

Lullaby

Lullaby Drone

Phase 5: Finalization

The final phase involves joining the fragments and the drone into a coherent final product. A fade in and fade out is applied to the drone to give the piece a sense of beggining and ending. The drone also slightly fades out when the fragments are playing, in order to help that top layer stand out.

Volleyball

Volleyball Composition

Strings

Strings Composition

Underwater

Underwater Composition

In order to further glue the composition together and add more interest, it is possible to add a reverb. Instead of adding the same reverb to all the pieces, we use a convolutional reverb. Convolutional reverb works by taking a recording of the space and convolving that with the recording to which we want to apply the reverb. In our case, instead of using a pre-recorded response, we try and calculate the space responce of the source audio. This is done by running an onset detection algorithm on the audio and taking samples of the sound that follows each onset, but before the next onset, or in musical terms, the space between notes. We then apply a decay to the result, as the source audio doesn't have a real decay in most cases. The system also outputs the room sound, so that it can be applied to other compositions.

Football

Football Composition

Football Composition with Reverb

Loons

Loons Composition

Loons Composition with Reverb

Finally, while the system was firstly built with a single channel output in mind, basic stereo support was added. The way stereo works is by actually generating two different compositions, one to be played on the right channel and another for the left channel. Each composition has a different drone and different fragments, however the density of fragments is half of what it was in the mono version. This results in a piece with the same density of fragments when combining the two channels, except half the fragments will come from each channel. Just playing each piece on each channel results in a less than desired experience, as the sounds from each channel sound disconected, possibly even disorienting the listener. In order to fix this, a bit of the right channel is mixed into the left channel, with reversed polarity, and vice versa. This helps tie both channels together, resulting in a more satisfying stereo experience. While this process is rather simple and required very little modifications to the existing system, it is not without its flaws. All of the sounds come either from the left or from the right, instead of sounds comming from different points in space. Also, there is no underlying logic to where the sounds come from, as the two channels are generated independently.

Sports Commentary