SoundMusic is a suite of AI powered tools that generates music from audio files.
SoundMusic is a working name, for the time being...
This showcase is under construction.
28/06/2020
Over the past few months we've put together a lot of the previously explored concepts and some new ones into a system capable of generating novel electroacoustic music (or at least sound art) from previously recorded audios. The system we have right now is versatile and was designed with human-machine collaboration in mind, allowing the human to intervene in many of the phases of the creation process or even taking over at some points. SoundMusic splits the creative process into 5 distinct phases, each with a tangible result that can be used on its own or incorporated into the final output of the system.
The first phase of the process involves extracting salient sounds from the source audio. This is done by programaticaly tweaking the parameters of a silence detectio algorithm in order to maximize the number of non-silent sections detected within a certain duration interval. Bellow are some examples of the kind of sounds that are extracted.
Splash
Fragment 1
Fragment 2
Fragment 3
Birds
Fragment 1
Fragment 2
Fragment 3
Swans
Fragment 1
Fragment 2
Fragment 3
The purpose of the synthesis phase is to create novel sounds to be used in the final piece, creating some detatchment from the source audio. The phase uses a parametrized synth that is capable of generating a sound from a source audio through the combination of multiple synthesis techniques. This synth is composed of many sub-synths, each dedicated to a different kind of synthesis. This design makes it easy to introduce new sub-synths to the complex synth, leading to the creation of different sounds. The following synthesis techniques have been implemented:
Additive Synthesis
Additive synthesis is a technique that generates new sounds by adding together many waves, tipicaly sine waves, to recreate the sound of an instrument. In our system however, we open the possibility of combining other waves that aren't sine waves. The amplitude, frequency and phase of the waves that are combined are controlled by the input sound. In particular, the N bins with highest average amplitude over time are considered. This synth has the following parameters:
FM Synthesis
Frequency Modulation synthesis is a technique for generating sounds by modulating the frequency of a wave with another wave. In our specific case, a regular wave's frequecy is modulated by the values resulting from running a pitch tracking algorithm on the original sound. We also modulate the amplitude of the wave with the amplitude of the original wave. This synth has the following parameters:
PM Synthesis
Phase Modulation synthesis is similar to the previously described Frequency Modulation synthesis and achieves similar results. The main difference is that instead of modulating the frequecy of a wave, we modulate its phase. The reason why we separate the two is because of the way our implementations interact with the source sound. In this case, we use a regular wave to modulate the phase of the playback of the source wave, leading to very different results from the previous technique. This synth has the following parameters:
Granular Synthesis
Granular synthesis is a technique for creating new sounds from a source sound by making manipulations on a very small temporal scale. Our specific implementation works by coppying small segments from the source audio into random locations of the target audio to create a new texture from the source audio. The following parameters control this synth:
Spectral Granular Synthesis
This technique is similar to the previously described technique, with the difference that instead of working with parts of the source sound wave, it works on the spectral representations of the source and target. This means that the grains can be shifted not only in the time domain, but also in the frequency domain. This can also create interesting features from the source audio. The following parameters control this synth:
Combined Synthesis
All of the above techniques are combined into a single synth that contains an instance of each of the previous synths. This synth inherites all of the parameters of the previous synths, and also enough parameters to generate a wave that modulates the amplitude of the output of each of the sub-synths. For each of the previously described synths, this synth also has the following parameters:
The final synth's parameters can either be completely random, or generated through a Genetic Algorithm. The Genetic Algorithm considers as the genotype for each individual synth the list of parameters that controls it, and as the fenotype an example of sound produced by the synth. Since each synth can generate a different sound for each source audio, we evaluate n examples selected randomly and the final score of the synth is the average of those scores. Evaluating the quality of the sounds is a subjective task and as such, it is one where human input is needed. The first option is running a standard Interactive Genetic Algorithm, in which the user evaluates each samples' fitness. However, while this gives a lot of control to the user to shape the result of this phase, it results in a rather time consuming process, so we offer the possibility of delegating this process to a machine learning component. SVMs have proven to work well in audio classification tasks, and work well with a relatively small number of examples. The user can train an SVM based regression model with examples of sounds evaluated by them, and then use this model as the fitness function in the genetic algorithm. We have used this option to quickly generate examples from our large dataset and it has proven to produce desirable results. Furthermore, multiple passes of the synths can be applied on the audio, using the output of the first pass as input for the second to achieve different sounds from what could be considered composition of multiple instances of the base synth. While the program can go on to produce an entire composition, it can also be interesting to stop here and use the generated sounds as the basis for human made compositions.
The sounds generated in the synthesis phase are then loaded into samplers that will serve as the "instruments" throughout the generated composition. These samplers organize sounds in a cartesian space with 3 dimensions: pitch (in Hz), duration (in seconds) and volume (in dB). Each sampler is then controlled by a stream of commands in the form of three dimensional vectors indicating the pitch, duration and volume of the desired sound. For each command, the sampler produces a sound by interpolating the n closest sounds to the desired point on the cartesian space and pitch shifting to the desired pitch and applying a volume envelope with the desired duration.
There are two methods to achieve the interpolation of the sounds. The preferred method, as it is quicker and provides better results in the context of this thesis, is directly interpolating the samples that represent each of the sounds. An alternative method uses the NSynth WaveNet auto-encoder from the Magenta project to encode the samples from both sounds and the encoded representations are interpolated and decoded into the final sound. While we believe that this method should result in interesting sounds, right now the observed results are sub-par when compared to the other, more straightforward method. As our sounds are rather different from the sounds that the Magenta Team used to train the auto-encoder, we believe that retraining the auto-encoder with more similar examples should lead to better results, however the resources and time required to do so are beyond the scope of this work.
After the samplers are generated, we need a sequence of commands to feed the samplers in order to complete this phase. This sequence of commands can be derived from a MIDI file provided by the user, or can be generated from the input audio. The process to generate meaningful sequences of commands from the input audio is as follows:
The results of this phase can also be used on their own, as part of a human-composed piece, or can be used by the following phases of the process to be integrated into the final output of the system.
Khunan
Khunan Fragments
Surf
Surf Fragments
Passing Train
Passing Train Fragments
This phase of the process is entirely optional and works on a completely different plane in order to solve an aesthetic problem of the product from the previous phase. Due to the way the commands are generated, undesireable chunks of silence can end up in the final result. In order to fix this, the system generates an evolving drone that serves as a backdrop to the piece. The drone is created by taking short segments of the source audio, applying a band pass and loopind them. The amplitude of the loops are then modulated by sine waves with different amplitudes, frequencies and durations. A drone is therefore characterized by:
A population of drones is generated through a genetic algorithm, using as a fitness function the average distance between the spectogram of the generated drone and the spectogram of random segments of the source audio. The final population of drones is combined by modulating the frequencies of each drone by a lower frequency sine wave with different phases and adding up the resulting waves. The final result is a slowly evolving texture that resembles the texture of the original sound without focusing on any particular moment.
Phaser
Phaser Drone
Effects
Effects Drone
Lullaby
Lullaby Drone
The final phase involves joining the fragments and the drone into a coherent final product. A fade in and fade out is applied to the drone to give the piece a sense of beggining and ending. The drone also slightly fades out when the fragments are playing, in order to help that top layer stand out.
Volleyball
Volleyball Composition
Strings
Strings Composition
Underwater
Underwater Composition
In order to further glue the composition together and add more interest, it is possible to add a reverb. Instead of adding the same reverb to all the pieces, we use a convolutional reverb. Convolutional reverb works by taking a recording of the space and convolving that with the recording to which we want to apply the reverb. In our case, instead of using a pre-recorded response, we try and calculate the space responce of the source audio. This is done by running an onset detection algorithm on the audio and taking samples of the sound that follows each onset, but before the next onset, or in musical terms, the space between notes. We then apply a decay to the result, as the source audio doesn't have a real decay in most cases. The system also outputs the room sound, so that it can be applied to other compositions.
Football
Football Composition
Football Composition with Reverb
Loons
Loons Composition
Loons Composition with Reverb
Finally, while the system was firstly built with a single channel output in mind, basic stereo support was added. The way stereo works is by actually generating two different compositions, one to be played on the right channel and another for the left channel. Each composition has a different drone and different fragments, however the density of fragments is half of what it was in the mono version. This results in a piece with the same density of fragments when combining the two channels, except half the fragments will come from each channel. Just playing each piece on each channel results in a less than desired experience, as the sounds from each channel sound disconected, possibly even disorienting the listener. In order to fix this, a bit of the right channel is mixed into the left channel, with reversed polarity, and vice versa. This helps tie both channels together, resulting in a more satisfying stereo experience. While this process is rather simple and required very little modifications to the existing system, it is not without its flaws. All of the sounds come either from the left or from the right, instead of sounds comming from different points in space. Also, there is no underlying logic to where the sounds come from, as the two channels are generated independently.
Sports Commentary
Sports Commentary Stereo Composition
Raindrops
Raindrops Stereo Composition
Guineafowls
Guineafowls Stereo Composition
Loons
Loons Stereo Composition
Birds
Birds Stereo Composition
To finish off this tour of the SoundMusic system as it exists now, we present a composition that combines output from the system and a human touch.
SC-1
22/04/2020
In the previous examples, the sounds used as the source for the process are quite limited. This is due to the fact that we still hadn't built a proper dataset. Well, that has changed. We now have a dataset of 119 sounds ranging from 30 seconds in length to 30 minutes in length. The sounds come in different formats, all supported by librosa. While we were first planning on using an automated script to collect the dataset, the fact that it would have to manualy check the entries afterwards turned us off from that idea, so we ended up using a manual process to gather the dataset. The sounds all come from freesound.org and were collected by using each of the search terms in the list bellow and sellecting appropriate results that came up in the first page.
27/03/2020
When implementing the Ghosts Generator into the system, at first, I made a mistake in the sorting of the frequencies, which resulted in taking the frequencies with least energy, instead of the ones with most energy. While this was clearly not the intended behaviour, it was still interesting and it got me thinking about the concept of complementary sounds – two sounds such that when we sum their corresponding amplitude matrices from the DFT, every entry in the resulting matrix would be maximized.
Thankfully, this means it is pretty easy to calculate the DFT of the complementary sound of a given sound – all you have to do is take a matrix where every value is the maximum amplitude value in the original DFT and subtract the original amplitudes from that matrix. I also decided to invert all the values in the phase matrix, just for good measure. Furthermore, thankfully as well, once you have the DFT of a value, estimating the signal that originates that DFT is also somewhat trivial. The results of this were… Underwhelming. Turns out most frequencies have really low values, so the complementary of nearly all sounds is going to be very rich in a lot of frequencies, and therefore be super noisy.
True Complement of Splashy Fishes
Attention: Super loud!
But then it hit me! What I want is a sound with high energy in the frequencies where the original sound has low energy, low energy in the frequencies where the original sound has high energy, and zero energy in the frequencies where the original sound has zero or near zero frequencies. All I had to do to get that was create a binary mask from the entries in the original sound that are higher than a threshold and apply that mask to the resulting DFT. This threshold is calculated by multiplying the average value of amplitude in the original DFT by a constant. In the examples, the constant was 10 as it worked well for all of the dataset.
Masked Complement of Splashy Fishes
Masked Complement of Ducks
Masked Complement of Chirpy Birds
Masked Complement of Swans
Masked Complement of the City
Masked Complement of Rain
Masked Complement of Sports
The examples bellow were generated by calculating the complement of the original audios, as described above, and then using the result as the input for a Ghosts Generator with the same parameters as the first one described here. I also extracted samples of 0.2 seconds or more from the input, by detecting onsets and backtracking them, and shifted their position randomly by a maximum of 5 seconds to get some elements from the complement of the audio into the final result.
The Other Splashy Fishes
The Other Ducks
The Other Chirpy Birds
The Other Swans
The Other City
The Other Rain
The Other Sports
Fixes (27/03/2020)
Added an envelope to the samples to fix the clicking.
Fixed a bug in the Ghosts generator sorting of frequencies.
26/03/2020
This point marks a shift in SoundMusic, as we change our attention from the traditional concept of notes to a more generic concept of sound. Beggining here, the basic block of the SoundMusic output is not a note, characterized by a pitch and a duration, but a sound, stored internally as an array of samples representing a wave, the rate for those samples, and the point in time when that sound should be played in relation to the beggining of the piece. As a first experiment with this revamped system, I decided to work with the simplest wave there is - the sine wave.
The examples bellow are all composed of a series of sine waves generated from a source audio file. We start by applying a band-pass filter to the signal, to get rid of unwanted frequencies. We then take the DFT of the signal and split it into an amplitude and a phase matrix. We normalize the amplitudes by dividing all the values in the matrix by the maximum amplitude. We also detect the onsets in the signal, selecting the frames in the DFT that correspond to the onsets. For each of the frames we generate a sine wave of arbitrary length for each of the N frequencies with highest amplitude value and apply an ADSR envelope to the sine. The generated sine waves have the normalized amplitude and phase calculated in the DFT.
In the examples bellow this process was repeated two times for the input signal, first with with the band pass from 100 Hz to 2000 Hz, an amplitude multiplier of 0.5, selecting the 3 highest frequencies and generating sine waves with 1 second of duration. In the second, the band pass is from 20 Hz to 20 000 Hz, an amplitude multiplier of 0.7, selecting the highest frequency and generating sine waves of half a second of duration.
Fixes (27/03/2020)
Fixed a bug in the sorting of frequencies.
12/03/2020
The following examples were generated by a precursor to our system. It takes as inputs an audio file and a midi and renders the midi with sounds extracted from the audio. In the first two examples, the sounds were used unchanged, only some reverb and EQ was applied to hide some of the artifacts from the manipulation. In the second two examples, the sounds were split into small parts of about 80 ms. These small parts were then reassembled randomly to synthesize new sounds that were then used to render the midis. The pitch of some of these sounds is hard to identify, making it harder to recognise the original melodies in the result. However, the generated sounds have interesting textural properties.
The Mii Channel Music by Sampled Swans
Fauré's Pavane by Sampled Swans
The Mii Channel Music by Granulated Birds
Fauré's Pavane by Granulated Water