I think we can all agree that more great musicians are dead than alive. Modern technology cannot resurrect Ian Curtis, Kurt Cobain and Lemmy, but we believe that through AI more songs inspired by these great (but dead!) musicians can be written.
Hey Ho, Let's Go!
In this project we used 130 songs in MIDI format (60 unique songs and 70 variations thereof) of the most amazing punk rock band ever, The Ramones, as well as the lyrics of all their 178 songs. A deep learning neural network was then trained with all that data; not big data, but still. Training the neural deep learning network for about 45 minutes on a GPU instance of AWS, the network learned to write guitar and bass lines for Ramones-inspired music.
Regular deep-Learning neural networks are great for static data, for example for classifying photos, cats vs. dogs. Music and text have a temporal component and we therefore need stateful networks, ie. deep learning networks with memory. For this project--lyrics & music generation--we chose Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs). More information on these networks can be found on the blog of Andrej Karpathy.
Please also check out the slides of the final presentation of this project on Slideshare or YouTube, and the code on github.
First we downloaded all the MIDI formatted songs that we could find on the internet and transposed all of them to C-Major. The recommended software for dealing with MIDI files would be MuseScore, but also Apple's GarageBand can open the files easily. Please be aware that MIDI is an old and ugly format, dealing with it is a bit of a nightmare.
For the whole project we were using Python versions 2.7 and 3.6. The library to manipulate the MIDI input data of our choice was music21. The library is powerful but also buggy and we needed several work-arounds to be able to extract all the tracks we wanted.
As not enough "voice-tracks" could be found, we decided to focus solely on the guitar and bass tracks which we extracted from all the transposed songs and serially concatenated them. After having inspected the music, the songs were quantized in sixteenths (1/16) notes, i.e. the base note became a "1/16-th note". Next to the 6 parallel tracks of the guitar (6 strings) and the one track from bass we added 2 bits for "mute" and 2 bits for "hold". With the "hold" bit, notes that are longer than 1/16-th can be encoded; this is done by multiplying the notes and setting the "hold-bit" to "1".
The numbers encoding the pitch (0..127) of each note were directly taken from the MIDI files.
By chaining all 130 songs we ended up with about 150'000 lines of 1/16-th notes of which about 2'200 were unique combinations of
guitar, bass and mute/ hold bits. These ~2'200 unique combinations were hot-encoded
("bag of words") and used to train a LSTM neural
network. The chosen input length was 32 and 64 respectively, ie. 2 or 4 bars were used as inputs.
The training data (X) was generated by splitting the notes into blocks of either 32 or 64 1/16-th quantized chords (guitar & bass) that had been hot-encoded; the expected output (y) corresponding to the input data (X) used for training the network was the following chord. The output vector (y) was hot-encoded as well.
The networks used were LSTM-RNN's programmed using "Keras on "Tensorflow; sizes varied but the following is one configuration that provided good results:
The code written in Python with Keras: | |
---|---|
maxlen = 32 #max. seq.-length: 32 or 64 model = Sequential() model.add(LSTM(128, return_sequences=True, input_shape=(maxlen, len(words)))) model.add(Dropout(0.2)) model.add(LSTM(128, return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(len(words))) model.add(Activation('softmax')) |
The following LSTM-RNNs were implemented and trained resulting the described losses:
Implementations and Losses | ||
---|---|---|
Implementation 1: maxlen = 64 step = 3 model = Sequential() model.add(LSTM(128, return_sequences=True, input_shape=(maxlen, len(words)))) model.add(Dropout(0.2)) model.add(LSTM(128, return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(len(words))) model.add(Activation('softmax')) |
Implementation 2: maxlen = 32 step = 3 model = Sequential() model.add(LSTM(128, return_sequences=True, input_shape=(maxlen, len(words)))) model.add(Dropout(0.2)) model.add(LSTM(128, return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(len(words))) model.add(Activation('softmax')) |
Implementation 3: maxlen = 32 step = 1 model = Sequential() model.add(LSTM(128, return_sequences=True, input_shape=(maxlen, len(words)))) model.add(Dropout(0.2)) model.add(LSTM(128, return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(len(words))) model.add(Activation('softmax')) |
min. loss: 0.8142 Total params: 1'700'136 |
min. loss: 0.9395 Total params: 1'700'136 |
min. loss: 0.9009 Total params: 1'700'136 |
Implementation 4: maxlen = 64 step = 1 model = Sequential() model.add(LSTM(128, return_sequences=True, input_shape=(maxlen, len(words)))) model.add(Dropout(0.2)) model.add(LSTM(128, return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(len(words))) model.add(Activation('softmax')) |
Implementation 5: maxlen = 32 step = 1 model = Sequential() model.add(LSTM(64, return_sequences=True, input_shape=(maxlen, len(words)))) model.add(Dropout(0.2)) model.add(LSTM(64, return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(len(words))) model.add(Activation('softmax')) |
Implementation 6: maxlen = 32 step = 1 model = Sequential() model.add(LSTM(128, input_shape=(maxlen, len(words)))) model.add(Dropout(0.2)) model.add(Dense(len(words))) model.add(Activation('softmax')) |
min. loss: 0.9119 Total params: 1'700'136 |
min. loss: 1.1968 Total params: 802'088 |
min. loss: 0.8712 Total params: 1'568'552 |
For all the implementations the batch size was kept constant at 128. The relations between the minimum loss and the structure of the LSTM-RNN, as well as the trainings-data was not apparent. It can however be seen that the loss is larger when there are less parameters to train (cf., Implementation 5). All RNNs were trained until the loss seemed to have reached (or was stuck stuck in) a (local) minimum. The training data of about 150'000 chord sequences is probably too small for reliable results and more meaningful interpretations.
After about 50 to 120 epochs of training (depending on the network), the error-rate was not decreasing anymore; this took around 20-50 minutes on a NVIDIA K80 TESLA GPU. The training data could be reduced by increasing the step size - this had a big (linear) impact on the training times, but did not affect the final error much.
No validation data was put aside for verification/ testing.
Not only the weights of the fully-trained networks were stored, but also intermediate results. In some cases, these produced better results; this effect was however not further investigated.
We scraped websites to download all lyrics of the 178 Ramones songs;
this was a rather straight-forward task using the Python library
BeautifulSoup.
First statistics regarding the lyrics were then computed from the text:
Statistics from the lyrics of the 178 songs: | |
---|---|
6’361 lines, 29’976 words, 154’207 characters |
|
|
|
Most frequent word: | “I” - 1351 occurrences |
Most frequent noun: | “baby” - 272 occurrences |
Most frequent verb: | “go” - 235 occurrences |
Other notable words: | “yeah” (196), “love” (168), “punk” (98) |
all of a sudden i feel son gone a gas i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, i'm alive, yeah, oh yeah
LSTM-RNNs produce impressive results but they also require vast amounts of data and computational power. We should not forget that also Markov Chains can "learn" from training-text and then write "original" texts. An interesting blog entry on Markov Chains for text generation can be found here: The unreasonable effectiveness of Character-level Language Models
Examples of Markov Chain Model outputs: |
||
---|---|---|
4-gram |
6-gram |
10-gram |
|
||
m child all / he's prime / baby, but there my family cretin family / everything's rock, rocker / i'll never that you can / stay after go / but your high / and i love here i turn town on now / it's geek / do your temployment / i fell / police day. that i wash you're to walk to don't care / i said ever go to you can't want to the club / you're goes it just can humanking my poor body put my tears i'll all good tired potato turnsey / someo / leave me somehow i gets so be / because i'm nothing in my before complete / well my eyes / make the othere's that can't he law and place / in the to there / we drag / fast | ay-o / misfits, twilight zone / i really killing lies hypnotized / twenty-twenty-twenty-twenty-twenty-twenty four hours to go / oohh! / go lil' camaro go / no it's like charles manson / go to be lonely / just pass it by. / when i knew it away from the concert when the time, sharona? / oo younger kind / maybe i wanna be that crime / i want the bbi / i let me walking down the airwaves / we want you up / 'cause she's the end, the corner / sheena is a punk a punk rocker / sheena is a punk rocker / sheena is a punk punk, a punk rocker now / your eye | she's that kinda girl /
oh baby, do you wanna dance /
yeah, yeah, yeah, yeah, there's no success for me /
involved in a robberry /
there's no reason to live. /
i'm the man who make /
the street his home. /
and my lean, mean heart /
i just wanna have to shout it out /
i don't have brain damage /
i'm not your enemy /
girl, i am your friend /
come with us and find /
the pleasures of a journey to the center of the mirror see your stupid face /
what a disgrace man and you know your biggest problem /
is the water as high as it'll get /
super wet, mom told me, /
every chance i have seen /
trying to forget /
and your friend /
come with us and find /
the pleasure and pain /
i used to be on an endless run. /
|
Already with a 10-gram Markov Chain the above example output text contains only a single typo and this with a computational complexity that can be achieved by a regular PC in a tenth of a second (vs. a LSTM-RNN that takes tens of minutes on a dedicated GPU to train).
On the other hand, comparing the generated output to the original (training) text (the lyrics of the 178 songs) it becomes obvious that many lines of the above texts are just "copy-pasted" and rearranged.
Do I believe that we are ready to ditch our MP3 music collections and start listening to computer-generated music?
Probably not!
Do I think that computer-generated "AI" music can be used as an input for creative musicians, as a basis to create good music?
Definitely! And that's what we did...
Hey Ho, Let's Go!
The various LSTM-RNN implementations generated lyrics as well as music in the MIDI format. These were then used by a proper musician as an input to generate proper punk rock songs.
The various configurations produced several Ramones inspired songs in the MIDI format, ie. the guitar and the bass tracks. The quality of the generated music varied greatly and by adjusting the "diversity/ temperature" of the output-sampling the "variability" of the songs could be adjusted. Three examples that were then used as inspiration to write a punk rock song are shown below:
Three midi songs generated by three LSTM-RNN implementations: | ||||||
---|---|---|---|---|---|---|
|
||||||
Implementation 1: | Implementation 2: | Implementation 3: | ||||
LSTM-RNN: single layer 128; weights leading to lowest loss; Diversity: 0.9 |
LSTM-RNN: single layer 64; weights leading to lowest loss Diversity: 1.1 |
LSTM-RNN: double layer 128; weights leading to medium loss Diversity: 1.1 |
||||
|
|
|
||||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
|
|
||||
Download Scoresheet (PDF) | Download Scoresheet (PDF) | Download Scoresheet (PDF) | ||||
Download MIDI Song | Download MIDI Song | Download MIDI Song | ||||
|
|
|
||||
The initial seed for all three songs were the same 2 bars (32 16-th notes) of the Ramones song "Beat on the Brat". Depending on the implementation as well as the "diversity" the songs turned out very differently; the higher the "diversity", the more "chaotic" or "creative" the output becomes. Using weights that did not lead to the lowest loss may produce more "creative" results as well. Many parameters can be adjusted, leading to a wealth of results.
LSTM Neural Network generated: | Markov Chain generated: |
---|---|
|
|
fight for money, fight for fun i want you and i want you and you wanna see a hold i don't wanna be all my baby i hell never was gotta want to be don't want to be alright i wanted it a good from me i don't wanna be all i want |
i'm the man who make the street his home. and my lean, mean heart i just wanna have to shout it out i don't have brain damage i'm not your enemy girl, i am your friend come with us and find the pleasures of a journey to the center |
Diversity: 0.1 | Diversity: 0.2 | Diversity: 0.5 |
---|---|---|
|
||
we got to stop this crazy care i don't wanna be a real me the street the street the time the thing the tale i don’t wanna be a girl in the street the danger the time i don't wanna be a place the street the tood the street the |
we got to stop this crazy care i don't wanna be a get the street the threed the street the danger now i don't wanna be a pink and the tame and i don't wanna be a good the girl i am a tio the thing the street the time i want |
we got to stop this crazy care it's a lot to be on the place i can't get a lot to me to die, i was baby, baby i'm a the friends to poy i can't let the street to do now i wanna be a place |
Pre-chorus & middle part: | Verse & chorus: |
---|---|
|
|
![]() |
![]() |
I’m alive |
---|
No reason to live Mommy told me No success for me Blow every chance I see But my lean, mean heart Just wanna shout it out i'm alive, i'm alive, i'm alive, I'm alive, yeah, yeah, yeah! I fight for money I'm not your enemy I fight for fun I am your friendy Your biggest problem is I don't want to be alright i'm alive, i'm alive, i'm alive, I'm alive, yeah, yeah, yeah! Do you wanna dance? Come on take a chance Come see your stupid face In the center of the mirror i'm alive, i'm alive, i'm alive, I'm alive, yeah, yeah, yeah! I'm alive! |
Follow us on Twitter & receive every 6 hours lyrics from THE RAiMONES!
(powered by AWS Lambda; consuming 3'708ms computational power per tweet)