This work presents a music timbre transfer model that aims to transfer the style of a music clip while preserving the semantic content. Compared to the existing music timbre transfer models, our model can achieve many-to-many timbre transfer between different instruments.
The proposed method involves only an autoencoder framework, which comprises two pretrained encoders and one decoder trained in an unsupervised manner. To learn more representative features for the encoders, we produced a parallel dataset, called MI-Para, which is synthesized from MIDI files and digital audio workstations (DAW).
Both the objective and the subjective evaluation results showed the effectiveness of the proposed framework. To scale up the application scenario, we also demonstrate that our model can achieve style transfer by training in a semi-supervised manner with a smaller parallel dataset.
MI-Para Dataset
The MIDI files of 21 Western pop songs and cartoon songs (total 2700 seconds) are collected from Bitmidi. Then we utilize MIDI files and digital audio workstation (DAW) to synthesize audio files. By doing this, we ensure that the only change between the two audio files is the timbre. Each song in the dataset was played by four kinds of instruments, including piano, acoustic guitar, electric guitar, and bass.
Demo
There are four kinds of instruments in the MI-Para dataset, so we conducted twelve transfer tasks between different kinds of instruments in our experiment.