Introduction

Pop2Piano is a Transformer network designed to generate piano covers from the audio waveforms of pop music. It allows users to create piano renditions directly from a song's audio without the need for melody and chord extraction modules.

Architecture

Pop2Piano is an encoder-decoder Transformer model based on T5. The input audio is converted to its waveform and processed by the encoder to produce a latent representation. The decoder generates token IDs in an autoregressive manner, with each token representing time, velocity, note, or a special type. These tokens are then decoded into a MIDI file.

Training

Pop2Piano primarily uses Korean Pop music for training but is also effective with Western Pop and Hip Hop songs. The model includes features to vary results by using different composer settings during the generation process.

Guide: Running Locally

  1. Installation:
    Install necessary libraries:

    pip install git+https://github.com/huggingface/transformers.git
    pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
    
  2. Using Your Own Audio:
    Load and process your own audio file:

    import librosa
    from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
    
    audio, sr = librosa.load("<your_audio_file_here>", sr=44100)
    model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
    processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
    
    inputs = processor(audio=audio, sampling_rate=sr, return_tensors="pt")
    model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
    tokenizer_output = processor.batch_decode(
        token_ids=model_output, feature_extractor_output=inputs
    )["pretty_midi_objects"][0]
    tokenizer_output.write("./Outputs/midi_output.mid")
    
  3. Using Audio from Hugging Face Hub:
    Process audio from a dataset:

    from datasets import load_dataset
    from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
    
    model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
    processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
    ds = load_dataset("sweetcocoa/pop2piano_ci", split="test")
    
    inputs = processor(
        audio=ds["audio"][0]["array"], sampling_rate=ds["audio"][0]["sampling_rate"], return_tensors="pt"
    )
    model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
    tokenizer_output = processor.batch_decode(
        token_ids=model_output, feature_extractor_output=inputs
    )["pretty_midi_objects"][0]
    tokenizer_output.write("./Outputs/midi_output.mid")
    
  4. Cloud GPU Recommendation:
    For faster processing, consider using cloud services with GPU support, such as AWS, Google Cloud, or Azure.

License

The Pop2Piano model is available under a license that can be viewed in the original repository. For more details, refer to the repository.

More Related APIs in Automatic Speech Recognition