SovitsTokenizer

SovitsTokenizer: A low-bitrate audio tokenizer that converts speech into discrete tokens (as low as 25 tokens per second) while preserving semantic and prosodic richness. Leveraging the pre-trained SoVITs model from GPT-SoVITs, it fine-tunes VQ-VAE layers for efficient audio compression and utilizes HuBERT’s robust semantic extraction. By combining HuBERT’s deep linguistic understanding with VQ-VAE’s detailed capture of prosodic and phonetic features, and decoupled text embedding from MTRE module, SpeechTokenizer produces compact yet highly expressive speech discrete units, which can also be used for voice conversion by decoding the tokens with a reference audio for speaker adaptation.

Installation

git clone https://github.com/hon9kon9ize/sovits-tokenizer.git
cd sovits-tokenizer
pip install -e .

Usage

from sovits_tokenizer import SovitsTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
generator_weights = "pretrained_models/s2G2333k.pth" # download from https://huggingface.co/lj1995/GPT-SoVITS/tree/main/gsv-v2final-pretrained
hubert_base_path = "pretrained_models/chinese-hubert-base" # download from https://huggingface.co/lj1995/GPT-SoVITS/tree/main/chinese-hubert-base

speech_tokenizer = SovitsTokenizer(generator_weights, hubert_base_path, device=device)

print(codes.shape) # (1, 1, 538) batch_size, codebook_size, seq_len
print(outputs.shape) # (688640,)
print("duration", outputs.shape[0] / 32000) # duration 21.52 
print("TBS", codes.shape[-1] / math.ceil(outputs.shape[0] / 32000)) # TBS 25

# Reconstruction and Voice Conversion
reference_audio = "path/to/reference_audio.wav"
recon_wav = speech_tokenizer.decode(codes, reference_audio)

Example

Input audio:

original.mp4

Reconstructed audio:

recon.mp4

Acknowledgment and Inspiration

This project is inspired by and builds upon the work of GPT-SoVITs. We borrow key ideas and components from the original repository, such as leveraging VQ-VAE and HuBERT for speech representation and semantic extraction. The innovations in GPT-SoVITS have been instrumental in shaping the foundation of SpeechTokenizer, extending its capabilities to efficient tokenization and voice conversion using reference audio.

Special thanks to the original developers of GPT-SoVITS for their groundbreaking contributions to the field!

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
sovits_tokenizer		sovits_tokenizer
.gitignore		.gitignore
README.md		README.md
demo.ipynb		demo.ipynb
original.mp4		original.mp4
recon.mp4		recon.mp4
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SovitsTokenizer

Installation

Usage

Example

Acknowledgment and Inspiration

About

Releases

Packages

Languages

hon9kon9ize/sovits-tokenizer

Folders and files

Latest commit

History

Repository files navigation

SovitsTokenizer

Installation

Usage

Example

Acknowledgment and Inspiration

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages