-
Notifications
You must be signed in to change notification settings - Fork 24
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
22 changed files
with
134,259 additions
and
46 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2019 | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
recursive-include tacotron *.yaml | ||
recursive-include tacotron *.txt | ||
include LICENSE |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,101 @@ | ||
# Tacotron | ||
# Tacotron (with Dynamic Convolution Attention) | ||
|
||
A PyTorch implementation of [Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis](https://arxiv.org/abs/1910.10288). Audio samples can be found [here](bshall.github.io/tacotron/). | ||
|
||
<div align="center"> | ||
<img width="655" height="390" alt="Tacotron (with Dynamic Convolution Attention)" | ||
src="https://raw.githubusercontent.com/bshall/Tacotron/main/tacotron.png"><br> | ||
<sup><strong>Fig 1:</strong>Tacotron (with Dynamic Convolution Attention).</sup> | ||
</div> | ||
|
||
<div align="center"> | ||
<img width="897" height="154" alt="Example Mel-spectrogram and attention plot" | ||
src="https://raw.githubusercontent.com/bshall/Tacotron/main/example.png"><br> | ||
<sup><strong>Fig 2:</strong>Example Mel-spectrogram and attention plot.</sup> | ||
</div> | ||
|
||
## Quick Start | ||
|
||
Ensure you have Python 3.6 and PyTorch 1.7 or greater installed. Then install this package with: | ||
``` | ||
pip install tacotron | ||
``` | ||
|
||
## Example Usage | ||
|
||
```python | ||
import torch | ||
import soundfile as sf | ||
from univoc import Vocoder | ||
from tacotron import load_cmudict, text_to_id, Tacotron | ||
|
||
# download pretrained weights for the vocoder (and optionally move to GPU) | ||
vocoder = Vocoder.from_pretrained( | ||
"https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt" | ||
).cuda() | ||
|
||
# download pretrained weights for tacotron (and optionally move to GPU) | ||
tacotron = Tacotron.from_pretrained( | ||
"https://github.com/bshall/Tacotron/releases/download/v0.1/tacotron-ljspeech-yspjx3.pt" | ||
).cuda() | ||
|
||
# load cmudict and add pronunciation of PyTorch | ||
cmudict = load_cmudict() | ||
cmudict["PYTORCH"] = "P AY1 T AO2 R CH" | ||
|
||
text = "A PyTorch implementation of Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis." | ||
|
||
# convert text to phone ids | ||
text = torch.LongTensor(text_to_id(text, cmudict)).unsqueeze(0).cuda() | ||
|
||
# synthesize audio | ||
with torch.no_grad(): | ||
mel, _ = tacotron.generate(text) | ||
wav, sr = vocoder.generate(mel.transpose(1, 2)) | ||
|
||
# save output | ||
sf.write("location_relative_attention.wav", wav, sr) | ||
``` | ||
|
||
## Train from Scatch | ||
|
||
1. Clone the repo: | ||
``` | ||
git clone https://github.com/bshall/Tacotron | ||
cd ./Tacotron | ||
``` | ||
2. Install requirements: | ||
``` | ||
pip install -r requirements.txt | ||
``` | ||
3. Download and extract the [LJ-Speech dataset](https://keithito.com/LJ-Speech-Dataset/): | ||
``` | ||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 | ||
tar -xvjf LJSpeech-1.1.tar.bz2 | ||
``` | ||
4. Download the train split [here](https://github.com/bshall/Tacotron/releases/tag/v0.1) and extract it in the root directory of the repo. | ||
5. Extract Mel spectrograms and preprocess audio: | ||
``` | ||
python preprocess.py in_dir=path/to/LJSpeech-1.1 out_dir=datasets/LJSpeech-1.1 | ||
``` | ||
6. Train the model: | ||
``` | ||
python train.py checkpoint_dir=ljspeech dataset_dir=datasets/LJSpeech-1.1 text_dir=path/to/LJSpeech-1.1/metadata.csv | ||
``` | ||
|
||
## Pretrained Models | ||
|
||
Pretrained weights for the LJSpeech model are available [here](https://github.com/bshall/Tacotron/releases/tag/v0.1). | ||
|
||
## Notable Differences from the Paper | ||
|
||
1. Trained using a batch size of 64 on a single GPU (using automatic mixed precision). | ||
2. Used a gradient clipping threshold of 0.05 as it seems to stabilize the alignment with the smaller batch size. | ||
3. Used a different learning rate schedule (again to deal with smaller batch size). | ||
4. Used 80-bin (instead of 128 bin) log-Mel spectrograms. | ||
|
||
## Acknowlegements | ||
|
||
- https://github.com/keithito/tacotron | ||
- https://github.com/PetrochukM/PyTorch-NLP | ||
- https://github.com/fatchord/WaveRNN |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[build-system] | ||
requires = ["setuptools", "wheel"] | ||
build-backend = "setuptools.build_meta" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
librosa>=0.8.0 | ||
numpy>=1.18.0 | ||
tqdm>=4.41 | ||
hydra-core>=1.0.3 | ||
pyloudnorm>=0.1.0 | ||
tensorboard>=2.3.0 | ||
importlib-resources |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
[metadata] | ||
name = tacotron | ||
version = 0.1.0 | ||
author = Benjamin van Niekerk | ||
author_email = benjamin.l.van.niekerk@gmail.com | ||
url = https://github.com/bshall/Tacotron | ||
description = A PyTorch implementation of Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis. | ||
long_description = file:README.md | ||
long_description_content_type = text/markdown | ||
project_urls = | ||
Source = https://github.com/bshall/Tacotron | ||
Samples = https://bshall.github.io/tacotron/ | ||
keywords = | ||
Speech Synthesis | ||
Tacotron | ||
Text-to-Speech | ||
PyTorch | ||
classifiers = | ||
Natural Language :: English | ||
Intended Audience :: Science/Research | ||
License :: OSI Approved :: MIT License | ||
Operating System :: POSIX :: Linux | ||
Programming Language :: Python | ||
Programming Language :: Python :: 3.6 | ||
Programming Language :: Python :: 3.7 | ||
Programming Language :: Python :: 3.8 | ||
Programming Language :: Python :: 3.9 | ||
Topic :: Scientific/Engineering | ||
Topic :: Scientific/Engineering :: Artificial Intelligence | ||
|
||
[options] | ||
packages = tacotron | ||
include_package_data = True | ||
python_requires = >=3.6 | ||
install_requires = | ||
librosa>=0.8.0 | ||
numpy>=1.18.0 | ||
tqdm>=4.41 | ||
requests | ||
importlib-resources | ||
omegaconf>=2.0.3 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
from .model import Tacotron | ||
from .text import load_cmudict, text_to_id | ||
from .dataset import TTSDataset, BucketBatchSampler, pad_collate |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# @package _group_ | ||
preprocess: | ||
sr: 16000 | ||
hop_length: 200 | ||
win_length: 800 | ||
n_fft: 2048 | ||
n_mels: 80 | ||
fmin: 50 | ||
preemph: 0.97 | ||
top_db: 80 | ||
ref_db: 20 | ||
mulaw: | ||
bits: 10 | ||
|
||
train: | ||
batch_size: 64 | ||
bucket_size_multiplier: 5 | ||
n_steps: 250000 | ||
clip_grad_norm: 0.05 | ||
optimizer: | ||
lr: 1e-3 | ||
scheduler: | ||
milestones: | ||
- 20000 | ||
- 40000 | ||
- 100000 | ||
- 150000 | ||
- 200000 | ||
gamma: 0.5 | ||
checkpoint_interval: 5000 | ||
n_workers: 8 | ||
|
||
|
||
model: | ||
encoder: | ||
n_symbols: 91 | ||
embedding_dim: 256 | ||
prenet: | ||
input_size: ${model.encoder.embedding_dim} | ||
hidden_size: 256 | ||
output_size: 128 | ||
dropout: 0.5 | ||
cbhg: | ||
input_channels: ${model.encoder.prenet.output_size} | ||
K: 16 | ||
channels: 128 | ||
projection_channels: 128 | ||
n_highways: 4 | ||
highway_size: 128 | ||
rnn_size: 128 | ||
decoder: | ||
prenet: | ||
input_size: ${preprocess.n_mels} | ||
hidden_size: 256 | ||
output_size: 128 | ||
dropout: 0.5 | ||
attention: | ||
attn_rnn_size: ${model.decoder.attn_rnn_size} | ||
hidden_size: 128 | ||
static_channels: 8 | ||
static_kernel_size: 21 | ||
dynamic_channels: 8 | ||
dynamic_kernel_size: 21 | ||
prior_length: 11 | ||
alpha: 0.1 | ||
beta: 0.9 | ||
input_size: ${model.encoder.cbhg.channels} | ||
n_mels: ${preprocess.n_mels} | ||
attn_rnn_size: 256 | ||
decoder_rnn_size: 256 | ||
reduction_factor: 2 | ||
zoneout_prob: 0.1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
defaults: | ||
- config | ||
|
||
in_dir: ??? | ||
out_dir: ??? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
defaults: | ||
- config | ||
|
||
resume: false | ||
checkpoint_dir: ??? | ||
text_path: ??? | ||
dataset_dir: ??? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Oops, something went wrong.