machineexperiments
machineexperiments
Machine Experiments
4 posts
Don't wanna be here? Send us removal request.
machineexperiments · 4 years ago
Text
LJSpeech Ultimate
Recreation of LJ Speech dataset in 44.1KHz from Linda Johnson's LibriVox recordings; they were automatically denoised, de-essed and clipped with ASR + fuzzy string search. This covers for the three main flaws in LJSpeech: low sampling rate, noisy files, and sharp sibilants.
Download: https://drive.google.com/file/d/1-0ExTPwG0IfMIHFWwSoz2T2ojs_WhJa7/view?usp=sharing
Note that before loading the filelists into Python without any modifications you'll have to re-encode them in UTF-8 (they're in ANSI).
0 notes
machineexperiments · 4 years ago
Text
Notes on finetuning GPT-J
A while ago I finetuned EleutherAI's GPT-J-6B on FIMFiction data with TPUs from the TRC program. ( inference Colab) and found some useful things along the way:
1. Use Python 3.7. Later versions have some Keras import error
2. If you want to create a training tfrecords list and aren't going to be using anything for validation, gsutil ls gs://path/to/my/*.tfrecords > ../mesh-transformer-jax/data/your_dataset.train.index works just fine.
3. For the love of everything that is holy, with no validation set, make sure val_set is set to {}. I spent hours chasing phantom bugs when it was a stupid mistake.
4. The TPU Research Cloud program is awesome.
5. Make sure your disk (the one you're loading TFRecords from and exporting checkpoints to) is in the same region or at least continent as your TPU, or you will quickly burn through your credits, because 90GB checkpoints combined with data ingress charges. Trust me, I speak from experience.
There's probably some more I can't think of right now, I just drank my morning coffees.
0 notes
machineexperiments · 4 years ago
Text
LJSpeech Ultimate Progress [Developing]
Neural text-to-speech models like Tacotron 2 requires data, and many times a lot of single-speaker data. The most prevalent large single-speaker dataset available for anyone to use is LJSpeech.
While it has served pretty well to train several models with good performance, the LJ Speech dataset presents a few, glaring problems:
1) It is only distributed in 22KHz sampling rate, limiting the audio quality of OSS TTS
2) The sibilants (SSS sounds) are very sharp, to the point where it becomes unpleasant to listen to above a certain volume
3) Many of the clips have background noise, like the humming of fans.
First stage of the plan involves extracting audio from Linda Johnson's LibriVox recordings and applying various filters and effects. They're easily available there, and 128kbps MP3s are ready for download.
Then, after collecting a lot of recordings, I used batch processing to apply de-essing, de-noising, de-reverb, and de-clicking filters. I collected about 40 hours of processed audio, quite overkill, but the next step is going to discard a lot of it. My dataset is going to be in 44.1KHz. Here is a sample of the processed audio:
Now, we need to get a way to transcribe the audios. We could just run it through ASR and call it a day, but not for me. Since our speaker is reading public domain audiobooks, I can first transcribe it with ASR, split automatically based on a threshold of breaks (most STT gives timing information for each word), and use fuzzy string search to match recognized text sequences to actual audiobook ones.
The original LJSpeech dataset was made by force aligning texts to their audios then splitting based on audio silences. Let's see if my way will work.
Update 05/08/2021: I tested my way of aligning and it worked well, here's a sample filelist and wavs, made from a 6 minute cut. Currently doing entire dataset. I'm using TensorSpeech/TensorFlowASR's Conformer model BTW. Coqui's refuses to run in Colab and Windows, and I have no interest in dealing with the Kaldi nightmare that is ESPNet
06/08/2021:
Currently running the auto-collection script locally. I used glob + ffmpeg to split each audio into 5 minute chunks.
Tumblr media
07/08/2021:
Done! About 22 hours at an efficiency rate of 35% to 60%, didn't bother to collect more detailed statistics.
Tumblr media
Here is a Google Drive link. I'm currently training a TensorFlowTTS/Tacotron2 on it
Now, you might be wondering, why so inefficient? Which, by the way, is a good question. As I said earlier, my method utilizes timing information from ASR instead of audio silences to decide clips. I prefer this method better because it gives me more precise control over clipping.
Tumblr media
The two key parameters here are cut_interval and max_dist. The former details the threshold gap between a word before a new cut is made. I generally found that 0.62 was a good middle ground for this speaker. Too high, and it extracts mostly long clips, too little, and too short clips.
max_dist, however, is the maximum Levenshtein Distance that the fuzzy string search (see fuzzysearch) will accept between the ASR'd clip and the actual book text. I had to set this to a quite conservative value because while higher distances would extract a lot more perfectly acceptable transcripts, they would also introduce some problems like clips where the transcript misses a word or two, and since I don't want to manually go hunting for problems too much, and because I could afford it since I had collected lots of data, I set it to be quite restrictive.
Also another one of note is paddy, that is an interval in seconds that is added to the end interval of a clip, so that the audio doesn't end so abruptly. However, if the timing indicates that another word will overlap, it will reduce accordingly, so some clips do end pretty harshly.
I also noticed that the speed difference in Conformer ASR between a Tesla T4 in Colab and my Ryzen 5 3600 wasn't significant, in fact, almost negligible. Either that or my chronic overconsumption of coffee has completely fried my perception of time.
Final Denoising Pass
After finishing the clipping, I noticed that some still had just a little bit of background noise. Thankfully, I already had a nice tool, in the form of my program's batch denoising extra feature.
Tumblr media
This uses RNNoise, which is quick, effective against constant noises like fans, with minimal to no quality loss.
As I said earlier, here is the Google Drive link, at V0.2 since I had to re-export as 16-bit PCM wavs.
0 notes
machineexperiments · 4 years ago
Text
Progress of porting TensorflowTTS to TPU
TensorflowTTS is a very nice repo, not only due to its modularity, but also the fact that it uses Tensorflow 2, which makes it very easy to deploy, and I want to be able to train it with TPUs because they're fast and easy to access (with the TRC program).
First difference between GPU and TPU is that they use different strategies. This is the simplest part, and it required only a bit of change
Second is that while not technically obligatory, TFRecords are used as the main method of loading data from disk, and it's practically obligatory because things like tf.numpy_function, used to assist in loading datasets, won't work under TPUs.
Thankfully, TensorflowTTS is really modular and well written, so modifying the dataloader was easy and straightforward after making a conversion notebook, which I will adapt into a script soon in the TPU branch. In addition, since TPUs don't have support for tf.string, I had to convert utt_id to int32 and make it a numerical value.
Unfortunately, trying to train Multi-Band MelGAN-HF yields an error which I am pretty much lost on.
(7) Invalid argument: {{function_node __inference__one_step_forward_102607}} Output shapes of then and else branches do not match: (f32[64,<=8192], f32[64,<=8192]) vs. (f32[64,<=8192], f32[0])
According to an issue (the only instance I can find) it's a problem with the XLA compiler. I will seek support.
1 note · View note