machineexperiments - Tumblr blog

machineexperiments · 4 years ago

Text

Notes on finetuning GPT-J

A while ago I finetuned EleutherAI's GPT-J-6B on FIMFiction data with TPUs from the TRC program. ( inference Colab) and found some useful things along the way:

1. Use Python 3.7. Later versions have some Keras import error

2. If you want to create a training tfrecords list and aren't going to be using anything for validation, gsutil ls gs://path/to/my/*.tfrecords > ../mesh-transformer-jax/data/your_dataset.train.index works just fine.

3. For the love of everything that is holy, with no validation set, make sure val_set is set to {}. I spent hours chasing phantom bugs when it was a stupid mistake.

4. The TPU Research Cloud program is awesome.

5. Make sure your disk (the one you're loading TFRecords from and exporting checkpoints to) is in the same region or at least continent as your TPU, or you will quickly burn through your credits, because 90GB checkpoints combined with data ingress charges. Trust me, I speak from experience.

There's probably some more I can't think of right now, I just drank my morning coffees.

0 notes

machineexperiments · 4 years ago

Text

LJSpeech Ultimate Progress [Developing]

Neural text-to-speech models like Tacotron 2 requires data, and many times a lot of single-speaker data. The most prevalent large single-speaker dataset available for anyone to use is LJSpeech.

While it has served pretty well to train several models with good performance, the LJ Speech dataset presents a few, glaring problems:

1) It is only distributed in 22KHz sampling rate, limiting the audio quality of OSS TTS

2) The sibilants (SSS sounds) are very sharp, to the point where it becomes unpleasant to listen to above a certain volume

3) Many of the clips have background noise, like the humming of fans.

First stage of the plan involves extracting audio from Linda Johnson's LibriVox recordings and applying various filters and effects. They're easily available there, and 128kbps MP3s are ready for download.

Then, after collecting a lot of recordings, I used batch processing to apply de-essing, de-noising, de-reverb, and de-clicking filters. I collected about 40 hours of processed audio, quite overkill, but the next step is going to discard a lot of it. My dataset is going to be in 44.1KHz. Here is a sample of the processed audio:

Now, we need to get a way to transcribe the audios. We could just run it through ASR and call it a day, but not for me. Since our speaker is reading public domain audiobooks, I can first transcribe it with ASR, split automatically based on a threshold of breaks (most STT gives timing information for each word), and use fuzzy string search to match recognized text sequences to actual audiobook ones.

The original LJSpeech dataset was made by force aligning texts to their audios then splitting based on audio silences. Let's see if my way will work.

Update 05/08/2021: I tested my way of aligning and it worked well, here's a sample filelist and wavs, made from a 6 minute cut. Currently doing entire dataset. I'm using TensorSpeech/TensorFlowASR's Conformer model BTW. Coqui's refuses to run in Colab and Windows, and I have no interest in dealing with the Kaldi nightmare that is ESPNet

06/08/2021:

Currently running the auto-collection script locally. I used glob + ffmpeg to split each audio into 5 minute chunks.

07/08/2021:

Done! About 22 hours at an efficiency rate of 35% to 60%, didn't bother to collect more detailed statistics.

Here is a Google Drive link. I'm currently training a TensorFlowTTS/Tacotron2 on it

Now, you might be wondering, why so inefficient? Which, by the way, is a good question. As I said earlier, my method utilizes timing information from ASR instead of audio silences to decide clips. I prefer this method better because it gives me more precise control over clipping.

The two key parameters here are cut_interval and max_dist. The former details the threshold gap between a word before a new cut is made. I generally found that 0.62 was a good middle ground for this speaker. Too high, and it extracts mostly long clips, too little, and too short clips.

max_dist, however, is the maximum Levenshtein Distance that the fuzzy string search (see fuzzysearch) will accept between the ASR'd clip and the actual book text. I had to set this to a quite conservative value because while higher distances would extract a lot more perfectly acceptable transcripts, they would also introduce some problems like clips where the transcript misses a word or two, and since I don't want to manually go hunting for problems too much, and because I could afford it since I had collected lots of data, I set it to be quite restrictive.

Also another one of note is paddy, that is an interval in seconds that is added to the end interval of a clip, so that the audio doesn't end so abruptly. However, if the timing indicates that another word will overlap, it will reduce accordingly, so some clips do end pretty harshly.

I also noticed that the speed difference in Conformer ASR between a Tesla T4 in Colab and my Ryzen 5 3600 wasn't significant, in fact, almost negligible. Either that or my chronic overconsumption of coffee has completely fried my perception of time.

Final Denoising Pass

After finishing the clipping, I noticed that some still had just a little bit of background noise. Thankfully, I already had a nice tool, in the form of my program's batch denoising extra feature.

This uses RNNoise, which is quick, effective against constant noises like fans, with minimal to no quality loss.

As I said earlier, here is the Google Drive link, at V0.2 since I had to re-export as 16-bit PCM wavs.

0 notes

machineexperiments · 4 years ago

Text

Progress of porting TensorflowTTS to TPU

TensorflowTTS is a very nice repo, not only due to its modularity, but also the fact that it uses Tensorflow 2, which makes it very easy to deploy, and I want to be able to train it with TPUs because they're fast and easy to access (with the TRC program).

First difference between GPU and TPU is that they use different strategies. This is the simplest part, and it required only a bit of change

Second is that while not technically obligatory, TFRecords are used as the main method of loading data from disk, and it's practically obligatory because things like tf.numpy_function, used to assist in loading datasets, won't work under TPUs.

Thankfully, TensorflowTTS is really modular and well written, so modifying the dataloader was easy and straightforward after making a conversion notebook, which I will adapt into a script soon in the TPU branch. In addition, since TPUs don't have support for tf.string, I had to convert utt_id to int32 and make it a numerical value.

Unfortunately, trying to train Multi-Band MelGAN-HF yields an error which I am pretty much lost on.

(7) Invalid argument: {{function_node __inference__one_step_forward_102607}} Output shapes of then and else branches do not match: (f32[64,<=8192], f32[64,<=8192]) vs. (f32[64,<=8192], f32[0])

According to an issue (the only instance I can find) it's a problem with the XLA compiler. I will seek support.

1 note · View note