Data Preparation

After Environment setup, we can start preparing the data for training and decoding.

The first step is to prepare the data for training. We have already provided prepare.sh that would prepare everything required for training.

cd /tmp/icefall
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
cd egs/yesno/ASR

./prepare.sh

Note that in each recipe from icefall, there exists a file prepare.sh, which you should run before you run anything else.

That is all you need for data preparation.

For the more curious

If you are wondering how to prepare your own dataset, please refer to the following URLs for more details:

https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes

It contains recipes for a variety of dataset. If you want to add your own dataset, please read recipes in this folder first.

https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/yesno.py

The yesno recipe in lhotse.

If you already have a Kaldi dataset directory, which contains files like wav.scp, feats.scp, then you can refer to https://lhotse.readthedocs.io/en/latest/kaldi.html#example.

A quick look to the generated files

./prepare.sh puts generated files into two directories:

download

data

download

The download directory contains downloaded dataset files:

tree -L 1 ./download/

./download/
|-- waves_yesno
`-- waves_yesno.tar.gz

Hint

Please refer to https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/yesno.py#L41 for how the data is downloaded and extracted.

data

tree ./data/

./data/
|-- fbank
|   |-- yesno_cuts_test.jsonl.gz
|   |-- yesno_cuts_train.jsonl.gz
|   |-- yesno_feats_test.lca
|   `-- yesno_feats_train.lca
|-- lang_phone
|   |-- HLG.pt
|   |-- L.pt
|   |-- L_disambig.pt
|   |-- Linv.pt
|   |-- lexicon.txt
|   |-- lexicon_disambig.txt
|   |-- tokens.txt
|   `-- words.txt
|-- lm
|   |-- G.arpa
|   `-- G.fst.txt
`-- manifests
    |-- yesno_recordings_test.jsonl.gz
    |-- yesno_recordings_train.jsonl.gz
    |-- yesno_supervisions_test.jsonl.gz
    `-- yesno_supervisions_train.jsonl.gz

4 directories, 18 files

data/manifests:

This directory contains manifests. They are used to generate files in data/fbank.

To give you an idea of what it contains, we examine the first few lines of the manifests related to the train dataset.

cd data/manifests
gunzip -c  yesno_recordings_train.jsonl.gz  | head -n 3

The output is given below:

{"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}
{"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}
{"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}

Please refer to https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L300 for the meaning of each field per line.

gunzip -c  yesno_supervisions_train.jsonl.gz  | head -n 3

The output is given below:

{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}
{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}
{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}

Please refer to https://github.com/lhotse-speech/lhotse/blob/master/lhotse/supervision.py#L510 for the meaning of each field per line.

data/fbank:

This directory contains everything from data/manifests. Furthermore, it also contains features for training.

data/fbank/yesno_feats_train.lca contains the features for the train dataset. Features are compressed using lilcom.

data/fbank/yesno_cuts_train.jsonl.gz stores the CutSet, which stores RecordingSet, SupervisionSet, and FeatureSet.

To give you an idea about what it looks like, we can run the following command:

cd data/fbank

gunzip -c yesno_cuts_train.jsonl.gz | head -n 3

The output is given below:

{"id": "0_0_0_0_1_1_1_1-0", "start": 0, "duration": 6.35, "channel": 0, "supervisions": [{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 635, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.35, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "0,13000,3570", "channels": 0}, "recording": {"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}, "type": "MonoCut"}
{"id": "0_0_0_1_0_1_1_0-1", "start": 0, "duration": 6.11, "channel": 0, "supervisions": [{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 611, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.11, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "16570,12964,2929", "channels": 0}, "recording": {"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}, "type": "MonoCut"}
{"id": "0_0_1_0_0_1_1_0-2", "start": 0, "duration": 6.02, "channel": 0, "supervisions": [{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 602, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.02, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "32463,12936,2696", "channels": 0}, "recording": {"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}, "type": "MonoCut"}

Note that yesno_cuts_train.jsonl.gz only stores the information about how to read the features. The actual features are stored separately in data/fbank/yesno_feats_train.lca.

data/lang:

This directory contains the lexicon.

data/lm:

This directory contains language models.