背景
原材料:视频文件(signed 16-bit PCM)/音频
工具:VGGish/ffmpeg
任务:抽取音频文件的embedding语义向量.
下载地址:https://github.com/tensorflow/models/tree/master/research/audioset/vggish
VGGish基础介绍
在AudioSet是Google发行的声音版ImageNet上训练得到预训练模型,该模型可以将音频文件抽取为128维度的语义向量,除了直接抽取特征外,还可以进行针对特定的任务进行FineTuning操作。
VGGish vs VGG
VGGish是 VGG的变体,含有11个权重层,具体有如下改变:
- 输入大小修改为96x64,log mel spectrogram的音频输入
- 去掉了最后一组的conv和pool层,有4组结构而不是5组e.
- 全联接层(compact embedding层)不是想image那样使用1000而是使用的是128维度的.
VGGish依赖包
VGGish文件结构:
vggish_slim.py
: Model definition in TensorFlow Slim notation.vggish_params.py
: Hyperparameters.vggish_input.py
: Converter from audio waveform into input examples.mel_features.py
: Audio feature extraction helpers.vggish_postprocess.py
: Embedding postprocessing.vggish_inference_demo.py
: Demo of VGGish in inference mode.vggish_train_demo.py
: Demo of VGGish in training mode.vggish_smoke_test.py
: Simple test of a VGGish installation
VGGish使用介绍
音频文件:signed 16-bit PCM samples
使用示例: vggish_inference_demo.py
计算log mel spectrogram()
examples_batch = vggish_input.wavfile_to_examples(wav_file)vggish抽取
features_tensor = sess.graph.get_tensor_by_name(vggish_params.INPUT_TENSOR_NAME)
embedding_tensor = sess.graph.get_tensor_by_name(vggish_params.OUTPUT_TENSOR_NAME)[embedding_batch] = sess.run([embedding_tensor], feed_dict={features_tensor: examples_batch})后处理
PCA变换+8bit定点化
VGGish was trained with audio features computed as follows:
- All audio is resampled to 16 kHz mono.
- A spectrogram is computed using magnitudes of the Short-Time Fourier Transform with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann window.
- A mel spectrogram is computed by mapping the spectrogram to 64 mel bins covering the range 125-7500 Hz.
- A stabilized log mel spectrogram is computed by applying log(mel-spectrum + 0.01) where the offset is used to avoid taking a logarithm of zero.
- These features are then framed into non-overlapping examples of 0.96 seconds, where each example covers 64 mel bands and 96 frames of 10 ms each.
视频抽取音频
ffmpeg -y -i xxx.mp4 -ar 16000 -ac 1 xxx.wav
主要参数含义:
-y: 覆盖输出的文件
-ar rate set audio sampling rate (in Hz)
-ac channels set number of audio channels
参考
- Gemmeke, J. et. al., AudioSet: An ontology and human-labelled dataset for audio events, ICASSP 2017
- Hershey, S. et. al., CNN Architectures for Large-Scale Audio Classification, ICASSP 2017
- https://github.com/tensorflow/models/tree/master/research/audioset