音频特征抽取

背景

原材料:视频文件(signed 16-bit PCM)/音频

工具:VGGish/ffmpeg

任务:抽取音频文件的embedding语义向量.

下载地址:https://github.com/tensorflow/models/tree/master/research/audioset/vggish

VGGish基础介绍

AudioSet是Google发行的声音版ImageNet上训练得到预训练模型,该模型可以将音频文件抽取为128维度的语义向量,除了直接抽取特征外,还可以进行针对特定的任务进行FineTuning操作。

VGGish vs VGG

VGGish是 VGG的变体,含有11个权重层,具体有如下改变:

  • 输入大小修改为96x64,log mel spectrogram的音频输入
  • 去掉了最后一组的conv和pool层,有4组结构而不是5组e.
  • 全联接层(compact embedding层)不是想image那样使用1000而是使用的是128维度的.
VGGish依赖包
VGGish文件结构:
  • vggish_slim.py: Model definition in TensorFlow Slim notation.
  • vggish_params.py: Hyperparameters.
  • vggish_input.py: Converter from audio waveform into input examples.
  • mel_features.py: Audio feature extraction helpers.
  • vggish_postprocess.py: Embedding postprocessing.
  • vggish_inference_demo.py: Demo of VGGish in inference mode.
  • vggish_train_demo.py: Demo of VGGish in training mode.
  • vggish_smoke_test.py: Simple test of a VGGish installation

VGGish使用介绍

音频文件:signed 16-bit PCM samples

使用示例: vggish_inference_demo.py

  • 计算log mel spectrogram()
    examples_batch = vggish_input.wavfile_to_examples(wav_file)

  • vggish抽取

    features_tensor = sess.graph.get_tensor_by_name(vggish_params.INPUT_TENSOR_NAME)
    embedding_tensor = sess.graph.get_tensor_by_name(vggish_params.OUTPUT_TENSOR_NAME)[embedding_batch] = sess.run([embedding_tensor], feed_dict={features_tensor: examples_batch})

  • 后处理

    PCA变换+8bit定点化

VGGish was trained with audio features computed as follows:

  • All audio is resampled to 16 kHz mono.
  • A spectrogram is computed using magnitudes of the Short-Time Fourier Transform with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann window.
  • A mel spectrogram is computed by mapping the spectrogram to 64 mel bins covering the range 125-7500 Hz.
  • A stabilized log mel spectrogram is computed by applying log(mel-spectrum + 0.01) where the offset is used to avoid taking a logarithm of zero.
  • These features are then framed into non-overlapping examples of 0.96 seconds, where each example covers 64 mel bands and 96 frames of 10 ms each.

视频抽取音频

ffmpeg -y -i xxx.mp4 -ar 16000 -ac 1 xxx.wav

主要参数含义:

-y: 覆盖输出的文件

-ar rate set audio sampling rate (in Hz)
-ac channels set number of audio channels

参考