音频特征抽取

原材料：视频文件(signed 16-bit PCM)/音频

工具：VGGish/ffmpeg

任务：抽取音频文件的embedding语义向量.

在AudioSet是Google发行的声音版ImageNet上训练得到预训练模型，该模型可以将音频文件抽取为128维度的语义向量，除了直接抽取特征外，还可以进行针对特定的任务进行FineTuning操作。

VGGish是 VGG的变体，含有11个权重层，具体有如下改变：

音频文件：signed 16-bit PCM samples

使用示例: vggish_inference_demo.py

计算log mel spectrogram()
examples_batch = vggish_input.wavfile_to_examples(wav_file)
vggish抽取

features_tensor = sess.graph.get_tensor_by_name(vggish_params.INPUT_TENSOR_NAME)
embedding_tensor = sess.graph.get_tensor_by_name(vggish_params.OUTPUT_TENSOR_NAME)[embedding_batch] = sess.run([embedding_tensor], feed_dict={features_tensor: examples_batch})
后处理

PCA变换+8bit定点化

VGGish was trained with audio features computed as follows:

All audio is resampled to 16 kHz mono.
A spectrogram is computed using magnitudes of the Short-Time Fourier Transform with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann window.
A mel spectrogram is computed by mapping the spectrogram to 64 mel bins covering the range 125-7500 Hz.
A stabilized log mel spectrogram is computed by applying log(mel-spectrum + 0.01) where the offset is used to avoid taking a logarithm of zero.
These features are then framed into non-overlapping examples of 0.96 seconds, where each example covers 64 mel bands and 96 frames of 10 ms each.

ffmpeg -y -i xxx.mp4 -ar 16000 -ac 1 xxx.wav

主要参数含义：

-y: 覆盖输出的文件

-ar rate set audio sampling rate (in Hz)
-ac channels set number of audio channels