Queries about VGGish embeddings for audio classifications

asteroid · January 22, 2024, 3:51am

Post must be deleted Post must be deleted Post must be deleted

asteroid · January 22, 2024, 4:22am

Hi all,

I’m using VGGish as a feature extractor for a project involving the analysis of specific types of audio recordings. My aim is to extract distinctive features from these recordings for advanced classification purposes. Although the audio type is covered in VGGish’s training, my focus is on identifying subtle differences within these recordings and using these nuanced labels with the extracted embeddings for further model training.

I’m wondering if anyone is familiar with the VGGish feature extractor and how I can tell whether the model is running correctly and generates accurate embeddings for my use case explained above?

I’m using the following setup (as specified in the smoke test script):

NumPy version: 1.24.3
TensorFlow version: 2.13.0
Resampy version: 0.2.2
Python version: 3.10.0

When running the smoke test I get the following error:

vggish_smoke_test.py 
NumPy version: 1.24.3
TensorFlow version: 2.13.0
Resampy version: 0.2.2
Python version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:27:15) [Clang 11.1.0 ]

Testing your install of VGGish

Resampling via resampy works!
Log Mel Spectrogram example:  [[-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 ...
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]]
2024-01-22 13:39:58.532065: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2024-01-22 13:39:58.532117: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2024-01-22 13:39:58.532131: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2024-01-22 13:39:58.533086: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-01-22 13:39:58.533360: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
/Users/astrid/miniforge3/envs/mlp/lib/python3.10/site-packages/tensorflow/python/keras/engine/base_layer_v1.py:1697: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
/Users/astrid/miniforge3/envs/mlp/lib/python3.10/site-packages/tensorflow/python/keras/legacy_tf_layers/core.py:325: UserWarning: `tf.layers.flatten` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Flatten` instead.
  warnings.warn('`tf.layers.flatten` is deprecated and '
2024-01-22 13:39:58.820285: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled
2024-01-22 13:39:58.834282: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-01-22 13:39:59.494764: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
Traceback (most recent call last):
  File "/Users/astrid/PycharmProjects/tensorflow-fork/research/audioset/vggish/vggish_smoke_test.py", line 109, in <module>
    np.testing.assert_allclose(
  File "/Users/astrid/miniforge3/envs/mlp/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/Users/astrid/miniforge3/envs/mlp/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/Users/astrid/miniforge3/envs/mlp/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.1, atol=0

Mismatched elements: 2 / 2 (100%)
Max absolute difference: 4677.1648125
Max relative difference: 3642446.46411244
 x: array([-2393.0867,  4677.508 ], dtype=float32)
 y: array([0.000657, 0.343   ])
VGGish embedding:  [  4145.806    -1775.6422   -4421.003      283.77875    280.54425
   7062.1445   -2468.7595   -2667.1458    8583.2705    2895.051
   1085.9359     112.46885   -674.6059    -628.6546    -639.0709
  -5431.542    -8282.701    -7877.5376   -3698.3943   -5914.762
  -4910.4116    4006.3538   -3452.2876    2604.2393  -11533.147
  -9635.958    -5293.464    -9104.569    -2764.2964    1372.5797
  -2867.9656   -3831.2227   -6376.5195   -7480.3164  -12168.685
  -2922.416     2131.4229    1245.7706    4473.025     -991.0255
  -8729.306     -968.88434  -1651.0012    1473.9858   -4756.5674
   1671.9463  -12686.039    -7621.3125    2401.2542   -6626.9785
  -3241.2998   -2296.9084    3245.5037    7920.159    -7880.0103
  -3329.7896   -9680.734     1454.865    -1290.4586   -6949.8228
  -1032.4684     240.48125   6042.2363    1059.341    -3920.9705
  -1738.9835   -2570.2908   -2550.5198   -5161.6216   -3711.025
  -8867.201     1133.5637    2433.5835    3250.356    -2027.6403
  -4212.3267    2270.933    -8749.348    -5870.978   -10817.919
  -3732.8555   -5644.2427    4470.46      3122.015    -1798.985
  -7916.349    -3385.9292   -8242.644    -1519.1459  -10945.789
   2603.645    -8356.439     2241.9624   -7348.066    -8720.624
  -8423.815    -2160.2988    1883.056      582.0027   -3764.776
  -1615.0001   -5850.113    -3389.6348   -6375.688     2802.1833
   3134.701    -5422.21      3110.646    -5574.4707   -2304.9424
   -401.61966   3439.3347   -6006.554    -1890.2675    -468.6196
   2881.7344   -1482.3462   -9001.828     6980.145    -5279.7983
   1888.4413   -9818.397     3517.1284    -670.00903  -4047.3115
   4867.6484   -8735.253    -1674.2653 ]
embedding mean/stddev -2393.0867 4677.508
Postprocessed VGGish embedding:  [  0   0   0   0   0 255 255 255   0   0   0   0 255   0   0 255 255 255
 255 255   0 255   0 255   0   0 255   0 255 255 255   0 255   0   0 255
   0   0 255   0   0 255   0   0   0   0 255   0   0 255 255   0 255   0
 255   0 255   0 255 255   0 255   0 255 255 255 255   0   0   0 255   0
 255 255   0   0   0   0 255   0   0 255 255 255 255   0 255 255   0   0
   0   0 255   0   0   0   0   0 255   0   0 255 255 255   0   0 255   0
 255   0 255   0 255 255   0   0 255 255 255 255 255 255 255 255 255 255
   0   0]
postproc embedding mean/stddev 123.515625 127.43772893401457

I’m thinking that despite following the core package requirements that this could still fail due to different overall environment setup, however I am wondering how to tell the model runs correctly?

This is vggish_inference_demo.py where I have added a processed .wav recording:

# Copyright 2017 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

r"""A simple demonstration of running VGGish in inference mode.

This is intended as a toy example that demonstrates how the various building
blocks (feature extraction, model definition and loading, postprocessing) work
together in an inference context.

A WAV file (assumed to contain signed 16-bit PCM samples) is read in, converted
into log mel spectrogram examples, fed into VGGish, the raw embedding output is
whitened and quantized, and the postprocessed embeddings are optionally written
in a SequenceExample to a TFRecord file (using the same format as the embedding
features released in AudioSet).

Usage:
  # Run a WAV file through the model and print the embeddings. The model
  # checkpoint is loaded from vggish_model.ckpt and the PCA parameters are
  # loaded from vggish_pca_params.npz in the current directory.
  $ python vggish_inference_demo.py --wav_file /path/to/a/wav/file

  # Run a WAV file through the model and also write the embeddings to
  # a TFRecord file. The model checkpoint and PCA parameters are explicitly
  # passed in as well.
  $ python vggish_inference_demo.py --wav_file /path/to/a/wav/file \
                                    --tfrecord_file /path/to/tfrecord/file \
                                    --checkpoint /path/to/model/checkpoint \
                                    --pca_params /path/to/pca/params

  # Run a built-in input (a sine wav) through the model and print the
  # embeddings. Associated model files are read from the current directory.
  $ python vggish_inference_demo.py
"""

from __future__ import print_function

import numpy as np
import pandas as pd
import six
import soundfile
import tensorflow.compat.v1 as tf

import vggish_input
import vggish_params
import vggish_postprocess
import vggish_slim

flags = tf.app.flags

flags.DEFINE_string(
    'wav_file', 'file.wav',
    'Path to a wav file. Should contain signed 16-bit PCM samples. '
    'If none is provided, a synthetic sound is used.')

flags.DEFINE_string(
    'checkpoint', 'vggish_model.ckpt',
    'Path to the VGGish checkpoint file.')

flags.DEFINE_string(
    'pca_params', 'vggish_pca_params.npz',
    'Path to the VGGish PCA parameters file.')

flags.DEFINE_string(
    'tfrecord_file', None,
    'Path to a TFRecord file where embeddings will be written.')

FLAGS = flags.FLAGS


def main(_):
  # In this simple example, we run the examples from a single audio file through
  # the model. If none is provided, we generate a synthetic input.
  if FLAGS.wav_file:
    wav_file = FLAGS.wav_file
  else:
    # Write a WAV of a sine wav into an in-memory file object.
    num_secs = 5
    freq = 1000
    sr = 44100
    t = np.arange(0, num_secs, 1 / sr)
    x = np.sin(2 * np.pi * freq * t)
    # Convert to signed 16-bit samples.
    samples = np.clip(x * 32768, -32768, 32767).astype(np.int16)
    wav_file = six.BytesIO()
    soundfile.write(wav_file, samples, sr, format='WAV', subtype='PCM_16')
    wav_file.seek(0)
  examples_batch = vggish_input.wavfile_to_examples(wav_file)
  print(examples_batch)

  # Prepare a postprocessor to munge the model embeddings.
  pproc = vggish_postprocess.Postprocessor(FLAGS.pca_params)

  # If needed, prepare a record writer to store the postprocessed embeddings.
  writer = tf.python_io.TFRecordWriter(
      FLAGS.tfrecord_file) if FLAGS.tfrecord_file else None

  with tf.Graph().as_default(), tf.Session() as sess:
    # Define the model in inference mode, load the checkpoint, and
    # locate input and output tensors.
    vggish_slim.define_vggish_slim(training=False)
    vggish_slim.load_vggish_slim_checkpoint(sess, FLAGS.checkpoint)
    features_tensor = sess.graph.get_tensor_by_name(
        vggish_params.INPUT_TENSOR_NAME)
    embedding_tensor = sess.graph.get_tensor_by_name(
        vggish_params.OUTPUT_TENSOR_NAME)

    # Run inference and postprocessing.
    [embedding_batch] = sess.run([embedding_tensor],
                                 feed_dict={features_tensor: examples_batch})
    print(embedding_batch)
    postprocessed_batch = pproc.postprocess(embedding_batch)
    print(postprocessed_batch)

    # For raw embeddings
    embeddings_df = pd.DataFrame(embedding_batch)
    embeddings_df.to_csv('raw_embeddings.csv', index=False)

    # For post-processed embeddings
    postprocessed_df = pd.DataFrame(postprocessed_batch)
    postprocessed_df.to_csv('postprocessed_embeddings.csv', index=False)

    # Write the postprocessed embeddings as a SequenceExample, in a similar
    # format as the features released in AudioSet. Each row of the batch of
    # embeddings corresponds to roughly a second of audio (96 10ms frames), and
    # the rows are written as a sequence of bytes-valued features, where each
    # feature value contains the 128 bytes of the whitened quantized embedding.
    seq_example = tf.train.SequenceExample(
        feature_lists=tf.train.FeatureLists(
            feature_list={
                vggish_params.AUDIO_EMBEDDING_FEATURE_NAME:
                    tf.train.FeatureList(
                        feature=[
                            tf.train.Feature(
                                bytes_list=tf.train.BytesList(
                                    value=[embedding.tobytes()]))
                            for embedding in postprocessed_batch
                        ]
                    )
            }
        )
    )
    print(seq_example)
    if writer:
      writer.write(seq_example.SerializeToString())

  if writer:
    writer.close()

if __name__ == '__main__':
  tf.app.run()

The raw embeddings:

[[ 2756.9326  -1199.8948  -2900.791     128.01233   182.32951  4674.142
  -1631.9989  -1846.7615   5743.619    1904.8942    740.13043   102.50553
   -457.76312  -324.87384  -355.0232  -3592.4082  -5487.47    -5329.719
  -2412.3958  -3869.4138  -3269.5317   2790.9202  -2257.775    1789.078
  -7727.908   -6410.806   -3456.2542  -6050.643   -1898.5787    953.5816
  -1897.1648  -2591.8594  -4316.8213  -5088.4717  -8087.8438  -1908.6724
   1354.7952    820.46063  2971.3604   -659.65045 -5878.903    -528.14923
  -1065.45     1184.1643  -3248.1177   1140.3062  -8460.46    -4983.223
   1698.3937  -4557.1777  -2265.003   -1537.0801   2131.691    5227.428
  -5297.137   -2192.3735  -6419.795     931.5189   -963.0291  -4663.866
   -712.4488    177.36548  3995.95      706.87085 -2463.083   -1164.5531
  -1746.5039  -1806.945   -3413.1536  -2459.2554  -5883.6753    785.6369
   1588.856    2182.188   -1261.9966  -2828.1328   1483.209   -5914.6494
  -3952.074   -7239.8335  -2502.116   -3689.356    2927.5308   2035.9961
  -1315.393   -5181.636   -2369.9778  -5498.729   -1013.7645  -7334.33
   1748.266   -5488.7085   1524.1348  -4884.8755  -5866.6685  -5661.888
  -1484.344    1223.3682    334.27832 -2474.7815  -1163.9679  -3900.7063
  -2239.5137  -4280.021    1826.0135   2059.7024  -3626.6555   2093.119
  -3686.918   -1527.8844   -279.44202  2218.9045  -3955.217   -1246.5123
   -319.53705  1861.3494  -1048.6719  -5969.5107   4681.886   -3515.276
   1330.0773  -6535.7925   2420.3372   -397.57562 -2741.638    3234.0208
  -5772.872   -1068.8192 ]
 [ 2457.0186  -1035.3966  -2673.0425    112.68889   173.80641  4238.5264
  -1470.5931  -1662.3215   5182.865    1729.4012    716.36383   175.79173
   -446.68665  -337.6289   -304.13583 -3293.7856  -4898.9624  -4797.105
  -2275.0662  -3531.8345  -3005.336    2505.2292  -2007.4995   1530.7433
  -7036.09    -5837.109   -3177.276   -5484.803   -1711.9209    900.7856
  -1688.9238  -2372.5234  -3952.628   -4687.217   -7249.36    -1744.0284
   1260.919     701.9071   2723.6436   -603.1626  -5366.029    -506.66946
  -1007.3063   1026.9431  -2937.075    1003.5136  -7670.8125  -4532.009
   1500.611   -4195.7603  -2028.9331  -1415.187    1922.1714   4728.35
  -4812.224   -1949.5833  -5796.676     864.06433  -910.46344 -4247.7227
   -711.0734    127.39277  3579.9036    613.7357  -2214.797   -1051.9226
  -1602.5641  -1597.7485  -2999.3923  -2341.1975  -5286.994     730.96924
   1435.4087   1934.6099  -1135.1221  -2594.132    1339.4258  -5310.2666
  -3659.4177  -6533.5186  -2275.5522  -3372.8804   2682.113    1822.5903
  -1211.9471  -4651.404   -2055.5298  -4964.5186   -897.50336 -6647.9585
   1617.6327  -5030.092    1386.7338  -4390.19    -5343.241   -5209.279
  -1409.8773   1116.4227    289.62033 -2203.8167  -1028.9636  -3541.5996
  -2068.3115  -3924.0996   1642.9878   1865.5432  -3222.662    1896.2085
  -3375.8242  -1408.7878   -285.27808  1964.8013  -3609.0054  -1150.9559
   -260.9037   1730.4817   -895.07855 -5430.4873   4233.935   -3256.5818
   1205.6384  -5877.106    2199.7102   -424.80838 -2538.0137   2910.4604
  -5191.5586   -895.8431 ]
 [ 2296.4055   -987.3507  -2485.729     124.33925   198.58687  4006.7634
  -1373.579   -1537.2711   4897.963    1620.9253    712.42035   158.63565
   -407.67886  -294.93695  -287.18225 -3083.267   -4687.002   -4513.9927
  -2132.918   -3365.512   -2796.5828   2291.0002  -1888.6941   1448.4585
  -6655.6436  -5462.2407  -2996.6199  -5169.9053  -1601.158     805.7517
  -1610.6764  -2211.6794  -3670.8662  -4372.468   -6873.888   -1596.1859
   1206.8888    686.93677  2577.5234   -580.5423  -5022.0396   -490.27615
   -948.062     873.1976  -2772.5422    939.6318  -7173.763   -4222.4956
   1416.5039  -3936.6462  -1894.5361  -1312.8009   1828.0969   4493.235
  -4524.694   -1841.0607  -5414.3467    828.20776  -816.3447  -4004.6897
   -586.1435    150.64182  3427.127     601.1416  -2082.328    -991.07983
  -1493.0211  -1558.3323  -2803.5042  -2179.5542  -4942.416     707.5284
   1367.0704   1839.8708  -1106.3308  -2454.5269   1290.5985  -4966.6074
  -3436.9673  -6132.8975  -2129.554   -3166.4424   2552.6677   1724.22
  -1057.1676  -4381.1265  -1899.9567  -4682.8975   -904.14594 -6266.201
   1570.8909  -4750.665    1302.5356  -4114.8413  -5005.37    -4887.6357
  -1294.8773   1068.483     300.72098 -2100.631    -993.5702  -3328.8289
  -1900.2711  -3673.7434   1556.2346   1743.7675  -3048.4165   1794.3772
  -3173.4653  -1323.4886   -250.27246  1872.7726  -3350.4636  -1078.0472
   -189.5445   1632.3423   -835.99084 -5105.101    4015.5042  -3000.1921
   1137.0171  -5556.279    2056.376    -355.69843 -2387.773    2715.4758
  -4915.04     -873.0093 ]]

And the post-processed embeddings:

[[  0   0   0   0   0 255 255 255   0   0   0   0 255   0   0 255 255 255
  255 255   0 255   0 255   0   0 255   0 255 255 255   0 255   0   0 255
    0   0 255   0   0 255   0   0   0   0 255   0   0 255 255   0 255   0
  255   0 255   0 255 255   0 255   0 255 255 255 255   0   0   0 255   0
  255 255   0   0   0   0 255   0   0 255 255 255 255   0 255 255   0   0
    0   0 255   0   0   0   0   0 255   0   0 255 255 255   0   0 255 255
  255   0 255   0 255 255   0   0 255 255 255 255 255 255 255 255 255 255
    0   0]
 [  0   0   0   0   0 255 255 255   0   0   0   0 255   0   0 255 255 255
  255 255   0 255   0 255   0   0 255   0 255 255 255   0 255   0   0 255
    0   0 255   0   0 255   0   0   0   0 255   0   0 255 255   0 255   0
  255   0 255   0 255 255   0 255   0 255 255 255 255   0   0   0 255   0
  255 255   0   0   0   0 255   0   0 255 255 255 255   0 255 255   0   0
    0   0 255   0   0   0   0   0 255   0   0 255 255 255   0   0 255   0
  255   0 255   0 255 255   0   0 255 255 255 255 255 255 255 255 255 255
    0   0]
 [  0   0   0   0   0 255 255 255   0   0   0   0 255   0   0 255 255 255
  255 255   0 255   0 255   0   0 255   0 255 255 255   0 255   0   0 255
    0   0 255   0   0 255   0   0   0   0 255   0   0 255 255   0 255   0
  255   0 255   0 255 255   0 255 255 255 255 255 255   0   0   0 255   0
  255 255   0   0   0   0 255   0   0 255 255 255 255   0 255 255   0   0
    0   0 255   0   0   0   0   0 255   0   0 255 255 255   0   0 255   0
  255   0 255   0 255 255   0   0 255 255 255 255 255 255 255 255 255 255
    0   0]]
feature_lists {
  feature_list {
    key: "audio_embedding"
    value {
      feature {
        bytes_list {
          value: "\000\000\000\000\000\377\377\377\000\000\000\000\377\000\000\377\377\377\377\377\000\377\000\377\000\000\377\000\377\377\377\000\377\000\000\377\000\000\377\000\000\377\000\000\000\000\377\000\000\377\377\000\377\000\377\000\377\000\377\377\000\377\000\377\377\377\377\000\000\000\377\000\377\377\000\000\000\000\377\000\000\377\377\377\377\000\377\377\000\000\000\000\377\000\000\000\000\000\377\000\000\377\377\377\000\000\377\377\377\000\377\000\377\377\000\000\377\377\377\377\377\377\377\377\377\377\000\000"
        }
      }
      feature {
        bytes_list {
          value: "\000\000\000\000\000\377\377\377\000\000\000\000\377\000\000\377\377\377\377\377\000\377\000\377\000\000\377\000\377\377\377\000\377\000\000\377\000\000\377\000\000\377\000\000\000\000\377\000\000\377\377\000\377\000\377\000\377\000\377\377\000\377\000\377\377\377\377\000\000\000\377\000\377\377\000\000\000\000\377\000\000\377\377\377\377\000\377\377\000\000\000\000\377\000\000\000\000\000\377\000\000\377\377\377\000\000\377\000\377\000\377\000\377\377\000\000\377\377\377\377\377\377\377\377\377\377\000\000"
        }
      }
      feature {
        bytes_list {
          value: "\000\000\000\000\000\377\377\377\000\000\000\000\377\000\000\377\377\377\377\377\000\377\000\377\000\000\377\000\377\377\377\000\377\000\000\377\000\000\377\000\000\377\000\000\000\000\377\000\000\377\377\000\377\000\377\000\377\000\377\377\000\377\377\377\377\377\377\000\000\000\377\000\377\377\000\000\000\000\377\000\000\377\377\377\377\000\377\377\000\000\000\000\377\000\000\000\000\000\377\000\000\377\377\377\000\000\377\000\377\000\377\000\377\377\000\000\377\377\377\377\377\377\377\377\377\377\000\000"
        }
      }
    }
  }
}

This is the postprocess function:

  def postprocess(self, embeddings_batch):
    """Applies postprocessing to a batch of embeddings.

    Args:
      embeddings_batch: An nparray of shape [batch_size, embedding_size]
        containing output from the embedding layer of VGGish.

    Returns:
      An nparray of the same shape as the input but of type uint8,
      containing the PCA-transformed and quantized version of the input.
    """
    assert len(embeddings_batch.shape) == 2, (
        'Expected 2-d batch, got %r' % (embeddings_batch.shape,))
    assert embeddings_batch.shape[1] == vggish_params.EMBEDDING_SIZE, (
        'Bad batch shape: %r' % (embeddings_batch.shape,))

    # Apply PCA.
    # - Embeddings come in as [batch_size, embedding_size].
    # - Transpose to [embedding_size, batch_size].
    # - Subtract pca_means column vector from each column.
    # - Premultiply by PCA matrix of shape [output_dims, input_dims]
    #   where both are are equal to embedding_size in our case.
    # - Transpose result back to [batch_size, embedding_size].
    pca_applied = np.dot(self._pca_matrix,
                         (embeddings_batch.T - self._pca_means)).T

    # Quantize by:
    # - clipping to [min, max] range
    clipped_embeddings = np.clip(
        pca_applied, vggish_params.QUANTIZE_MIN_VAL,
        vggish_params.QUANTIZE_MAX_VAL)
    # - convert to 8-bit in range [0.0, 255.0]
    quantized_embeddings = (
        (clipped_embeddings - vggish_params.QUANTIZE_MIN_VAL) *
        (255.0 /
         (vggish_params.QUANTIZE_MAX_VAL - vggish_params.QUANTIZE_MIN_VAL)))
    # - cast 8-bit float to uint8
    quantized_embeddings = quantized_embeddings.astype(np.uint8)

    return quantized_embeddings

So this particular file is split into three segments and the raw embeddings vary significantly, but the post-processed ones are nearly identical across different segments. Is this typical behaviour for VGGish, or could it be a sign of an issue in the way the embeddings are processed? Is this because they are all part of the same class that VGGish identifies as part of its training and thus would always have the same post processed embedding?

To clarify, I have a dataset of the same type of audio but with four different labels (characteristics) between and I am trying to retrieve embeddings so that I can build a model that can predict these four classes. Each file will correspond to one such class but might be recognised as all the same by VGGish since it’s the same type of data (I wonder ?).

Any insights, suggestions, or pointers towards resources would be greatly appreciated!

#tensorflow #vggish #feature-extraction #machine-learning #audio-processing

Tim_Wolfe · January 28, 2024, 1:45am

To ensure VGGish runs correctly and generates accurate embeddings for your use case:

Check Environment: Verify your setup is compatible with VGGish requirements, especially TensorFlow versions.
Address Warnings: Update deprecated TensorFlow functions to their latest equivalents.
Analyze Embeddings: Investigate if the VGGish embeddings capture the nuances you’re interested in by visually inspecting or statistically analyzing the embeddings.
Fine-tune: Consider fine-tuning VGGish with your specific dataset to better capture subtle differences.
Consult Resources: Look into VGGish documentation and community forums for similar use cases and solutions.