Interpretable Latent Spaces Using Space-Filling Vector Quantization | by Mohammad Hassan Vali | Apr, 2024


A new unsupervised method that combines two concepts of vector quantization and space-filling curves to interpret the latent space of DNNs

This post is a short explanation of our novel unsupervised distribution modeling technique called space-filling vector quantization [1] published at Interspeech 2023 conference. For more details, please look at the paper under this link.

Image from StockSnap.io

Deep generative models are well-known neural network-based architectures that learn a latent space whose samples can be mapped to sensible real-world data such as image, video, and speech. Such latent spaces act as a black-box and they are often difficult to interpret. In this post, we introduce our novel unsupervised distribution modeling technique that combines two concepts of space-filling curves and vector quantization (VQ) which is called Space-Filling Vector Quantization (SFVQ). SFVQ helps to make the latent space interpretable by capturing its underlying morphological structure. Important to note that SFVQ is a generic tool for modeling distributions and using it is not restricted to any specific neural network architecture nor any data type (e.g. image, video, speech and etc.). In this post, we demonstrate the application of SFVQ to interpret the latent space of a voice conversion model. To understand this post you don’t need to know about speech signals technically, because we explain everything in general (not technical). Before everything, let me explain what is the SFVQ technique and how it works.

Space-Filling Vector Quantization (SFVQ)

Vector quantization (VQ) is a data compression technique similar to k-means algorithm which can model any data distribution. The figure below shows a VQ applied on a Gaussian distribution. VQ clusters this distribution (gray points) using 32 codebook vectors (blue points) or clusters. Each voronoi cell (green lines) contains one codebook vector such that this codebook vector is the closest codebook vector (in terms of Euclidean distance) to all data points located in that voronoi cell. In other words, each codebook vector is the representative vector of all data points located in its corresponding voronoi cell. Therefore, applying VQ on this Gaussian distribution means to map each data point to its closest codebook vector, i.e. represent each data point with its closest codebook vector. For more information about VQ and its other variants you can check out this post.

Vector Quantization applied on a Gaussian distribution using 32 codebook vectors. (image by author)

Space-filling curve is a piece-wise continuous line generated with a recursive rule and if the recursion iterations are repeated infinitely, the curve gets bent until it completely fills a multi-dimensional space. The following figure illustrates the Hilbert curve [2] which is a well-known type of space-filling curves in which the corner points are defined using a specific mathematical formulation at each recursion iteration.

Five first iterations of Hilbert curve to fill a 2D square distribution. (image by author)

Getting intuition from space-filling curves, we can thus think of vector quantization (VQ) as mapping input data points on a space-filling curve (rather than only mapping data points exclusively on codebook vectors as what we do in normal VQ). Therefore, we incorporate vector quantization into space-filling curves, such that our proposed space-filling vector quantizer (SFVQ) models a D-dimensional data distribution by continuous piece-wise linear curves whose corner points are vector quantization codebook vectors. The following figure illustrates VQ and SFVQ applied on a Gaussian distribution.

Codebook vectors (blue points) of a vector quantizer, and a space-filling vector quantizer (curve in black) on a Gaussian distribution (gray points). Voronoi regions for VQ are shown in green. (image by author)

For technical details on how to train SFVQ and how to map data points on SFVQ’s curve, please see section 2 in our paper [1].

Note that when we train a normal VQ on a distribution, the adjacent codebook vectors that exists inside the learned codebook matrix can refer to totally different contents. For example, the first codebook element could refer to a vowel phone and the second one could refer to a silent part of speech signal. However, when we train SFVQ on a distribution, the learned codebook vectors will be located in an arranged form such that adjacent elements in the codebook matrix (i.e. adjacent codebook indices) will refer to similar contents in the distribution. We can use this property of SFVQ to interpret and explore the latent spaces in Deep Neural Networks (DNNs). As a typical example, in the following we will explain how we used our SFVQ method to interpret the latent space of a voice conversion model [3].

Voice Conversion

The following figure shows a voice conversion model [3] based on vector quantized variational autoencoder (VQ-VAE) [4] architecture. According to this model, encoder takes the speech signal of speaker A as the input and passes the output into vector quantization (VQ) block to extracts the phonetic information (phones) out of this speech signal. Then, these phonetic information together with the identity of speaker B goes into the decoder which outputs the converted speech signal. The converted speech would contain the phonetic information (context) of speaker A with the identity of speaker B.

Voice conversion model based on VQ-VAE architecture. (image by author)

In this model, the VQ module acts as an information bottleneck that learns a discrete representation of speech that captures only phonetic content and discards the speaker-related information. In other words, VQ codebook vectors are expected to collect only the phone-related contents of the speech. Here, the representation of VQ output is considered the latent space of this model. Our objective is to replace the VQ module with our proposed SFVQ method to interpret the latent space. By interpretation we mean to figure out what phone each latent vector (codebook vector) corresponds to.

Interpreting the Latent Space using SFVQ

We evaluate the performance of our space-filling vector quantizer (SFVQ) on its ability to find the structure in the latent space (representing phonetic information) in the above voice conversion model. For our evaluations, we used the TIMIT dataset [5], since it contains phone-wise labeled data using the phone set from [6]. For our experiments, we use the following phonetic grouping:

  • Plosives (Stops): {p, b, t, d, k, g, jh, ch}
  • Fricatives: {f, v, th, dh, s, z, sh, zh, hh, hv}
  • Nasals: {m, em, n, nx, ng, eng, en}
  • Vowels: {iy, ih, ix, eh, ae, aa, ao, ah, ax, ax-h, uh, uw, ux}
  • Semi-vowels (Approximants): {l, el, r, er, axr, w, y}
  • Diphthongs: {ey, aw, ay, oy, ow}
  • Silence: {h#}.

To analyze the performance of our proposed SFVQ, we pass the labeled TIMIT speech files through the trained encoder and SFVQ modules, respectively, and extract the codebook vector indices corresponding to all existing phones in the speech. In other words, we pass a speech signal with labeled phones and then compute the index of the learned SFVQ’s codebook vector which those phones are getting mapped to them. As explained above, we expect our SFVQ to map similar phonetic contents next to each other (index-wise in the learned codebook matrix). To examine this expectation, in the following figure we visualize the spectrogram of the sentence “she had your dark suit”, and its corresponding codebook vector indices for the ordinary vector quantizer (VQ) and our proposed SFVQ.

We will be happy to hear your thoughts

Leave a reply

0
Your Cart is empty!

It looks like you haven't added any items to your cart yet.

Browse Products
Powered by Caddy
Shopping cart