Understanding t-SNE: Unveiling the Intricacies of Dimensionality Reduction

Vitality Learning
13 min readSep 8, 2024

--

Photo by Volodymyr Hryshchenko on Unsplash

Dimensionality reduction is a cornerstone technique in the realm of machine learning and data visualization. Among the myriad of methods available, t-distributed Stochastic Neighbor Embedding (t-SNE) stands out for its prowess in visualizing high-dimensional data in two or three dimensions. In this article, we’ll delve deep into what t-SNE is, the rationale behind its design, its mathematical foundations, strengths, limitations, and diverse applications. Additionally, we’ll explore two practical code examples to illustrate its utility.

The codes for this post are available at this GitHub link.

Introduction to t-SNE

t-distributed Stochastic Neighbor Embedding, commonly known as t-SNE, is a powerful tool for reducing the dimensionality of large datasets, making it easier to visualize and interpret complex data structures. Introduced by Laurens van der Maaten and Geoffrey Hinton in 2008, t-SNE has become a staple in exploratory data analysis, especially for visualizing high-dimensional data such as images, text embeddings, and biological data.

The Rationale Behind t-SNE

The primary objective of t-SNE is to map high-dimensional data into a lower-dimensional space (typically 2D or 3D) while preserving the structure of the data as much as possible. This involves maintaining the relative distances between points, ensuring that similar data points remain close to each other in the reduced space, and dissimilar points are adequately separated.

Traditional linear dimensionality reduction techniques like Principal Component Analysis (PCA) focus on preserving global structures but often fail to capture the nuanced local relationships within the data. t-SNE, on the other hand, emphasizes preserving local neighborhoods, making it particularly adept at uncovering hidden patterns and clusters within complex datasets.

t-SNE Overview

Distance Preservation

At its core, t-SNE strives to preserve the pairwise distances between data points when mapping from a high-dimensional space to a lower-dimensional one. By doing so, it ensures that the intrinsic relationships within the data are maintained, allowing for meaningful visualizations.

Local vs. Global Structure

One of the distinguishing features of t-SNE is its focus on local structure over global structure. While PCA captures the overall variance in the data, t-SNE zooms in on the local neighborhoods, ensuring that points that are close in high-dimensional space remain close in the reduced space. This makes t-SNE especially effective for visualizing clusters but may sometimes distort the global arrangement of these clusters relative to each other.

High-Level Process of t-SNE

The t-SNE algorithm can be broken down into the following high-level steps:

  1. Compute Pairwise Similarities in High-Dimensional Space: Determine how similar or dissimilar each pair of data points is in the original high-dimensional space.
  2. Define Pairwise Similarities in Low-Dimensional Space: Establish a probability distribution that reflects similarities in the reduced space.
  3. Optimize the Mapping: Adjust the positions of points in the low-dimensional space to minimize the divergence between the high-dimensional and low-dimensional similarity distributions.
  4. Visualization: Plot the resulting low-dimensional representation for analysis and interpretation.

Mathematical Details of t-SNE

To truly grasp how t-SNE operates, it is essential to delve into its mathematical underpinnings.

Pairwise Similarities in High-Dimensional Space

For a dataset with n data points, t-SNE first computes the conditional probability pji​, which reflects the probability that data point xi​ would select xj as its neighbor if neighbors were chosen based on a Gaussian distribution centered at xi​:

Here, σi​ is the variance of the Gaussian distribution, effectively controlling the “bandwidth” around xi​.

The Gaussian distribution is well-suited for modeling local neighborhood relationships because it concentrates the majority of probability mass around the mean (the center of a point’s local neighborhood) and decays rapidly as you move farther away. This ensures that nearby points are assigned high similarity probabilities, while distant points are assigned low probabilities.

Mathematically, for each point i, the similarity to another point j is modeled as a Gaussian with mean centered at i and variance (or bandwidth) determined by a perplexity parameter, which controls how large or small the local neighborhood should be.

To symmetrize the probabilities, t-SNE defines:

This symmetric joint probability pij​ represents the similarity between points xi​ and xj​ in the high-dimensional space.

Pairwise Similarities in Low-Dimensional Space

In the reduced space (e.g., 2D), t-SNE models the similarity between points yi​ and yj​ using a Student’s t-distribution with one degree of freedom (equivalent to a Cauchy distribution):

The Student’s t-distribution is used in the low-dimensional space because it has heavier tails than a Gaussian. This means that it assigns higher probabilities to points that are far apart, allowing for more flexibility in the placement of distant points.

The choice of the t-distribution helps t-SNE handle the crowding problem, allowing dissimilar points to be modeled effectively without overwhelming the visualization with too much overlap. Indeed, with a heavier-tailed distribution, t-SNE can place points that are far apart in the high-dimensional space further apart in the lower-dimensional space. This helps t-SNE avoid the overcrowding problem and better separates clusters of data.

Kullback-Leibler Divergence

To measure how well the low-dimensional representation Q captures the high-dimensional similarities P, t-SNE employs the Kullback-Leibler (KL) divergence:

The objective is to minimize this divergence, ensuring that the distribution of similarities in the low-dimensional space closely matches that of the high-dimensional space.

Gradient Descent Optimization

Minimizing the KL divergence is achieved through an optimization technique known as gradient descent. During this iterative process, the positions of the points in the low-dimensional space are adjusted to reduce the divergence between P and Q. This results in a configuration where similar points cluster together, and dissimilar points are appropriately separated.

Perplexity and Bandwidth

Perplexity is a crucial hyperparameter in t-SNE that influences the balance between local and global aspects of the data. It can be thought of as a smooth measure of the number of nearest neighbors considered for each point. Typical values range from 5 to 50, depending on the dataset’s complexity.

The bandwidth σi of the Gaussian distribution is dynamically adjusted based on the perplexity to ensure an appropriate scale of similarity for each data point.

Strengths of t-SNE

Cluster Visualization

t-SNE excels at revealing clusters within data, making it invaluable for exploratory data analysis. By preserving local neighborhoods, it ensures that similar data points form distinct groups in the visualization, facilitating the identification of underlying patterns.

Nonlinearity

t-SNE is considered a nonlinear dimensionality reduction approach because it does not rely on a simple, fixed mathematical transformation, like linear methods do, to project high-dimensional data into a lower-dimensional space. Instead, it focuses on preserving local structures by adaptively changing how points are mapped based on their relationships with their neighbors. Let’s break this down in more detail:

Linear vs. Nonlinear Transformations

  • Linear dimensionality reduction methods, such as PCA (Principal Component Analysis), rely on linear transformations. These methods essentially project data onto a lower-dimensional space using straight-line combinations of the original features. The relationships between the original dimensions and the reduced dimensions are fixed and linear, preserving global structures (variance) but often losing the nuances of local relationships.
  • Nonlinear methods, like t-SNE, adaptively transform data in a way that allows for complex, nonlinear relationships between data points to be captured. These methods do not assume that the data lies in a flat, Euclidean space; instead, they are capable of mapping data that resides in curved or non-Euclidean manifolds. In the case of t-SNE, the mapping is flexible and depends on the specific distance and neighborhood relationships between data points, leading to a highly non-linear projection.

Nonlinear Adaptation to Data

Because t-SNE adapts the mapping based on local neighborhood relationships, it can create complex, curvilinear boundaries between clusters in the lower-dimensional space. This adaptability is what makes it nonlinear — each region of the high-dimensional space can be treated differently depending on its local structure.

Thus, t-SNE’s ability to warp and reshape the space to focus on preserving local geometry over global geometry makes it fundamentally nonlinear. The relationship between the high-dimensional data and the reduced dimensions is not a straightforward equation but a result of an adaptive, iterative optimization process that can accommodate intricate, non-flat structures in the data.

Limitations of t-SNE

Non-Invertibility

t-SNE is non-invertible, meaning there’s no straightforward way to map points back from the reduced space to the original high-dimensional space. This limits its utility in scenarios where such mappings are required.

Global Structure Distortion

While t-SNE is adept at preserving local structures, it can sometimes distort global structures. This means that the relative positions of clusters might not accurately reflect their relationships in the high-dimensional space, potentially leading to misinterpretations.

Computational Complexity

t-SNE is computationally intensive, especially for large datasets. The pairwise similarity computations and iterative optimization process can be time-consuming, although approximate methods like Barnes-Hut t-SNE have been developed to mitigate this issue.

Interpretability

The reduced dimensions in t-SNE do not have a direct interpretation in terms of the original features. This can make it challenging to interpret the axes of the visualization in a meaningful way.

Scalability

Handling very large datasets with t-SNE can be challenging due to its computational demands. While optimizations exist, scaling t-SNE to millions of data points remains a non-trivial task.

Parameter Sensitivity

t-SNE’s performance is sensitive to its hyperparameters, particularly perplexity. Choosing an inappropriate value can lead to misleading visualizations, making it essential to experiment with different settings.

Applications of t-SNE

Data Visualization

One of the most common uses of t-SNE is in data visualization. By projecting high-dimensional data into 2D or 3D, researchers can visually inspect the data for patterns, clusters, and anomalies, aiding in exploratory data analysis.

Understanding Neural Networks

t-SNE is a valuable tool for visualizing the internal representations of neural networks. By applying t-SNE to the activations of hidden layers, one can gain insights into how the network processes and clusters data internally, facilitating model interpretability and debugging.

Genomics and Computational Biology

In fields like genomics, t-SNE helps visualize high-dimensional biological data, such as gene expression profiles. This aids in identifying subpopulations of cells, understanding disease mechanisms, and discovering biomarkers.

Document Analysis in Natural Language Processing (NLP)

For tasks in NLP, t-SNE can visualize word embeddings or document representations, revealing semantic relationships and clusters of similar topics or meanings, thus enhancing the understanding of textual data.

Marketing and Consumer Behavior Analysis

In marketing, t-SNE assists in analyzing consumer behavior by visualizing high-dimensional customer data. This enables the identification of distinct customer segments, tailoring marketing strategies to target specific groups effectively.

Practical Code Examples

To illustrate the practical applications of t-SNE, let’s explore two code examples using the MNIST dataset — a classic benchmark in machine learning comprising handwritten digits.

Code Example 1: t-SNE with a Neural Network Decoder

import numpy as np
from sklearn.manifold import TSNE
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras.datasets import mnist
import matplotlib.pyplot as plt

# 1. Data loading and pre-processing
(x_train, _), (x_test, _) = mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 28*28)
x_test = x_test.reshape(-1, 28*28)

# 2. Application of t-SNE to get a reduced (2D) representation
tsne = TSNE(n_components=2, random_state=42)
x_train_tsne = tsne.fit_transform(x_train)

# 3. Definition of the decoding net (attempt to reconstruct the original data)
input_tsne = Input(shape=(2,))
x = Dense(128, activation='relu')(input_tsne)
x = Dense(256, activation='relu')(x)
x = Dense(512, activation='relu')(x)
output_img = Dense(28*28, activation='sigmoid')(x)

decoder = Model(inputs=input_tsne, outputs=output_img)
decoder.compile(optimizer='adam', loss='mse')

# 4. Network training
decoder.fit(x_train_tsne, x_train, epochs=50, batch_size=128, validation_split=0.2)

# 5. Visualization of the reconstructed images
n = 10 # Number of original images to visualize
decoded_imgs = decoder.predict(x_train_tsne[:n])

# Show the original and reconstructed images
plt.figure(figsize=(20, 4))
for i in range(n):
# Original images
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_train[i].reshape(28, 28), cmap='gray')
plt.title("Original")
plt.axis('off')

# Reconstructed images
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
plt.title("Reconstructed")
plt.axis('off')

plt.show()

Explanation:

Data Loading and Pre-processing:

  • The MNIST dataset is loaded and normalized to have pixel values between 0 and 1.
  • The images are reshaped from 28×28 matrices to 784-dimensional vectors to prepare for t-SNE.

Application of t-SNE:

  • t-SNE reduces the high-dimensional image data to a 2D representation. The parameter random_state=42 ensures reproducibility by initializing the random number generator with a fixed seed. The choice of 42 is arbitrary—it's a commonly used seed value, inspired by "The Hitchhiker's Guide to the Galaxy" where 42 is the "Answer to the Ultimate Question of Life, the Universe, and Everything."

Definition of the Decoding Network:

  • A neural network (decoder) is defined to map the 2D t-SNE representations back to the original 784-dimensional space, attempting to reconstruct the original images.

Network Training:

  • The decoder is trained to minimize the mean squared error (MSE) between the original and reconstructed images using the reduced t-SNE representations as inputs.

Visualization:

  • The original and reconstructed images are displayed side by side to assess the decoder’s performance. This demonstrates how, despite t-SNE’s non-invertibility, a neural network can approximate an inverse mapping.

Code Example 2: Visualizing Neural Network Layers with t-SNE

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.utils import to_categorical
import matplotlib.colors as mcolors
import matplotlib.cm as cm # Correct import for ScalarMappable

# 1. Loading and preprocessing the dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype('float32') / 255.0 # Normalize pixel values to [0, 1]
x_test = x_test.astype('float32') / 255.0
# Adding a channel dimension for grayscale images (required by Conv2D)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

# Convert labels to one-hot encoding
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

# 2. Building and training the Convolutional Neural Network (CNN)
input_layer = Input(shape=(28, 28, 1))

# Convolutional layers
x = Conv2D(32, (3, 3), activation='relu', name='conv_layer_1')(input_layer)
x = MaxPooling2D((2, 2), name='maxpool_layer_1')(x)
x = Conv2D(64, (3, 3), activation='relu', name='conv_layer_2')(x)
x = MaxPooling2D((2, 2), name='maxpool_layer_2')(x)

# Dense (fully connected) layers
x = Flatten(name='flatten_layer')(x)
x = Dense(64, activation='relu', name='dense_layer')(x)
output_layer = Dense(10, activation='softmax', name='output_layer')(x)

# Compile the model
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5, batch_size=128, validation_data=(x_test, y_test))

# 3. Create a fixed color map for digits 0-9
# Use a fixed colormap where each number gets the same color across plots
colors = plt.get_cmap('tab10') # Tab10 has 10 distinct colors
color_map = {i: colors(i) for i in range(10)} # Map digits to colors

# Create a normalization object to scale label values between 0 and 9
norm = mcolors.Normalize(vmin=0, vmax=9)

# 4. Extract intermediate representations from each layer
layer_names = ['conv_layer_1', 'maxpool_layer_1', 'conv_layer_2', 'maxpool_layer_2', 'dense_layer']

# Loop through each layer and visualize with t-SNE
for layer_name in layer_names:
# Create a model that outputs the representation of the current layer
intermediate_layer_model = Model(inputs=model.input, outputs=model.get_layer(layer_name).output)

# Get the output from the intermediate layer (for test set)
intermediate_output = intermediate_layer_model.predict(x_test)

# For convolutional layers, flatten the output to 2D for t-SNE (required for visualization)
if len(intermediate_output.shape) > 2:
intermediate_output = intermediate_output.reshape(intermediate_output.shape[0], -1)

# 5. Apply t-SNE for dimensionality reduction to 2D space
tsne = TSNE(n_components=2, random_state=42)
x_test_tsne = tsne.fit_transform(intermediate_output)

# 6. Visualize the t-SNE result for the current layer with fixed color map
plt.figure(figsize=(8, 8))
for i in range(10): # Loop through each digit (0-9)
indices = np.argmax(y_test, axis=1) == i # Find indices for digit 'i'
plt.scatter(x_test_tsne[indices, 0], x_test_tsne[indices, 1],
color=color_map[i], label=f'Digit {i}', alpha=0.6)

# Use matplotlib.cm.ScalarMappable to fix the colorbar issue
sm = cm.ScalarMappable(cmap='tab10', norm=norm)
sm.set_array([]) # ScalarMappable expects an array, but it's not needed here

# Get the current axes (plt.gca()) and pass it to the colorbar function
plt.colorbar(sm, ticks=range(10), ax=plt.gca())

plt.title(f't-SNE visualization of {layer_name} output with fixed colors')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.show()

Explanation:

Loading and Preprocessing:

  • The MNIST dataset is loaded, normalized, and reshaped to include a channel dimension required by convolutional layers.
  • Labels are converted to one-hot encoding for classification.

Building and Training the CNN:

  • A simple Convolutional Neural Network (CNN) is constructed with two convolutional layers, each followed by a max-pooling layer, and two dense layers.
  • The model is compiled with the Adam optimizer and trained for 5 epochs.

Fixed Color Map Creation:

  • A fixed color map is established using Matplotlib’s 'tab10' colormap, ensuring consistent coloring for each digit across different plots.

Extracting Intermediate Representations:

  • The names of the layers from which representations will be extracted are specified.
  • For each layer, a new model is created that outputs the activations of that layer.
  • The activations for the test set are obtained and flattened if necessary.

Applying t-SNE:

  • t-SNE reduces the high-dimensional activations to 2D representations for visualization.

Visualization:

  • For each digit (0–9), points are plotted with consistent colors.
  • A colorbar is added using matplotlib.cm.ScalarMappable to map colors to digits correctly.
  • Each plot is titled according to the layer being visualized, showcasing how the representations evolve through the network layers.

References

  1. Van der Maaten, L., & Hinton, G. (2008). “Visualizing Data using t-SNE”. Journal of Machine Learning Research, 9(Nov), 2579–2605.
  2. Laurens van der Maaten. “t-Distributed Stochastic Neighbor Embedding (t-SNE)”. [Online]. Available: https://lvdmaaten.github.io/tsne/
  3. Rauber, P. E., Fadel, S. G., Araujo, A. A., & Telea, A. C. (2016). “Visualizing the hidden activity of artificial neural networks”. IEEE Transactions on Visualization and Computer Graphics, 23(1), 101–110.
  4. Belkin, M., & Niyogi, P. (2003). “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation”. Neural Computation, 15(6), 1373–1396.

--

--

Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.