SOM Core Functions

Module contents

Self-Organizing Maps (SOMs) Module

What is a Self-Organizing Map?

A Self-Organizing Map (SOM) is a type of unsupervised artificial neural network used to group and visualize data points that are similar to each other. What makes SOMs unique is that they visualize these groups (or clusters) in simple 2D grid configuration, even though the original data that they represent is high-dimensional. This makes SOMs very useful for exploring and discovering hidden patterns with complex datasets.

How Does a SOM Work?

  • The 2D Grid: A SOM is visualized as a grid of “neurons,” arranged in rows and columns. Each neuron represents a part of the dataset being modeled.

  • Weight Vectors: Behind the scenes, each neuron is associated with a “weight vector.” This weight vector has the same number of dimensions as the original data. Think of the weight vector as the neuron’s “position” in the high-dimensional space, summarizing the features of the data it represents.

  • Training the SOM: During training, the SOM adjusts the weight vectors of the neurons to better represent the dataset. When a neuron’s weight vector is updated, its neighboring neurons on the grid are also updated. This ensures that similar data points are mapped to nearby neurons on the grid, preserving the topology of the dataset.

  • Mapping Data Points to the Grid: For each data point, the SOM calculates how “close” it is to each neuron’s weight vector using a distance metric (e.g., Euclidean distance). The data point is then assigned to the neuron with the closest weight vector. This process transforms the high- dimensional data points into positions on the 2D grid.

How Does a SOM Compare to Other Clustering Methods?

SOMs are similar to k-means clustering in that they group data points by assigning them to a central representative (a “neuron” in SOMs, or a “centroid” in k-means). However, SOMs are more structured because neurons are part of a grid, and their updates are influenced by neighboring neurons. This constraint creates a smooth, organized map where similar clusters are close to each other, making SOMs ideal for data visualization.

class SOM.utils.som_utils.SOM(scale_method: str, x_dim: int, y_dim: int, topology: str, neighborhood_fnc: str, epochs: int, train_dat: DataFrame, other_dat: DataFrame = None)

Bases: object

A class for building, training, and visualizing Self-Organizing Maps (SOMs).

This class provides tools to train a Self-Organizing Map on high-dimensional data, including scaling and unscaling data, calculating various fit metrics (e.g., topographic error, percent variance explained), and visualizing the SOM grid after fitting.

Attributes (set before training):

train_dat (np.ndarray):

The raw (unscaled) training data used to fit the SOM.

other_dat (np.ndarray):

Additional, non-numeric data associated with each data point in train_dat, but not used to train the SOM.

train_dat_features (List[str]):

A list of feature names associated with train_dat.

other_dat_features (List[str]):

A list of feature names associated with other_dat.

xdim (int):

The number of neurons in the x-dimension of the SOM grid.

ydim (int):

The number of neurons in the y-dimension of the SOM grid.

topology (str):

The topology of the SOM grid (‘rectangular’ or ‘hexagonal’, associated with 4 and 6 neighbors, respectively).

neighborhood_fnc (str):

The neighborhood function used during training (‘gaussian’ or ‘bubble’).

epochs (int):

The number of training epochs (i.e., how many times the dataset is presented during the training process).

_scale (Callable):

Method for scaling train_dat (z-score or min/max).

_unscale (Callable):

Method for unscaling scaled data (z-score or min/max).

Attributes (set after training):

map (MiniSom):

The trained MiniSom instance, created after calling train_map.

observation_mapping (np.ndarray):

The mapping of data points to their closest neurons (i.e. BMU, best matching unit).

neuron_coordinates (pd.DataFrame):

The coordinates of neurons in the SOM grid.

weights (pd.DataFrame):

The unscaled weight vectors of each neuron after training.

weights_scaled (pd.DataFrame):

The scaled weight vectors of each neuron after training.

Methods:

__init__:

Initializes the SOM instance with training data, configuration parameters, and scaling methods.

_validate_inputs:

Validates the input parameters for the SOM class, ensuring data types, ranges, and formats are correct.

train_map:

Trains the SOM using scaled training data. Includes initializing neuron weights and mapping data points to neurons.

calculate_percent_variance_explained:

Calculates the Percent Variance Explained (PVE) of the fit SOM, indicating how well the SOM represents the variance in the data.

calculate_topographic_error:

Calculates the topographic error of the fit SOM, a measure of how well the SOM preserves the topology of the data.

plot_component_planes:

Generates and saves component plane plots for each feature in the training data.

plot_categorical_data:

Generates and saves the distribution of categorical data (from other_dat) across the SOM grid.

plot_map_grid:

Plots a blank SOM grid, optionally labeling neurons with their indices.

_zscore_scale:

Scales the training data (train_dat) using z-score normalization.

_minmax_scale:

Scales the training data (train_dat) using min-max normalization.

_zscore_unscale:

Reverses z-score normalization to unscale data back to its original range.

_minmax_unscale:

Reverses min-max scaling to unscale data back to its original range.

_get_observation_neuron_mappings:

Maps each data point in the training set (train_dat) to its Best Matching Unit (BMU).

_get_neuron_coordinates:

Retrieves the x and y coordinates of all neurons in the SOM grid.

_get_weights:

Retrieves the weight vectors of all neurons in the SOM grid after training.

_draw_circle:

Draws a circle at specified coordinates in a matplotlib plot, representing a neuron.

_map_value_to_color:

Maps numeric feature values to colors using a specified colormap.

_add_colorbar:

Adds a colorbar to a matplotlib plot to represent feature value ranges.

_get_distinct_colors:

Generates distinct colors for categorical data using a seaborn palette.

_check_output_path:

Ensures the specified directory exists, creating it if necessary.

_are_nodes_adjacent:

Determines if two neurons in the SOM grid are adjacent based on their coordinates, used when calculating topographic error.

Initialize the Self-Organizing Map (SOM) class with user-defined parameters.

This constructor sets up the SOM by validating inputs, scaling the training data, and preparing the attributes needed for training and analysis.

param train_dat:

A DataFrame containing the numerical training data for the SOM.

type train_dat:

pd.DataFrame

param other_dat:

A DataFrame with non-numeric data associated with each row in train_dat, not used for training the SOM but for analysis.

type other_dat:

pd.DataFrame

param scale_method:

Method for scaling the data; must be either ‘zscore’ or ‘minmax’.

type scale_method:

str

param x_dim:

The number of neurons in the x-dimension of the SOM grid. Must be a positive integer.

type x_dim:

int

param y_dim:

The number of neurons in the y-dimension of the SOM grid. Must be a positive integer.

type y_dim:

int

param topology:

Topology of the SOM grid; must be ‘rectangular’ or ‘hexagonal’.

type topology:

str

param neighborhood_fnc:

Neighborhood function to use during training; must be ‘gaussian’ or ‘bubble’.

type neighborhood_fnc:

str

param epochs:

Number of epochs (iterations over the dataset) for training the SOM. Must be a positive integer.

type epochs:

int

calculate_percent_variance_explained() float

Calculate the Percent Variance Explained (PVE) for the trained SOM.

PVE is the proportion of the total variance in the data explained by the SOM. It is computed as: PVE = ((TSS - WCSS) / TSS) * 100, where:

  • TSS = total sum of squares

  • WCSS = within cluster (i.e., within neuron) sum of squares

Returns:

Percent Variance Explained (PVE)

Return type:

float

calculate_topographic_error() float

Calculate the topographic error of the trained SOM.

Topographic error measures the proportion of data points for which the two closest neurons (BMUs) are not adjacent in the SOM grid. A lower topographic error indicates better topology preservation.

Returns:

The topographic error, ranging from 0 (perfect preservation) to 1 (poor

preservation).

Return type:

float

plot_categorical_data(output_dir: str)

Generate and save categorical data distribution plots across the SOM grid.

This method visualizes how the SOM, trained using train_dat, organizes the data points relative to the separate, categorical dataset (other_dat). While the SOM is trained only on numerical features in train_dat, the resulting clustering/organization of the data points may aligns with categories in other_dat. By examining these plot, we can explore whether and how the organization of the SOM reflects patterns in these categorical variables.

Categorical data plots help in assessing whether the SOM’s training on numerical data leads to meaningful grouping or separation of associated categorical variables. These plots can:

  • Reveal clusters of similar categories across the SOM grid.

  • Provide insights into the relationship between numerical and categorical data.

  • Suggest whether the SOM’s organization inherently captures distinctions present in the categorical dataset.

Parameters:

output_dir (str) – The directory where the categorical data distribution plots will be saved.

Saves:

One plot per categorical feature as a .png file in the specified directory. Each plot displays the SOM grid with neurons colored based on the value of the feature being plotted.

plot_component_planes(output_dir: str)

Generate and save component plane plots for each feature in the training data.

A component plane is a visualization that shows how the values of a specific feature in train_dat are distributed across the SOM grid. Each neuron on the grid is assigned a color based on the value of the feature it represents, allowing you to see patterns and relationships in the data.

Component planes help in understanding how individual features contribute to the organization of the SOM. By examining these plots, you can:

  • Identify clusters and trends in specific features.

  • Compare how different features vary across the grid, potentially highlighting

    relationships/correlations between features.

Parameters:

output_dir (str) – The directory where the component plane plots will be saved.

Saves:

One plot per feature as a .png file in the specified directory. Each plot displays the SOM grid with neurons colored based on the value of the feature being plotted.

plot_map_grid(print_neuron_idx: bool = False) Tuple[Figure, Axes]

Plot a blank SOM grid where each neuron is represented by a circle.

This method creates a visual representation of the SOM grid without any feature or category values. Optionally, you can display the index of each neuron within its circle.

Parameters:

print_neuron_idx (bool, optional) – Whether to display the neuron indices on the plot. Defaults to False.

Returns:

The matplotlib figure and axis objects for the plot.

Return type:

Tuple[plt.Figure, plt.Axes]

train_map()

Train the Self-Organizing Map on the scaled training data.

This method initializes the SOM using PCA-based weight initialization and trains it using the specified parameters. It also calculates neuron weights and maps observations to their corresponding neurons.

Sets the following attributes:
  • map: The trained MiniSom instance.

  • observation_mapping: Mapping of observations to neurons.

  • neuron_coordinates: Coordinates of neurons in the SOM grid.

  • weights_scaled: Scaled weights of each neuron.

  • weights: Unscaled weights of each neuron.