SOM-Seq Logo

SOM-Seq

GitHub release Tests Build codecov License Docs

SOM-Seq is a Python toolbox that combines synthetic single-cell sequencing data generation (Seq_Sim) with Self-Organizing Map (SOM) clustering and visualization (SOM). The two modules can be used together or independently.


Installation

From source:

git clone https://github.com/caterer-z-t/SOM_Seq_Sim.git
cd SOM_Seq_Sim
pip install .

Quick Start

Generate Sequencing Data

seq-sim --num_samples 30 --fold_change 0.5 --config_file Seq_Sim/config.yml

Or equivalently:

python Seq_Sim/seq_sim.py \
    --num_samples 30 \
    --fold_change 0.5 \
    --config_file Seq_Sim/config.yml

Output CSV files are written to the directory specified in config.yml (default: data/).

See Seq_Sim/config.yml for all configurable parameters (cell-type counts, batch structure, disease proportions, number of features, etc.).


Fit a SOM

Single hyperparameter set:

som-fit \
    -t data/seq_sim_training_data.csv \
    -c data/seq_sim_categorical_data.csv \
    -o output/ \
    -s zscore -x 5 -y 4 -p hexagonal -n gaussian -e 100

Hyperparameter tuning (pass multiple values; best combination is selected automatically):

som-fit \
    -t data/seq_sim_training_data.csv \
    -c data/seq_sim_categorical_data.csv \
    -o output/ \
    -s zscore minmax -x 3 5 7 -y 3 5 -p rectangular hexagonal -n gaussian -e 50 100

Or equivalently use python SOM/som.py with the same flags.

Python API:

import pandas as pd
from SOM.utils.som_utils import SOM

train = pd.read_csv("data/seq_sim_training_data.csv")
meta  = pd.read_csv("data/seq_sim_categorical_data.csv")

som = SOM(
    train_dat=train,
    other_dat=meta,
    scale_method="zscore",
    x_dim=5,
    y_dim=4,
    topology="hexagonal",
    neighborhood_fnc="gaussian",
    epochs=100,
)
som.train_map()

print(f"PVE:               {som.calculate_percent_variance_explained():.1f}%")
print(f"Topographic error: {som.calculate_topographic_error():.3f}")

som.plot_component_planes(output_dir="output/")
som.plot_categorical_data(output_dir="output/")

CLI Reference

seq-sim / python Seq_Sim/seq_sim.py

Flag

Description

--num_samples

Number of subjects to simulate

--fold_change

Disease-associated fold change magnitude

--config_file

Path to config.yml

som-fit / python SOM/som.py

Flag

Description

-t

Path to training data CSV (numeric features)

-c

Path to categorical/metadata CSV (optional)

-o

Output directory for plots and metrics

-s

Scaling method: zscore or minmax (one or more)

-x

SOM x-dimension (one or more integers)

-y

SOM y-dimension (one or more integers)

-p

Topology: rectangular or hexagonal (one or more)

-n

Neighborhood function: gaussian or bubble (one or more)

-e

Number of training epochs (one or more integers)

-m

Generate component plane plots (default: True)

When more than one value is provided for any hyperparameter, the CLI performs a grid search and selects the best combination by PVE 100 × topographic_error.


Documentation

Full API reference and tutorials: som-seq-sim.readthedocs.io


Testing

pytest test/ -v

Contributing

See CONTRIBUTING.md. Please follow the Code of Conduct.


Citation

If you use SOM-Seq in your research, please cite:

Caterer Z., Pernat M., Hurd V. (2024). SOM-Seq: A Python Toolbox for Single-Cell Sequencing Simulation and Self-Organizing Map Analysis.

A machine-readable citation is available in CITATION.cff.


License

MIT — see LICENSE.