Seq_Sim Core Functions

Module contents

Seq_Sim.utils.seq_sim_utils.add_differential_cell_types(celltype_df: DataFrame, dummy_data: DataFrame, n_cells: int, n_major_diff_celltypes: int, n_minor_diff_celltypes: int, fc_interact: float) List[str]

Add differential expression for cell types based on disease condition.

Parameters:
  • celltype_df – DataFrame of cell types.

  • dummy_data – DataFrame of dummy data.

  • n_cells – Total number of cells per individual.

  • n_major_diff_celltypes – Number of differentially expressed major cell types.

  • n_minor_diff_celltypes – Number of differentially expressed minor cell types.

  • fc_interact – Fold change for interacted cells.

Returns:

List of differential cell types.

Seq_Sim.utils.seq_sim_utils.add_rows_for_diff_cells(celltype_df: DataFrame, dummy_data: DataFrame, diff: int, cell_type: str) DataFrame

Add additional rows to the dataframe for the differentially expressed cell type.

Parameters:
  • celltype_df – DataFrame of cell types.

  • dummy_data – DataFrame of dummy data.

  • diff – Number of additional rows to add for the cell type.

  • cell_type – The cell type for which to add rows.

Returns:

The updated dataframe with added rows for the differential cell type.

Seq_Sim.utils.seq_sim_utils.apply_noise(pc_cluster, var_all, n_cells, cluster_ratio, ratio_variance)

Apply noise to the generated features.

Parameters:
  • pc_cluster (int) – Cluster features.

  • var_all (List) – List of variances.

  • n_cells (int) – Number of cells.

  • cluster_ratio (float) – Cluster ratio.

  • ratio_variance (float) – Ratio variance.

Returns:

Final feature with noise.

Return type:

np.ndarray

Seq_Sim.utils.seq_sim_utils.arg_parser()

Argument parser for the command-line arguments.

Returns:

Argument parser object.

Return type:

argparse.ArgumentParser

Seq_Sim.utils.seq_sim_utils.calculate_abundance(celltype_df: DataFrame, dummy_data: DataFrame, cell_type: str, n_cells: int) ndarray

Calculate the abundance of a specific cell type across individuals.

Parameters:
  • celltype_df – DataFrame of cell types.

  • dummy_data – DataFrame of dummy data.

  • cell_type – The specific cell type to calculate abundance for.

  • n_cells – Total number of cells per individual.

Returns:

Abundance of the cell type for each subject.

Seq_Sim.utils.seq_sim_utils.calculate_diff_expression(abundance: ndarray, n_cells: int, fc_interact: float) int

Calculate the differential expression based on fold change.

Parameters:
  • abundance – Abundance of the cell type for each subject.

  • n_cells – Total number of cells per individual.

  • fc_interact – Fold change for interacted cells.

Returns:

The number of additional rows to add to the dataframe for the cell type.

Seq_Sim.utils.seq_sim_utils.case_when(condition_list: List[Tuple[bool, any]], default_value: any) any

Mimic dplyr’s case_when functionality.

Parameters:
  • condition_list – List of tuples with boolean conditions and associated return values.

  • default_value – The value returned if no condition matches.

Returns:

The value associated with the first matching condition, or the default value.

Seq_Sim.utils.seq_sim_utils.combine_features(features, n_features)

Combine the features into a DataFrame.

Parameters:
  • features (List) – List of features.

  • n_features (int) – Number of features.

Returns:

Pseudo-features

Return type:

pd.DataFrame

Seq_Sim.utils.seq_sim_utils.create_celltype_dataframe(subject_ids: List[str], major_cell_counts: List[List[int]], rare_cell_counts: List[List[int]], n_major_cell_types: int, n_minor_cell_types: int) DataFrame

Create the dataframe for cell types based on the given counts.

Parameters:
  • subject_ids – List of subject IDs.

  • major_cell_counts – Major cell counts for each subject.

  • rare_cell_counts – Rare cell counts for each subject.

  • n_major_cell_types – Number of major cell types.

  • n_minor_cell_types – Number of minor cell types.

Returns:

A DataFrame containing the cell types for each subject.

Seq_Sim.utils.seq_sim_utils.create_directory_if_not_exists(directory: str) None

Creates a directory if it does not already exist.

Parameters:

directory (str) – Path to the directory.

Seq_Sim.utils.seq_sim_utils.create_new_rows_for_cell_type(subj_id: str, cell_type: str, count: int) DataFrame

Create new rows for a specific cell type for a given subject.

Parameters:
  • subj_id – The subject ID.

  • cell_type – The type of cell.

  • count – The number of cells of that type.

Returns:

DataFrame with new rows for the given cell type.

Seq_Sim.utils.seq_sim_utils.encode_categorical_columns(data, cluster_col, disease_col, individual_col)

Encode categorical columns for pseudo-feature generation.

Parameters:
  • data (pd.DataFrame) – Dummy data.

  • cluster_col (str) – Cluster column name.

  • disease_col (str) – Disease column name.

  • individual_col (str) – Individual column name.

Returns:

Tuple of encoded cell diseases, cell individuals, and cell clusters.

Seq_Sim.utils.seq_sim_utils.generate_all_features(n_features, data, seed, cluster_ratio, ratio_variance, cluster_col, disease_col, individual_col)

Generate all features for pseudo-feature generation.

Parameters:
  • n_features (int) – Number of features to generate.

  • data (pd.DataFrame) – Dummy data.

  • seed (int) – Seed for reproducibility.

  • cluster_ratio (float) – Cluster ratio.

  • ratio_variance (float) – Ratio variance.

  • cluster_col (str) – Cluster column name.

  • disease_col (str) – Disease column name.

  • individual_col (str) – Individual column name.

Returns:

List of generated features.

Seq_Sim.utils.seq_sim_utils.generate_and_save_features(num_samples: int, fold_change: float, config: Dict[str, Any]) None

Generates dummy data and pseudo features, and saves them based on the configuration.

Parameters:
  • num_samples (int) – Number of samples to generate.

  • fold_change (float) – Fold change for interaction effects.

  • config (Dict[str, Any]) – Configuration dictionary.

Seq_Sim.utils.seq_sim_utils.generate_cell_counts(subject_ids: List[str], n_cells: int, sd_celltypes: float, n_major_cell_types: int, n_minor_cell_types: int, relative_abundance: float) Tuple[List[List[int]], List[List[int]]]

Generate baseline cell counts for major and minor cell types for all subjects.

Parameters:
  • subject_ids – List of subject IDs.

  • n_cells – Total number of cells per subject.

  • sd_celltypes – Standard deviation of cell counts.

  • n_major_cell_types – Number of major cell types.

  • n_minor_cell_types – Number of minor cell types.

  • relative_abundance – Ratio between major and minor cell types.

Returns:

major cell counts and rare cell counts for each subject.

Return type:

A tuple of lists

Seq_Sim.utils.seq_sim_utils.generate_cell_counts_for_subject(n_cells: int, sd_celltypes: float, n_major_cell_types: int, n_minor_cell_types: int, relative_abundance: float) Tuple[List[int], List[int]]

Generate major and rare cell counts for a single subject.

Parameters:
  • n_cells – Total number of cells per subject.

  • sd_celltypes – Standard deviation of cell counts.

  • n_major_cell_types – Number of major cell types.

  • n_minor_cell_types – Number of minor cell types.

  • relative_abundance – Ratio between major and minor cell types.

Returns:

Tuple of major cell counts and rare cell counts.

Seq_Sim.utils.seq_sim_utils.generate_cell_types(n_major_cell_types: int, n_minor_cell_types: int) List[str]

Generate a list of cell types based on the number of major and minor cell types.

Parameters:
  • n_major_cell_types – Number of major cell types.

  • n_minor_cell_types – Number of minor cell types.

Returns:

List of cell types.

Seq_Sim.utils.seq_sim_utils.generate_cell_types_from_range(start_index: int, num_types: int) List[str]

Generate a list of cell types based on a starting index and number of types.

Parameters:
  • start_index – Starting index for the cell types.

  • num_types – Number of cell types to generate.

Returns:

List of cell types.

Seq_Sim.utils.seq_sim_utils.generate_cluster_features(cell_clusters, unique_clusters)

Generate features based on cluster data.

Parameters:
  • cell_clusters (np.ndarray) – Encoded cell clusters.

  • unique_clusters (int) – Number of unique clusters.

Returns:

Tuple of cluster features and variance.

Seq_Sim.utils.seq_sim_utils.generate_diff_cell_types(diff_clusters: List[int]) List[str]

Generate the list of differential cell types based on the clusters.

Parameters:

diff_clusters – List of indices for the differentially expressed clusters.

Returns:

List of differential cell types.

Seq_Sim.utils.seq_sim_utils.generate_disease_features(cell_diseases)

Generate features based on disease data.

Parameters:

cell_diseases (np.ndarray) – Encoded cell diseases.

Returns:

Variance of disease features.

Return type:

float

Seq_Sim.utils.seq_sim_utils.generate_dummy_data_wo_interaction(n_cells: int = 3000, sd_celltypes: float = 0.1, n_major_cell_types: int = 7, n_minor_cell_types: int = 3, relative_abundance: float = 0.1, n_major_diff_celltypes: int = 1, n_minor_diff_celltypes: int = 1, n_individuals: int = 30, n_batches: int = 4, prop_sex: float = 0.5, prop_disease: float = 0.5, fc_interact: float = 0.1, seed: int = 1234) Tuple[DataFrame, List[str]]

Generate dummy data for simulation without interactions.

Parameters:
  • n_cells – Number of cells of major cell types per individual.

  • sd_celltypes – Standard deviation of number of cells.

  • n_major_cell_types – Number of major cell types.

  • n_minor_cell_types – Number of minor cell types.

  • relative_abundance – Ratio between major and rare cell types.

  • n_major_diff_celltypes – Number of differentially expressed major cell types.

  • n_minor_diff_celltypes – Number of differentially expressed minor cell types.

  • n_individuals – Total number of individuals.

  • n_batches – Number of batches.

  • prop_sex – Proportion of sex.

  • prop_disease – Proportion of disease.

  • fc_interact – Proportion of interacted cells.

  • seed – Random seed for reproducibility.

Returns:

  • DataFrame: The generated dummy data.

  • List of differential cell types.

Return type:

A tuple containing

Seq_Sim.utils.seq_sim_utils.generate_feature(idx, data, seed, cluster_ratio, ratio_variance, cluster_col, disease_col, individual_col)

Process a single feature for pseudo-feature generation.

Parameters:
  • idx (int) – Feature index.

  • data (pd.DataFrame) – Dummy data.

  • seed (int) – Seed for reproducibility.

  • cluster_ratio (float) – Cluster ratio.

  • ratio_variance (float) – Ratio variance.

  • cluster_col (str) – Cluster column name.

  • disease_col (str) – Disease column name.

  • individual_col (str) – Individual column name.

Returns:

Generated feature.

Return type:

np.ndarray

Seq_Sim.utils.seq_sim_utils.generate_individual_features(cell_individual)

Generate features based on individual data.

Parameters:

cell_individual (np.ndarray) – Encoded cell individuals.

Returns:

Variance of individual features.

Return type:

float

Seq_Sim.utils.seq_sim_utils.generate_major_cell_counts(n_cells: int, sd_celltypes: float, n_major_cell_types: int) List[int]

Generate cell counts for major cell types based on a uniform distribution.

Parameters:
  • n_cells – Total number of cells per subject.

  • sd_celltypes – Standard deviation of cell counts.

  • n_major_cell_types – Number of major cell types.

Returns:

List of major cell counts.

Seq_Sim.utils.seq_sim_utils.generate_pseudo_features(data, n_features=20, cluster_ratio=0.25, ratio_variance=0.5, cluster_col='cell_type', disease_col='disease', individual_col='batch', seed=1234)

Generate pseudo-features for simulation.

Parameters:
  • data (pd.DataFrame) – Dummy data.

  • n_features (int, optional) – Number of features to generate. Defaults to 20.

  • cluster_ratio (float, optional) – Cluster ratio. Defaults to 0.25.

  • ratio_variance (float, optional) – Ratio variance. Defaults to 0.5.

  • cluster_col (str, optional) – Cluster column. Defaults to “cell_type”.

  • disease_col (str, optional) – Disease column. Defaults to “disease”.

  • individual_col (str, optional) – Individual column. Defaults to “batch”.

  • seed (int, optional) – Seed. Defaults to 1234.

Returns:

Pseudo-features.

Return type:

pd.DataFrame

Seq_Sim.utils.seq_sim_utils.generate_rare_cell_counts(n_cells: int, sd_celltypes: float, relative_abundance: float, n_minor_cell_types: int) List[int]

Generate cell counts for minor cell types based on a uniform distribution.

Parameters:
  • n_cells – Total number of cells per subject.

  • sd_celltypes – Standard deviation of cell counts.

  • relative_abundance – Ratio between major and minor cell types.

  • n_minor_cell_types – Number of minor cell types.

Returns:

List of rare cell counts.

Seq_Sim.utils.seq_sim_utils.generate_variance(data)

Generate variance for the given data.

Parameters:

data (np.ndarray) – The data based on which variance is to be generated.

Returns:

Mean of the generated variance.

Return type:

float

Seq_Sim.utils.seq_sim_utils.identify_diff_clusters(n_major_diff_celltypes: int, n_minor_diff_celltypes: int, n_cells: int) List[int]

Identify the indices of the differentially expressed clusters.

Parameters:
  • n_major_diff_celltypes – Number of differentially expressed major cell types.

  • n_minor_diff_celltypes – Number of differentially expressed minor cell types.

  • n_cells – Total number of cells per individual.

Returns:

List of indices representing the differentially expressed clusters.

Seq_Sim.utils.seq_sim_utils.load_config(config_file: str) dict

Load configuration settings from a YAML file.

Parameters:

config_file – Path to the YAML configuration file.

Returns:

A dictionary of configuration settings.

Seq_Sim.utils.seq_sim_utils.process_feature(x, data, seed, cluster_ratio, ratio_variance, cluster_col, disease_col, individual_col)

Process a single feature for pseudo-feature generation.

Parameters:
  • x (int) – Feature index.

  • data (pd.DataFrame) – Dummy data.

  • seed (int) – Seed for reproducibility.

  • cluster_ratio (float) – Cluster ratio.

  • ratio_variance (float) – Ratio variance.

  • cluster_col (str) – Cluster column name.

  • disease_col (str) – Disease column name.

  • individual_col (str) – Individual column name.

Returns:

Generated feature.

Return type:

np.ndarray

Seq_Sim.utils.seq_sim_utils.save_data_to_csv(data: DataFrame, file_path: str) None

Saves a DataFrame to a CSV file.

Parameters:
  • data (pd.DataFrame) – DataFrame to save.

  • file_path (str) – Destination file path.

Seq_Sim.utils.seq_sim_utils.set_random_seed(x, seed)

Set random seed for reproducibility.

Parameters:
  • x (int) – Feature index.

  • seed (int) – Seed for reproducibility.

Seq_Sim.utils.seq_sim_utils.validate_arguments(args: list[str]) tuple[int, float, str]

Validates and parses command-line arguments.

Parameters:

args (list[str]) – List of command-line arguments.

Returns:

Parsed number of samples, fold change, and config file path.

Return type:

tuple[int, float, str]

Raises:

ValueError – If arguments are invalid or insufficient.