Seq_Sim Core Functions¶
Module contents¶
- Seq_Sim.utils.seq_sim_utils.add_differential_cell_types(celltype_df: DataFrame, dummy_data: DataFrame, n_cells: int, n_major_diff_celltypes: int, n_minor_diff_celltypes: int, fc_interact: float) List[str]¶
Add differential expression for cell types based on disease condition.
- Parameters:
celltype_df – DataFrame of cell types.
dummy_data – DataFrame of dummy data.
n_cells – Total number of cells per individual.
n_major_diff_celltypes – Number of differentially expressed major cell types.
n_minor_diff_celltypes – Number of differentially expressed minor cell types.
fc_interact – Fold change for interacted cells.
- Returns:
List of differential cell types.
- Seq_Sim.utils.seq_sim_utils.add_rows_for_diff_cells(celltype_df: DataFrame, dummy_data: DataFrame, diff: int, cell_type: str) DataFrame¶
Add additional rows to the dataframe for the differentially expressed cell type.
- Parameters:
celltype_df – DataFrame of cell types.
dummy_data – DataFrame of dummy data.
diff – Number of additional rows to add for the cell type.
cell_type – The cell type for which to add rows.
- Returns:
The updated dataframe with added rows for the differential cell type.
- Seq_Sim.utils.seq_sim_utils.apply_noise(pc_cluster, var_all, n_cells, cluster_ratio, ratio_variance)¶
Apply noise to the generated features.
- Parameters:
pc_cluster (int) – Cluster features.
var_all (List) – List of variances.
n_cells (int) – Number of cells.
cluster_ratio (float) – Cluster ratio.
ratio_variance (float) – Ratio variance.
- Returns:
Final feature with noise.
- Return type:
np.ndarray
- Seq_Sim.utils.seq_sim_utils.arg_parser()¶
Argument parser for the command-line arguments.
- Returns:
Argument parser object.
- Return type:
argparse.ArgumentParser
- Seq_Sim.utils.seq_sim_utils.calculate_abundance(celltype_df: DataFrame, dummy_data: DataFrame, cell_type: str, n_cells: int) ndarray¶
Calculate the abundance of a specific cell type across individuals.
- Parameters:
celltype_df – DataFrame of cell types.
dummy_data – DataFrame of dummy data.
cell_type – The specific cell type to calculate abundance for.
n_cells – Total number of cells per individual.
- Returns:
Abundance of the cell type for each subject.
- Seq_Sim.utils.seq_sim_utils.calculate_diff_expression(abundance: ndarray, n_cells: int, fc_interact: float) int¶
Calculate the differential expression based on fold change.
- Parameters:
abundance – Abundance of the cell type for each subject.
n_cells – Total number of cells per individual.
fc_interact – Fold change for interacted cells.
- Returns:
The number of additional rows to add to the dataframe for the cell type.
- Seq_Sim.utils.seq_sim_utils.case_when(condition_list: List[Tuple[bool, any]], default_value: any) any¶
Mimic dplyr’s case_when functionality.
- Parameters:
condition_list – List of tuples with boolean conditions and associated return values.
default_value – The value returned if no condition matches.
- Returns:
The value associated with the first matching condition, or the default value.
- Seq_Sim.utils.seq_sim_utils.combine_features(features, n_features)¶
Combine the features into a DataFrame.
- Parameters:
features (List) – List of features.
n_features (int) – Number of features.
- Returns:
Pseudo-features
- Return type:
pd.DataFrame
- Seq_Sim.utils.seq_sim_utils.create_celltype_dataframe(subject_ids: List[str], major_cell_counts: List[List[int]], rare_cell_counts: List[List[int]], n_major_cell_types: int, n_minor_cell_types: int) DataFrame¶
Create the dataframe for cell types based on the given counts.
- Parameters:
subject_ids – List of subject IDs.
major_cell_counts – Major cell counts for each subject.
rare_cell_counts – Rare cell counts for each subject.
n_major_cell_types – Number of major cell types.
n_minor_cell_types – Number of minor cell types.
- Returns:
A DataFrame containing the cell types for each subject.
- Seq_Sim.utils.seq_sim_utils.create_directory_if_not_exists(directory: str) None¶
Creates a directory if it does not already exist.
- Parameters:
directory (str) – Path to the directory.
- Seq_Sim.utils.seq_sim_utils.create_new_rows_for_cell_type(subj_id: str, cell_type: str, count: int) DataFrame¶
Create new rows for a specific cell type for a given subject.
- Parameters:
subj_id – The subject ID.
cell_type – The type of cell.
count – The number of cells of that type.
- Returns:
DataFrame with new rows for the given cell type.
- Seq_Sim.utils.seq_sim_utils.encode_categorical_columns(data, cluster_col, disease_col, individual_col)¶
Encode categorical columns for pseudo-feature generation.
- Parameters:
data (pd.DataFrame) – Dummy data.
cluster_col (str) – Cluster column name.
disease_col (str) – Disease column name.
individual_col (str) – Individual column name.
- Returns:
Tuple of encoded cell diseases, cell individuals, and cell clusters.
- Seq_Sim.utils.seq_sim_utils.generate_all_features(n_features, data, seed, cluster_ratio, ratio_variance, cluster_col, disease_col, individual_col)¶
Generate all features for pseudo-feature generation.
- Parameters:
n_features (int) – Number of features to generate.
data (pd.DataFrame) – Dummy data.
seed (int) – Seed for reproducibility.
cluster_ratio (float) – Cluster ratio.
ratio_variance (float) – Ratio variance.
cluster_col (str) – Cluster column name.
disease_col (str) – Disease column name.
individual_col (str) – Individual column name.
- Returns:
List of generated features.
- Seq_Sim.utils.seq_sim_utils.generate_and_save_features(num_samples: int, fold_change: float, config: Dict[str, Any]) None¶
Generates dummy data and pseudo features, and saves them based on the configuration.
- Parameters:
num_samples (int) – Number of samples to generate.
fold_change (float) – Fold change for interaction effects.
config (Dict[str, Any]) – Configuration dictionary.
- Seq_Sim.utils.seq_sim_utils.generate_cell_counts(subject_ids: List[str], n_cells: int, sd_celltypes: float, n_major_cell_types: int, n_minor_cell_types: int, relative_abundance: float) Tuple[List[List[int]], List[List[int]]]¶
Generate baseline cell counts for major and minor cell types for all subjects.
- Parameters:
subject_ids – List of subject IDs.
n_cells – Total number of cells per subject.
sd_celltypes – Standard deviation of cell counts.
n_major_cell_types – Number of major cell types.
n_minor_cell_types – Number of minor cell types.
relative_abundance – Ratio between major and minor cell types.
- Returns:
major cell counts and rare cell counts for each subject.
- Return type:
A tuple of lists
- Seq_Sim.utils.seq_sim_utils.generate_cell_counts_for_subject(n_cells: int, sd_celltypes: float, n_major_cell_types: int, n_minor_cell_types: int, relative_abundance: float) Tuple[List[int], List[int]]¶
Generate major and rare cell counts for a single subject.
- Parameters:
n_cells – Total number of cells per subject.
sd_celltypes – Standard deviation of cell counts.
n_major_cell_types – Number of major cell types.
n_minor_cell_types – Number of minor cell types.
relative_abundance – Ratio between major and minor cell types.
- Returns:
Tuple of major cell counts and rare cell counts.
- Seq_Sim.utils.seq_sim_utils.generate_cell_types(n_major_cell_types: int, n_minor_cell_types: int) List[str]¶
Generate a list of cell types based on the number of major and minor cell types.
- Parameters:
n_major_cell_types – Number of major cell types.
n_minor_cell_types – Number of minor cell types.
- Returns:
List of cell types.
- Seq_Sim.utils.seq_sim_utils.generate_cell_types_from_range(start_index: int, num_types: int) List[str]¶
Generate a list of cell types based on a starting index and number of types.
- Parameters:
start_index – Starting index for the cell types.
num_types – Number of cell types to generate.
- Returns:
List of cell types.
- Seq_Sim.utils.seq_sim_utils.generate_cluster_features(cell_clusters, unique_clusters)¶
Generate features based on cluster data.
- Parameters:
cell_clusters (np.ndarray) – Encoded cell clusters.
unique_clusters (int) – Number of unique clusters.
- Returns:
Tuple of cluster features and variance.
- Seq_Sim.utils.seq_sim_utils.generate_diff_cell_types(diff_clusters: List[int]) List[str]¶
Generate the list of differential cell types based on the clusters.
- Parameters:
diff_clusters – List of indices for the differentially expressed clusters.
- Returns:
List of differential cell types.
- Seq_Sim.utils.seq_sim_utils.generate_disease_features(cell_diseases)¶
Generate features based on disease data.
- Parameters:
cell_diseases (np.ndarray) – Encoded cell diseases.
- Returns:
Variance of disease features.
- Return type:
float
- Seq_Sim.utils.seq_sim_utils.generate_dummy_data_wo_interaction(n_cells: int = 3000, sd_celltypes: float = 0.1, n_major_cell_types: int = 7, n_minor_cell_types: int = 3, relative_abundance: float = 0.1, n_major_diff_celltypes: int = 1, n_minor_diff_celltypes: int = 1, n_individuals: int = 30, n_batches: int = 4, prop_sex: float = 0.5, prop_disease: float = 0.5, fc_interact: float = 0.1, seed: int = 1234) Tuple[DataFrame, List[str]]¶
Generate dummy data for simulation without interactions.
- Parameters:
n_cells – Number of cells of major cell types per individual.
sd_celltypes – Standard deviation of number of cells.
n_major_cell_types – Number of major cell types.
n_minor_cell_types – Number of minor cell types.
relative_abundance – Ratio between major and rare cell types.
n_major_diff_celltypes – Number of differentially expressed major cell types.
n_minor_diff_celltypes – Number of differentially expressed minor cell types.
n_individuals – Total number of individuals.
n_batches – Number of batches.
prop_sex – Proportion of sex.
prop_disease – Proportion of disease.
fc_interact – Proportion of interacted cells.
seed – Random seed for reproducibility.
- Returns:
DataFrame: The generated dummy data.
List of differential cell types.
- Return type:
A tuple containing
- Seq_Sim.utils.seq_sim_utils.generate_feature(idx, data, seed, cluster_ratio, ratio_variance, cluster_col, disease_col, individual_col)¶
Process a single feature for pseudo-feature generation.
- Parameters:
idx (int) – Feature index.
data (pd.DataFrame) – Dummy data.
seed (int) – Seed for reproducibility.
cluster_ratio (float) – Cluster ratio.
ratio_variance (float) – Ratio variance.
cluster_col (str) – Cluster column name.
disease_col (str) – Disease column name.
individual_col (str) – Individual column name.
- Returns:
Generated feature.
- Return type:
np.ndarray
- Seq_Sim.utils.seq_sim_utils.generate_individual_features(cell_individual)¶
Generate features based on individual data.
- Parameters:
cell_individual (np.ndarray) – Encoded cell individuals.
- Returns:
Variance of individual features.
- Return type:
float
- Seq_Sim.utils.seq_sim_utils.generate_major_cell_counts(n_cells: int, sd_celltypes: float, n_major_cell_types: int) List[int]¶
Generate cell counts for major cell types based on a uniform distribution.
- Parameters:
n_cells – Total number of cells per subject.
sd_celltypes – Standard deviation of cell counts.
n_major_cell_types – Number of major cell types.
- Returns:
List of major cell counts.
- Seq_Sim.utils.seq_sim_utils.generate_pseudo_features(data, n_features=20, cluster_ratio=0.25, ratio_variance=0.5, cluster_col='cell_type', disease_col='disease', individual_col='batch', seed=1234)¶
Generate pseudo-features for simulation.
- Parameters:
data (pd.DataFrame) – Dummy data.
n_features (int, optional) – Number of features to generate. Defaults to 20.
cluster_ratio (float, optional) – Cluster ratio. Defaults to 0.25.
ratio_variance (float, optional) – Ratio variance. Defaults to 0.5.
cluster_col (str, optional) – Cluster column. Defaults to “cell_type”.
disease_col (str, optional) – Disease column. Defaults to “disease”.
individual_col (str, optional) – Individual column. Defaults to “batch”.
seed (int, optional) – Seed. Defaults to 1234.
- Returns:
Pseudo-features.
- Return type:
pd.DataFrame
- Seq_Sim.utils.seq_sim_utils.generate_rare_cell_counts(n_cells: int, sd_celltypes: float, relative_abundance: float, n_minor_cell_types: int) List[int]¶
Generate cell counts for minor cell types based on a uniform distribution.
- Parameters:
n_cells – Total number of cells per subject.
sd_celltypes – Standard deviation of cell counts.
relative_abundance – Ratio between major and minor cell types.
n_minor_cell_types – Number of minor cell types.
- Returns:
List of rare cell counts.
- Seq_Sim.utils.seq_sim_utils.generate_variance(data)¶
Generate variance for the given data.
- Parameters:
data (np.ndarray) – The data based on which variance is to be generated.
- Returns:
Mean of the generated variance.
- Return type:
float
- Seq_Sim.utils.seq_sim_utils.identify_diff_clusters(n_major_diff_celltypes: int, n_minor_diff_celltypes: int, n_cells: int) List[int]¶
Identify the indices of the differentially expressed clusters.
- Parameters:
n_major_diff_celltypes – Number of differentially expressed major cell types.
n_minor_diff_celltypes – Number of differentially expressed minor cell types.
n_cells – Total number of cells per individual.
- Returns:
List of indices representing the differentially expressed clusters.
- Seq_Sim.utils.seq_sim_utils.load_config(config_file: str) dict¶
Load configuration settings from a YAML file.
- Parameters:
config_file – Path to the YAML configuration file.
- Returns:
A dictionary of configuration settings.
- Seq_Sim.utils.seq_sim_utils.process_feature(x, data, seed, cluster_ratio, ratio_variance, cluster_col, disease_col, individual_col)¶
Process a single feature for pseudo-feature generation.
- Parameters:
x (int) – Feature index.
data (pd.DataFrame) – Dummy data.
seed (int) – Seed for reproducibility.
cluster_ratio (float) – Cluster ratio.
ratio_variance (float) – Ratio variance.
cluster_col (str) – Cluster column name.
disease_col (str) – Disease column name.
individual_col (str) – Individual column name.
- Returns:
Generated feature.
- Return type:
np.ndarray
- Seq_Sim.utils.seq_sim_utils.save_data_to_csv(data: DataFrame, file_path: str) None¶
Saves a DataFrame to a CSV file.
- Parameters:
data (pd.DataFrame) – DataFrame to save.
file_path (str) – Destination file path.
- Seq_Sim.utils.seq_sim_utils.set_random_seed(x, seed)¶
Set random seed for reproducibility.
- Parameters:
x (int) – Feature index.
seed (int) – Seed for reproducibility.
- Seq_Sim.utils.seq_sim_utils.validate_arguments(args: list[str]) tuple[int, float, str]¶
Validates and parses command-line arguments.
- Parameters:
args (list[str]) – List of command-line arguments.
- Returns:
Parsed number of samples, fold change, and config file path.
- Return type:
tuple[int, float, str]
- Raises:
ValueError – If arguments are invalid or insufficient.