The goal of this repository is to provide all code and utils necessary to generate the FedPyDESeq2 paper plots.
Start by cloning this repository and navigating to its root directory.
git clone https://github.com/owkin/fedpydeseq2-paper-experiments.git
cd fedpydeseq2-paper-experiments
Create a Conda environment with Python 3.11 to ensure reproducibility.
conda create -n fedpydeseq2_paper_experiments python=3.11
conda activate fedpydeseq2_paper_experiments
Install Poetry to manage dependencies.
conda install pip
pip install poetry==1.8.2
Install the necessary dependencies using Poetry, including the linting group if needed.
poetry install --with linting
To download the raw data, please use the provided command fedpydeseq2-download-data
to download it to the
desired path.
fedpydeseq2-download-data --raw_data_output_path <path>
You can pass the conda
activation path as an argument as well, for example:
fedpydeseq2-download-data --raw_data_output_path <path> --conda_activate_path /opt/miniconda/bin/activate
This repository provides three main commands once installed:
-
run-dge-gsea
:- Executes Differential Gene Expression (DGE) experiments and performs Gene Set Enrichment Analysis (GSEA) on the DGE results.
- Requires an experiment configuration file and a paths configuration file.
-
make-fedpydeseq2-remote-script
:- Generates the necessary scripts to run
fedpydeseq2
on a remote server for real Federated Learning (FL) execution. - Requires an experiment configuration file and a paths configuration file.
- Generates the necessary scripts to run
-
make-plots
:- Creates the desired figures based on the DGE and GSEA results, as specified in the configuration file.
- Requires an experiment configuration file and a paths configuration file.
Each of these commands can run on an experiment configuration file and a paths configuration file. Alternatively, you can provide just the experiment name which will point to the configuration file.
-
Experiment configuration file:
- Sets up an experiment to compare different DGE methods using the same settings defined by
pydeseq2_parameters
. - Configures the GSEA run for this comparison.
- Specifies the figures to be created from the DGE and GSEA results.
- Sets up an experiment to compare different DGE methods using the same settings defined by
-
Paths configuration file:
- Specifies the paths where the raw data is stored, which is used to run the DGE experiments.
- Defines where to save the results for each experiment.
In the package, we run several Differential Gene Expression (DGE) methods to compare their performance and results. The available methods are:
fedpydeseq2_simulated
: Simulates federated learning by splitting the data into multiple subsets and run the federated pipeline locally.meta_analysis
: Can be run using various techniques, including:- P-value combination methods: Stouffer's and Fisher's methods.
- Fixed effect model.
- Random effect model with either a one-step DerSimonian-Laird estimator or an iterated method.
pydeseq2_largest
: Runs the DGE analysis on the largest cohort available, providing a baseline for comparison.pydeseq2
: The standard approach, running the analysis on the entire dataset.fedpydeseq2_remote
: Designed for real federated learning scenarios, where the analysis is performed on remote servers, and each dataset is stored only on its corresponding remote server.
For now, we implement three types of figures to visualize the results:
- Heterogeneity plots: Visualize the variability in gene expression across different centers or batches.
- Cross tables: Compare the results of different DGE methods by displaying the overlap and differences in identified genes.
- GSEA plots: Visualize the results of Gene Set Enrichment Analysis, showing the enrichment scores and significance of gene sets.
pydeseq2_parameters:
design_factors:
- stage
continuous_factors: null
refit_cooks: null
contrast: null
keep_original_centers: true
datasets:
- TCGA-LUAD
- TCGA-LUSC
- TCGA-READ
- TCGA-COAD
- TCGA-SKCM
- TCGA-PAAD
- TCGA-BRCA
- TCGA-PRAD
heterogeneity: null
meta_analysis: META_ANALYSIS_PARAMETERS
run_dge:
- pydeseq2
- fedpydeseq2_simulated
- meta_analysis
- pydeseq2_largest
run_gsea:
- fedpydeseq2_simulated
- meta_analysis
- pydeseq2_largest
- pydeseq2
- fedpydeseq2_remote
plots:
cross_table:
log2fc_threshold: 2.0
padj_threshold: 0.05
method_pairs:
- ['fedpydeseq2_simulated', 'pydeseq2']
- ['fedpydeseq2_remote', 'pydeseq2']
- ['fedpydeseq2_simulated', 'fedpydeseq2_remote']
- ['meta_analysis', 'pydeseq2']
- ['pydeseq2_largest', 'pydeseq2']
gsea_plots:
method_pairs:
- ['fedpydeseq2_simulated', 'pydeseq2']
- ['fedpydeseq2_remote', 'pydeseq2']
- ['meta_analysis', 'pydeseq2']
- ['pydeseq2_largest', 'pydeseq2']
plot_parameters:
- with_diff: true
padj_threshold: 0.05
- with_diff: false
padj_threshold: 0.05
- with_diff: true
padj_threshold: 0.05
with_pvalues: true
- with_diff: false
padj_threshold: 0.05
with_pvalues: true
-
pydeseq2_parameters:
- design_factors: Specifies the factors used in the design matrix for the analysis. Here, it includes
stage
. - continuous_factors: Specifies continuous factors used in the analysis. Set to
null
indicating no continuous factors are used. - refit_cooks: Parameter for refitting Cook's distances. Set to
null
. - contrast: Specifies the contrast parameter for the analysis. Set to
null
.
- design_factors: Specifies the factors used in the design matrix for the analysis. Here, it includes
-
keep_original_centers:
- A boolean flag indicating whether to keep the original centers. Set to
true
.
- A boolean flag indicating whether to keep the original centers. Set to
-
datasets:
- Lists the datasets to be used in the analysis. Examples include
TCGA-LUAD
,TCGA-LUSC
,TCGA-READ
, etc.
- Lists the datasets to be used in the analysis. Examples include
-
heterogeneity:
- Parameter for heterogeneity analysis. Set to
null
.
- Parameter for heterogeneity analysis. Set to
-
meta_analysis:
- Placeholder for meta-analysis parameters. Set to
META_ANALYSIS_PARAMETERS
.
- Placeholder for meta-analysis parameters. Set to
-
run_dge:
- Specifies the DGE methods to be run. The methods listed are:
pydeseq2
fedpydeseq2_simulated
meta_analysis
pydeseq2_largest
- Specifies the DGE methods to be run. The methods listed are:
-
run_gsea:
- Specifies the Gene Set Enrichment Analysis (GSEA) methods to be run. The methods listed are:
fedpydeseq2_simulated
meta_analysis
pydeseq2_largest
pydeseq2
fedpydeseq2_remote
- Specifies the Gene Set Enrichment Analysis (GSEA) methods to be run. The methods listed are:
-
plots:
-
Specifies the plotting configurations for visualizing the results of the DGE and GSEA analyses.
-
cross_table:
- log2fc_threshold: Log2 fold change threshold for filtering results. Set to
2.0
. - padj_threshold: Adjusted p-value threshold for filtering results. Set to
0.05
. - method_pairs: Lists pairs of methods to be compared in the cross table. Examples include:
['fedpydeseq2_simulated', 'pydeseq2']
['fedpydeseq2_remote', 'pydeseq2']
['fedpydeseq2_simulated', 'fedpydeseq2_remote']
['meta_analysis', 'pydeseq2']
['pydeseq2_largest', 'pydeseq2']
- log2fc_threshold: Log2 fold change threshold for filtering results. Set to
-
gsea_plots:
- method_pairs: Lists pairs of methods to be compared in the GSEA plots. Examples include:
['fedpydeseq2_simulated', 'pydeseq2']
['fedpydeseq2_remote', 'pydeseq2']
['meta_analysis', 'pydeseq2']
['pydeseq2_largest', 'pydeseq2']
- plot_parameters: Specifies parameters for generating the GSEA plots. Examples include:
with_diff: true, padj_threshold: 0.05
with_diff: false, padj_threshold: 0.05
with_diff: true, padj_threshold: 0.05, with_pvalues: true
with_diff: false, padj_threshold: 0.05, with_pvalues: true
- method_pairs: Lists pairs of methods to be compared in the GSEA plots. Examples include:
-
You must configure the paths in the experiment_paths.yaml
in the repository. This specifies where you want to save the different experiment results, and where you want to store the remote experiment results if you want to run them. Note that these are not run using the run-dge-gsea
method, and must be run separately using the helper functions defined in the next section.
experiments:
single_factor:
results: "path/to/local/results"
remote_results: "path/to/remote/results"
remote_processed: "path/to/remote/processed"
remote_credentials: "path/to/remote/credentials"
multi_factor:
results: "path/to/local/results"
remote_results: "path/to/remote/results"
remote_processed: "path/to/remote/processed"
remote_credentials: "path/to/remote/credentials"
continuous_factor:
results: "path/to/local/results"
remote_results: "path/to/remote/results"
remote_processed: "path/to/remote/processed"
remote_credentials: "path/to/remote/credentials"
heterogeneity:
results: "path/to/local/results"
remote_results: "path/to/remote/results"
remote_processed: "path/to/remote/processed"
remote_credentials: "path/to/remote/credentials"
raw_data_path: "path/to/raw/data"
For a given experiment, the following directories will be created.
experiments[experiment_name][results]/
├── dge_results/
│ ├── pydeseq2/
│ │ ├── <experiment_id>/
│ │ └── artefacts/
│ ├── fedpydeseq2_simulated/
│ │ ├── <experiment_id>/
│ │ └── artefacts/
│ ├── meta_analysis/
│ │ ├── <experiment_id>/
│ │ └── artefacts/
│ ├── pydeseq2_largest/
│ │ ├── <experiment_id>/
│ │ └── artefacts/
├── gsea_results/
│ ├── pydeseq2/
│ │ └── <experiment_id>/
│ ├── fedpydeseq2_simulated/
│ │ └── <experiment_id>/
│ ├── meta_analysis/
│ │ └── <experiment_id>/
│ ├── pydeseq2_largest/
│ │ └── <experiment_id>/
│ └── fedpydeseq2_remote/
│ └── <experiment_id>/
├── figures/
│ ├── heterogeneity_plots/
│ ├── cross_tables/
│ └── gsea_plots/
We perform four main experiments in this repository, each configured with a file located in the experiment_specifications
directory at the root of the repository:
This experiment focuses on analyzing the effect of a single factor on gene expression. It uses the DESeq2 model to identify differentially expressed genes based on a single condition or factor. The configuration for this experiment is specified in the experiment_specifications/single_factor_specs.yaml
file. The datasets used include various TCGA datasets such as TCGA-LUAD, TCGA-LUSC, and others.
This experiment extends the single factor analysis by considering multiple factors simultaneously. It uses the DESeq2 model to identify differentially expressed genes while accounting for the combined effects of multiple conditions or factors. The configuration for this experiment is specified in the experiment_specifications/multi_factor_specs.yaml
file. Factors such as stage and gender are considered, and datasets include TCGA-LUAD, TCGA-LUSC, TCGA-READ, and others.
This experiment investigates the effect of continuous factors on gene expression. It uses the DESeq2 model to identify differentially expressed genes based on continuous variables. The configuration for this experiment is specified in the experiment_specifications/continuous_factor_specs.yaml
file. Continuous factors such as CPE are considered, and datasets include TCGA-LUAD, TCGA-LUSC, TCGA-READ, and others.
This experiment investigates the heterogeneity in gene expression across different centers or batches. It uses the DESeq2 model to identify differentially expressed genes while accounting for batch effects and other sources of heterogeneity. The configuration for this experiment is specified in the experiment_specifications/heterogeneity_specs.yaml
file. The heterogeneity method used is "binomial", and datasets include TCGA-NSCLC and TCGA-CRC.
Running Differential Gene Expression with different methods and Gene Set Enrichment Analysis (GSEA) on top
The dge_gsea_pipe.py
script is used to run Differential Gene Expression (DGE) for different methods to be compared and Gene Set Enrichment Analysis (GSEA) methods on top on specified datasets. This script is located in the fedpydeseq2-paper-experiments/paper_experiments/run_dge_gsea_methods/
directory.
NB: this does not concern the running of fedpydeseq2 on a remote server. The dedicated section can be found below.
The script performs the following tasks:
- Loads the configuration and paths files.
- Runs the specified DGE methods on the datasets (but not the remote fedpydeseq2).
- Runs GSEA on the results of the DGE methods (including the remote fedpydeseq2 results, if specified).
The script requires a configuration file that specifies the parameters for the DGE and GSEA methods. This configuration file is typically located in the experiment_specifications
directory and is named according to the experiment being run (e.g., single_factor_specs.yaml
, multi_factor_specs.yaml
).
Ensure that the experiment_paths.yaml
file is up to date. This file specifies where the raw data is stored and where to save the results for each experiment.
To run the run-dge-gsea
command, follow these steps:
-
Ensure Configuration Files are Ready:
- Prepare the experiment configuration file in the
experiment_specifications
directory. - Ensure the
experiment_paths.yaml
file is correctly set up.
- Prepare the experiment configuration file in the
-
Execute the Command:
- Use the following command to run the script, replacing
<experiment_name>
with the name of your experiment configuration file (without the.yaml
extension).
- Use the following command to run the script, replacing
run-dge-gsea --experiment_name <experiment_name>
This command will load the specified configuration, run the DGE methods on the datasets (excluding the remote fedpydeseq2), and perform GSEA on the results (including the remote fedpydeseq2 results, if specified).
To run one of the experiments remotely, you must fill in your experiment_paths.yaml
with the fields related to the remote setting. For example, for the heterogeneity experiment:
experiments:
heterogeneity:
results: ""
remote_results: "my path"
remote_processed: "path to processed data"
remote_credentials: "path to remote credentials"
raw_data_path: "important"
- results: Local path to save the results.
- remote_results: Remote path to save the results.
- remote_processed: Remote path to the processed data.
- remote_credentials: Path to the credentials for accessing remote resources.
The credentials are what allow you to access the Substra client (see Substra documentation).
After you fill in the configuration file, run the following command:
make-fedpydeseq2-remote-scripts --experiment_name <experiment_name>
- experiment_name: Must be a key in the
experiments
part of theexperiment_paths.yaml
file and have a corresponding<experiment_name>_specs.yaml
file in theexperiment_specifications
directory.
This command will create a bash script in the fedpydeseq2_remote_scripts
subfolder of this repository.
Here is an example of what the generated script might look like. Note that all explicit paths have been replaced with placeholders:
# You might need to run
# `chmod 777 /path/to/fedpydeseq2_remote_scripts/single_factor_remote.sh`
# to give execution rights to the script
for dataset in 'TCGA-LUAD' 'TCGA-LUSC' 'TCGA-READ' 'TCGA-COAD' 'TCGA-SKCM' 'TCGA-PAAD' 'TCGA-BRCA' 'TCGA-PRAD'
do
fedpydeseq2-tcga-pipe \
--register_data \
--backend "remote" \
--dataset_name "${dataset}" \
--keep_original_centers \
--save_filepath "/path/to/results" \
--raw_data_path "/path/to/raw_data" \
--processed_data_path "/path/to/processed_data" \
--credentials_path "/path/to/credentials.yaml" \
--design_factors "stage" \
--force_preprocessing
done
- Ensure you are in the Conda environment created during setup.
- Navigate to the
fedpydeseq2_remote_scripts
subfolder. - Execute the generated bash script.
cd fedpydeseq2_remote_scripts
bash <generated_script>.sh
This will run the fedpydeseq2
experiment remotely using the specified configuration.
The make-plots
command is configured to run the main
function from the plot_pipe
module located in the paper_experiments.figures
package.
Run the make-plots
command to generate the new plot:
make-plots --experiment_name your_experiment_name
-
Command Execution:
- When you run
make-plots
from the command line, it triggers themain
function in theplot_pipe
module.
- When you run
-
Loading Configuration:
- The
main
function typically starts by loading the experiment configuration file and the paths configuration file. These files contain the necessary parameters and paths for generating the plots.
- The
-
Generating Plots:
- The
main
function then processes the DGE and GSEA results based on the configurations specified in the YAML file. - It uses the plotting parameters defined in the
plots
section of the configuration file to generate the desired figures.
- The
-
Saving Plots:
- The generated plots are saved to the specified locations, as defined in the paths configuration file.
The cross_table
plot is designed to compare the results of different Differential Gene Expression (DGE) methods. Here's a detailed breakdown of how it works:
In the YAML configuration file, the cross_table
section specifies the parameters for generating the cross table plots:
plots:
cross_table:
log2fc_threshold: 2.0
padj_threshold: 0.05
method_pairs:
- ['fedpydeseq2_simulated', 'pydeseq2']
- ['fedpydeseq2_remote', 'pydeseq2']
- ['fedpydeseq2_simulated', 'fedpydeseq2_remote']
- ['meta_analysis', 'pydeseq2']
- ['pydeseq2_largest', 'pydeseq2']
- log2fc_threshold: Log2 fold change threshold for filtering results.
- padj_threshold: Adjusted p-value threshold for filtering results.
- method_pairs: List of method pairs to be compared in the cross table.
The run_plot_pipe
function in plot_pipe.py
handles the generation of the cross table plots:
- Load Configuration: Extracts the
cross_table
configuration from the YAML file. - Prepare Paths: Creates the directory for saving cross table plots.
- Filter Results: Filters the DGE results based on the specified thresholds.
- Generate Cross Tables: Compares the results of the specified method pairs and generates the cross tables.
- Save Cross Tables: Saves the generated cross tables to the specified path.
The build_test_vs_ref_cross_tables
function is responsible for generating the cross table plots. It:
- Loads results for the specified method pairs.
- Filters the results based on the log fold change and adjusted p-value thresholds.
- Creates the cross table by comparing the filtered results.
- Saves the cross table to the specified path.
The gsea_plots
section in the configuration file specifies the parameters for generating Gene Set Enrichment Analysis (GSEA) plots. Here's a detailed breakdown of how it works:
In the YAML configuration file, the gsea_plots
section specifies the parameters for generating the GSEA plots:
plots:
gsea_plots:
method_pairs:
- ['fedpydeseq2_simulated', 'pydeseq2']
- ['fedpydeseq2_remote', 'pydeseq2']
- ['meta_analysis', 'pydeseq2']
- ['pydeseq2_largest', 'pydeseq2']
plot_parameters:
- with_diff: true
padj_threshold: 0.05
- with_diff: false
padj_threshold: 0.05
- with_diff: true
padj_threshold: 0.05
with_pvalues: true
- with_diff: false
padj_threshold: 0.05
with_pvalues: true
- method_pairs: List of method pairs to be compared in the GSEA plots.
- plot_parameters: List of parameters for generating the GSEA plots, such as whether to include differential expression (
with_diff
) and the adjusted p-value threshold (padj_threshold
).
The run_plot_pipe
function in plot_pipe.py
handles the generation of the GSEA plots:
- Load Configuration: Extracts the
gsea_plots
configuration from the YAML file. - Prepare Paths: Creates the directory for saving GSEA plots.
- Check Methods: Ensures that the methods specified in
method_pairs
are included in the list of methods to run for GSEA. - Generate GSEA Plots: Calls the utility function to generate the GSEA plots for each method pair and set of plot parameters.
- Save GSEA Plots: Saves the generated GSEA plots to the specified path.
The make_gsea_plot_method_pairs
function is responsible for generating the GSEA plots. It:
- Load Results: Loads the GSEA results for the specified method pairs.
- Filter Results: Filters the results based on the specified plot parameters.
- Generate Plots: Creates the GSEA plots by comparing the filtered results of the method pairs.
- Save Plots: Saves the generated plots to the specified path.
The heterogeneity
plot is designed to analyze and visualize the effect of the heterogeneity of centers in the results of different Differential Gene Expression (DGE) methods. Here's a detailed breakdown of how it works:
In the YAML configuration file, the heterogeneity
section specifies the parameters for generating the heterogeneity plots:
plots:
heterogeneity:
method_test: ['fedpydeseq2_simulated', 'pydeseq2']
method_ref: 'pydeseq2'
scoring_function_names: ['scoring_function_1', 'scoring_function_2']
- method_test: List of methods to be tested for heterogeneity.
- method_ref: Reference method against which the test methods are compared.
- scoring_function_names: List of scoring functions used to evaluate heterogeneity.
The run_plot_pipe
function in plot_pipe.py
handles the generation of the heterogeneity plots:
- Load Configuration: Extracts the
heterogeneity
configuration from the YAML file. - Prepare Paths: Creates the directory for saving heterogeneity plots.
- Generate Heterogeneity Plots: Calls the utility function to generate the heterogeneity plots for each dataset and scoring function.
- Save Heterogeneity Plots: Saves the generated heterogeneity plots to the specified path.
The build_test_vs_ref_heterogeneity_plot
function is responsible for generating the heterogeneity plots. It:
- Load Results: Loads the DGE results for the test methods and the reference method.
- Score Methods: Uses the specified scoring functions to score the test methods vs the reference methods.
- Generate Plots: Creates the heterogeneity plots based on the calculated scores.
- Save Plots: Saves the generated plots to the specified path.
To run the heterogeneity experiment remotely, you need to configure the experiment_paths.yaml
file with the necessary paths and credentials:
experiments:
heterogeneity:
results: ""
remote_results: "my path"
remote_processed: "path to processed data"
remote_credentials: "path to remote credentials"
raw_data_path: "important"
- results: Path to save the local results.
- remote_results: Path to save the remote results.
- remote_processed: Path to the processed data on the remote server.
- remote_credentials: Path to the credentials for accessing the remote server.
- raw_data_path: Path to the raw data.
To add a new plot to the plot_pipe.py
script, follow these steps:
-
Define the New Plot Configuration:
- Add a new section in the configuration file (YAML) for the new plot.
-
Update the
run_plot_pipe
Function:- Add logic to handle the new plot configuration.
- Call the appropriate utility function to generate the new plot.
-
Create the Utility Function:
- Implement a new utility function to generate the plot.
-
Update the Documentation:
- Update the README.md file to include information about the new plot.
Add a new section in your YAML configuration file for the new plot. For example:
plots:
new_plot:
parameter1: value1
parameter2: value2
Modify the run_plot_pipe
function to handle the new plot configuration:
def run_plot_pipe(
config: dict,
paths: dict,
raw_data_path: str | Path,
) -> None:
...
plots_config = config["plots"]
...
# Generate new plot
if "new_plot" in plots_config:
new_plot_config = plots_config["new_plot"]
new_plot_path = plot_results_path / "new_plot"
new_plot_path.mkdir(parents=True, exist_ok=True)
# Call the utility function to generate the new plot
generate_new_plot(
config=new_plot_config,
save_path=new_plot_path,
dataset_names=dataset_names,
design_factors=design_factors,
continuous_factors=continuous_factors,
meta_analysis_parameters=meta_analysis_parameters,
**pydeseq2_kwargs,
)
Implement the utility function to generate the new plot:
def generate_new_plot(
config: dict,
save_path: Path,
dataset_names: list,
design_factors: list,
continuous_factors: list,
meta_analysis_parameters: dict,
**kwargs,
) -> None:
"""Generate the new plot.
Parameters
----------
config : dict
The configuration dictionary for the new plot.
save_path : Path
The path to save the new plot.
dataset_names : list
The list of dataset names.
design_factors : list
The list of design factors.
continuous_factors : list
The list of continuous factors.
meta_analysis_parameters : dict
The meta analysis parameters.
kwargs : dict
Additional keyword arguments.
"""
# Implement the logic to generate the new plot
...
Update the README.md file to include information about the new plot:
### New Plot
This section describes how to configure and generate the new plot.
#### Configuration
Add the following section to your YAML configuration file: