- coad_clinical.csv: Clinical information for colorectal tumors from the TCGA-COAD dataset.
- read_clinical.csv: Clinical information for rectal tumors from the TCGA-READ dataset.
These files contain clinical data that we will use to categorize tumors for further analysis.
The two clinical files share similar structures and contain the following important columns:
- Tumor_Sample_Barcode: A unique identifier for each tumor sample.
- MSI_TMB: A combined column that contains both the MSI status (Microsatellite Instability) and TMB value (Tumor Mutational Burden). This column is categorizing tumors into the following three groups:
- MSI-H (microsatellite instability-high)
- MSS/TMB-H (microsatellite stable with high tumor mutational burden)
- MSS/TMB-L (microsatellite stable with low tumor mutational burden)
You can download it from google drive
In the gene_expression
folder, we have gene expression data for both TCGA-COAD and TCGA-READ cohorts, each located in their respective subfolders:
coad/
: Contains gene expression data for colorectal tumors.read/:
Contains gene expression data for rectal tumors.
Each subfolder includes several files with different types of normalization:
counts.csv
: Contains the raw gene expression counts for each sample.fpkm.csv:
Contains gene expression data normalized using FPKM (Fragments Per Kilobase of transcript per Million mapped reads).fpkm_uq.csv:
Contains Upper Quartile (UQ) FPKM normalized data, which adjusts for sequencing depth and variability across samples.tmm.csv:
Contains gene expression data normalized using the TMM (Trimmed Mean of M-values) method, often used to correct for library size differences.tpm.csv:
Contains gene expression data normalized by TPM (Transcripts Per Million), a method that accounts for gene length and sequencing depth.