Gene Set Enrichment Analysis (GSEA) Details

This folder contains intermediate data files required for the GSEA analysis. If either the FPKM values or the phenotype data sets change, please re-run the following script to update the expression and phenotype files:

$ Rscript ./input/GSEA_Aggregate_Expression_Data.R input/expression_fpkm.txt > input/all_phenotype.cls

The above script:

  1. Reads the cohort metadata and reads in the FPKM values for each patient in the discovery set;
  2. Collapses the data using gene symbols (uses the median values if there are multiple data points per sample);
  3. Removes low-quality data points (FPKM_Status == FAIL or FPKM_cinterval == 0);
  4. Removes non-informative genes that do not vary across samples (genes with standard dev == 0);
  5. Writes a single matrix with genes as rows and samples as columns with corresponding FPKM values in GSEA-compatible format;
  6. Outputs the phenotype information required for GSEA analysis from the meta data (benefit vs no_benefit, tissue types and post vs pre biopsy).

To run the GSEA analysis:

  1. Download the GSEA software from Broad;
  2. Load analysis/GSEA/expression_fpkm.txt and any of the analysis/GSEA/*_phenotype.cls into the software;
  3. Load either Hallmark (fewer gene sets; h.all.v5.0.symbols.gmt) or Reactome (more gene sets; c2.cp.reactome.v5.0.symbols.gmt) data sets;
  4. Load GENE_SYMBOL.chip as your chip file;
  5. Run the analysis and make sure you permute for gene_sets (since we have really few samples).

Results

Analysis Link to Results Description
1 results FPKM values and only run on the Hallmark gene sets; explatory
2 results Stratify patients into more categories for better comparisons
3 results Include patients from the validation set and expand the analysis to compare patient groups that contain at least 3 patients. This also includes a leading edge analysis to see the degree of overlap across enriched gene sets for benefit vs no-benefit comparison.