Data Preparation Protocol

create empty directories for a new "version" and/or new Set. Relevant versions are:
- version_7 : regular GSEA with Ratio_of_Classes ranking
  Note: This setting has a bias towards the first phenotype when used with linear-scale expression values. (e.g. GSEA(mutant_versus_wt) != GSEA(wt_versus_mutant) )
- version_8 : preRanked GSEA - all "Set_0x" folders use the moderated t-Statistic (details on how the t-Statistic was obtained can be found on the description page of each Set).
  Older efforts using a score that was derived from q-values of (limma) tTest or ANOVA have now a suffix "_qValScore".
- version_9 : regular GSEA with log2_Ratio_of_Classes ranking (all other settings identical to version_7 but don't show the mentioned bias)
- version_10 : regular GSEA with tTest ranking (all other settings identical to version_9)
regular GSEA
- Expression Matrices
- - Affymetrix datasets:
  - - convert CEL files to GCT format while performing RMA adjustment and converting from log2 to linear scale with following R-Code:
      http://baderlab.org/CancerStemCellProject/CodeRepository#cel2gct.R
    - create one expression matrix for each phenotype-combination of interest
      (and matching phenotype declaration "CLS" files)
    - collapse expression matrices from Affy_IDs to Gene Symbols using the Chip Annotation file "Mouse430_2.chip" as available from the Broad-Institute website using the "max_probe" algorithm implemented in:
      http://www.baderlab.org/Software/EnrichmentMap/CollapseExpressionMatrix
  - Illumina datasets:
  - - extract already normalized expression values from "2008_GuidosLeukemiaData/‌03_T-ALL_Notch_GEP/‌ExpressionData/‌03_ATM_TALL_normalized_withq.csv" resp. "2008_GuidosLeukemiaData/‌04_DM progression_GEP/‌ExpressionData/‌04_normalized_withq_6gp.csv"
      and convert log2 to linear scale
    - create one expression matrix for each phenotype-combination of interest
      (and matching phenotype declaration "CLS" files)
    - collapse expression matrices from Lumi_IDs to Gene Symbols using the Chip Annotation file "Illumina_MusRef8_v1_1.chip" as available from the Broad-Institute website using the "max_probe" algorithm implemented in:
      http://www.baderlab.org/Software/EnrichmentMap/CollapseExpressionMatrix
- copy all input files (Expression Matrix "GCT", phenotype declaration "CLS", Gene sets Collections "GMT") to a "gsea_data" folder.
preRanked GSEA
- Ranked Gene Lists with probeset_ID's and t-Statistics were created for all phenotype combinations for a given data-set as described on the set's page using the proper R-Code as documented on: http://baderlab.org/CancerStemCellProject/CodeRepository
- Expression Matrices with probeset_ID's and normalized linear expression values were taken from the regular GSEA analyzes.
- Ranked Gene Lists and Expression Matrices were collapsed together with the tool available from http://www.baderlab.org/Software/EnrichmentMap/CollapseExpressionMatrix, choosing for both files that probe-set, that has the largest absolute t-Score (= the maximum deviation from zero).
  As for a phenotype-combination "mutant_versus_wt" two ranked gene-lists are created ("mutant_versus_wt" and "wt_versus_mutant") but only one expression matrix is needed, the files are collapsed together only for the first case and the additional ranked gene-list is collapsed alone.
create Shell-scripts "SetX_run_gsea_YY.sh" - each of these scripts iterates over a selection of phenotype-combinations for data-set X and performs the GSEA analysis with a given gene-sets-collection YY otherwise using the same settings.
See GSEA Settings for details.
Inside these Shell-Scripts several Variables in the form ${VARIABLE_NAME} are used to define file-names and labels. Also the existence of most necessary files is checked and the loop abordet with an error-message, in case something is not found.
At the end of each GSEA run a copy of the generated GSEA-Parameter-File (RPT) is copied to the "enrichment_maps" folder. In case of preRanked GSEA two additional (custom) parameters are automatically added to simplify the generation of the Enrichment Maps:
- param→phenotypes→mutant_versus_wt
  will populate the phenotype labels instead of just using na_pos and na_neg
- param→expressionMatrix→/.../version_8/Set_XX/gsea_data/gct/Set_XX_mutant_versus_wt.gct
  will populate the Expressions-Box with the Expression-Matrix (GCT file) instead of the Ranked-Gene-List.