create empty directories for a new "version" and/or new Set. Relevant versions are:
version_7 : regular GSEA with Ratio_of_Classes ranking
Note:This setting has a bias towards the first phenotype when used with linear-scale expression values. (e.g. GSEA(mutant_versus_wt) !=
GSEA(wt_versus_mutant) )
version_8 : preRanked GSEA - all "Set_0x" folders use the moderated t-Statistic (details on how the t-Statistic was obtained can
be found on the description page of each Set).
Older efforts using a score that was derived from q-values of (limma) tTest or ANOVA have now a suffix "_qValScore".
version_9 : regular GSEA with log2_Ratio_of_Classes ranking (all other settings identical to version_7
but don't show the mentioned bias)
version_10 : regular GSEA with tTest ranking (all other settings identical to version_9)
extract already normalized expression values from
"2008_GuidosLeukemiaData/03_T-ALL_Notch_GEP/ExpressionData/03_ATM_TALL_normalized_withq.csv" resp.
"2008_GuidosLeukemiaData/04_DM progression_GEP/ExpressionData/04_normalized_withq_6gp.csv"
and convert log2 to linear scale
create one expression matrix for each phenotype-combination of interest
(and matching phenotype declaration "CLS" files)
collapse expression matrices from Lumi_IDs to Gene Symbols using the Chip Annotation file "Illumina_MusRef8_v1_1.chip" as
available from the Broad-Institute website using the "max_probe" algorithm implemented in: http://www.baderlab.org/Software/EnrichmentMap/CollapseExpressionMatrix
copy all input files (Expression Matrix "GCT", phenotype declaration "CLS", Gene sets Collections "GMT") to a "gsea_data" folder.
preRanked GSEA
Ranked Gene Lists with probeset_ID's and t-Statistics were created for all phenotype combinations for a given data-set as described on
the set's page using the proper R-Code as documented on: http://baderlab.org/CancerStemCellProject/CodeRepository
Expression Matrices with probeset_ID's and normalized linear expression values were taken from the regular GSEA analyzes.
Ranked Gene Lists and Expression Matrices were collapsed together with the tool available from http://www.baderlab.org/Software/EnrichmentMap/CollapseExpressionMatrix,
choosing for both files that probe-set, that has the largest absolute t-Score (= the maximum deviation from zero).
As for a phenotype-combination "mutant_versus_wt" two ranked gene-lists are created ("mutant_versus_wt" and
"wt_versus_mutant") but only one expression matrix is needed, the files are collapsed together only for the first case and the
additional ranked gene-list is collapsed alone.
create Shell-scripts "SetX_run_gsea_YY.sh" - each of these scripts iterates over a selection of phenotype-combinations for data-set
X and performs the GSEA analysis with a given gene-sets-collection YY otherwise using the same
settings.
See GSEA Settings for details. Inside these Shell-Scripts several Variables in the form ${VARIABLE_NAME} are used to define file-names and labels. Also the existence of most
necessary files is checked and the loop abordet with an error-message, in case something is not found.
At the end of each GSEA run a copy of the generated GSEA-Parameter-File (RPT) is copied to the "enrichment_maps" folder. In case of
preRanked GSEA two additional (custom) parameters are automatically added to simplify the generation of the Enrichment Maps:
param→phenotypes→mutant_versus_wt
will populate the phenotype labels instead of just using na_pos and na_neg