How to prepare data for MADS+
To start from CEL files, MADS+ requires the gene expression indexes, probe intensities calculated by GeneBASE from CEL files.
If CEL files are not in GCOS text format, use Affymetrix Power Tool to convert CEL files into GCOS text format.
apt-cel-convert -f text -o output_folder_text_cels *.CEL
Run GeneBASE to do background-correction, normalization and gene expression index calculation on CEL files.
ProbeEffects -par parameter.txt
Background-correction and normalization of CEL files. Example of parameter.txt is available here.
python hjay_reformat.py folder_fitted_cel_files
Reformat the ProbeEffects output for junction array.
mkdir out
Create a folder for ProbeSelect output.
python ProbeSelect.py folder_reformatted_cel_files -r T -t F
Select probes for gene expression index calculation.
python MADS_plus.py gene_expression sample_info folder_of_probe_intensity HJAY_all_ps_location.txt HJAY_tc_annotated_location.txt CassetteExons 35alterExons MXExons output.txt -p -g -f -x
gene_expression file contains the gene expression indexes from GeneBASE "summary.selected". Example
sample_info file classifies the samples into two groups. Example
folder_of_probe_intensity is the folder of probe intensities from GeneBASE "out" folder. Example
HJAY_all_ps_location.txt and HJAY_tc_annotated_location.txt are in the HJAY library and annotation files.
CassetteExons, 35alterExons and MXExons are annotation files for the cassette exons, alternative 3'/5' splice sites and mutually exclusive exons. They are also available in the HJAY library and annotation files.
-p is the p value cutoff for each probeset. Probeset whose p value smaller than this value (default -p 0.01) will be considered as significant.
-g is the gene expression cutoff. Lowly expressed genes whose average expression indexes are below a given cutoff (default -g 500) will be removed from downstream analysis.
-f is the gene expression fold change cutoff. Significantly differently expressed genes whose expression index fold changes between sample groups are higher than a given cutoff (default -f 2.0) will be removed from downstream analysis.
-x is the probe extreme value cutoff. Probes with extremely high intensities which are larger than certain percent (default -x 0.95) of the probe intensities of all other core probes will be removed from downstream analysis.
The order of samples should be the same for gene expression file, sample info file and probe intensity files.
For sample data, the following command line should be run in the folder containing MADS_plus.py to get the results:
python MADS_plus.py gene_expression.xls sample.info out/ HJAY_all_ps_location.txt HJAY_tc_annotated_location.txt CassetteExons 35alterExons MXExons output.txt -p 0.01 -g 500 -f 2.0 -x 0.95
On a Linux system with AMD Quad-Core 2350 and 16G memory, our MADS+ analysis of the ESRP data set (4 replicates per condition for a total of 8 arrays; 549464 probeset from 17465 genes) takes ~453 minutes in CPU time.
HJAY_Plot_cassette/alt35/mx.pdf contains the graphs of alternatively spliced cassette exons, alternative 3'/5' splice sites and mutually exclusive exons. Each figure includes the p values and directions of splicing changes. The red lines in the figure represent the gene expression indexes. The blue lines in the figure represent the probe intensities.
For cassette exons,
up=upstream 'include' exon-exon junction probeset, down=downstream 'include' exon-exon junction probeset, exon=exon probeset and skip='skip' exon-exon junction probeset.
For alternative 3'/5' splice sites,
shorter_juc=exon-exon junction probeset of shorter isoform, longer_juc=exon-exon junction probeset of longer isoform, shorter_ex=exon probeset of shorter isoform and longer_ex=exon probeset of longer isoform.
up_juc, down_juc and exon_1 in the first row are upstream 'include' exon-exon junction probeset, downstream 'include' exon-exon junction probeset and exon probeset for the first exon. The up_juc, down_juc and exon_2 in the second row are for the second exon.
HJAY_excel_cassette/alt35/mx.txt contains the gene names, coordinates and p values of each exon.