Multivariate analysis of variance test for gene set analysis.
Journal - Bioinformatics (Oxford, England) (England )
MOTIVATION: Gene class testing (GCT) or gene set analysis (GSA) is a statistical approach to determine whether some functionally predefined sets of genes express differently under different experimental conditions. Shortcomings of the Fisher's exact test for the overrepresentation analysis are illustrated by an example. Most alternative GSA methods are developed for data collected from two experimental conditions, and most is based on a univariate gene-by-gene test statistic or assume independence among genes in the gene set. A multivariate analysis of variance (MANOVA) approach is proposed for studies with two or more experimental conditions. RESULTS: When the number of genes in the gene set is greater than the number of samples, the sample covariance matrix is singular and ill-condition. The use of standard multivariate methods can result in biases in the analysis. The proposed MANOVA test uses a shrinkage covariance matrix estimator for the sample covariance matrix. The MANOVA test and six other GSA published methods, principal component analysis, SAM-GS, analysis of covariance, Global, GSEA and MaxMean, are evaluated using simulation. The MANOVA test appears to perform the best in terms of control of type I error and power under the models considered in the simulation. Several publicly available microarray datasets under two and three experimental conditions are analyzed for illustrations of GSA. Most methods, except for GSEA and MaxMean, generally are comparable in terms of power of identification of significant gene sets. AVAILABILITY: A free R-code to perform MANOVA test is available at http://mail.cmu.edu.tw/~catsai/research.htm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
|ISSN : ||1460-2059|
|Mesh Heading : ||Analysis of Variance Computer Simulation Gene Expression Profiling Oligonucleotide Array Sequence Analysis methods statistics & numerical data|
|Mesh Heading Relevant : ||methods|
Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data
Journal - BMC Bioinformatics
BackgroundMany researchers are concerned with the comparability and reliability of microarray gene expression data. Recent completion of the MicroArray Quality Control (MAQC) project provides a unique opportunity to assess reproducibility across multiple sites and the comparability across multiple platforms. The MAQC analysis presented for the conclusion of inter- and intra-platform comparability/reproducibility of microarray gene expression measurements is inadequate. We evaluate the reproducibility/comparability of the MAQC data for 12901 common genes in four titration samples generated from five high-density one-color microarray platforms and the TaqMan technology. We discuss some of the problems with the use of correlation coefficient as metric to evaluate the inter- and intra-platform reproducibility and the percent of overlapping genes (POG) as a measure for evaluation of a gene selection procedure by MAQC.ResultsA total of 293 arrays were used in the intra- and inter-platform analysis. A hierarchical cluster analysis shows distinct differences in the measured intensities among the five platforms. A number of genes show a small fold-change in one platform and a large fold-change in another platform, even though the correlations between platforms are high. An analysis of variance shows thirty percent of gene expressions of the samples show inconsistent patterns across the five platforms. We illustrated that POG does not reflect the accuracy of a selected gene list. A non-overlapping gene can be truly differentially expressed with a stringent cut, and an overlapping gene can be non-differentially expressed with non-stringent cutoff. In addition, POG is an unusable selection criterion. POG can increase or decrease irregularly as cutoff changes; there is no criterion to determine a cutoff so that POG is optimized.ConclusionUsing various statistical methods we demonstrate that there are differences in the intensities measured by different platforms and different sites within platform. Within each platform, the patterns of expression are generally consistent, but there is site-by-site variability. Evaluation of data analysis methods for use in regulatory decision should take no treatment effect into consideration, when there is no treatment effect, "a fold-change cutoff with a non-stringent p-value cutoff" could result in 100% false positive error selection.