Normalization & Sources of Variability in Genomic Data
Developing statistical tools for the analysis of genomic data requires a special understanding of nature of the biological and technological variation within this data. Microarray technologies, for example, can measure the relative abundance of RNA gene transcripts within a sample of cells for tens of thousands of genes simultaneously. Thus microarrays measure the relative degree to which an average cell within the population being sampled is making RNA copies of each gene being studied. Because this relative abundance of RNA transcripts is an intermediate step in the process of gene expression it serves as an indirect measure of the relative expression of genes as proteins. The measure is indirect because not all RNA transcripts are translated into proteins at the same rate. Both the indirect nature of microarray data as measurements of gene expression and the fact that measured values correspond to population averages for up to tens of millions of cells means that biology itself introduces important issues of variability into microarray experiments. Any statistical model based on microarray data, should take into account this biological variability.
In addition to the variability associated with the biological nature of genomic data, the technologies used to gather genomic data introduce variation into measured values. The degree of hybridization between dyes and mRNA will vary from sample to sample and inconsistencies in chip manufacture result in measurements of differing quality. Although one can hope that these sources of inconsistency between array measurements will be minimized as the underlying technology improves, for the present these "measurement errors" remain an important source of variability in microarray data. As is the case with the biological variation in microarray data, these technological variations should be accounted for in building statistical models.