Filtering, FDR and power

40Citations
Citations of this article
151Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: In high-dimensional data analysis such as differential gene expression analysis, people often use filtering methods like fold-change or variance filters in an attempt to reduce the multiple testing penalty and improve power. However, filtering may introduce a bias on the multiple testing correction. The precise amount of bias depends on many quantities, such as fraction of probes filtered out, filter statistic and test statistic used.Results: We show that a biased multiple testing correction results if non-differentially expressed probes are not filtered out with equal probability from the entire range of p-values. We illustrate our results using both a simulation study and an experimental dataset, where the FDR is shown to be biased mostly by filters that are associated with the hypothesis being tested, such as the fold change. Filters that induce little bias on the FDR yield less additional power of detecting differentially expressed genes. Finally, we propose a statistical test that can be used in practice to determine whether any chosen filter introduces bias on the FDR estimate used, given a general experimental setup.Conclusions: Filtering out of probes must be used with care as it may bias the multiple testing correction. Researchers can use our test for FDR bias to guide their choice of filter and amount of filtering in practice. © 2010 van Iterson et al; licensee BioMed Central Ltd.

Figures

  • Figure 1 Probability density functions (pdf) of p-values for two filters under the null hypothesis (tv = 8 (δ = 0)) (left panel) or alternative hypothesis (tv = 8 (δ = 1)) (right panel). For each filter 25% of the hypotheses are removed. The fold-change filter is shown as a solid line, the variance filter as a dashed line, and the pdf with no filtering is shown as a dashed line. For more details on how these were obtained see section “Filtering and p-values distribution” of Additional File 1.
  • Figure 2 Filter statistic distribution for all features (blue histogram) and separately for features for which H0 holds (blue line) and Ha holds (red line) for one simulated dataset. Each one of these filter statistics leaves features out with small statistic values. The vertical gray lines mark deciles of the distribution for all features, so that if 10% of the features must be left out, then they are the ones with value of the filter statistic to the left of the first vertical gray line. The last vertical line leaves 1% of the data.
  • Figure 3 Proportion of non-differentially expressed genes as function of the fraction filtered out x ≡ 1 - g. Each curve represents the mean of π0(x) over 1000 simulated datasets (error bars are small but not displayed for clarity). From bottom to top, curves represent the following situations: best filter (thin solid line), fold change filter (solid line, ○), signal filter (dashed-and-dotted line, ×), variance filter (dashed line, Δ) and the random filter (thin solid line).
  • Figure 4 Achieved FDR as function of the fraction filtered out x for the different filter statistics, fixing the FDR with each method at 0.05. Values shown are the mean FDR over 1000 simulated datasets (the variability of the FDR is small - not shown). Filters are: dashed line = variance, dotted = signal, dashed-and-dotted = fold change. In all cases the proportion of non-differentially expressed genes is fixed at 0.8. The q-value method cannot be computed for the fold-change filter as in this cases the p-value range changes.
  • Figure 5 Actual power (y-axis) displayed as function of the Benjamini-Hochberg FDR fixed at various levels, for the different filter statistics. In each panel, one curve is displayed for each given fraction of features left out, varying from 0 (dark blue) to 0.9 (dark red) by steps of 0.1. In all cases the proportion of differentially expressed genes is fixed at 0.20. Note that all filters leave out some alternative features, so the maximum power achievable may be below 1 after filtering.
  • Figure 6 Boxplots generated by null p-values yielded after permutation is applied to the simulated data, for varying proportions of features left in the data (x-axis) using the fold-change filter. For comparison, the distribution yielded without filtering is shown (leftmost boxplot). Lines represent the achieved FDR using each of the methods aimed at 5% control level: BH (dashed line with triangles), aBH (dashedand-dotted line with crosses), BY (dotted line). For comparison, the Bonferroni correction is also shown (solid line). Above each boxplot the pvalue yielded by our test for FDR bias is given (’***’ for < 0.001). The solid thin straight line at 5% represents the FDR threshold used.
  • Figure 7 Achieved FDR as function of the fraction filtered out for the different filter statistics, using an FDR control level fixed at 0.05 (horizontal solid, black line). Computations are done using randomly selected subsets of n = 8, 16, 24 samples from each subtype considered. Differential expression is evaluated using limma, and p-values are FDR-corrected.

References Powered by Scopus

The control of the false discovery rate in multiple testing under dependency

8056Citations
N/AReaders
Get full text

A direct approach to false discovery rates

4244Citations
N/AReaders
Get full text

Adaptive linear step-up procedures that control the false discovery rate

2162Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Multiple hypothesis testing in genomics

267Citations
N/AReaders
Get full text

Bioinformatics methods for mass spectrometry-based proteomics data analysis

187Citations
N/AReaders
Get full text

Identification of DNA methylation changes in newborns related to maternal smoking during pregnancy

169Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

van Iterson, M., Boer, J. M., & Menezes, R. X. (2010). Filtering, FDR and power. BMC Bioinformatics, 11. https://doi.org/10.1186/1471-2105-11-450

Readers over time

‘10‘11‘12‘13‘14‘15‘16‘17‘18‘19‘20‘21‘22‘23‘24‘2506121824

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 58

48%

Researcher 40

33%

Professor / Associate Prof. 22

18%

Readers' Discipline

Tooltip

Agricultural and Biological Sciences 61

59%

Biochemistry, Genetics and Molecular Bi... 20

19%

Mathematics 13

13%

Computer Science 10

10%

Save time finding and organizing research with Mendeley

Sign up for free
0