NAME

fasda fold-change - Compute fold-change and P-values from normalized counts

SYNOPSIS

fasda fold-change [--output file.txt] [--near-exact]\


    normalized-counts1.tsv  normalized-counts2.tsv \


    [normalized-counts3.tsv ...]

OPTIONS

--output file.txt: Report fold-change and P-values to file.txt instead of the default stdout.
--near-exact: Force calculation of near-exact P-values, rather than the default Mann-Whitney (Wilcoxon) for 8 or more replicates. Be aware that this will take a long time.

DESCRIPTION

fasda fold-change computes fold-change and P-values for two or more conditions. The input is two or more tab-separated value (TSV) files containing normalized counts for all conditions. These files are generally produced by fasda normalize which is turn takes its input from kallisto or fasda abundance output.

The Mann-Whitney U-test (A.K.A. Wilcoxon rank sum test) is used to compute P-values for a minimum of 8 replicates per condition. Exact P-values are computed for 2 to 4 replicates and near-exact P-values for 5 to 7 replicates (the enormous space of possible count pairs is down-sampled to keep run time within reason).

For each pair of conditions, fasda fold-change reports the mean normalized counts for each condition, the standard deviation / mean counts for each condition, the percent agreement across replicates as to whether the fold-change is up or down, the fold-change using total counts for each condition, and the P-value for this set of counts across the two conditions.

Feature                 MNC1    MNC2  SD/C1  SD/C2  FC 1-2 log2(FC) P-val
YPL071C_mRNA            29.6    31.1    0.3    0.2    1.05    0.07  0.72611
YLL050C_mRNA           399.5   543.8    0.2    0.2    1.36    0.44  0.02857
YMR172W_mRNA            60.6    83.3    0.2    0.2    1.38    0.46  0.02167
YOR185C_mRNA            57.1    56.1    0.3    0.2    0.98   -0.03  0.92562
YLL032C_mRNA            33.9    21.6    0.3    0.3    0.64   -0.65  0.14138
YBR225W_mRNA            61.4    75.4    0.2    0.2    1.23    0.30  0.11281

P-values will generally be lower when fold-changes are higher, when mean normalized counts are higher and when standard deviation is lower. We report standard deviation divided by mean normalized counts to provide an immediate sense of how variable the counts are across replicates for each feature. E.g. the actual standard deviation for condition 1 in YPL071C_mRNA (using rounded output) would be 34.4 * 0.4 = 13.76.

Interpreting Results

P-values from any differential analysis tool should never be taken too seriously. There are countless uncontrollable biological variables that can affect the RNA abundance in a cell. There are also numerous sources of experimental error in sample prep and sequencing that can lead to inaccuracy in read counts. Technical replicates (replicates from the same biological sample) and spike-in controls can reveal some of these technical issues, but do not address biological variations.

Another problem is that many biology experiments use only 3 replicates. We simply cannot draw high confidence from any statistics based on 3 samples.

P-value calculations typically make the same assumptions about all genes. In reality, a 2-fold change in expression could be hugely significant for one gene under certain conditions and completely meaningless for a different gene or different conditions. Statistical routines have no knowledge of the biology that determines this.

There is huge variability on the computational side as well. Well-established differential analysis tools commonly report very different sets of genes as differentially expressed. Li, et al (https://doi.org/10.1186/s13059-022-02648-4) reported that 23.71% to 75% of the DEGs identified by DESeq2 were missed by edgeR. In one data set tested, DESeq2 and edgeR had only an 8% overlap in the DEGs they identified.

Hence, simply assuming that P-values < 0.05 represent significant changes while others do not would be foolish. Rather than try too hard to produce adjusted P-values that you can take on faith, we provide simple, honest statistics and leave it to you to consider them carefully.

From our preliminary experiments, we have found that P-values between 0.05 and 0.20 are often questionable and may warrant a closer look at the raw data. A quick look at the individual read counts for each replicate in these cases can be enlightening. You will usually see a high variance in counts across individuals, sometimes with up-regulation in some individuals and down-regulation in others.

For example, consider the following gene, for which Sleuth reported a P-value of 0.0132, while the exact P-value computed by FASDA is 0.116. The kallisto estimated counts show that one replicate was up-regulated almost 4-fold, another almost 2-fold, and the third was slightly down-regulated. A P-value of 0.01 would not likely make anyone suspect this situation. 0.116, on the other hand, tells us that there is a good chance this is significant, but maybe we should take a minute or two to look at the read counts and consider the biology behind them. This is a tiny investment that will help us better decide whether a costly experimental verification is warranted.

		    FASDA                     Sleuth
	   Feature    MNC1    MNC2  FC   1-2    SC1    SC2  SFC    SPV
ENSMUST00000017610  7620.5 16006.7 2.1 0.116  191.4  433.5  2.3 0.0132
Kallisto estimated counts:
	     R1      R2      R3
MNC1    5382.16  8567.4 6986.43
MNC2    21519.9 16196.7 6307.52

Conversely, Sleuth produced a P-value of 0.12 for the following, which looks like a slam-dunk given the counts.

		    FASDA                     Sleuth
	   Feature    MNC1    MNC2  FC   1-2    SC1    SC2  SFC    SPV
ENSMUST00000036928  1165.0  4064.3 3.5 0.024   70.5  300.4  4.3 0.1203
	     R1      R2      R3
MNC1    1045.41 942.707 1220.02
MNC2    2835.7  4167.81 3718.39

The bottom line is, while the 0.05 rule is a good one mathematically, we cannot count on experimental and computational results reflecting the biology with that much accuracy. Give the results of any differential analysis a generous margin of error, and examine the data more closely for anything within that margin.

FILES

abundance.tsv: Input with normalized counts for all replicates of 1 condition
file.txt: Output containing fold-change and P-values

BUGS

Please report bugs to the author and send patches in unified diff format. (man diff for more information)

AUTHOR

J. Bacon