Statistics Colloquium : Dr. John Spouge
NIH
Title: "Which terms in a sum are too large? An application of combinatorial probability to microRNA motifs"
Abstract: Data about herpesvirus microRNA motifs on human circular RNAs suggested a specific question in combinatorial probability. The motifs are simply fixed words in the RNA alphabet, whereas the circular RNAs are like circular lattices, or "rings". In biological terms, therefore, fix the total number of microRNA motifs found on the rings and assume each of the motif placements on the rings is equally likely. For motifs that do not self-overlap, the problem is equivalent to counting the placements of a fixed number of points on a fixed number of rings of differing size. For each ring, the corresponding p-value describes the chance that the ring has as many or more motifs than are actually observed. The p-values directed human experimentation towards discovering the potential biological function of the human circular RNAs enriched for specific herpesvirus microRNA motifs.
To generalize mathematically, consider independent random counts, not necessarily identically distributed. Conditioned on their sum, compute the p-value that each count is as large or larger than the value observed. When the sum contains n terms, and n is large, a naive algorithm for computing the p-values requires quadratic time, proportional to n*n: for each of the n random counts, the naive algorithm convolves it with the remaining (n-1) counts. A segment tree is a general data structure that improves to linearithmic time, proportional to n*log(n). The special leave-one-out structure, however, permits a Jackknife Product algorithm for the p-values, with linear time and storage, both proportional to n.
In the herpesvirus application, the Jackknife Product algorithm required 15 min; standard segment tree algorithms would have taken an estimated 3 h; and the quadratic algorithm, an estimated 1 month. The Jackknife Product algorithm may have many possible uses in statistics.