| RNA Folding with Nossinov-Jacobsen Algorithm |
Introduction
RNA molecules often fold back on themselves, forming stable double-helix structures akin to the famous DNA double helix, with G-C and A-U pairs forming, so called Watson-Crick base pairs. (Here, we’ll ignore other pairs, which sometimes form , too.) A commonly used set of rules for what structures may form is: ri rj allowed only if i < j − 4 (it can't bend too sharply) and if ri rj and rio rjo are two pairs with i i0 then either
- i = i0 and j = j0 (ri can't pair with two different bases)
- j < i0 (one pair completely precedes the other), or
- j0 < j (one is completely nested within the other)
Two pairs violating conditions 2-3 are called a “pseudoknot”. Forbidding
pseudoknots makes algorithms both simpler and faster, and means that there is a
very simple linear representation of the structure using nested parentheses. For
example, the structure depicted above is:
GCCCACCUUCGAAAAGACUGGAUGACCAUGGGCCAUGAUU
( ( ( ( ( . ( ( ( . . . . ) ) ) . .
( ( ( . . . . ) ) ) ) ) ) ) ) . ( . . . . )
12 Pairs.
For a given RNA sequence, there will usually be a huge number of structures satisfying the above restrictions. Nature usually favors the “most stable” structure. One approximation to “most stable” is the structure having the maximum possible number of pairs (subject to the above restrictions).
Nossinov-Jacobsen Algorithm (for Watson-Crick base pairs)
The Nussinov-Jacobsen algorithm, based on the following recurrence, computes, for each 1 £ i < j £ n the quantity B[i, j] which is the maximum number of pairs in any folding of the substring xi xi+1 . . . xj of the input:
max [ B[i+1, j-1] + pi,j , max { B[i,k] + B[k+1, j] : i £ k < j } ] , if i < j − 4,
B[i, j] is most easily represented as an n by n array, of which the
lower-left triangle will be all zero. Since B[i, j] depends on entries to
its left in the same row and entries below it in the same column, the
convenient order in which to compute entries is to fill columns in left to
right order, each column filled from the diagonal upwards. B[1, n] will hold
the maximum number of pairs.
Implementation in Java
The algorithm is implemented by Dynamic Programming in Java. The program
takes a RNA sequence from a standard input prompt.
NOTE :
The program doesn't have any input validation except the maximum length of a
RNA sequence, which is 2000. So, you should enter a valid RNA sequence which
consists of G, C, A, and U.
- source files : RNAFolding.java
- pre-compiled executable :
runNAFolding.zip (Download and extract the files in the same
directory, then double click on the batch file named runRNAFolding.bat)
see screen shot
Time Complexity Analysis
With Dynamic Programming, the Nussinov-Jacobsen algorithm takes Q(n3) worst-case time; three nested loops of order n. The recursive algorithm for traceback takes Q(n). Thus, the total complexity is Q(n3) worst-case time. The following graph illustrates actual running-times on various RNA sequences of different lengths (20 to 1800). The best-fit curve is plotted with a cubic polynomial. As shown on the graph, the actual running-time reasonably follows the theoretical analysis of the complexity of the algorithm.

by Hyung-Joon Kim, CSE 417, University of Washington

