RNA Folding with Nossinov-Jacobsen Algorithm

Introduction

RNA molecules often fold back on themselves, forming stable double-helix structures akin to the famous DNA double helix, with G-C and A-U pairs forming, so called Watson-Crick base pairs. (Here, we’ll ignore other pairs, which sometimes form , too.) A commonly used set of rules for what structures may form is: ri rj allowed only if i < j − 4 (it can't bend too sharply) and if ri rj and rio rjo are two pairs with i i0 then either

  1. i = i0 and j = j0 (ri can't pair with two different bases)
  2. j < i0 (one pair completely precedes the other), or
  3. j0 < j (one is completely nested within the other)

Two pairs violating conditions 2-3 are called a “pseudoknot”. Forbidding pseudoknots makes algorithms both simpler and faster, and means that there is a very simple linear representation of the structure using nested parentheses. For example, the structure depicted above is:
GCCCACCUUCGAAAAGACUGGAUGACCAUGGGCCAUGAUU
 ( ( ( ( ( . ( ( ( . . . . ) ) ) . . ( ( ( . . . . ) ) ) ) ) ) ) ) . ( . . . . )
12 Pairs.

For a given RNA sequence, there will usually be a huge number of structures satisfying the above restrictions. Nature usually favors the “most stable” structure. One approximation to “most stable” is the structure having the maximum possible number of pairs (subject to the above restrictions).

 
 

Nossinov-Jacobsen Algorithm (for Watson-Crick base pairs)

The Nussinov-Jacobsen algorithm, based on the following recurrence, computes, for each 1 £  i < j £  n the quantity B[i, j] which is the maximum number of pairs in any folding of the substring xi xi+1 . . . xj of the input:

B[i, j] = 0  , if i  ³  j − 4
            max [ B[i+1, j-1] + pi,j , max { B[i,k] + B[k+1, j] : i £ k < j } ]  , if i <  j − 4,
Where pi,j = 1 if xi could pair with xj (i.e., G-C or A-U pairs), and pi,j = 0 otherwise.

B[i, j] is most easily represented as an n by n array, of which the lower-left triangle will be all zero. Since B[i, j] depends on entries to its left in the same row and entries below it in the same column, the convenient order in which to compute entries is to fill columns in left to right order, each column filled from the diagonal upwards. B[1, n] will hold the maximum number of pairs.

Implementation in Java

The algorithm is implemented by Dynamic Programming in Java. The program takes a RNA sequence from a standard input prompt.

NOTE :  The program doesn't have any input validation except the maximum length of a RNA sequence, which is 2000. So, you should enter a valid RNA sequence which consists of G, C, A, and U.

Also, try the following applet to see what the output is like. However, the applet version is a bit simplified, so it limits the max length of a sequence to 40 and doesn't display the running-time nor the matrix which is computed to find the max number of pairs in RNA folding.

 
Enter a string of RNA sequence below :  (e.g., GACGUAGCUAGCUUAGCAU )

Time Complexity Analysis

With Dynamic Programming, the Nussinov-Jacobsen algorithm takes Q(n3) worst-case time; three nested loops of order n. The recursive algorithm for traceback takes Q(n). Thus, the total complexity is Q(n3) worst-case time. The following graph illustrates actual running-times on various RNA sequences of different lengths (20 to 1800). The best-fit curve is plotted with a cubic polynomial. As shown on the graph, the actual running-time reasonably follows the theoretical analysis of the complexity of the algorithm.



by Hyung-Joon Kim, CSE 417, University of Washington