RNAlib-2.2.7
Input / Output File Formats

File formats for Secondary Structure Constraints

Constraints Definition File

The RNAlib can parse and apply data from constraint definition text files, where each constraint is given as a line of whitespace delimited commands. The syntax we use extends the one used in mfold / UNAfold where each line begins with a command character followed by a set of positions.
Additionally, we introduce several new commands, and allow for an optional loop type context specifier in form of a sequence of characters, and an orientation flag that enables one to force a nucleotide to pair upstream, or downstream.

Constraint commands

The following set of commands is recognized:

  • F $ \ldots $ Force
  • P $ \ldots $ Prohibit
  • C $ \ldots $ Conflicts/Context dependency
  • A $ \ldots $ Allow (for non-canonical pairs)
  • E $ \ldots $ Soft constraints for unpaired position(s), or base pair(s)

Specification of the loop type context

The optional loop type context specifier [WHERE] may be a combination of the following:

  • E $ \ldots $ Exterior loop
  • H $ \ldots $ Hairpin loop
  • I $ \ldots $ Interior loop (enclosing pair)
  • i $ \ldots $ Interior loop (enclosed pair)
  • M $ \ldots $ Multibranch loop (enclosing pair)
  • m $ \ldots $ Multibranch loop (enclosed pair)
  • A $ \ldots $ All loops

If no [WHERE] flags are set, all contexts are considered (equivalent to A )

Controlling the orientation of base pairing

For particular nucleotides that are forced to pair, the following [ORIENTATION] flags may be used:

  • U $ \ldots $ Upstream
  • D $ \ldots $ Downstream

If no [ORIENTATION] flag is set, both directions are considered.

Sequence coordinates

Sequence positions of nucleotides/base pairs are $ 1- $ based and consist of three positions $ i $, $ j $, and $ k $. Alternativly, four positions may be provided as a pair of two position ranges $ [i:j] $, and $ [k:l] $ using the '-' sign as delimiter within each range, i.e. $ i-j $, and $ k-l $.

Valid constraint commands

Below are resulting general cases that are considered valid constraints:

  1. "Forcing a range of nucleotide positions to be paired":
    Syntax:
    F i 0 k [WHERE] [ORIENTATION]

    Description:
    Enforces the set of $ k $ consecutive nucleotides starting at position $ i $ to be paired. The optional loop type specifier [WHERE] allows to force them to appear as closing/enclosed pairs of certain types of loops.
  2. "Forcing a set of consecutive base pairs to form":
    Syntax:
    F i j k [WHERE] 

    Description:
    Enforces the base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $ to form. The optional loop type specifier [WHERE] allows to specify in which loop context the base pair must appear.
  3. "Prohibiting a range of nucleotide positions to be paired":
    Syntax:
    P i 0 k [WHERE] 

    Description:
    Prohibit a set of $ k $ consecutive nucleotides to participate in base pairing, i.e. make these positions unpaired. The optional loop type specifier [WHERE] allows to force the nucleotides to appear within the loop of specific types.
  4. "Probibiting a set of consecutive base pairs to form":
    Syntax:
    P i j k [WHERE] 

    Description:
    Probibit the base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $ to form. The optional loop type specifier [WHERE] allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of.
  5. "Prohibiting two ranges of nucleotides to pair with each other":
    Syntax:
    P i-j k-l [WHERE] 
    Description:
    Prohibit any nucleotide $ p \in [i:j] $ to pair with any other nucleotide $ q \in [k:l] $. The optional loop type specifier [WHERE] allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of.
  6. "Enforce a loop context for a range of nucleotide positions":
    Syntax:
    C i 0 k [WHERE] 
    Description:
    This command enforces nucleotides to be unpaired similar to prohibiting nucleotides to be paired, as described above. It too marks the corresponding nucleotides to be unpaired, however, the [WHERE] flag can be used to enforce specfic loop types the nucleotides must appear in.
  7. "Remove pairs that conflict with a set of consecutive base pairs":
    Syntax:
    C i j k 

    Description:
    Remove all base pairs that conflict with a set of consecutive base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $. Two base pairs $ (i,j) $ and $ (p,q) $ conflict with each other if $ i < p < j < q $, or $ p < i < q < j $.
  8. "Allow a set of consecutive (non-canonical) base pairs to form":
    Syntax:
    A i j k [WHERE]

    Description:
    This command enables the formation of the consecutive base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $, no matter if they are canonical, or non-canonical. In contrast to the above F and W commands, which remove conflicting base pairs, the A command does not. Therefore, it may be used to allow non-canoncial base pair interactions. Since the RNAlib does not contain free energy contributions $ E_{ij} $ for non-canonical base pairs $ (i,j) $, they are scored as the maximum of similar, known contributions. In terms of a Nussinov like scoring function the free energy of non-canonical base pairs is therefore estimated as

    \[ E_{ij} = \min \left[ \max_{(i,k) \in \{GC, CG, AU, UA, GU, UG\}} E_{ik}, \max_{(k,j) \in \{GC, CG, AU, UA, GU, UG\}} E_{kj} \right]. \]

    The optional loop type specifier [WHERE] allows to specify in which loop context the base pair may appear.
  9. "Apply pseudo free energy to a range of unpaired nucleotide positions":
    Syntax:
    E i 0 k e

    Description:
    Use this command to apply a pseudo free energy of $ e $ to the set of $ k $ consecutive nucleotides, starting at position $ i $. The pseudo free energy is applied only if these nucleotides are considered unpaired in the recursions, or evaluations, and is expected to be given in $ kcal / mol $.
  10. "Apply pseudo free energy to a set of consecutive base pairs":
    Syntax
    E i j k e

    Use this command to apply a pseudo free energy of $ e $ to the set of base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $. Energies are expected to be given in $ kcal / mol $.