>a1;142 K: 25 length: 3697 GGAGCTGGAGGCCCCCAGGCAACTACACCGTCCACGTA....
Welcome to the advanced guide to trinity. You are probably reading this because you want to better understand what all those zillions of output files and output directories correspond to, and how you can attempt to troubleshoot certain processes. We aim to provide much of that information here.
Below, the individual Trinity processes (Inchworm, Chrysalis, and then Butterfly) are described, including their expected output files and data formats. In the case of Butterfly, additional options are presented for being able to explore transcript graphs and tracking the progress of butterfly as it works its way through these graphs.
Trinity was written as a collective effort by three individuals (Inchworm by Brian Haas, Chrysalis by Manfred Grabherr, and Butterfly by Moran Yassour). In addition to each tool being engineered separately, they each use different code-bases, Inchworm and Chrysalis were written in C++, and Butterfly was written in Java. They interact only through well-defined file formats for intermediate products used as inputs by downstream processes.
Meryl is a third-party tool developed by Brian Walenz at the J. Craig Venter Institute and is part of the kmer distribution. We include it in Trinity as a way of rapidly building a k-mer catalog from a set of sequences. It is impressively very fast! and hugely useful. If, for whatever reason, you choose not to use it, you can run Trinity.pl with the —no_meryl option, in which case Inchworm will build the k-mer catalog. You will obtain faster runtimes by using Meryl and having Inchworm simply parse the k-mer catalog provided rather than having to do this separately. Thanks to Michael Ott (now a Trinity Developer) and Alexie Papanicolaou for introducing Meryl to Trinity.
Inchworm reads in a set of sequences, decomposes the sequences into sets of overlapping k-mers (overlap of k-1), and stores each k-mer in memory in the form of a hash table. The key-value pairs are the k-mers as keys and the abundance of the corresponding k-mer as the value. The k-mer is stored as a 2-bit encoded unsigned integer, and with 64-bit architectures, allows for k-mers up to 32-mers to be stored. We find that 25-mers work very well for both highly and lowly expressed transcripts, and so we leverage 25-mers as a fixed option with Trinity. If you should run Inchworm separately, you can use any length k-mer up to 32-mers, but only the 25-mers will be compatible with the Trinity package.
When using strand-specific RNA-Seq data, only the transcribed strand is loaded into memory (keyed in the hash table). When using double-stranded RNA-Seq, both strands are processed.
When Trinity is executed, the Inchworm process is first launched. The progress of Inchworm is captured in the specified trinity_out_dir/ as monitor.out. By running tail -f output_dir/monitor.out, you can monitor the progress of Inchworm as it reads in sequences (populating the hash table), sorting the k-mers, and then outputting the draft assembled contigs.
The Inchworm-assembled contigs are reported as the file output_dir/inchworm.K25.L48.fa, with the name of the file indicating that a k-mer of 25 was used, and contigs at least 48 long were reported. The fasta sequence accession for each contig contains information that is used by Chrysalis in the next step. For example, the following header for an Inchworm fasta assembly:
>a1;142 K: 25 length: 3697 GGAGCTGGAGGCCCCCAGGCAACTACACCGTCCACGTA....
indicates that the sequence entry a1;142 has an average k-mer abundance value of 142. This value is a proxy for transcript expression.
Inchworm does a very good job at reconstructing full-length transcripts from RNA-Seq data, but since it leverages only unique k-mers for contig construction, it can only report the parts of alternatively spliced isoforms that are unique. Subsequent Trinity steps reconstruct the full-length alternatively spliced transcripts.
Inchworm is not very fast (as the name might imply, but this was not intentional). We are exploring ways of speeding this process up and lowering its memory requirements.
Chrysalis groups together inchworm contigs that are related, either as potential alternatively spliced transcripts or potential paralogous transcripts that share subsequences in common. A de Bruijn graph is constructed from the clustered Inchworm contigs, and the original sequence reads are mapped to positions within the de Bruijn graph. The many (hundreds of thousands) of Inchworm contigs typically yield tens of thousands of separately defined de Bruijn graphs. Each graph and metadata are written as separate output files, to be leveraged as input by Butterfly in the next step. The files generated for each graph include:
*chrysalis intermediate outputs* chrysalis/RawComps.0/comp9.raw.graph :de Bruijn graph based on Inchworm contigs only chrysalis/RawComps.0/comp9.raw.fasta :raw RNA-Seq reads that map to the Inchworm contigs based on k-mer composition
*chrysalis outputs to be used by Butterfly as inputs* chrysalis/RawComps.0/comp9.out :the de Bruijn graph with edge weights incorporating the mapped reads chrysalis/RawComps.0/comp9.reads :the read sequences and anchor points within the above graph
The format of the .out file is like so:
Component 2 0 -1 0 GAGCTCTTCAGGAGGGGGAATGTG 0 1 0 3 AGCTCTTCAGGAGGGGGAATGTGC 0 2 1 3 GCTCTTCAGGAGGGGGAATGTGCT 0 3 2 3 CTCTTCAGGAGGGGGAATGTGCTT 0 4 3 3 TCTTCAGGAGGGGGAATGTGCTTG 0 5 4 3 CTTCAGGAGGGGGAATGTGCTTGT 0 ...
with fields: node_id, from_node_id, edge_support, node_kmer_seq Node identifier -1 is a start node with no k-mer sequence.
The format of the .reads file is like so:
Component 2 >61DFRAAXX100204:2:25:3750:2732/2 0 1833 51 1884 GGGAAGGCACTTTCCGGATGATCCCGTATCCCCTGGAGAAGGGACACCTATTTTATCCATACCCAATCTGTACAGA + >61DFRAAXX100204:2:25:7347:5444/2 0 202 51 253 GACTGCAGTCTCTGCTGCTGCTCGCAGACCTGCCCTGCGCTAGCTACCTAGCCCTGCCTCACTGCATCCCTCAAGA + >61DFRAAXX100204:2:25:8933:8122/2 0 2418 51 1183 CTTGGAGATAAACGAGTGTGCAACTGCGTACATTCTCTTGGCGGAAGAAGAAGCGACAACTATTGCTGAAGCAGAA + >61DFRAAXX100204:2:26:11187:19799/2 0 1324 51 1375 CTATATCAAAAGAAGGCTGGCGATGTGTGCCCGGAGACTTGGAAGGACCAGAGAAGCAGTGAAGATGATGAGAGAT + >61DFRAAXX100204:2:26:12653:14528/2 14 1432 51 1469 CTCCTAAGCATGTACAATATCCATGAGAACCTTCTAGAAGCTCTTCTGGAACTCCAAGCTTATGCTGATGTTCAGG + >61DFRAAXX100204:2:26:12686:3440/2 15 843 51 879 CAGAATGCAAAGTAAGGCGAAATCCACTGAATCTGTTTAGGGGTGCGGAATATAATCGGTACACTTGGGTCACAGG + >61DFRAAXX100204:2:26:16242:3695/2 14 279 51 316 GCATCCCTTAAGAACCGCGGCAGCCTTTCCTTGCCTGCTGGATTTTGAGAAGCAGCTCTTCGATTTGGGCTGGTGT + >61DFRAAXX100204:2:26:16448:13715/2 0 1753 51 1804 TGAAGCGATAGCATATGCATTCTTTCATCTTGCACACTGGAAGAGGGTGGAAGGGGCTTTGAATCTCTTGCATTGT + >61DFRAAXX100204:2:26:16861:10738/2 0 2865 51 622 CGACAACCTGAGCACAGTGAGCATGTTTTTGAACACGTTAACCCCAAAGTTCTACGTGGCCCTGACAGGCACTTCC + >61DFRAAXX100204:2:26:17369:11435/2 0 1005 51 1056 TGCAAAAAGCTTGGAGAGAAAGGAACCCTCAAGCCAGGATTTCTGCAGCTCATGAAGCCTTGGAGATAAACGAAAT + ...
with fields: read_accession, start_in_read, start_node_id, end_in_read, end_node_id, read_sequence, read_orientation_in_graph
When Chrysalis completes, it creates a file called butterfly_commands that contains the minimal command string to execute Butterfly on these components. The Trinity.pl wrapper modifies these commands to include Java settings (such as heap size intialization and any butterfly parameters set at runtime). The modified command file butterfly_commands.adj contains the butterfly commands that should be executed, either by the Trinity.pl script (when run with the —run_butterfly option) or in parallel on a computing grid (your mechanism may vary).
The Butterfly commands generated by Chrysalis above (and ultimately written to butterfly_commands.adj) are executed in parallel on your local server, if you enabled the —run_butterfly parameter of Trinity.pl. The included script:
bhaas@niveum$ trinityrnaseq/util/cmd_process_forker.pl
################################################################################# # # -c <string> filename containing a list of commands to execute, one cmd per line. # # --CPU Default: 1 # ###################################################################################
Executes these butterfly commands, throttling them at the —CPU setting (which is passed on from the Trinity.pl —CPU parameter setting).
Butterfly consumes the .out and .reads files for a given Chrysalis component. Butterfly traces the paths that reads and pairs of reads take within the graph and reports the most probable transcripts as a fasta file. The result file for component 2 would exist as: comp2_allProbPaths.fasta. The format of the fasta file is like so:
>comp2_c0_seq1 len=2364 path=[0:0-587 588:588-1076 1146:1077-2363] GAGCTCTTCAGGAGGGGGAATGTGCTTGTGGTTTTTGGTCTTGTGCATTTTGTGACAAAG GAATTCCCTTTTGAATCGCGCTGTTCCCTTGAAACCCTGGAGCCTCTGGTTCAAGCAGCG CAGTCAGTCTGTGCAGTGTCCCTGACGTCATCCGGCGTATGCATAAGCTCTGCTATTGTC TTACCGCTAGAGCAGGGCTGAGGACTGCAGTCTCTGCTGCTGCTCGCAGACCTGCCCTGC ...
The accession of each fasta entry is bundled with information, and is broken down like so:
>comp2_c0_seq1 len=2364 path=[0:0-587 588:588-1076 1146:1077-2363]
comp2: contig is derived from Chrysalis component # 2 c0: contig also corresponds to Butterfly sub-component # 0 (during graph compaction and pruning, some components are partitioned into disconnected subcomponents). seq1: contig sequence count from chrysalis component 2, butterfly subcomponent zero. If this subcomponent yields multiple sequences, these will have different seq numbers. len: length of the transcript contig
path: list of vertices in the compacted graph that represent the final transcript sequence and the range within the given assembled sequence that those nodes corresond to. For example, node:0 spans from position 0-587, and then connects to node 588: which extends from position 588-1076 within the transcript, and so on. It's coincidental in this case that the node identifier matches up with the start position within the sequence; it's not always the case, as shown by the third node of this sequence path.
The operations of butterfly can become more transparent if you execute the Butterfly command with a verbose setting of at least 5, in which case, in addition to yielding the most probably transcript contigs, it will report the underlying compacted graph structure, and describe the vertices that are being visited during transcript reconstruction. For example, the following Butterfly command reports:
RUNNING: java -Xmx1000M -jar /Users/bhaas/sVN/trinityrnaseq/Butterfly/Butterfly.jar -N 28363 -L 305 -F 280 -C chrysalis/RawComps.0/comp25 --edge-thr=0.05 --stderr -V 5 fixExtremelyHighSingleEdges() method: combineSimilarPathsThatEndAtV(-1) method: combineSimilarPathsThatEndAtV(-1) method: combineSimilarPathsThatEndAtV(-1) method: combineSimilarPathsThatEndAtV(256) method: combineSimilarPathsThatEndAtV(256) method: combineSimilarPathsThatEndAtV(0) method: combineSimilarPathsThatEndAtV(0) method: combineSimilarPathsThatEndAtV(73) method: combineSimilarPathsThatEndAtV(73) method: combineSimilarPathsThatEndAtV(73) method: combineSimilarPathsThatEndAtV(247) method: combineSimilarPathsThatEndAtV(247) method: combineSimilarPathsThatEndAtV(100) method: combineSimilarPathsThatEndAtV(100) method: combineSimilarPathsThatEndAtV(109) method: combineSimilarPathsThatEndAtV(109) method: combineSimilarPathsThatEndAtV(-2)
The graph vertices that are being visited are provided in the parenthesis above, starting with (-1), which is the start node that all initial vertices link to, and ending at (-2), which is a final sink node.
Butterfly, given the -V 5 setting, creates a file called comp25_justBeforeFindingPaths.dot that represents the structure of the compacted graph. This graph can be viewed using GraphViz. The graph can be exported in pdf format for searching (not sure why graphviz doesn't have a search function). The Preview software on Mac OSX works well for this (acroread doesn't for some unknown reason). In the pdf-formatted file, you can search for node identifiers and find the corresponding vertex in the graph. The graph nodes are formatted like so: TTTACCTCAC…GATGGCTCAG\:1\(0)\[73], with the trailing three numbers corresponding to: average_node_coverage, node_id, sequence_length.
If you have very complex graphs that are taking an exceedingly long time to process (more than a day), you can consider increasing the —edge-thr Butterfly threshold to further simplify the graph before transcript reconstruction. Hopefully, this should not happen. Be sure to send us any ultra-long-running graphs so we can explore more efficient ways of processing them in Butterfly.