After downloading the sofware, simply type make in the base installation directory. This should build Inchworm and Chrysalis, both written in C++. Butterfly should not require any special compilation, as its written in Java and already provided as portable precompiled software.
Trinity has been tested and is supported on Linux.
Trinity is run via the script: Trinity.pl in the base installation directory.
Usage info is as follows:
##################################################################
#
# Required:
#
# --seqType <string> :type of reads: (fq or fa)
#
# If paired reads:
#
# --left <string> :left reads
# --right <string> :right reads
#
# Or, if unpaired reads:
#
# --single <string> :single reads
#
#
# --output <string> :name of directory for output (will be created if it doesn't already exist)
# default( "trinity_out_dir" )
#
# if strand-specific data, set:
#
# --SS_lib_type <string> :if paired: RF or FR, if single: F or R
#
#
#
# Butterfly-related options:
#
# --bfly_opts <string> :parameters to pass through to butterfly (see butterfly documentation).
#
# --bflyHeapSpace <string> :java heap space setting for butterfly (default: 1000M) => yields command java -Xmx1000M -jar Butterfly.jar ... $bfly_opts
#
# --no_run_butterfly :stops after the Chrysalis stage. You'll need to run the Butterfly computes separately, such as on a computing grid.
# Then, concatenate all the Butterfly assemblies by running:
# find trinity_out_dir/ -name "*allProbPaths.fasta" -exec cat {} ; > trinity_out_dir/Trinity.fasta
#
#
# Inchworm-related options:
#
# --no_meryl :do not use meryl for computing the k-mer catalog (default: uses meryl, providing improved runtime performance)
# --min_kmer_cov <int> :min count for K-mers to be assembled by Inchworm (default: 1)
#
# Misc:
#
# --CPU <int> :number of CPUs to use, default: 2
#
# --min_contig_length <int> :minimum assembled contig length to report (def=200)
#
# --paired_fragment_length <int> :maximum length expected between fragment pairs (aim for 90% percentile) (def=300)
#
# --jaccard_clip :option, set if you have paired reads and you expect high gene density with UTR overlap (use FASTQ input file format for reads).
#
#
#####################################################################################################################################
Note
|
Trinity performs best with strand-specific data, in which case sense and antisense transcripts can be resolved. |
If you have strand-specific data, specify the library type. There are four library types:
-
Paired reads:
-
RF: first read (/1) of fragment pair is sequenced as anti-sense (reverse(R)), and second read (/2) is in the sense strand (forward(F)); typical of the dUTP/UDG sequencing method.
-
FR: first read (/1) of fragment pair is sequenced as sense (forward), and second read (/2) is in the antisense strand (reverse)
-
Unpaired (single) reads:
By setting the —SS_lib_type parameter to one of the above, you are indicating that the reads are strand-specific. By default, reads are treated as not strand-specific.
Other important considerations:
-
Whether you use Fastq or Fasta formatted input files, be sure to keep the reads oriented as they are reported by Illumina, if the data are strand-specific. This is because, Trinity will properly orient the sequences according to the specified library type. If the data are not strand-specific, now worries because the reads will be parsed in both orientations.
-
If you do not have strand-specific data, and you do not plan to use the —jaccard_clip option, you can combine all your reads into a single fastq or fasta file and use the —single option. You can also combine paired reads and single reads, as long as the paired reads are recognized by having the same accession prefix with /1 and /2 to discriminate between paired ends.
-
If you have multiple paired-end library fragment sizes, set the —paired_fragment_length according to the larger insert library. Pairings that exceed that distance will be treated as if they were unpaired by the Butterfly process. Trinity's defaults are tuned to a library with an ~300 base fragment length.
-
by setting the —CPU option, you are indicating:
-
the number of threads for Inchworm to use (in most cases, Inchworm multithreading does not currently lead to performance gains. In future releases, this may change).
-
most importantly, the number of Butterfly executions that will occur simultaneously.
A typical Trinity command for assembling non-strand-specific RNA-Seq data would be like so, running the entire process on a single high-memory server (requiring 1G RAM per 1M ~76 base Illumina paired reads):
First, set your stacksize to unlimited. The way to do this depends on your system architecture:
CentOS: 'unlimit'
Ubuntu: 'ulimit -s unlimited'
And then verify your stacksize settings:
CentOS: 'limit'
Ubuntu: 'ulimit -a'
If you do not do this, there is a very good possibility that Chrysalis will fail.
Now, you would run Trinity:
Trinity.pl --seqType fq --left reads_1.fq --right reads_2.fq --CPU 4 --run_butterfly --bflyHeapSpace 10G
Example data and sample pipeline are provided and described here.
When Trinity completes, it will create a Trinity.fasta output file in the trinity_out_dir/ output directory (or output directory you specify).
If your transcriptome RNA-Seq data are derived from a gene-dense compact genome, such as from fungal genomes, where transcripts may often overlap in UTR regions, you can minimize fusion transcripts by leveraging the —jaccard_clip option if you have paired reads. Trinity will examine the consistency of read pairings and fragment transcripts at positions that have little read-pairing support. In expansive genomes of vertebrates and plants, this is unnecessary and not recommended. In compact fungal genomes, it is highly recommended. In addition to requiring paired reads, you must also have the Bowtie short read aligner installed. As part of this analysis, reads are aligned to the Inchworm contigs using Bowtie, and read pairings are examined across the Inchworm contigs, and contigs are clipped at positions of low pairing support. These clipped Inchworm contigs are then fed into Chrysalis for downstream processing. Be sure that your read names end with "/1" and "/2" for read name pairings to be properly recognized.
The Inchworm and Chrysalis steps can be memory intensive. A basic recommendation is to have 1G of RAM per 1M pairs of Illumina reads. Simpler transcriptomes (lower eukaryotes) require less memory than more complex transcriptomes such as from vertebrates. Butterfly requires less memory and can be executed in parallel on a computing grid, but its often easier to just execute it as a single process on a large memory server, where Butterfly processes are forked off to take advantage of multiple CPUs. The Chrysalis step can sometimes enter a deep recursion, in which case the stack memory can exceed default limits. Before running Trinity, set the stacksize to unlimited (or as high as you can). See above and the FAQ for more details.
If you are able to run the entire Trinity process on a single high-memory multi-core server, indicate the number of butterfly processes to run in parallel by the —CPU paramter (currently capped at 10, but you can force it higher). If you decide instead to run the Butterfly commands as distributed on a compute farm, set —no_run_butterfly to stop the pipeline after Chrysalis completes. A trinity_out_dir/chrysalis/butterfly_commands.adj file will be generated, and you can run these commands in parallel on your computing grid (from within the trinity_out_dir, since some paths are local rather than fully qualified). Most butterfly jobs require minimal memory (<1G), but some read-rich graphs can require up to 10G of RAM or more. Butterfly requires that Java version 1.6 be installed. After successfully executing all Butterfly commands, you can capture all the assembled transcripts into a single file by running the following from within the trinity_out_dir/ directory.:
find chrysalis/ -name "*allProbPaths.fasta" -exec cat {} \; > Trinity.fasta
If any Butterfly commands should fail execution, try reexecuting the failed commands with a higher java heap size (such as java -Xmx10G …). There are often just a few out of tens of thousands of Butterfly commands that may require more than the 1G default of RAM specified. If you specify —bflyHeapSize 10G at Trinity.pl runtime, then it will be more likely to succeed in running all Butterfly commands, and will automatically generate the Trinity.fasta file for you.
Our experience is that the entire process can require ~1 hour per million pairs of reads in the current implementation (see FAQ). We're striving to improve upon both memory and time requirements.
Since Trinity can easily take several days to complete, it is useful to be able to monitor the process and to know at which stage (Inchworm, Chrysalis, Butterfly) Trinity is currently at. There are a few general ways to do this:
-
by running top, you'll be able to see which Trinity process is running and how much memory is being consumed.
-
Inchworm logs status information to the trinity_out_dir/monitor.out file. You can run tail -f on that file to continually monitor its status until it completes and finishes outputting the inchworm fasta file in that directory.
-
Chrysalis and the downstream process that runs the Butterfly commands will generate standard output. Be sure to capture stdout when you run the Trinity.pl script. You can tail -f that output file to follow the progress of the Chrysalis and Butterfly stages after the Inchworm stage completes.
The Trinity software distribution includes sample data in the sample_data/test_Trinity_Assembly/ directory. Simply run the included runMe.sh shell script to execute the Trinity assembly process with provided paired strand-specific Illumina data derived from mouse. Running Trinity on the sample data requires ~2G of RAM and should run on an ordinary desktop/laptop computer.
Visit the Advanced Guide to Trinity for more information regarding Trinity behavior, intermediate data files, and file formats.
Additional questions, comments, etc?
Subscribe to the email list [https://lists.sourceforge.net/lists/listinfo/trinityrnaseq-users]here.