find trinity_out_dir/chrysalis -name "*allProbPaths.fasta" -exec cat {} \; > Trinity.fasta
Ideally, you will have access to a large-memory server, having ~1G of RAM per 1M reads to be assembled. Inchworm, in particular, is memory hungry, and loads the entire k-mer composition of reads into physical memory. In double-stranded mode, it loads the forward and reverse-complemented k-mers into memory, so even more RAM is utilized. Future versions of the Inchworm software will focus on reducing its memory requirements.
We regularly run Trinity on our Dell 256G RAM 22-core server.
It depends on a number of factors including the number of reads to be assembled and the complexity of the transcript graphs. Inchworm and Chrysalis fairly reliably consume about 1 hour per million reads. Inchworm can take up to twice as long when non-strand-specific RNA-Seq data are used, since it loads both strands of the transcript data.
You can monitor the progress of the system in several ways. (1) status information is written to stdout, (2) while Inchworm is running, it logs status information into the output_dir/monitor.out file, which you can follow by running "tail -f trinity_out_dir/monitor.out", and (3) running top to determine which process is running, how long it's been running, and what computational resources are being consumed.
To troubleshoot individual graphs using Butterfly, see Advanced Guide to Trinity.
The Inchworm and Chrysalis steps need to be run on a single server as a single process, however Butterfly can be run in parallel on a computing grid. If you choose to run Butterfly on your computing grid, then do not use the Trinity.pl —run_butterfly parameter. When Chrysalis completes, it creates a file containing all the butterfly commands that need to be executed. The filename should be trinity_out_dir/chrysalis/butterfly_commands.adj (be sure to use the .adj extension, since this file includes other butterfly parameters passed through the Trinity.pl wrapper). Each command line in this file can be run independently on different nodes within your computing grid, with commands launched from within the trinity_out_dir/ directory (input files are specified by relative paths, so this is important). How to get each of these commands to run on your grid will depend on your computing infrastructure. We have several ways to do this via LSF, but they're all home-grown solutions and not included here.
Once all butterfly commands have been executed, you can retrieve all the resulting assembled transcripts by concatenating the individual result files together like so:
find trinity_out_dir/chrysalis -name "*allProbPaths.fasta" -exec cat {} \; > Trinity.fasta
Currently, the mappings of reads to transcripts are not reported. To obtain this information, we recommend realigning the reads to the assembled transcripts using Bowtie or BWA. A wrapper around Bowtie that will generate alignments for you is included in the Trinity package as:
Trinity_Release/util/bowtie_paired_read_aligner.pl
and example data nad example execution of this is included at:
Trinity_Release/misc/read_alignment/pairedSS, run the 'runBowtie.sh' script in that directory to demo the process.
For strand-specifid data: If you have RNA-Seq data from multiple libraries and you want to run them all through Trinity in a single pass, simply combine all your left.fq files into a single left.fq file, and combine all right.fq files into a single right.fq file. Then run Trinity using these concatenated inputs. If you have additional singletons, add them to the .fq file that they correspond to based on the sequencing method used (if they're equivalent to the left.fq entries, add them there, etc).
For non-strand-specifid data: You can combine all your .fq files into a single.fq file, and run Trinity with the —single parameter. The only exception is if you want to use the —jaccard_clip option (advanced option for gene-dense genomes where transcripts tend to overlap), in which case you'll need to keep the pairs separated into left.fq and right.fq files.
There is no good way to combine strand-specific data with non-strand-specific data.
In nearly every case examined thus far, this is because the stacksize memory grew beyond its maximal setting. Chrysalis can sometimes recurse fairly deeply, in which case the stacksize can grow substantially. Before running Trinity, be sure that your stacksize is set to unlimited. On CentOS linux, you can check your default settings like so (it may be different on your linux distro):
bhaas@hyacinth$ limit
cputime unlimited filesize unlimited datasize unlimited stacksize 100000 kbytes coredumpsize 0 kbytes memoryuse unlimited vmemoryuse unlimited descriptors 131072 memorylocked 256 kbytes maxproc 2102272
and you would set the stacksize (and other settings) to unlimited like so:
bhaas@hyacinth$ unlimit
and verify the new settings:
bhaas@hyacinth$ limit
cputime unlimited filesize unlimited datasize unlimited stacksize unlimited coredumpsize unlimited memoryuse unlimited vmemoryuse unlimited descriptors 131072 memorylocked 256 kbytes maxproc 2102272
On Solaris, Ubuntu linux, and perhaps Mac OSX, the syntax is different, and might be:
ulimit -s unlimited
On Ubuntu, type: ulimit -a to examine and verify your settings.
On snow leopard, you cannot set it to unlimited for some reason (older versions you could), so try to max it out.
An update to Chrysalis is in the works that will explore alternatives to the recursive processes that sometimes require altered stacksize settings.
Note
|
The problem of butterfly long-running processes has hopefully been resolved in Trinity releases since 2011-05-19. |
Occassionally (very rarely, such as one component per tens of thousands, if at all) Butterfly will encounter a complicated transcript graph and seems to take an eternity to process it. You will notice this by running top and seeing a java process that has been running for a very long time. For example, I'm running a dozen butterfly commands on my large server (22 cores, 256 GB RAM) and I can see various butterfly jobs running as java in the view:
Tasks: 500 total, 7 running, 493 sleeping, 0 stopped, 0 zombie top - 09:13:33 up 131 days, 21:07, 4 users, load average: 70.72, 53.70, 28.00Tasks: 510 total, 9 running, 501 sleeping, 0 stopped, 0 zombie Cpu(s): 89.1%us, 10.4%sy, 0.0%ni, 0.2%id, 0.0%wa, 0.1%hi, 0.2%si, 0.0%stMem: 264349428k total, 48345144k used, 216004284k free, 126640k buffers Swap: 8385920k total, 314336k used, 8071584k free, 18855720k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7775 bhaas 16 0 1373m 302m 8724 S 201.2 0.1 0:04.02 java 7735 bhaas 17 0 1358m 329m 8776 S 171.1 0.1 0:04.47 java 7310 bhaas 17 0 1300m 359m 8804 S 140.9 0.1 0:07.84 java 8194 bhaas 17 0 1294m 165m 8680 S 125.8 0.1 0:01.88 java 8313 bhaas 18 0 1356m 36m 8580 S 98.1 0.0 0:00.73 java 8075 bhaas 17 0 1290m 53m 8668 S 93.1 0.0 0:01.18 java 10241 bhaas 18 0 1376m 604m 8820 S 88.0 0.2 4:31.80 java 32424 bhaas 18 0 1306m 474m 8816 S 88.0 0.2 0:58.53 java 8143 bhaas 17 0 1292m 48m 8664 S 85.5 0.0 0:01.23 java 8258 bhaas 17 0 1291m 48m 8656 S 80.5 0.0 0:01.07 java 1305 bhaas 17 0 1377m 509m 8820 S 78.0 0.2 0:56.11 java 10247 bhaas 18 0 1356m 1.0g 8812 S 78.0 0.4 4:26.23 java ...
A way to see exactly what jobs are running is to execute the following:
bhaas@hyacinth$ ps auxww | grep Butterfly bhaas 4588 50.3 0.1 1355708 435476 pts/4 Sl 09:17 0:38 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 9814096 -L 300 -F 300 -C chrysalis/RawComps.0/comp374 --edge-thr=0.16 bhaas 5920 51.3 0.1 1353604 409604 pts/4 Sl 09:18 0:33 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10114793 -L 300 -F 300 -C chrysalis/RawComps.0/comp412 --edge-thr=0.16 bhaas 7747 53.0 0.2 1325344 530752 pts/4 Sl 09:13 3:01 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 11032490 -L 300 -F 300 -C chrysalis/RawComps.0/comp127 --edge-thr=0.16 bhaas 10241 56.5 0.2 1409492 625972 pts/4 Sl 09:06 7:18 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10630881 -L 300 -F 300 -C chrysalis/RawComps.0/comp2 --edge-thr=0.16 bhaas 10247 51.9 0.4 1389204 1077640 pts/4 Sl 09:06 6:42 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10702374 -L 300 -F 300 -C chrysalis/RawComps.0/comp0 --edge-thr=0.16 bhaas 10249 51.8 0.4 1394704 1082764 pts/4 Sl 09:06 6:41 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10702374 -L 300 -F 300 -C chrysalis/RawComps.0/comp1 --edge-thr=0.16
Most of the butterfly commands have been running for only a short period of time (seconds), but there are a couple that have been running for several minutes. Most commands will take less than a few minutes to run, and some can take up to an hour. If you see a butterfly command (java) that has been running for many hours, you can consider killing it and trying it again later with altered butterfly parameters. There are a couple of ways to kill the process.
From the command line, you can kill it like so:
kill $pid
where $pid is the process ID in the first column of the top output or second column of the ps output.
From within top, you can kill it by typing k, enter, $pid, enter. (on linux, this is how it works; your system may vary).
Once a Butterfly command has finished (or you've killed it to retry it later), the next butterfly command in the queue will take its place.
If all Butterfly commands complete successfully, then the Trinity.pl wrapper script will report success and concatenate all the individual butterfly assembly outputs into a single file (Trinity.fasta). If any commands did not succeed, then the failed (or killed) commands will be reported and written to a file so that you can adjust the parameters (such as tweak the —edge-thr value to a higher setting) and rerun, or troubleshoot (see Advanced Guide to Trinity).
Once all butterfly commands have been executed, you can retrieve all the resulting assembled transcripts by concatenating the individual result files together like so:
find trinity_out_dir/chrysalis -name "*allProbPaths.fasta" -exec cat {} \; > Trinity.fasta
When butterfly commands fail, it's usually due to not enough heap memory being available. This tends happen with the largest component (comp0). Failed butterfly commands will be written to a failed_cmds.txt file and the Trinity process will terminate. Try rerunning the failed commands (which should be very few, such as 1-3 out of the tens of thousands of commands) manually and resetting the heap size -Xmx value to a higher number, such as 20G. After these failed processes complete successfully, you can combine all the Butterfly results into a single file like so:
find trinity_out_dir/chrysalis -name "*allProbPaths.fasta" -exec cat {} \; > Trinity.fasta
and consider the entire run a success (albeit with a little hand-holding).
There are a couple reasons why this error message might creep up.
all memory has been consumed on the machine. Each butterfly process wants to reserve 10G of heap space. If there's less than 10G of free memory on the machine per butterfly (—CPU setting), then java will not be able to initialize. The -Xmx10G setting indicates that 10G of heap memory should be reserved.
no more processes are available, due to too many threads being spawned, or too many zombie processes existing and not being properly collected. (cmd_process_forker.pl in the July and earlier releases was not collecting all finished child processes effectively, which could lead to the build-up of zombies. The August release will contain the updated util/cmd_process_forker.pl script. Zombie build-up can be observed within top,and it should normally stay at a low value, fewer that the —CPU setting.
NUMA architecture: one of our users found that the java invocation required: -XX:ParallelGCThreads=<Numerical Thread Count>, otherwise it would try to use too many threads.
Although Inchworm has the capability of running independently with different k-mer sizes up to 32 (the 64-bit limit with 2-bit base encoding), Chrysalis and Butterfly are current fixed at the 25mer k-mer size. In testing, we discovered early on that 25-mers appeared to be near-optimal across a different transcriptomes and different read abundance levels, and so fixed the value accordingly as part of the Trinity process. Future development will aim to expose the k-mer setting as an option.
You can do so, but you wouldn't be able to benefit as from the strand-specificity, since you'll need to run Trinity in non-strand-specific mode.
Yes. If your data are not strand-specific, then just combine all reads into a single.fq file and use the —single option. Contact the mailing list if you have a more complicated scenario.