Transcriptome assemblers reconstruct full-length transcripts from the short sequence fragments generated by RNA-Seq. Large consortia such as TCGA, ICGC, GTex, ENCODE, the Cancer Cell Line Encyclopedia (CCLE), and others have performed RNA-Seq on thousands of human tissues and cell lines, providing an unparalleled resource for investigating transcriptional diversity and complexity. Towards this end, we present Transcriptome Assemblies Combined into One (TACO), an algorithm that reconstructs a consensus transcriptome from a collection of individual assemblies. TACO employs change point detection to break apart complex loci and correctly delineate transcript start and end sites, and a dynamic programming approach to assemble transcripts from a network of splicing patterns. TACO vastly outperforms existing software tools such as Cuffmerge and Stringtie merge. Please refernece our manuscript in Nature Methods for further details and the results of our comparison analysis. TACO also contains an easy to use companion tool for comparing meta-assemblies to reference transcriptomes, assessing overlap with reference and also protein coding potential.
Multi-sample transcriptome assembly protocol
Varying the number of input assemblies
Performance metrics for TACO, Cuffmerge, and StringTie when merging different numbers of input assemblies. Recall (i.e., sensitivity), precision and the F-measure for all three tools were assessed for splicing patterns, splice junctions, and bases. Twenty batches were randomly selected at each number of assemblies. Points represent the mean statistic across the 20 runs, error bars represent the 95% confidence interval.Varying isoform fraction cutoff
The parameter --isoform-frac FRAC
configures TACO to only report isoforms with abundance ratio more than
FRAC
of the most abundant transcript in the locus
The precision-recall (PR) plots depict the performance for the three merging tools at 50 different isoform fraction cutoffs ranging from 0.001-0.999. Statistics describe the merger of the 55 breast cancer cell lines in the CCLE.
Performance for highly expressed transcripts
The precision-recall (PR) plots depict the performance for the three merging tools across subsets of the top 500, 1k, 1.5k, 2k, 3k, 5k, 7k, 10k, 12k, 15k, 18k, 20k, 22k, 25k, 27k, and 30k highest expressed transcripts in the merger of 55 breast cancer cell lines from the CCLE
Fraction of read-through transcripts
Scatterplot depicting the fraction of read-through genes produced in the multi-sample assembly from the three merge tools. Gene designated as being a read-through if it contains at least one transcript isoform bearing exonic overlap with two separate protein-coding genes on the same strand. Fraction of read-throughs was determined for assemblies at 50 different isoform fraction cutoffs ranging from 0.001-0.999.
Version | Date | Changes | Execuatble Binaries | Source Package |
---|---|---|---|---|
v0.7.3 | 9/25/2017 | Bug fix permitting spaces in GTF attributes | Mac Linux | Source |
v0.7.2 | 5/15/2017 | Added option to supply additional splice motifs to permit in splice junction filter | Mac Linux | Source |
v0.7.0 | 2/13/2017 | Fixed bug in refcomp code for annotation of intragenic overlap | Mac Linux | Source |
v0.6.3 | 2/01/2017 | Cleaner Input Instructions | Mac Linux | Source |
v0.6.2 | 1/31/2017 | Refcomp Optimized | Mac Linux | Source |
v0.6.1 | 1/28/2017 | Added Refcomp Source Code and Added Mac Support | Mac Linux | Source |
v0.6.0 | 12/13/2016 | Added Reference Comparison Tool | Mac Linux | Source |
v0.5.1 | 4/28/2016 | Version used for manuscript submission | Mac Linux | Source |
The TACO GitHub repository is located here.
 More information on the build procedure is located here.
 Instructions on testing the installation are here.
 For your convenience we are providing links for human references. Be sure to gunzip the file before running with TACO or TACO-refcomp.
Unpack the downloaded binary package:
$ tar -xpvfz taco-<RELEASE>.tar.gzOn some machines, it may be necessary to change the permissions of the executable
$ cd taco-<RELEASE>/ $ chmod +x taco_run
To test installatoin, download the following two test data files and place them in the same directory as the TACO executable:
In the directory where you placed the two downloaded files and unpacked the binary file, type:
./taco_run --gtf-expr-attr expr taco_test_files.txt
Note:
Our sample dataset has already been processed by TACO, which converts
the attribute specified by ‑‑gtf-expr-attr
(e.g. FPKM)
into expr
. Hence, we use the argument
‑‑gtf-expr-attr expr
here.
You should see the following output:
2016-03-30 14:54:02,097 pid=89792 INFO - taco version 0.5.0 2016-03-30 14:54:02,097 pid=89792 INFO - ------------------------------------------------------------------------------ 2016-03-30 14:54:02,097 pid=89792 INFO - verbose logging: False 2016-03-30 14:54:02,097 pid=89792 INFO - num processes: 1 2016-03-30 14:54:02,097 pid=89792 INFO - output directory: taco 2016-03-30 14:54:02,097 pid=89792 INFO - min transfrag length: 200 2016-03-30 14:54:02,098 pid=89792 INFO - min expression: 0.5 2016-03-30 14:54:02,098 pid=89792 INFO - reference GTF file: None 2016-03-30 14:54:02,098 pid=89792 INFO - guided assembly mode: False 2016-03-30 14:54:02,098 pid=89792 INFO - guided strand mode: False 2016-03-30 14:54:02,098 pid=89792 INFO - guided ends mode: False 2016-03-30 14:54:02,098 pid=89792 INFO - GTF expression attribute: expr 2016-03-30 14:54:02,098 pid=89792 INFO - isoform fraction: 0.05 2016-03-30 14:54:02,098 pid=89792 INFO - max_isoforms: 0 2016-03-30 14:54:02,098 pid=89792 INFO - change point: True 2016-03-30 14:54:02,098 pid=89792 INFO - change point pvalue: 0.001 2016-03-30 14:54:02,098 pid=89792 INFO - change point fold change: 0.85 2016-03-30 14:54:02,098 pid=89792 INFO - change point trim: True 2016-03-30 14:54:02,098 pid=89792 INFO - path graph loss threshold: 0.1 2016-03-30 14:54:02,098 pid=89792 INFO - path frac: 0.0 2016-03-30 14:54:02,099 pid=89792 INFO - max paths: 0 2016-03-30 14:54:02,099 pid=89792 INFO - Samples: 1 2016-03-30 14:54:02,099 pid=89792 INFO - Aggregating GTF files 2016-03-30 14:54:02,210 pid=89792 INFO - Sorting GTF 2016-03-30 14:54:02,236 pid=89792 INFO - Indexing Loci 2016-03-30 14:54:02,239 pid=89792 INFO - Assembling GTF files 2016-03-30 14:54:02,239 pid=89792 INFO - Assembling in parallel using 1 processes 2016-03-30 14:54:02,729 pid=89792 INFO - Merging output files 2016-03-30 14:54:02,739 pid=89792 INFO - Removing temporary files 2016-03-30 14:54:02,740 pid=89792 INFO - Done
Source files can be obtained from
download section. Unpack the release file (.tar.gz
or
.zip
) into source directory taco-X.Y.Z
. For example:
$ tar -zxvf taco-X.Y.Z.tar.gz 
If pyinstaller is installed, the build.sh
bash script can be used to compile, install, and generate executable binary files. (note: it might be necessay to chmod +x build.sh
to enable the build file to be executed).
e.g.:
$ cd taco-<RELEASE> $ ./build.shTwo binary files called
taco_run
and taco_refcomp
are generated in the unpacked directory that can be executed to run TACO and TACO-refcomp.
 
To install TACO and run without creating an executable, run the following:
$ cd taco-<RELEASE> $ python setup.py buildThe TACO code and reference comparison script can be found in
<install_dir>/build/lib*/taco/
as taco_run.py
and taco_refcomp.py
. You must either run these scripts from the <install_dir>/build/lib*/taco/
directory:
$ cd build/lib*/taco $ python taco_run.py -h $ python taco_refcomp.py -hor add the build directory to your python path to call this script elsewhere:
$ export PYTHONPATH=<insatll_dir>/build/lib*/taco:$PYTHONPATH
Run TACO from the command line as follows (from either the downloaded platform-specific binary or building from the source code):
$ ./taco_run <options> <gtf_files.txt>
A detailed description of the parameters used to control TACO is below:
<gtf_files.txt>
This required input to TACO is a text file containing paths to files in the GTF 2.2 format, one per line, produced by an ab initio assembler. For example, the text file passed to TACO could look like this:
/path/to/my/assemblies/mcf7.gtf /path/to/my/assemblies/bt474.gtf /path/to/my/assemblies/hela.gtf
-h/--help
Prints the help message and exits
-v/--verbose
Enables detailed logging for debugging purposes
-p/--num-processes N
(default=1)
Run TACO in parallel mode with N processes
-o/--output-dir DIR
Directory where output files will be stored. This directory must not already exist.
--gtf-expr-attr ATTR
(default=FPKM)
GTF attribute field containing expression estimate. The default setting is
FPKM
for Cufflinks GTF input.
--filter-min-length N
(default 200bp)
Pre-filters input transfrags with length < N
prior to assembly. Set to 0
to disable this filter.
--filter-min-expr X
(default 0.5)
Pre-filters input transfrags with expression < X
. The units of the expression cutoff value X
correspond to the units specified by the gtf-expr-attr
parameter, which is FPKM
by default. Set
to 0.0
to disable this filter.
--isoform-frac FRAC
(default 0.05)
Report transcript isoforms with expression fraction >=FRAC
relative to the highest expressed gene. For each gene, the highest abundance isoform will be reported with a FRAC
of 1.0
.
TACO has several advanced options that can be used to configure specific aspects of its behavior, such as change point detection and splicing pattern network construction. These parameters have been chosen to optimize performance for human transcriptomes. We recommend that users leave these options at their default settings.
TACO writes output to the directory specified by the
-o
command line option. Within this directory, the
import output files are:
This GTF file contains TACO's assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized (“gene_id”, and “transcript_id”). There one GTF record per row, and each record represents either a transcript or an exon within a transcript. The columns are defined as follows:
Column number | Column name | Example | Description |
---|---|---|---|
1 | seqname | chrX | Chromosome or contig name |
2 | source | taco | The name of the program that generated this file (always taco) |
3 | feature | exon | The type of record (always either “transcript” or “exon”. |
4 | start | 77696957 | The leftmost coordinate of this record (where 1 is the leftmost possible coordinate) |
5 | end | 77712009 | The rightmost coordinate of this record, inclusive. |
6 | score | 77712009 | The most abundant isoform for each gene is assigned a score of 1000. Minor isoforms are scored by the ratio (minor FPKM/major FPKM) |
7 | strand | + | TACO's guess for which strand the isoform came from. Always one of “+”, “-“, “.” |
7 | frame | . | TACO does not predict where the start and stop codons (if any) are located within each transcript, so this field is not used. |
8 | attributes | … | See below. |
Each GTF record is decorated with the following attributes:
Attribute | Example | Description |
---|---|---|
gene_id | G7 | TACO gene id |
transcript_id | TU56 | TACO transcript id |
locus_id | L1 | TACO locus id |
tss_id | TSS31 | TACO transcription start site id |
expr | 2.441 | Isoform-level abundance. The units correspond to the expression units of the input transfrags (usually FPKM or TPM) |
rel_frac | 0.7647 | Relative abundance of isoform compared to the major isoform in the gene. The most abundant isoform for each gene is assigned a rel_frac of 1.0. Minor isoforms are scored by the ratio (minor expr/major expr) |
abs_frac | 0.7647 | Relative abundance of isoform compared to the total expression of all isoforms in the gene. Isoforms are scored by the ratio (expr / sum(expr(x) for each isoform x)). |
This BED file contains TACO's assembled isoforms. Please refer to
the
UCSC genome browser's detailed description of the BED
format. The name
column (Column 4) contains
a string of the format gene_id|transcript_id(expr)
,
e.g. G7|TU56(2.77)
.
TACO outputs 3 bedGraph files with the coverage profile of the
input transfrags on the forward, reverse, and unknown/unspecified
strands. Please refer to the
UCSC genome browser's detailed description of the bedGraph
format. These files can be converted to
bigWig format
using the free conversion tool bedGraphToBigWig
for
viewing on genome browsers such as IGV or UCSC.
A UCSC BED track of junctions reported by TACO. Each junction consists of two connected BED blocks. The score is the sum of the expression values of transfrags supporting the junction. This file can be converted to bigBed track format for viewing on genome browsers such as IGV or UCSC.
This tool can take any assembly in GTF format and compare to a reference, and provide details on overlap with reference transcripts, nature of that overlap, and can also run the CPAT tool to determine coding potential.
We recommend using the GENCODE reference, which can be found here.
Run reference comparison tool as follows:
$ ./taco_refcomp -o <output_directory> -r <reference_gtf> -t <test_gtf> --cpat (optional flag to run coding potential prediction)
-h/--help
Prints the help message and exits
-v/--verbose
Enables detailed logging for debugging purposes
-p/--num-processes N
(default=1)
Run TACO-refcomp in parallel mode with N processes. Note that parallelization done at chromosome level.
-r REF
Path to reference gtf to be used in comparison."
-t TEST
Path to gtf being tested (e.g., a TACO produced meta-assembly)."
-o/--output-dir DIR
Directory where output files will be stored.
--cpat
(default=false)
Run CPAT tool for coding potential scoring. CPAT function only supported for human, mouse, and zebrafish. The CPAT tool can take over an hour depending on the size of the transcriptome being analyzed.
--cpat-species (human, mouse, zebrafish)
(default=human)
Only needed if using --cpat
flag.
Species information needed to use approrpriate reference for CPAT algorithm
--cpat-genome GENOME
Only needed if using --cpat
flag.
Path to genome fasta file (GENOME
) that CPAT utilizes to generate sequence
from assembly. (Genome must be same as that used to generate assemblies)
Note: If --cpat
flag is used, CPAT(v1.2.2) will be run. Please cite the CPAT tool
(reference).
Q. Can you compare a TACO generated assembly to an annotated reference??
A. We provide a reference comparison tool (taco_refcomp
) that can be found bundled
with the taco
tool. This tool can be used to compare any transcriptome to a reference. The tool also
supports running the CPAT coding potential tool. If CPAT is used, please cite the
CPAT manuscript.
Q. How much memory does TACO require?
A. TACO memory usage scales with the size of the input and the number of parallel processes. Specifically, TACO partitions the input transfrags into independent loci, and transfrags within each loci are stored in memory. Running TACO with multiple processes allows it to assemble multiple loci in parallel at the expense of additional memory overhead. To examine peak memory usage, we tested TACO on >10,000 datsets. Some of the largest loci contained over 10 million transfrags each, and memory per process ranged from 3-10GB.
Q. How much time does TACO take to run?
A. TACO runtime also scales with the size of the input and the number of parallel processes. We can offer several examples: 1) TACO processed 55 breast cancer cell line datasets using a Mac OSX laptop computer with 4 cores in about 10 minutes, and 2) TACO processed ~1000 datasets from the CCLE on a 64-core Linux workstation in about 2 hours and 30 minutes.
Q. Why do I receive a WARNING message that my sort
command does not support the --parallel
flag?
A. TACO uses the operating system's command line sort
to concatenate and sort the input GTF files before beginning assembly. For large jobs (~1000 samples), sorting the GTF can consume a significant portion of
total runtime. Newer versions of GNU coreutils provide a parallelized version of sort. TACO checks to see if sort can be parallelized and defaults to single-threaded sort if this is not supported.
Please send any questions/comments regarding tool to yniknafs@med.umich.edu
.
If using TACO, please cite Niknafs et al., Nature Methods, 2017.
Niknafs, Y. S., Pandian, B., Iyer, H. K., Chinnaiyan, A. M., & Iyer, M. K. (2017). TACO produces robust multisample transcriptome assemblies from RNA-seq. Nat Meth, 14(1), 68–70.
Cufflinks ab initio assemblies of the RNA-Seq datasets from the Cancer Cell Line Encyclopedia (CCLE) are provided for user download. RNA-Seq data were aligned using STAR prior to running Cufflinks.
Dataset | Samples | Size (Uncompressed) | Download Link |
---|---|---|---|
CCLE Breast Cancer Cell Lines | 55 | 285MB (4GB) | ccle_brca_cufflinks.tar.gz |
CCLE All Cell Lines | 935 | 4.4GB (80GB) | ccle_cufflinks.tar.gz |
GTEX Tissue | 51 | 273MB (5GB) | gtex_tissue.tar.gz |
GTEX Brain | 50 | 238MB (5GB) | gtex_brain.tar.gz |
First, untar/uncompress the downloaded dataset file:
tar -zxvf ccle_cufflinks.tar.gz
Files will uncompress into a directory e.g. ccle_cufflinks
. Within this directory is the file list_of_files.txt
which is a list relative paths to the GTF files.
Change your directory to uncompressed directory:
cd ccle_cufflinks
Run TACO with the following command:
./taco_run --gtf-expr-attr FPKM -o my_output_dir list_of_files.txt
Dataset | Reference | Download Link |
---|---|---|
Tissue - PacBio | Sharon et al., 2013 | tissue.pacbio.GRCh38.reference.gtf.tar.gz |
Brain - SRL-RNA-seq | Tilgner et al., 2015 | brain.slr.GRCh38.reference.gtf.tar.gz |
Copyright (c) 2015-2018 Matthew Iyer, Yashar Niknafs, Balaji Pandian, Chinnaiyan Lab, University of Michigan Medical School. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
© 2017 Michigan Center for Translational Pathology, University of Michigan Health System