CAFE (Co-Assembly of stranded and unstranded RNA-seq data Followed by End-correction) is a high-performance pipeline for transcriptome assembly. CAFE predicts the directionality of unstranded RNA-seq reads using a kMC (k-order Markov Chain) model and generates RPDs (reads with a predicted direction), increasing assembly sensitivity and specificity. Furthermore, full-length transcripts can be obtained by updating exon-junctions with exon-junction reads and calculating maximum entropy scores from putative splicing signals. To improve annotations at transcript boundaries, transcription start sites (TSSs), determined from CAGE-seq, and cleavage and polyadenylation sites (CPSs), from 3P-seq, were incorporated into relevant transcripts. CAFE should not only help to build comprehensive, precise transcriptome maps from complex genomes but also expand the universe of non-coding genomes.
|Species||-v <species>||Species used for assembly (either 'human' or 'mouse')|
|Assembly||-v <assembly>||Assembly version of reference genome (e.g.: 'hg19' or 'mm9')|
|Chromosome||-c <chromosome>||Sets the chromosome for assembly (e.g.: 'all' or 'chr1')|
|Type||-t <type>||Type of input BAM files. If you want to perform co-assembly with unstranded and stranded RNA-seq reads, specify '-t npsp'. (npsp: co-assembly of unstranded and stranded reads, np: assembly of unstranded reads alone, sp: assembly of stranded reads alone)|
|Unstranded.bam||-u <unstranded.bam>||Input BAM file with unstranded RNA-seq reads for assembly (not compatible with '-t sp')|
|Stranded.bam||-s <stranded.bam>||Input BAM file with stranded RNA-seq reads for assembly (not compatible with '-t np')|
|Outputdir||-o <outputdir>||Sets the output directory where CAFE will write all result files|
More detailed parameters can be modified by changing the config file in source code. However, this is not recommended since these parameters were already optimized.
The CAFE source package provides exercise files to follow the CAFE pipeline.
The ‘example’ directory contains BAM files with unstranded (GEO accession numbers: GSM591659 and GSM591682) and stranded (GEO accession numbers: GSM546921, GSM546927, GSM591670, and GSM591671) RNA-seq reads sequenced from HeLa-S3 cells.
Run CAFE from the command line as below:
python cafe.py -g human -v hg19 -c chr22 -t npsp -u ./data/example/unstranded/hela_chr22.bam -s ./data/example/stranded/hela_chr22.bam -o ./results/
CAFE’s progress as it runs will be printed at each step as standard output. All result files produced by CAFE will be stored in the ‘results’ directory.