Example: assemble phage genome

After installing Sprai, assemble a small genome.

For example, we showed how to assemble the phage genome.

Prepare data

Go to pacbiotoca wiki and download phage data ( http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz ).

mkdir tmp
cd tmp
wget http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz
tar xvzf sampleData.tar.gz

Convert fasta to fastq. You can use a fa2fq.pl script in Sprai.

fa2fq.pl sampleData/pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.fq

Prepare Sprai

mkdir tmp2
cd tmp2
ln -s ../pacbio.filtered_subreads.fq all.fq
cp /your/sprai/directory/asm.spec .
cp /your/sprai/directory/ec.spec .

Open ec.spec and change values.

#>- params -<#
input_fastq all.fq
estimated_genome_size 50000
partition 12
evalue 1e-50
trim 42
ca_path /path/to/your/wgs/Linux-amd64/bin/
word_size 11
#>- params -<#

input_fastq is your input file name.

estimated_genome_size is the number of nucleotides of your target. If you do not know it, set large number. For example, set 1e+12.

partition is the number of processors Sprai uses.

evalue is used by blastn.

trim is the number of nucleotides Sprai cut from both sides of alignments.

word_size is used by blastn.

Correct errors & assemble

fs2ctg_v4.pl ec.spec asm.spec > log.txt 2>&1 &

Sprai gives ID to each read in all.fq and outputs c00.idfq.gz.

Sprai partitions all.fq to c00_00xx.fa.

c00.nin, c00.nhr and c00.nsq are output by makeblastdb.

Sprai aligns c00_00xx.fa to c00.nxx database by using blastn, correct errors and output c00_00xx.dfq.gz. dfq format contains the IDs of reads, bases, aligned depths and quality values. (Current Sprai outputs dummy quality values.)

Sprai cut low depth regions of each read, delete ‘-‘ characters to convert to FASTQ format and outputs corrected reads in c00.fin.idfq.gz.

Sprai extracts longest 20X reads of the estimated_genome_size from c00.fin.idfq.gz. And feed them to Celera Assembler.

Celera Assembler outputs files into c01.fin.top20x directory.

Find contigs

You will find contigs in a c01.fin.top20x/9-terminator/c01.fin.top20x.ctg.fasta file. Files in c01.fin.top20x are produced by the wgs-assembler (Celera Assembler). Read the wgs-assembler document for details.

Get N50 value

If you install Statistics::Descriptive module to a directory Celera Assembler can see, Celera Assembler outputs N50 value and so on in a do_c01.fin.top20x.log file. You will find lines in the file like below:

[Contigs]
TotalContigsInScaffolds         1
TotalBasesInScaffolds           50618
TotalVarRecords                 0
MeanContigLength                50618
MinContigLength                 50618
MaxContigLength                 50618
N25ContigBases                  50618
N50ContigBases                  50618
N75ContigBases                  50618

These are the statistics of contigs of the assembly.