DISCLAIMER

The documentation is based on an interpretation of the java code and MAY contain errors. Also to simplify descriptions below some low level details of the algorithm have been omitted.

Overview

CAMERA blast uses NCBI blast (currently 2.2.18) to perform the alignments. To improve performance, both the query and reference/subject sequences are split into partitions. NCBI blast jobs are run on every combination of query and subject sequences utilizing various compute clusters. Code developed in house is then used to merge the outputs of the individual NCBI blasts into a single result. This document describes the algorithm used by CAMERA.

Partitioning of Reference/Subject sequences

Sequences are partitioned into 90 megabase bins for both nucleotide and proteins.

Partitioning of Query sequences

Splitting of query sequences are done by the following rules when reading through the user query fasta file. The algorithm starts writing sequences to the first file of query sequences switching to another file if 10,000 sequences have been written or the number of Bases in the partition multiplied by three times the average length of sequences in the partition plus 1,000 times the number of db alignments requested exceeds 200,000,000

In Pseudo code:

numBases = Number of Basepairs in current partition
avgSeqLen = Average length of sequences in current partition
dbAlignsRequested = Number of Database alignments requested

Start new partition if numBases x (3 x avgSeqLen + 1000) x dbAlignsRequested > 200,000,000

NCBI Blast job generation

An NCBI Blast job is generated for every combination of query sequence files to subject sequence files.

For Example:

If there are 3 query sequence files (q1,q2,q3) and two subject sequences databases/files (r1,r2) Then the following blast jobs would be generated:

blastall -d r1 -i q1
blastall -d r1 -i q2
blastall -d r1 -i q3
blastall -d r2 -i q1
blastall -d r2 -i q2
blastall -d r2 -i q3

Running jobs on compute resources

The NCBI Blast jobs above are submitted as individual jobs to CAMERA's internal compute cluster or to Triton. The output of the blast is compressed into a binary format and persisted to a shared filesystem.

Merge of Blast output

To get a single blast like output the alignments of the individual blast job outputs must be merged together. This merging phase requires sorting and elimination of lower quality hits. The following paragraphs describe the merge algorithm used by CAMERA blast.

The merge algorithm works by iterating over the individual blast outputs of all alignments for each query sequence partition file. Each hit is binned into an array of hits by the query sequence id. The hit is only added to the array if the array size is less then the maximum number of alignments specified (-b parameter) or if the hit has a better score(see scoring rules below) then the current worst hit in the array. At the completion of parsing a given blast output file the arrays of all hits are sorted and if necessary truncated to maximum number of alignments (-b parameter). After all the blast output files have been processed the hits are put into a single array and sorted again by (see scoring rules below) and written out to a binary file. The algorithm then iterates through the binary files writing out a master blast like xml file output.

Scoring rules

Sorting of hits uses the following cascade of rules where if the current comparison matches the code goes to the next rule. Query strings is included in the comparison for sorting the whole list of alignments which contain multiple query sequences.

1. Compare queryId and If they differ the better alignment is the one where queryId comes lexigraphically first.

2. Compare evalue and if they differ the alignment with smaller value is considered better.

3. Compare bitscore and if they differ the alignment with the larger value is considered better.

4. Compare lengthAlignment and if they differs the alignment with the larger value is considered better.

5. Compare numberSimilar and if they differ the alignment with the larger value is considered better.

6. Compare queryBegin and if they differ the alignment with the smaller value is considered better.

7. Compare subjectBegin and if they differ the alignment with the smaller value is considered better.

8. Compare queryEnd and if they differ the alignment with the smaller value is considered better.

9. Compare subjectEnd and if they differ the alignment with the smaller value is considered better.

10. If all above comparisons are equal order is entirely dependent on order read in from blast outputs

Values below in <...> denote xml tags in blast output and implies value of variable is the characters within the tag.

queryId = <Iteration_query-ID>
evalue = <Hsp_evalue>
bitscores = <Hsp_bit-score>
lengthAlignment = <Hsp_align-len>
numberSimilar = <Hsp_positive> - <Hsp_identity> (Subtract Hsp-positive from Hsp_identity if set)
queryBegin = <Hsp_query-from> subtract 1 and swapped with <Hsp_query-to> if orientation is reversed
subjectBegin = <Hsp_hit-to> subtract 1 and swapped with <Hsp_hit-from> if orientation is reversed
queryEnd = <Hsp_query-to> subtract 1 and swapped with <Hsp_query-from> if orientation is reversed
subjectEnd = <Hsp-hit-to> subtract 1 and swapped with <Hsp-hit-from> if orientation is reversed