Workflows

This page documents the workflows available in the CAMERA Portal. Additional information is available by logging into the CAMERA Portal



Data Preparation

QC Filter

Each base in a given read has a quality score, Q, associated with it. Q=-10*log10(p), where p is the probability error. To have a sense of the quality of the given reads, the read average score can be used to see the quality performance. "Quality Control Filter" takes fasta and qual files or fastq file as input, calculates the average score for each read, then fetches high quality reads, filters out shorter than minimum read length; and generates statistical analysis on the input reads. 

Note: This workflow does not have a graphical output but the results can be downloaded to your machine to view.

454 Duplicate Clustering

This workflow identifies the duplicates from 454 reads, including exact duplicates and near identical duplicates. These duplicates are mostly sequencing artifacts in metagenomic samples, and therefore should be removed. However, most duplicates in transcriptomic reads are not artificial, so it is not suggested to run this workflow for transcriptomic datasets. 

Note: This workflow does not have a graphical output but the results can be downloaded to your machine to view.

BLAST

BLASTn

BLASTN searches nucleotide databases using a nucleotide query. 

CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. 
Note: Using the following advanced parameters with BLAST on the NCBI web site will yield identical results. 
Match reward: 2 
Mismatch penalty: -3 
Gap open cost: 5 
Gap extend cost: 2 
Only the top alignment per hit will be kept for blast jobs when (CAMERA_REF)NCBI Refseq Genomes (N) is used as the reference data set.

BLASTp

BLASTP searches protein databases using a protein query. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site. 

Gap open cost: 11
Gap extend cost: 1 

BLASTx

BLASTX searches protein databases using a translated nucleotide query. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site. 

Gap open cost: 11
Gap extend cost: 1 

MEGA Blast

Megablast is intended for comparing a query to closely related sequences and works best if the target percent identity is 95% or more but is very fast. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. 

Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site. 
Mismatch penalty: -2
Only top alignment per hit will be kept for blast jobs when (CAMERA_REF)NCBI Refseq Genomes (N) is used as the reference data set.

TBLASTn

TBLASTN searches translated nucleotide databases using a protein query. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site. Gap open cost: 11Gap extend cost: 1

TBLASTx

TBLASTX searches translated nucleotide databases using a translated nucleotide query. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site. Gap open cost: 11Gap extend cost: 1

BLAST Kegg

This workflow uses BLAST to search protein sequences against the KEGG protein database. The KEGG number and its pathway/functions will be returned. Note: This workflow does not have a graphical output, but the results can be downloaded to your computer for viewing and analysis. 

DNA Clustering
This workflow uses cd-hit-est program to cluster DNA sequences. The non-redundant sequences, cluster file, cluster distribution and cluster table will be outputted. 
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

RNA Prediction

rRNA prediction by hmmer

This workflow uses hmmer 3.0 program to predict rRNA sequences from input DNA reads. The predicted rRNAs, masked input sequences and predicted rRNA coordinate table will be outputted. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

rRNA prediction by blastn

This workflow uses blastn program to predict rRNA sequences from input DNA reads. The predicted rRNAs, masked input sequences and predicted rRNA coordinate table will be outputted. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

tRNA prediction

This workflow uses tRNAscan-SE program to predict tRNA sequences from input DNA reads. The predicted tRNAs, masked input sequences and predicted tRNA coordinate table will be outputted. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

Clustering

DNA Clustering

This workflow uses cd-hit-est program to cluster DNA sequences. The non-redundant sequences, cluster file, cluster distribution and cluster table will be outputted. 

Protein clustering

This workflow uses cd-hit program (default sequence identity cutoff=0.9) to cluster protein sequences in just one step. The non-redundant sequences, cluster file, cluster distribution and cluster table will be outputted. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

Hierarchical protein clustering

This workflow uses cd-hit program to cluster protein sequences in two steps. First it uses default sequence identity cutoff=0.9 to do clustering. Based on clustering results at the first step, we use cd-hit (default sequence identity cutoff=0.6) again to do clustering for the second step. The non-redundant sequences, cluster file, cluster distribution and cluster table will be outputted. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

454 Duplicate Clustering

This workflow identifies the duplicates from 454 reads, including exact duplicates and near identical duplicates. These duplicates are mostly sequencing artifacts in metagenomic samples, and therefore should be removed. However, most duplicates in transcriptomic reads are not artificial, so it is not suggested to run this workflow for transcriptomic datasets. 

Note: This workflow does not have a graphical output but the results can be downloaded to your machine to view.

Sequence Assembly

Assembly

This workflow assembles the 454 reads using a meta-assembler developed by CAMERA. This meta-assembler first run a list of assembly programs to generate a pool of contigs, it then re-assemble the contigs into final results. Our analysis showed that the meta-assembler is better than any of its component assembly programs. 

Note: This workflow does not have a graphical output but the results can be downloaded to your machine to view.

Orf Prediction

Orf finder by fraggene_scan

This workflow uses fraggene_scan program to predict orfs from input DNA reads. The orf sequences and predicted orf coordinate table will be outputted. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

Orf finder by metagene

This workflow uses metagene program to predict orfs from input reads. The orf sequences and predicted orf coordinate table will be outputted. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

Orf finder by six-reading-frame

This workflow uses six-reading-frame translation technique to predict orfs from input DNA reads. The orf sequences and predicted orf coordinate table will be outputted. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

Functional Annotation

Metagenomic data annotation and clustering

This is the full RAMMCAP pipeline for analysis of metagenomic sequences. It accepts a FASTA file of raw reads. The pipeline identifies the tRNA, rRNA, and ORFs from the reads. It then performs clustering analysis on the reads and the ORFs. The ORFs are annotated against PFAM, TIGRFAM, and COG.

Function annotation by PFAM

This workflow uses hmmer 3.0 program to give function annotation to protein sequences. It is based on PFAM database. Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view

Function annotation by COG

This workflow uses rpsblast program to give function annotation to protein sequences. It is based on COG database for prokaryotic proteins. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

Function annotation by KOG

This workflow uses rpsblast program to give function annotation to protein sequences. It is based on KOG database for eukaryotic proteins. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

Function annotation by TIGRFAM

This workflow uses hmmer 3.0 program to assign function annotation to protein sequences. It is based on TIGRFAM database. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

Function annotation by NCBI PRK

This workflow uses rpsblast program to give function annotation to protein sequences. It is based on PRK database for Reference Sequence proteins. 

Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.

Diversity

Alpha Diversity (Rohwer)

This workflow employs PHACCS, developed by Rohwer lab, to estimate a viral community structure and diversity based on contig spectrum calculated from metagenomic information obtained from one viral community. It accepts a FASTA file of viral nucleotide sequences per community.

Gamma Diversity (Rohwer)

This workflow employs PHACCS, developed by Rohwer lab, to estimate overall viral community structure and diversity in combined viral communities based on contig spectrum calculated from metagenomic sequences obtained from multiple viral communities. It accepts two to five FASTA files of viral nucleotide sequences.

  • No labels

1 Comment

  1. Hi David, was just typing you an e-mail and just noticed you've created a page.... it's a pretty long page though - you might consider adding a table of contents by clicking: Edit then Insert > Table of Contents.  :-)