Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Overview

In the old system a dataset could not be released until all processing involving that dataset had completed.  In many cases this caused a delay of several days while large jobs ran on the dataset.  This document describes an approach using a directory naming convention along with symlinks to eliminate this problem as well as provide versioning.  The basic idea is when a dataset is deployed it is written to a folder with a version stamp on it.  The stamp is done by appending .# where # is a number representing the build. 

Symlink Example

Below is an example of a symlink (123) under blast-data/system pointing to version 1 of dataset 123

Panel

lrwxrwxrwx     1 jboss  webservers          68 Apr 19 14:04 123 -> /.../release/blast-data/system/123.1

The symlink should have a full path to the versioned dataset.

Order of operations
  1. Create new dataset folder under blast-data/system/#.# where #.# is dataset name along with new build number
  2. Fill #.# folder with new dataset release
  3. Repoint symlink # from previous released dataset to new #.# dataset
readme file in #.# folders

For all datasets there should be a readme filled with the following properties.  This is needed by the blast application to get information about a reference dataset without the need to hit the database.  Important for future user uploaded blast databases.

Below is an example readme file for node  10120.  The important fields are Sequence_length, Sequence_count, and isProtein

No Format
Node_id=10120
Name="Bioreactor: All Metagenomic Sequence Reads (N)"
Build_id=22
Sequence_length=68588673
Partitions=1
Sequence_count=177294
isProtein=false
isReferenceSet=false
rebuild_time="Mon May 16 11:11:55 PDT 2011"
testblast(p|n)_query.fasta file

To test the database within each node a testblast(p|n)_query.fasta should exist containing the first sequence from each partition.  If the sequence from a partition is greater then 1,000 bases it should be truncated. The n means nucleotide and p means peptide.  The jcviblast.sh command line program can then use this file to run a blast against this node to verify the database is valid. 

Example for 10120 testblastn_query.fasta.  For 10120, there is only one database file p_0.fasta so we have only one sequence.  If there were multiple then we would see multiple sequences in this file.

No Format
>gnl|BL_ORD_ID|0 CAM_READ_0071668677 /library_id="CAM_LIB_A00001" /sample_id="CAM_S_423" raw_id=GD7Y0W001CAXCY length=477 xy=0827_1744 region=1 run=R_2010_03_19_13_13_34_
AGTGAACTCCGAAAACTTTAATCGGATCGACCAATTAAAATTATTTATTGTCCATATAGACCAGCTATCCTTGGACGAAA
TATTTCCTTTAAAGACTCTCAACAAAAAGATAAAGAGTTTTTTGCTAGTATTTGCCTAAGATACAATATTGGTTTCATTG
ATCTGACCGAACGTTTTAATTATTTTTTTCTTAATACGATGAAGTTTCCACGAGGTTTTGCAAACTCATTTCCTGGGCGA
GGTCACCTTAATGCTGATGGTCATCGTCTTGTTGCTGAGGCCATATTTTCGAATGAAGTGCATCGTAATTCGCTAAAGCA
GTAAATCATAATGTCCTTATCTCAATTTGAATTTATCCTTTTTCTTATTCTGGTTGCATTTGCTCTTTTTGTTCAAAGAT
AACGCATCTCGGAAGTGGATTCTTTTGACAGCGGGTATCTGCTTTTACTGCTACTGGACTACCGATTCTGTCTGTTG
Issues/Tasks
  • Some sort of logic must be in place to prevent the classic blast application to be examining the symlink at the exact moment it is being moved.  Although this is pretty unlikely to occur since the move will be really fast.