blast_datatable {rCRUX}R Documentation

Controls the iterative blast search implemented by run_blastdbcmd_blastn_and_aggregate_resuts, cleans the output, and adds taxonomy

Description

Given a datatable with the column names of the datatable returned by get_seeds_remote(), or get_seeds_local(), uses a random stratified sample based on taxonomic rank to iteratively process the data The random sample entires are sent to run_blastdbcmd_blastn_and_aggregate_resuts(), which uses blastdbcmd to convert entries into fasta files, passes them to blastn to query local blast formatted databases with those sequences. It compiles the results of blastn into a data.frame that it cleans and returns with taxonomy added using get_taxonomizr_from_accession. Additionally, it saves its state as text files in a specified directory with each iteration, allowing the user to restart an interrupted run of blast_seeds().

Usage

blast_datatable(
  blast_seeds,
  save_dir,
  blast_db_path,
  accession_taxa_sql_path,
  ncbi_bin = NULL,
  force_db = FALSE,
  sample_size = 1,
  wildcards = "NNNN",
  rank = "genus",
  max_to_blast = 1000,
  ...
)

Arguments

blast_seeds

a data.frame formatted like the output from get_seeds_remote or get_seeds_local

save_dir

a directory in which to create files representing the current state

blast_db_path

a directory containing one or more blast-formatted database. For multiple blast databases, separate them with a space and add an extra set of quotes. (e.g blast_db_path <- "/my/ncbi_nt/nt" or blast_db_path <- '"/my/ncbi_nt/nt /my/ncbi_ref_euk_rep_genomes/ref_euk_rep_genomes"')

accession_taxa_sql_path

a path to an sql created by taxonomizr::prepareDatabase()

ncbi_bin

the directory that the blast+ suite is in. If NULL, the program will use your PATH environmental variable to locate them

force_db

if true, try to use blast databases that don't appear to be blast databases

sample_size

the number of entries to sample per rank before calling blastn - errors if not enough entries per rank (default = 1)

wildcards

a character vector representing the number of wildcards to discard (default = "NNNN")

rank

the column representing the taxonomic rank to randomly sample (default = genus)

max_to_blast

is the maximum number of entries to accumulate into a fasta before calling blastn (default = 1000)

...

additional arguments passed to run_blastdbcmd_blastn_and_aggregate_resuts()

Details

blast_datatable uses run_blastdbcmd_blastn_and_aggregate_resuts() to run blastdbcmd and blastn to find sequences. It randomly samples rows of the seed table based on the taxonomic rank supplied by the user. The user can specify how many sequences can be blasted simultaneously using max_to_blast. The random sample will be subset for blasting. Once all of the seeds of the random sample are processed, they are removed from the dataframe as are the seeds found as blast hits. blast-datatable repeats this process or stratified random sampling until there are fewer reads remaining than max_to_blast, at which point it blasts all remaining seeds. The final aggregated results are cleaned for multiple blast taxids, hyphens, and wildcards.

Note: The blast db downloaded from NCBIs FTP site has representative accessions meaning identical sequences have been collapsed across multiple accessions even if they have different taxid. Here we identify representative accessions with multiple taxids, and unpack all of the accessions that were collapsed into that representative accessions.

We are not identifying or unpacking the representative accessions that report a single taxid

Saving data: blast_datatable uses files generated in run_blastdbcmd_blastn_and_aggregate_resuts() that store intermediate results and metadata about the search to local files as it goes. This allows the function to resume a partially completed blast, partially mitigating the consequences of encountering an error or experiencing other interruptions. Interruptions while blasting a subset of a random stratified sample will result in a loss of the remaining reads of the subsample, and may decrease overall blast returns. The local files are written to save_dir by save_state(). Manually changing these files is not suggested as it can change the behavior of blast_datatable.

Restarting an interrupted blast_seed() run: To restart from an incomplete blast_datatable, submit the previous command again. Do not modify the paths specified in the previous command, however parameter arguments (e.g. rank, max_to_blast) can be modified. blast_datable will automatically detect save files and resume from where it left off.

Warning: If you are resuming from an interrupted blast, make sure you supply the same data.frame for blast_seeds. If you intend to start a new blast, make sure that there is not existing blast save data in the directory you supply for save_dir.

Note: blast_datatable does not save intermediate data from blastdbcmd, so if it is interupted while getting building the fasta to submit to blastn it will need to repeat some work when resumed. The argument max_to_blast controls the frequency with which it calls blastn, so it can be used to make blast_datatable save more frequently.

Value

A data.frame representing the output of blastn


[Package rCRUX version 0.0.1.000 ]