| blast_datatable {rCRUX} | R Documentation |
Given a datatable with the column names of the datatable returned by
get_seeds_remote(), or get_seeds_local(), uses a random
stratified sample based on taxonomic rank to iteratively process the data
The random sample entires are sent to
run_blastdbcmd_blastn_and_aggregate_resuts(), which uses blastdbcmd
to convert entries into fasta files, passes them to blastn to query local
blast formatted databases with those sequences. It compiles the results
of blastn into a data.frame that it cleans and returns with taxonomy added
using get_taxonomizr_from_accession. Additionally, it saves its
state as text files in a specified directory with each iteration, allowing
the user to restart an interrupted run of blast_seeds().
blast_datatable(
blast_seeds,
save_dir,
blast_db_path,
accession_taxa_sql_path,
ncbi_bin = NULL,
force_db = FALSE,
sample_size = 1,
wildcards = "NNNN",
rank = "genus",
max_to_blast = 1000,
...
)
blast_seeds |
a data.frame formatted like the output from get_seeds_remote or get_seeds_local |
save_dir |
a directory in which to create files representing the current state |
blast_db_path |
a directory containing one or more blast-formatted database. For multiple blast databases, separate them with a space and add an extra set of quotes. (e.g blast_db_path <- "/my/ncbi_nt/nt" or blast_db_path <- '"/my/ncbi_nt/nt /my/ncbi_ref_euk_rep_genomes/ref_euk_rep_genomes"') |
accession_taxa_sql_path |
a path to an sql created by
|
ncbi_bin |
the directory that the blast+ suite is in. If NULL, the program will use your PATH environmental variable to locate them |
force_db |
if true, try to use blast databases that don't appear to be blast databases |
sample_size |
the number of entries to sample per rank before calling blastn - errors if not enough entries per rank (default = 1) |
wildcards |
a character vector representing the number of wildcards to discard (default = "NNNN") |
rank |
the column representing the taxonomic rank to randomly sample (default = genus) |
max_to_blast |
is the maximum number of entries to accumulate into a fasta before calling blastn (default = 1000) |
... |
additional arguments passed to |
blast_datatable uses run_blastdbcmd_blastn_and_aggregate_resuts() to
run blastdbcmd and blastn to find sequences. It randomly samples rows of the
seed table based on the taxonomic rank supplied by the user. The user can
specify how many sequences can be blasted simultaneously using max_to_blast.
The random sample will be subset for blasting. Once all of the seeds of the
random sample are processed, they are removed from the dataframe as are the
seeds found as blast hits. blast-datatable repeats this process or stratified
random sampling until there are fewer reads remaining than max_to_blast, at
which point it blasts all remaining seeds. The final aggregated results are
cleaned for multiple blast taxids, hyphens, and wildcards.
Note: The blast db downloaded from NCBIs FTP site has representative accessions meaning identical sequences have been collapsed across multiple accessions even if they have different taxid. Here we identify representative accessions with multiple taxids, and unpack all of the accessions that were collapsed into that representative accessions.
We are not identifying or unpacking the representative accessions that report a single taxid
Saving data:
blast_datatable uses files generated in
run_blastdbcmd_blastn_and_aggregate_resuts() that store intermediate
results and metadata about the search to local files as it goes. This allows
the function to resume a partially completed blast, partially mitigating
the consequences of encountering an error or experiencing other interruptions.
Interruptions while blasting a subset of a random stratified sample will
result in a loss of the remaining reads of the subsample, and may decrease
overall blast returns. The local files are written to save_dir by
save_state(). Manually changing these files is not suggested as
it can change the behavior of blast_datatable.
Restarting an interrupted blast_seed() run:
To restart from an incomplete blast_datatable, submit the previous command
again. Do not modify the paths specified in the previous command, however
parameter arguments (e.g. rank, max_to_blast) can be modified. blast_datable
will automatically detect save files and resume from where it left off.
Warning: If you are resuming from an interrupted blast, make sure you supply
the same data.frame for blast_seeds. If you intend to start a new blast,
make sure that there is not existing blast save data in the directory you
supply for save_dir.
Note: blast_datatable does not save intermediate data
from blastdbcmd, so if it is interupted while getting building the fasta to
submit to blastn it will need to repeat some work when resumed. The argument
max_to_blast controls the frequency with which it calls blastn, so it can
be used to make blast_datatable save more frequently.
A data.frame representing the output of blastn