| blast_seeds {rCRUX} | R Documentation |
blast_seeds uses blast_datatable() to search against a blast
formatted database. It creates a permanent directory blast_seeds_output and
a temporary directory 'blast_seeds_save' in the output_directory_path. It saves from
and passes files to blast_datatable() while the run is in progress.
During the final steps of the function the final data is saved in
rblast_seeds_output recording the results of the blast.
blast_seeds(
seeds_output_path,
blast_db_path,
accession_taxa_sql_path,
output_directory_path,
metabarcode_name,
expand_vectors = TRUE,
minimum_length = 5,
maximum_length = 500,
warnings = 0,
...
)
seeds_output_path |
a path to a csv from get_seeds_local or get_seeds_remote (e.g. seeds_output_path <- '/my/rCRUX_output_directory/12S_V5F1_filtered_get_seeds_remote_output_with_taxonomy.csv') |
blast_db_path |
a directory containing one or more blast-formatted database. For multiple blast databases, separate them with a space and add an extra set of quotes. (e.g blast_db_path <- "/my/ncbi_nt/nt" or blast_db_path <- '"/my/ncbi_nt/nt /my/ncbi_ref_euk_rep_genomes/ref_euk_rep_genomes"') |
accession_taxa_sql_path |
a path to the accessionTaxa sql created by taxonomizr (e.g. accession_taxa_sql_path <- "/my/accessionTaxa.sql") |
output_directory_path |
a directory in which to save partial and complete output (e.g. "/path/to/output/12S_V5F1_local_111122_e300_111122"). |
metabarcode_name |
a prefix for the output fasta, taxonomy, and count of unique ranks.(e.g. metabarcode_name <- "12S_V5F1"). |
expand_vectors |
logical, determines whether to expand too_many_Ns and not_in db into real tables and write them in the output directory. the default is expand_vectors = TRUE. |
minimum_length |
removes each row that has a value less than minimum_length in the product_length column. The default is minimum_length = 5 |
maximum_length |
removes each row that has a value greater than maximum_length in the product_length column The default is maximum_length = 500 |
warnings |
value to set the "warn" option to during the function call. On exit it returns to the previous value. Setting this argument to NULL will not change the option. |
... |
additional arguments passed to |
sample_size |
passed to |
max_to_blast |
passed to |
wildcards |
passed to |
rank |
passed to |
ncbi_bin |
passed to |
evalue |
passed to |
coverage |
passed to |
perID |
passed to |
align |
passed to |
The blast_datatable() call saves intermediate results and
metadata about the search as local files in the save directory generated by
blast_seeds. This allows the function to resume a partially
completed blast, mitigating the consequences of encountering an
error or experiencing other interruptions. To resume a partially completed
blast, supply the same seeds and working directory. See the documentation
of blast_datatable() for more information.
During the blast_seeds the following data are cashed as files and passed into
blast_datatable(): output_table.txt (most recent updates from the
blast run), blast_seeds_passed_filter.txt (seed table that tracks the blast
status of seeds), unsampled_indices.txt (list of seed indices that need to
be blasted), too_many_ns.txt (tracks seeds that have been removed due to more
consecutive Ns in a sequence than are acceptable (see parameter wildcards),
blastdbcmd_failed.txt (tracks reads that are present in the seeds database,
but not the local blast database. This is relevant for the results of
get_seeds_remote()), and lastly num_rounds.txt (tracks the number of
completed blast round for a given seed file).
The final output of blast_seeds, stored in blast_seeds_output, are the
following: summary.csv (blast output with appended taxonomy),
metabarcode_name_.fasta (contains the sequence for all accessions recovered
during the blast search), metabarcode_name.taxonomy (contains the taxonomy
for all accessions recovered during the blast search),
metabarcode_name_blast_seeds_summary_unique_taxonomic_rank_counts.txt
(provides a count of all unique instances within a taxonomic rank),
too_many_ns.txt (tracks seeds that have been removed due to more
consecutive Ns in a sequence than are acceptable (see parameter wildcards),
blastdbcmd_failed.txt (tracks reads that are present in the seeds database,
but not the local blast database. This is relevant for the results of
get_seeds_remote()).
seeds_output_path <- "/my/directory/12S_V5F1_remote_111122_modified_params/blast_seeds_output/summary.csv"
output_directory_path <- "/my/directory/12S_V5F1_remote_111122_modified_params"
metabarcode_name <- "12S_V5F1"
accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql"
blast_db_path <- "/my/directory/ncbi_nt/nt"
blast_seeds(seeds_output_path,
blast_db_path,
accession_taxa_sql_path,
output_directory_path,
metabarcode_name,
rank = species,
max_to_blast = 750)
# using the rank of species will increase the number of total unique blast hits
# modifying the max_to_blast submits fewer reads simultaneously and reduces overall RAM while extending the run