blast_seeds {rCRUX}R Documentation

Generate a metabarcode database by blasting a stratified random sample of seed sequences

Description

blast_seeds uses blast_datatable() to search against a blast formatted database. It creates a permanent directory blast_seeds_output and a temporary directory 'blast_seeds_save' in the output_directory_path. It saves from and passes files to blast_datatable() while the run is in progress. During the final steps of the function the final data is saved in rblast_seeds_output recording the results of the blast.

Usage

blast_seeds(
  seeds_output_path,
  blast_db_path,
  accession_taxa_sql_path,
  output_directory_path,
  metabarcode_name,
  expand_vectors = TRUE,
  minimum_length = 5,
  maximum_length = 500,
  warnings = 0,
  ...
)

Arguments

seeds_output_path

a path to a csv from get_seeds_local or get_seeds_remote (e.g. seeds_output_path <- '/my/rCRUX_output_directory/12S_V5F1_filtered_get_seeds_remote_output_with_taxonomy.csv')

blast_db_path

a directory containing one or more blast-formatted database. For multiple blast databases, separate them with a space and add an extra set of quotes. (e.g blast_db_path <- "/my/ncbi_nt/nt" or blast_db_path <- '"/my/ncbi_nt/nt /my/ncbi_ref_euk_rep_genomes/ref_euk_rep_genomes"')

accession_taxa_sql_path

a path to the accessionTaxa sql created by taxonomizr (e.g. accession_taxa_sql_path <- "/my/accessionTaxa.sql")

output_directory_path

a directory in which to save partial and complete output (e.g. "/path/to/output/12S_V5F1_local_111122_e300_111122").

metabarcode_name

a prefix for the output fasta, taxonomy, and count of unique ranks.(e.g. metabarcode_name <- "12S_V5F1").

expand_vectors

logical, determines whether to expand too_many_Ns and not_in db into real tables and write them in the output directory. the default is expand_vectors = TRUE.

minimum_length

removes each row that has a value less than minimum_length in the product_length column. The default is minimum_length = 5

maximum_length

removes each row that has a value greater than maximum_length in the product_length column The default is maximum_length = 500

warnings

value to set the "warn" option to during the function call. On exit it returns to the previous value. Setting this argument to NULL will not change the option.

...

additional arguments passed to blast_datatable()

sample_size

passed to blast_datatable() is the the number of entries to sample per rank. The default sample_size = 1 - is recommended unless the user is sampling higher order taxonomy. If there are not enough seeds to sample per rank the run will end in an error.

max_to_blast

passed to blast_datatable() and is the maximum number of entries to accumulate into a fasta before calling blastn. The default is max_to_blast = 1000 - the optimal number of reads to blast will depend on the user's environment (available RAM) and the number of possible hits (determined by marker and parameters)

wildcards

passed to blast_datatable() us a character vector that represents the minimum number of consecutive Ns the user will tolerate in a given seed or hit sequence. The default is wildcards = "NNNN"

rank

passed to blast_datatable() is the data column representing the taxonomic rank to randomly sample. The default is rank = genus - sampling a lower rank (e.g. species) will generate more total hits and take more time, conversely sampling a higher rank (e.g. family) will generate fewer total hits and take less time.

ncbi_bin

passed to run_blastdbcmd() run_blastn() is the path to blast+ tools if not in the user's path. Specify only if blastn and blastdbcmd are not in your path. The default is ncbi_bin = NULL - if not specified in path do the following: ncbi_bin = "/my/local/ncbi-blast-2.10.1+/bin".

evalue

passed to run_blastn() is the number of expected hits with a similar quality score found by chance. The default is evalue = 1e-6.

coverage

passed to run_blastn() is the minimum percent of the query length recovered in the subject hits. The default is coverage = 50.

perID

passed to run_blastn() is the minimum percent identity of the query relative to the subject hits. The default is perID = 70.

align

passed to run_blastn() is the maximum number of subject hits to return per query blasted. The default is align = 50000.

Details

The blast_datatable() call saves intermediate results and metadata about the search as local files in the save directory generated by blast_seeds. This allows the function to resume a partially completed blast, mitigating the consequences of encountering an error or experiencing other interruptions. To resume a partially completed blast, supply the same seeds and working directory. See the documentation of blast_datatable() for more information.

During the blast_seeds the following data are cashed as files and passed into blast_datatable(): output_table.txt (most recent updates from the blast run), blast_seeds_passed_filter.txt (seed table that tracks the blast status of seeds), unsampled_indices.txt (list of seed indices that need to be blasted), too_many_ns.txt (tracks seeds that have been removed due to more consecutive Ns in a sequence than are acceptable (see parameter wildcards), blastdbcmd_failed.txt (tracks reads that are present in the seeds database, but not the local blast database. This is relevant for the results of get_seeds_remote()), and lastly num_rounds.txt (tracks the number of completed blast round for a given seed file).

The final output of blast_seeds, stored in blast_seeds_output, are the following: summary.csv (blast output with appended taxonomy), metabarcode_name_.fasta (contains the sequence for all accessions recovered during the blast search), metabarcode_name.taxonomy (contains the taxonomy for all accessions recovered during the blast search), metabarcode_name_blast_seeds_summary_unique_taxonomic_rank_counts.txt (provides a count of all unique instances within a taxonomic rank), too_many_ns.txt (tracks seeds that have been removed due to more consecutive Ns in a sequence than are acceptable (see parameter wildcards), blastdbcmd_failed.txt (tracks reads that are present in the seeds database, but not the local blast database. This is relevant for the results of get_seeds_remote()).

Examples


seeds_output_path <- "/my/directory/12S_V5F1_remote_111122_modified_params/blast_seeds_output/summary.csv"
output_directory_path <- "/my/directory/12S_V5F1_remote_111122_modified_params"
metabarcode_name <- "12S_V5F1"
accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql"
blast_db_path <- "/my/directory/ncbi_nt/nt"


blast_seeds(seeds_output_path,
            blast_db_path,
            accession_taxa_sql_path,
            output_directory_path,
            metabarcode_name,
            rank = species,
            max_to_blast = 750)

# using the rank of species will increase the number of total unique blast hits
# modifying the max_to_blast submits fewer reads simultaneously and reduces overall RAM while extending the run



[Package rCRUX version 0.0.1.000 ]