R: Generate a metabarcode database by blasting a stratified...

blast_seeds {rCRUX}

R Documentation

Generate a metabarcode database by blasting a stratified random sample of seed sequences

Description

blast_seeds uses blast_datatable() to search against a blast formatted database. It creates a permanent directory blast_seeds_output and a temporary directory 'blast_seeds_save' in the output_directory_path. It saves from and passes files to blast_datatable() while the run is in progress. During the final steps of the function the final data is saved in rblast_seeds_output recording the results of the blast.

Usage

blast_seeds(
  seeds_output_path,
  blast_db_path,
  accession_taxa_sql_path,
  output_directory_path,
  metabarcode_name,
  expand_vectors = TRUE,
  minimum_length = 5,
  maximum_length = 500,
  warnings = 0,
  ...
)

Arguments

`seeds_output_path`	a path to a csv from get_seeds_local or get_seeds_remote (e.g. seeds_output_path <- '/my/rCRUX_output_directory/12S_V5F1_filtered_get_seeds_remote_output_with_taxonomy.csv')
`blast_db_path`	a directory containing one or more blast-formatted database. For multiple blast databases, separate them with a space and add an extra set of quotes. (e.g blast_db_path <- "/my/ncbi_nt/nt" or blast_db_path <- '"/my/ncbi_nt/nt /my/ncbi_ref_euk_rep_genomes/ref_euk_rep_genomes"')
`accession_taxa_sql_path`	a path to the accessionTaxa sql created by taxonomizr (e.g. accession_taxa_sql_path <- "/my/accessionTaxa.sql")
`output_directory_path`	a directory in which to save partial and complete output (e.g. "/path/to/output/12S_V5F1_local_111122_e300_111122").
`metabarcode_name`	a prefix for the output fasta, taxonomy, and count of unique ranks.(e.g. metabarcode_name <- "12S_V5F1").
`expand_vectors`	logical, determines whether to expand too_many_Ns and not_in db into real tables and write them in the output directory. the default is expand_vectors = TRUE.
`minimum_length`	removes each row that has a value less than minimum_length in the product_length column. The default is minimum_length = 5
`maximum_length`	removes each row that has a value greater than maximum_length in the product_length column The default is maximum_length = 500
`warnings`	value to set the "warn" option to during the function call. On exit it returns to the previous value. Setting this argument to NULL will not change the option.
`...`	additional arguments passed to `blast_datatable()`
`sample_size`	passed to `blast_datatable()` is the the number of entries to sample per rank. The default sample_size = 1 - is recommended unless the user is sampling higher order taxonomy. If there are not enough seeds to sample per rank the run will end in an error.
`max_to_blast`	passed to `blast_datatable()` and is the maximum number of entries to accumulate into a fasta before calling blastn. The default is max_to_blast = 1000 - the optimal number of reads to blast will depend on the user's environment (available RAM) and the number of possible hits (determined by marker and parameters)
`wildcards`	passed to `blast_datatable()` us a character vector that represents the minimum number of consecutive Ns the user will tolerate in a given seed or hit sequence. The default is wildcards = "NNNN"
`rank`	passed to `blast_datatable()` is the data column representing the taxonomic rank to randomly sample. The default is rank = genus - sampling a lower rank (e.g. species) will generate more total hits and take more time, conversely sampling a higher rank (e.g. family) will generate fewer total hits and take less time.
`ncbi_bin`	passed to `run_blastdbcmd()` `run_blastn()` is the path to blast+ tools if not in the user's path. Specify only if blastn and blastdbcmd are not in your path. The default is ncbi_bin = NULL - if not specified in path do the following: ncbi_bin = "/my/local/ncbi-blast-2.10.1+/bin".
`evalue`	passed to `run_blastn()` is the number of expected hits with a similar quality score found by chance. The default is evalue = 1e-6.
`coverage`	passed to `run_blastn()` is the minimum percent of the query length recovered in the subject hits. The default is coverage = 50.
`perID`	passed to `run_blastn()` is the minimum percent identity of the query relative to the subject hits. The default is perID = 70.
`align`	passed to `run_blastn()` is the maximum number of subject hits to return per query blasted. The default is align = 50000.

Details

The blast_datatable() call saves intermediate results and metadata about the search as local files in the save directory generated by blast_seeds. This allows the function to resume a partially completed blast, mitigating the consequences of encountering an error or experiencing other interruptions. To resume a partially completed blast, supply the same seeds and working directory. See the documentation of blast_datatable() for more information.

During the blast_seeds the following data are cashed as files and passed into blast_datatable(): output_table.txt (most recent updates from the blast run), blast_seeds_passed_filter.txt (seed table that tracks the blast status of seeds), unsampled_indices.txt (list of seed indices that need to be blasted), too_many_ns.txt (tracks seeds that have been removed due to more consecutive Ns in a sequence than are acceptable (see parameter wildcards), blastdbcmd_failed.txt (tracks reads that are present in the seeds database, but not the local blast database. This is relevant for the results of get_seeds_remote()), and lastly num_rounds.txt (tracks the number of completed blast round for a given seed file).

The final output of blast_seeds, stored in blast_seeds_output, are the following: summary.csv (blast output with appended taxonomy), metabarcode_name_.fasta (contains the sequence for all accessions recovered during the blast search), metabarcode_name.taxonomy (contains the taxonomy for all accessions recovered during the blast search), metabarcode_name_blast_seeds_summary_unique_taxonomic_rank_counts.txt (provides a count of all unique instances within a taxonomic rank), too_many_ns.txt (tracks seeds that have been removed due to more consecutive Ns in a sequence than are acceptable (see parameter wildcards), blastdbcmd_failed.txt (tracks reads that are present in the seeds database, but not the local blast database. This is relevant for the results of get_seeds_remote()).

Examples


seeds_output_path <- "/my/directory/12S_V5F1_remote_111122_modified_params/blast_seeds_output/summary.csv"
output_directory_path <- "/my/directory/12S_V5F1_remote_111122_modified_params"
metabarcode_name <- "12S_V5F1"
accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql"
blast_db_path <- "/my/directory/ncbi_nt/nt"


blast_seeds(seeds_output_path,
            blast_db_path,
            accession_taxa_sql_path,
            output_directory_path,
            metabarcode_name,
            rank = species,
            max_to_blast = 750)

# using the rank of species will increase the number of total unique blast hits
# modifying the max_to_blast submits fewer reads simultaneously and reduces overall RAM while extending the run

[Package rCRUX version 0.0.1.000 ]