get_seeds_local {rCRUX}R Documentation

Local blastn interpretation of querying primer_blast to and generate a .csv to use for blast_seeds

Description

get_seeds_local is a local interpretation of get_seeds_remote() that avoids querying NCBI's primer BLAST tool. Although it is slower than remotely generating blast seeds, it is not subject to the arbitrary throttling of jobs that require significant memory. It creates a 'get_seeds_local' directory at output_directory_path if one doesn't yet exist, then creates a subdirectory inside output_directory_path named after metabarcode_name. It creates four permenant files inside that directory. Several temporary files are generated during the run for keeping track of primers being blasted and needing to be blasted, and raw blast output. One represents the unfiltered output and another represents the output after filtering with user modifiable parameters and with appended taxonomy. Also generated is a summary of unique taxonomic ranks after filtering, and a fasta of the primers used for blast.

Usage

get_seeds_local(
  forward_primer_seq,
  reverse_primer_seq,
  output_directory_path,
  metabarcode_name,
  accession_taxa_sql_path,
  blast_db_path,
  mismatch = 6,
  minimum_length = 5,
  maximum_length = 500,
  primer_specificity_database = "nt",
  num_fprimers_to_blast = 50,
  num_rprimers_to_blast = 50,
  max_to_blast = 2,
  align = "10000000",
  ...,
  return_table = TRUE
)

Arguments

forward_primer_seq

which which turns degenerate primers into into a list of all possible non degenerate primers and converts the primer(s) into to a fasta file to be past to run_primer_blastn. (e.g. forward_primer_seq <- "TAGAACAGGCTCCTCTAG" or forward_primer_seq <- c("TAGAACAGGCTCCTCTAG", "GGWACWGGWTGAACWGTWTAYCCYCC")

reverse_primer_seq

which which turns degenerate primers into into a list of all possible non degenerate primers and converts the primer(s) into to a fasta file to be past to run_primer_blastn. (e.g reverse_primer_seq <- "TTAGATACCCCACTATGC" or reverse_primer_seq <- c("TTAGATACCCCACTATGC", "TANACYTCNGGRTGNCCRAARAAYCA")

output_directory_path

the parent directory to place the data in (e.g. "/path/to/output/12S_V5F1_local_111122_e300_111122").

metabarcode_name

used to name the files. get_seeds_local appends metabarcode_name to the beginning of each of the files it generates (e.g. metabarcode_name <- "12S_V5F1").

accession_taxa_sql_path

the path to sql created by taxonomizr (e.g. accession_taxa_sql_path <- "/my/accessionTaxa.sql")

blast_db_path

a directory containing one or more blast-formatted database. For multiple blast databases, separate them with a space and add an extra set of quotes. (e.g blast_db_path <- "/my/ncbi_nt/nt" or blast_db_path <- '"/my/ncbi_nt/nt /my/ncbi_ref_euk_rep_genomes/ref_euk_rep_genomes"')

mismatch

the highest acceptable mismatch value per hit. get_seeds_local removes each row with a mismatch greater than the specified value. The default is mismatch = 6

minimum_length

get_seeds_local removes each row that has a value less than minimum_length in the product_length column. The default is minimum_length = 5

maximum_length

get_seeds_local removes each row that has a value greater than maximum_length in the product_length column The default is maximum_length = 500

num_fprimers_to_blast

is the maximum number of possible forward primers to blast. This is relevant for degenerate primers, all possible primers from a degenerate sequence are enumerated, and the user can choose a number to be randomly sampled and used for primer blast. The default is num_fprimers_to_blast = 50

num_rprimers_to_blast

is the maximum number of possible reverse primers to blast. This is relevant for degenerate primers, all possible primers from a degenerate sequence are enumerated, and the user can choose a number to be randomly sampled and used for primer blast. The default is num_rprimers_to_blast = 50

max_to_blast

is the number of primers to blast simultaneously. The default is max_to_blast = 2. - Increasing this number will decrease overall run time, but increase the amount of RAM required.

align

is the maximum number of subject hits to return per query blasted. The default is align = '10000000'. - to few alignments will result in no matching pairs of forward and reverse primers. To many alignments can result in an error due to RAM limitations.

task

(passed to run_primer_blastn()) the task for blastn to perform - default here is "blastn_short", which is optimized for searches with queries < 50 bp

word_size

(passed to run_primer_blastn()) is the fragment size used for blastn search - smaller word sizes increase sensitivity and time of the search. The default is word_size = 7

evalue

(passed to run_primer_blastn()) is the number of expected hits with a similar quality score found by chance. The default is evalue = '3e-7'.

coverage

(passed to run_primer_blastn()) is the minimum percent of the query length recovered in the subject hits. The default is coverage = 90.

perID

(passed to run_primer_blastn()) is the minimum percent identity of the query relative to the subject hits. The default is perID = 50.

reward

(passed to run_primer_blastn()) is the reward for nucleotide match. The default is reward = 2.

ncbi_bin

passed to run_primer_blastn()) is the path to blast+ tools if not in the user's path. Specify only if blastn and is not in your path. The default is ncbi_bin = NULL - if not specified in path do the following: ncbi_bin = "/my/local/ncbi-blast-2.10.1+/bin".

Details

get_seeds_local passes the forward and reverse primer sequence for a given PCR product to run_primer_blastn(). In the case of a non degenerate primer set only two primers will be passed to run_primer_blast. In the case of a degenerate primer set, get_seeds_local will get all possible versions of the degenerate primer(s) (using primerTree's enumerate_primers() function), randomly sample a user defined number of forward and reverse primers, and generate a fasta file. The selected primers are subset and passed to run_primer_blastn which queries each primer against a blast formatted database using the task "blastn_short".

Output is cashed after each sucessful run of run_primer_blastn, so if a run is interrupted the user can resubmit the command and pick up where they left off. The user can modify parameters for the run with the exception of num_fprimers_to_blast and num_rprimers_to_blast.

The returned blast hits for each sequence are matched and checked to see if they generate plausible amplicon (e.g. amplify the same accession and are in the correct orientation to produce a PCR product). These hits are written to a file with the suffix ⁠_unfiltered_get_seeds_local_output.csv⁠. These hits are further filtered for length and number of mismatches. Taxonomy is appended to these filtered hits using get_taxonomizr_from_accession(). The results are written to to a file with the suffix ⁠_filtered_get_seeds_local_output_with_taxonomy.csv⁠. The number of unique instances for each rank in the taxonomic path for the filtered hits are tallied (NAs are counted once per rank) and written to a file with the suffix ⁠_filtered_get_seeds_local_unique_taxonomic_rank_counts.txt⁠

Note: Information about the blastn parameters can be found in run_primer_blast, and by accessing blastn -help. Default parameters are optimized to provide results similar to that generated through remote blast via primer-blast as implemented in iterative_primer_search().

Examples


# Non degenerate primer example: 12S_V5F1 (Riaz et al. 2011)

forward_primer_seq = "TAGAACAGGCTCCTCTAG"
reverse_primer_seq =  "TTAGATACCCCACTATGC"
output_directory_path <- "/my/directory/12S_V5F1_local_111122_species_750"
metabarcode_name <- "12S_V5F1"
accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql"
blast_db_path <- "/my/directory/ncbi_nt/nt"


get_seeds_local(forward_primer_seq,
                reverse_primer_seq,
                output_directory_path,
                metabarcode_name,
                accession_taxa_sql_path,
                blast_db_path,
                minimum_length = 80,
                maximum_length = 150)

# adjusting the minimum_length and maximum_length parameters reduces the number of total hits by removing reads that could result from off target amplification


# Degenerate primer example - mlCOIintF/jgHC02198 (Leray et al. 2013)
# Note: this will take considerable time and computational resources

forward_primer_seq <- "GGWACWGGWTGAACWGTWTAYCCYCC"
reverse_primer_seq <- "TANACYTCNGGRTGNCCRAARAAYCA"
output_directory_path <- "/my/directory/CO1_local"
metabarcode_name <- "CO1"


get_seeds_local(forward_primer_seq,
                reverse_primer_seq,
                output_directory_path,
                metabarcode_name,
                accession_taxa_sql_path,
                blast_db_path,
                minimum_length = 200,
                maximum_length = 400,
                aligns = '10000',
                num_rprimers_to_blast = 200,
                num_rprimers_to_blast = 2000,
                max_to_blast = 10)


# Non Degenerate but high return primer example - 18S (Amaral-Zettler et al. 2009)
# Note: this will take considerable time and computational resources

forward_primer_seq <- "GTACACACCGCCCGTC"
reverse_primer_seq <- "TGATCCTTCTGCAGGTTCACCTAC"
output_directory_path <- "/my/directory/18S_local"
metabarcode_name <- "18S"


get_seeds_local(forward_primer_seq,
                reverse_primer_seq,
                output_directory_path,
                metabarcode_name,
                accession_taxa_sql_path,
                blast_db_path,
                minimum_length = 250,
                maximum_length = 350,
                max_to_blast = 1)

# blasting two primers at a time can max out a system's RAM, however blasting one at a time is more feasable for personal computers with 16 GB RAM




[Package rCRUX version 0.0.1.000 ]