R: Query primer NCBI's Blast and generate a .csv to use for...

get_seeds_remote {rCRUX}

R Documentation

Query primer NCBI's Blast and generate a .csv to use for blast_seeds

Description

get_seeds_remote combines modified versions of primerTree::primer_search() and primerTree's parse_primer to make iterative_primer_search() which is called to query NCBI's primer BLAST tool, filters the results, then aggregates them into a single data.frame. It creates a directory get_seeds_remote in the output_directory_path. It creates three files inside that directory. One represents the unfiltered output and another represents the output after filtering with user modifiable parameters and with appended taxonomy. Also generated is a summary of unique taxonomic ranks after filtering.

Usage

get_seeds_remote(
  forward_primer_seq,
  reverse_primer_seq,
  output_directory_path,
  metabarcode_name,
  accession_taxa_sql_path,
  organism,
  mismatch = 3,
  minimum_length = 5,
  maximum_length = 500,
  primer_specificity_database = "nt",
  ...,
  return_table = TRUE
)

Arguments

`forward_primer_seq`	passed to `primer_search()`, which turns it into a list of all possible non degenerate primers, then passes a user defined number of primer set combinations to NCBI. (e.g. forward_primer_seq <- "TAGAACAGGCTCCTCTAG")
`reverse_primer_seq`	passed to `primer_search()`, which turns it into a list of all possible non degenerate primers, then passes a user defined number of primer set combinations to NCBI. (e.g. reverse_primer_seq <- "TTAGATACCCCACTATGC")
`output_directory_path`	the parent directory to place the data in. (e.g. "/path/to/output/12S_V5F1_remote_111122")
`metabarcode_name`	is passed to `get_seeds_remote()` which appends metabarcode_name to the beginning of each of the two files it generates (e.g. metabarcode_name <- "12S_V5F1").
`accession_taxa_sql_path`	the path to sql created by taxonomizr (e.g. accession_taxa_sql_path <- "/my/accessionTaxa.sql")
`organism`	a vector of character vectors. Each character vector is passed in turn to primer_search, which passes them to NCBI. get_seeds_remote aggregates all of the results into a single file. (e.g. organism = c("1476529", "7776")) - note increasing taxonomic rank (e.g. increasing from order to class) for this parameter can maximize primer hits, but can also lead to API run throttling due to memory limitations
`mismatch`	the highest acceptable mismatch value. parse_primer_hits returns a table with a mismatch column. get_seeds_remote removes each row with a mismatch greater than the specified value. The default is mismatch = 3 - Note this is smaller than `get_seeds_local()`
`minimum_length`	`parse_primer_hits()` returns a table with a product_length column. get_seeds_remote removes each row that has a value less than minimum_length in the product_length column. The default is minimum_length = 5
`maximum_length`	`parse_primer_hits()` returns a table with a product_length column. get_seeds_remote removes each row that has a value greater than maximum_length in the product_length column The default is maximum_length = 500
`primer_specificity_database`	passed to `primer_search()`, which passes it to NCBI. The default is primer_specificity_database = 'nt'.
`...`	additional arguments passed to primer_search, see `primerTree::primer_search()` and NCBI primer-blast tool](https://www.ncbi.nlm.nih.gov/tools/primer-blast/) for more information.
`num_permutations`	the number of primer permutations to search, if the degenerate bases cause more than this number of permutations to exist, this number will be sampled from all possible permutations. The default is num_permutations = 50 - Note for very degenerate bases, searches may be empty due to poor mutual matches for a given forward and reverse primer combination.
`HITSIZE`	a primer BLAST search parameter set high to maximize the number of observations returned. The default HITSIZE = 50000 - note increasing this parameter can maximize primer hits, but can also lead to API run throttling due to memory limitations
`NUM_TARGETS_WITH_PRIMERS`	a primer BLAST search parameter set high to maximize the number of observations returned. The default is NCBI NUM_TARGETS_WITH_PRIMERS = 1000 - - note increasing this parameter can maximize primer hits, but can also lead to API run throttling due to memory limitations

Details

get_seeds_remote passes the forward and reverse primer sequence for a given PCR product to iterative_primer_search() along with the taxid(s) of the organism(s) to blast, the database to search, and many additional possible parameters to NCBI's primer blast tool (see Note below). Degenerate primers are converted into all possible non degenerate sets and a user defined maximum number of primer combinations is passed to to the API. Multiple taxids are searched independently, as are multiple database searches (e.g. nt and refseq_representative_genomes). The data are parsed and stored in a dataframe, which is also written to a file with the suffix ⁠_unfiltered_get_seeds_remote_output.csv⁠.

These hits are further filtered using filter_primer_hits() to calculate and append amplicon size to the dataframe. Only hits that pass with default or user modified length and number of mismatches parameters are retained.

Taxonomy is appended to these filtered hits using get_taxonomizr_from_accession(). The results are written to to file with the suffix ⁠_filtered_get_seeds_remote_output_with_taxonomy.csv⁠. The number of unique instances for each rank in the taxonomic path for the filtered hits are tallied (NAs are counted once per rank) and written to a file with the suffix ⁠_filtered_get_seeds_local_remote_taxonomic_rank_counts.txt⁠

Note: get_seeds_remote passes many parameters to NCBI's primer blast tool. You can match the parameters to the fields available in the GUI here. First, use your browser to view the page source. Search for the field you are interested in by searching for the title of the field. It should be enclosed in a tag. Inside the label tag, it says ⁠for = "<name_of_parameter>"⁠. Copy the string after for = and add it to get_seeds_remote as the name of a parameter, setting it equal to whatever you like.

As of 2022-08-16, the primer blast GUI contains some options that are not implemented by primerTree::primer_search() and by extension iterative_primer_search() primer_search doesn't include explicit documentation of allowed options, but it will quickly report if an option isn't allowed, so trial and error will not be very time consuming.

Note: See iterative_primer_search() and modifiedPrimerTree_Functions for additional run parameters not included below.

Check NCBI's primer blast for additional search options**

get_seeds_remote passes many parameters to NCBI's primer blast tool. You can match the parameters to the fields available in the GUI here. First, use your browser to view the page source. Search for the field you are interested in by searching for the title of the field. It should be enclosed in a tag. Inside the label tag, it says for = "<name_of_parameter>". Copy the string after for = and add it to get_seeds_remote as the name of a parameter, setting it equal to whatever you like.

As of 2022-08-16, the primer blast GUI contains some options that are not implemented by primer_search. The table below documents some of the available options.

Name	Default
PRIMER_SPECIFICITY_DATABASE	nt
EXCLUDE_ENV	unchecked
ORGANISM	Homo sapiens
TOTAL_PRIMER_SPECIFICITY_MISMATCH	1
PRIMER_3END_SPECIFICITY_MISMATCH	1
TOTAL_MISMATCH_IGNORE	6
MAX_TARGET_SIZE	4000
HITSIZE	50000
EVALUE	30000
WORD_SIZE	7
NUM_TARGETS_WITH_PRIMERS	1000
MAX_TARGET_PER_TEMPLATE	100

Value

a data.frame containing the same information as the .csv it generates

Examples


forward_primer_seq = "TAGAACAGGCTCCTCTAG"
reverse_primer_seq =  "TTAGATACCCCACTATGC"
output_directory_path <- "/my/directory/12S_V5F1_remote_111122_modified_params"
metabarcode_name <- "12S_V5F1"
accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql"


get_seeds_remote(forward_primer_seq,
                reverse_primer_seq,
                output_directory_path,
                metabarcode_name,
                accession_taxa_sql_path,
                HITSIZE ='1000000',
                evalue='100000',
                word_size='6',
                MAX_TARGET_PER_TEMPLATE = '5',
                NUM_TARGETS_WITH_PRIMERS ='500000', minimum_length = 50,
                MAX_TARGET_SIZE = 200,
                organism = c("1476529", "7776"), return_table = FALSE)


# This results in approximately 111500 blast seed returns (there is some variation due to database updates, etc.), note the default generated approximately 1047.
# This assumes the user is not throttled by memory limitations.

[Package rCRUX version 0.0.1.000 ]