derep_and_clean_db {rCRUX}R Documentation

De-replicates identical sequences and finds the lowest taxonomic agreement for sequences with multiple taxids

Description

derep_and_clean_db takes the output from blast_seed() and de-replicates the dataset to identify representative sequences. It generates an out put directory derep_and_clean_db at output_directory_path to store the output .csv files and the fasta and taxonomy file generated by the function.

Usage

derep_and_clean_db(output_directory_path, summary_path, metabarcode_name)

Arguments

output_directory_path

the path to the output directory

summary_path

the path to the input file

metabarcode_name

used to name the subdirectory and the files.

Details

Before de-replicating the data set, all sequences with NA taxonomy for phylum, class, order, family, and genus are removed from the dataset because they typically represent environmental samples with low value for taxonomic classification. These sequences are stored in a Sequences_with_mostly_NA_taxonomic_paths.csv

All sequences with the same length and composition are collapsed to a single database entry, where the accessions and taxids (if there are more than one) are concatenated. The sequences with a clean taxonomic path (e.g. no ranks with multiple entries) are written to Sequences_with_single_taxonomic_path.csv.

Sequences with multiple entries for a given taxonomic rank are written to Sequences_with_multiple_taxonomic_paths.cvs. These sequences are processed further by removing NAs from rank instances with more than one entry (e.g. "Chordata, NA" will mutate to "Chordata"). Any remaining instances of taxonomic ranks with more than one taxid will be reduced to NA (e.g. "Badis assamensis, Badis badis" will mutate to "NA"). These sequences, with a taxonomic paths shortened to the lowest taxonomic agreement, are written to Sequences_with_lowest_common_taxonomic_path_agreement.csv.

Lastly, the sequences from Sequences_with_single_taxonomic_path.csv and Sequences_with_lowest_common_taxonomic_path_agreement.csv are used to generate a fasta file and taxonomy file of representative NCBI accessions for each sequence. The number of accessions identical to the representative accession is given.

Examples


output_directory_path <- "/my/directory/12S_V5F1_remote_111122_modified_params"
summary_path <- "/my/directory/12S_V5F1_remote_111122_modified_params/blast_seeds_output/summary.csv"
metabarcode_name <- "12S_V5F1"


derep_and_clean_db(output_directory_path, summary_path, metabarcode_name)



[Package rCRUX version 0.0.1.000 ]