| derep_and_clean_db {rCRUX} | R Documentation |
derep_and_clean_db takes the output from blast_seed() and
de-replicates the dataset to identify representative sequences. It generates
an out put directory derep_and_clean_db at output_directory_path to store
the output .csv files and the fasta and taxonomy file generated by the function.
derep_and_clean_db(output_directory_path, summary_path, metabarcode_name)
output_directory_path |
the path to the output directory |
summary_path |
the path to the input file |
metabarcode_name |
used to name the subdirectory and the files. |
Before de-replicating the data set, all sequences with NA taxonomy for phylum,
class, order, family, and genus are removed from the dataset because they
typically represent environmental samples with low value for taxonomic
classification. These sequences are stored in a
Sequences_with_mostly_NA_taxonomic_paths.csv
All sequences with the same length and composition are collapsed to a single
database entry, where the accessions and taxids (if there are more than one)
are concatenated. The sequences with a clean taxonomic path (e.g. no ranks
with multiple entries) are written to Sequences_with_single_taxonomic_path.csv.
Sequences with multiple entries for a given taxonomic rank are written to
Sequences_with_multiple_taxonomic_paths.cvs. These sequences are processed
further by removing NAs from rank instances with more than one entry
(e.g. "Chordata, NA" will mutate to "Chordata"). Any remaining instances
of taxonomic ranks with more than one taxid will be reduced to NA
(e.g. "Badis assamensis, Badis badis" will mutate to "NA"). These sequences,
with a taxonomic paths shortened to the lowest taxonomic agreement, are written
to Sequences_with_lowest_common_taxonomic_path_agreement.csv.
Lastly, the sequences from Sequences_with_single_taxonomic_path.csv
and Sequences_with_lowest_common_taxonomic_path_agreement.csv are used to
generate a fasta file and taxonomy file of representative NCBI accessions for
each sequence. The number of accessions identical to the representative
accession is given.
output_directory_path <- "/my/directory/12S_V5F1_remote_111122_modified_params"
summary_path <- "/my/directory/12S_V5F1_remote_111122_modified_params/blast_seeds_output/summary.csv"
metabarcode_name <- "12S_V5F1"
derep_and_clean_db(output_directory_path, summary_path, metabarcode_name)