The Local Taxa Tool (LocaTT, pronounced “locate”) is an R package which performs taxonomic assignment of DNA metabarcoding sequence data while considering geographic location. Sequences are BLASTed against a global reference database, and the tool suggests locally occurring taxa which are most closely related (by taxonomy) to any of the best-matching BLAST hits (by bit score). Compared to other implementations of BLAST, this approach provides the following advantages. First, all best BLAST matches are considered, as opposed to either the first or a set number of matches. Second, all suggested taxa are local species which actually occur in the study area. And third, local species which are not present in the reference database can still be suggested if related species have reference sequences, compared to approaches which subset the reference database to just local species.
The tool may be used for taxonomic assignment following the use of any other bioinformatics pipeline given that either the sequences to classify can be read into R, or the sequences are in fasta format. There are three main steps to using the Local Taxa Tool. First, reference databases from MIDORI or UNITE are formatted for use with the tool, and users can apply any desired edits to the taxonomies or sequences in the reference databases. Second, local species lists with taxonomies are prepared from either a list of species binomials or species lists downloaded from the IUCN Red List. Third, the Local Taxa Tool is run to perform taxonomic assignment and suggest local species for each query sequence. This tutorial describes how to prepare reference databases, prepare local species lists, and run the Local Taxa Tool.
The LocaTT package also contains basic bioinformatics functions for DNA metabarcoding. A full bioinformatics pipeline using the LocaTT package is in development.
R and BLAST are required to run the Local Taxa Tool. R can be installed from https://www.r-project.org, and RStudio Desktop can optionally be installed from https://posit.co/download/rstudio-desktop/. BLAST can be installed from https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ (use the file ending in .dmg for Mac, use the file ending in -win64.exe for Windows). Throughout this tutorial, all code is run in R. No use of the command line is necessary, as BLAST programs are called from R.
# Check R version.
R.Version()$version.string
## [1] "R version 4.2.0 (2022-04-22)"
# Display platform.
R.Version()$platform
## [1] "x86_64-apple-darwin17.0"
Now let’s install the Local Taxa Tool package, which is named
“LocaTT”. To install the current CRAN release, use the
install.packages
function below.
# Install the current Local Taxa Tool (LocaTT) release from CRAN.
install.packages("LocaTT")
The development version of the package is available on GitHub. To
install the development version, install the devtools package first in
order to get the install_github
function.
# Install the devtools package.
install.packages("devtools")
# Load the devtools package.
library(devtools)
# Install the development version of the Local Taxa Tool (LocaTT) package from GitHub.
devtools::install_github("Urodelan/LocaTT")
Now, load the Local Taxa Tool package.
# Load the Local Taxa Tool package.
library(LocaTT)
## Welcome to the Local Taxa Tool (LocaTT)! This is version 1.1.1.
## ______ ______ ______ Geographically-conscious
## | | \/ | | \/ | | taxonomic assignment
## _|__|_/\_|__|_/\_|__|_ for DNA metabarcoding.
## For a detailed tutorial, see: https://urodelan.github.io/Local_Taxa_Tool_Tutorial/
## Command line BLAST is required, see tutorial for details.
## See citation('LocaTT') if used in publications.
Once the Local Taxa Tool is installed and loaded, let’s check that the BLAST installation was successful.
# Display the BLAST version.
blast_version()
## [1] "blastn: 2.12.0+ Package: blast 2.12.0, build Jun 4 2021 04:06:33"
If the BLAST version was displayed, then we’re good to go! If you’re
using a non-standard installation of BLAST, set the path to the blastn
program using the blast_command argument. For non-standard BLAST
installations, paths to BLAST programs can be set in the
format_reference_database
and local_taxa_tool
functions as well.
Some functions in the Local Taxa Tool package take file paths as
arguments which are used by both R and the command line. If using
Windows, please use file paths which can be understood by both R and the
command line. For example, by using forward slashes in file paths
instead of backslashes. Further, some functions do not allow spaces in
certain file paths which are read by both R and the command line.
Therefore, it is recommended that file paths provided as function
arguments lack spaces. If the working directory contains spaces, you can
set R’s working directory with the setwd
function, then
ensure that relative file paths used as function arguments contain no
spaces.
Using the Local Taxa Tool on a high performance computing cluster:
On some high performance computing clusters, BLAST is installed as a
module which must be loaded from the terminal (e.g.,
module load blast
) in order to be recognized by R. Further,
on some clusters R is set up as a container which does not recognize
modules loaded outside of the container. Under this condition, adding
the cluster’s module path for BLAST to R’s PATH environmental variable
may allow R to locate command line BLAST programs. The cluster’s module
path for BLAST can be found using module show blast
on the
terminal, and this can be added to R’s PATH environmental variable using
R’s Sys.setenv
function. Below is an example of how to do
this.
# Run in Unix terminal. Get module path for BLAST on high performance computing cluster.
module show blast 2>&1 | grep ^prepend_path\( | sed 's/^prepend_path("PATH",//' | sed 's/)$//'
## "/your/path/to/BLAST/module/bin/folder/here"
# In R, add the cluster's module path for BLAST to R's PATH environmental variable.
Sys.setenv(PATH=paste("/your/path/to/BLAST/module/bin/folder/here",Sys.getenv("PATH"),sep=":"))
# Load the Local Taxa Tool package.
library(LocaTT)
# Check that R can find BLAST programs by checking the BLAST version.
blast_version()
## [1] "blastn: 2.12.0+ Package: blast 2.12.0, build Jun 4 2021 04:06:33"
Alternatively, paths to BLAST programs can be provided as arguments
in the functions blast_version
,
format_reference_database
, and
local_taxa_tool
. If you continue to encounter issues
running command line BLAST programs from R on a cluster, please contact
your cluster’s help desk.
This tutorial uses example data files which can be downloaded from the GitHub repository: https://github.com/Urodelan/Local_Taxa_Tool_Tutorial_Data. To download the example data files, head to the GitHub repository and click on the green “Code” drop-down menu at the top right. Then click “Download ZIP”, and unzip the downloaded file. Some example data files include DNA sequences. These example sequences are either from GenBank, were derived from existing GenBank sequences, or were randomly generated.
For the Local Taxa Tool to provide local species suggestions, a local
species list must be provided with the same taxonomy system as the
reference database. In turn, the reference database must contain
taxonomy information for each reference sequence. Reference databases
must be in fasta format, and fasta header lines must contain seven-level
taxonomies with taxonomic units delimited by semi-colons and underscores
representing spaces (e.g., in species binomials). The
format_reference_database
function formats reference
databases from MIDORI or UNITE (general fasta release) for use with the
Local Taxa Tool.
MIDORI Reference 2 (http://www.reference-midori.info) hosts reference databases derived from GenBank sequences with NCBI taxonomies which can be formatted for use with the Local Taxa Tool. MIDORI databases are available for multiple barcode regions, including COI and srRNA. Here, we will download the MIDORI srRNA reference database, which encompasses the 12S vertebrate metabarcoding region, for an example of using the Local Taxa Tool to classify the sequences of vertebrate prey items from the fecal samples of mammalian carnivores (order Carnivora).
Let’s go to download section of MIDORI’s website (http://www.reference-midori.info/download.php#) and select the following drop-down menus. Select “Databases”, then the desired release (in our case, the most recent release is “GenBank252”). Then, select either “BLAST” or “BLAST_sp” to access databases which are ready to be formatted for BLAST. We selected “BLAST_sp” for databases that include both sequences which have species-level identification and sequences which lack a binomial species-level description (e.g., sp.; see the MIDORI README file for details). Now, select “uniq” for databases containing only unique sequences for each species, and select “fasta” to find the databases in fasta format. From here, we can see the MIDORI databases for various barcode regions. For this tutorial, download the srRNA database and unzip it.
When formatting reference databases for the Local Taxa Tool using the
format_reference_database
function, please consider the
taxonomy system of your reference databases and whether any adjustments
should be made to improve compatibility between the taxonomies of the
reference sequences and your local species lists. Within this tutorial,
we provide options for preparing local taxa lists using two different
taxonomy systems, the NCBI taxonomy database and the IUCN Red List. If
the reference database and local species list use different taxonomy
systems, we strongly recommend adjusting the taxonomies of either the
reference database, local species list, or both to improve compatibility
between the taxonomy systems. The functions included in the Local Taxa
Tool package for preparing reference databases and local taxa lists have
options for adjusting taxonomies, and we will cover adjusting taxonomies
in reference databases now.
MIDORI reference sequences use taxonomies from the NCBI taxonomy database, which can be browsed at https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi. In instances where NCBI does not provide a taxonomic level, then MIDORI uses the lower taxonomic level with the name of the missing taxonomic level as a prefix. NCBI’s superkingdom (e.g., Eukaryota) is used as the domain, and phylum, class, order, family, genus, and species are included to complete a 7-level taxonomy. Here is an example taxonomy of Ministeria vibrans, which lacks a phylum, order, and family in NCBI:
Full taxonomy: Eukaryota;phylum_Filasterea;Filasterea;order_family_Ministeria;family_Ministeria;Ministeria;Ministeria_vibrans
In this tutorial, we will adjust taxonomies in the downloaded MIDORI
srRNA reference database to better match other taxonomic sources,
including the IUCN Red List. Locate the example Taxonomy_Edits_NCBI.csv
file, and open the file in your preferred spreadsheet application. You
will notice three fields in the file, which are all required. The first
field, “Old_Taxonomy”, contains the original taxonomies which need to be
changed, and the second field, “New_Taxonomy”, contains the taxonomies
which the old ones will be changed to. Within both the old and new
taxonomy fields, taxonomies should begin with the highest taxonomic
level (e.g., the domain name) and proceed to progressively
finer taxonomic levels. Taxonomic levels should be delimited by
semi-colons. The “Notes” field states what each taxonomy edit
accomplishes. For example, let’s consider the third record in the
downloaded Taxonomy_Edits_NCBI.csv file. The old taxonomy is
“Eukaryota;Chordata;class_Testudines;”, and the new taxonomy is
“Eukaryota;Chordata;Reptilia;”. In the “Notes” field, we can see that
this record assigns members of order Testudines (the turtles) to class
Reptilia. NCBI taxonomy does not have Reptilia. Because there is no
class associated with Testudines in NCBI taxonomy, the class for
Testudines used by MIDORI is “class_Testudines”. Within the
format_reference_database
function, the old reference
taxonomies are replaced with the new taxonomies as listed in the
taxonomy edits file. The replacements are made sequentially in the order
records appear in the taxonomy edits file. In addition to editing
taxonomies, records in the taxonomy edits file can also be used to fix
typos in the reference database. For example, the final record in the
Taxonomy_Edits_NCBI.csv file corrects a typo in the species binomial
“Panthera_sp_laea” to “Panthera_spelaea”.
Finally, we will add and remove certain sequences from the reference
database. Locate the example Sequence_Edits_Region_srRNA.csv file, and
open the file in your preferred spreadsheet application. There are 11
fields, which are all required. The first field, “Action”, must contain
the values “Add” or “Remove”. The next field, “Common_Name”, is required
to be in the file, but it is optional to provide values to this field.
The next fields contain taxonomic information: “Domain”, “Phylum”,
“Class”, “Order”, “Family”, “Genus”, “Species”. Values are required in
all these fields, and the taxonomic information should follow the
taxonomy system of the reference database after taxonomy edits are
applied (e.g., if following this tutorial, use “Reptilia” in
the class field for turtles instead of “class Testudines”). The species
field should contain species binomials, and spaces should be used
instead of underscores in all of the taxonomy fields (e.g., use
“Canis lupus” in the species field instead of “Canis_lupus”). If the
reference database is from UNITE, then use kingdom names (e.g.,
“Fungi”) in the domain field. The next field, “Sequence”, contains the
DNA sequence to be either added or removed from the reference database,
depending on the value of the “Action” field. For example, the first two
records of the Sequence_Edits_Region_srRNA.csv file add sequences with
the provided taxonomies to the reference database, and the final two
records remove the listed reference sequences. If reference sequences to
be removed are not found in the reference database, the
format_reference_database
function will return warnings
stating which sequences were not found. We’ll pretend that the first two
records of the Sequence_Edits_Region_srRNA.csv file are additional
reference sequences which were sequenced by your lab. The final two
records are reference sequences which the lab believes were incorrectly
identified, and these will be removed from the reference database. To
reclassify a reference sequence, the sequence could be removed using one
record, and then added with a different taxonomy using another record
within the sequence edits file.
Now we’re ready to format the reference database for use with the Local Taxa Tool. Run the line below to format the srRNA MIDORI reference database and apply taxonomy and sequence edits.
# Format srRNA MIDORI reference database.
format_reference_database(
# Path to input reference database (with .fasta extension).
path_to_input_reference_database="MIDORI2_UNIQ_NUC_SP_GB252_srRNA_BLAST.fasta",
# Path to output BLAST database (with .fasta extension).
path_to_output_BLAST_database="BLAST_DB_srRNA.fasta",
# Input reference database source (must be "MIDORI" [the default] or "UNITE").
input_reference_database_source="MIDORI",
# Path to taxonomy edits file (with .csv extension).
# Set to NA (the default) if no taxonomy edits are desired.
path_to_taxonomy_edits="Taxonomy_Edits_NCBI.csv",
# Path to sequence edits file (with .csv extension).
# Set to NA (the default) if no sequence edits are desired.
path_to_sequence_edits="Sequence_Edits_Region_srRNA.csv")
We provide two approaches to creating lists of local species with
taxonomies. The first approach involves creating a list of species
binomials from any source of your choosing, then using the
get_taxonomies.species_binomials
function to fetch NCBI
taxonomies from the species binomials. The second approach involves
downloading regional species lists from the IUCN Red List, and then
using the get_taxonomies.IUCN
function to adjust the
taxonomies to be compatible with the reference databases.
Create a list of local species from a source of your choosing. Here, we use an example list of amphibians and reptiles of Oregon from the Oregon Department of Fish and Wildlife websites:
Locate the example ODFW_Herptiles_SpeciesBinomials.csv file, and open
the file in your preferred spreadsheet application. The list contains
two fields which are required by the
get_taxonomies.species_binomials
function: “Common_Name”
and “Scientific_Name”. Values are required in the scientific name field,
whereas values are optional in the common name field. The scientific
name field should contain species binomials with spaces, not
underscores, representing spaces.
The get_taxonomies.species_binomials
function will fetch
taxonomies from the NCBI taxonomy database using the taxize package, and
the NCBI taxonomies can be adjusted with a user-supplied taxonomy edit
file. The taxonomy edit file here works in the same way as the taxonomy
edit file used to adjust the reference database taxonomies, and we
recommend that the same taxonomy edit file is used here. Due to
differences in the NCBI and UNITE taxonomy systems, we do not recommend
using the get_taxonomies.species_binomials
function to
prepare local taxa lists if using UNITE reference databases.
Optionally, taxonomies can be fetched quicker if we get an Entrez API
key, which can be created and viewed on the settings page of your NCBI
account. Here, we will set our Entrez API key as an R environment
variable. More information on setting up and using an API key can be
found at ??"taxize-authentication"
. If the user does not
have an Entrez API key, that is okay. Just skip this line of code and
move on to the get_taxonomies.species_binomials
function;
it will run slower but will otherwise work the same.
# Optionally set up your Entrez API key as an R environment variable.
Sys.setenv(ENTREZ_KEY="your_API_key")
Now let’s run the get_taxonomies.species_binomials
function to fetch NCBI taxonomies for your local species list. By
default, species queries will print as the function runs, but printing
the queries can be disabled by setting the print_taxize_queries argument
to FALSE. You may experience a “Bad Request (HTTP 400)” error. If this
error occurs, then run the function again. If your species list is
large, you may experience the bad request error several times before the
function completes successfully. Just be patient, and consider running
this function multiple times on smaller subsets of your species list if
the bad request error is persistent.
# Fetch NCBI taxonomies from species binomials.
get_taxonomies.species_binomials(
# Path to input list of common and scientific names (with .csv extension).
path_to_input_species_binomials="Oregon_Herptiles_SpeciesBinomials.csv",
# Path to output local taxa list with added NCBI taxonomies (with .csv extension).
path_to_output_local_taxa_list="Oregon_Herptiles_Taxonomies.csv",
# Path to taxonomy edits file (with .csv extension).
# We recommend using the same taxonomy edits file used to adjust the reference database taxonomies.
# If no taxonomy edits are desired, then set this variable to NA (the default).
path_to_taxonomy_edits="Taxonomy_Edits_NCBI.csv",
# Do not print species queries.
print_taxize_queries=FALSE)
## Warning in get_taxonomies.species_binomials(path_to_input_species_binomials =
## "Oregon_Herptiles_SpeciesBinomials.csv", : NCBI taxonomies could not be found
## for 1 local species. Be sure to update unknown taxonomies in the local taxa list
## before using the Local Taxa Tool. Unknown taxonomies appear at the top of the
## local taxa list.
Let’s read in the output local taxa list with NCBI taxonomies and take a look at it.
# Read in file.
ncbi_taxonomies_local_taxa<-read.csv(file="Oregon_Herptiles_Taxonomies.csv")
# View file.
head(ncbi_taxonomies_local_taxa)
Common_Name | Domain | Phylum | Class | Order | Family | Genus | Species |
---|---|---|---|---|---|---|---|
Klamath black salamander | Aneides klamathensis | ||||||
Western pond turtle | Eukaryota | Chordata | Reptilia | Testudines | Emydidae | Actinemys | Actinemys marmorata |
Northwestern salamander | Eukaryota | Chordata | Amphibia | Caudata | Ambystomatidae | Ambystoma | Ambystoma gracile |
Long-toed salamander | Eukaryota | Chordata | Amphibia | Caudata | Ambystomatidae | Ambystoma | Ambystoma macrodactylum |
Western tiger salamander | Eukaryota | Chordata | Amphibia | Caudata | Ambystomatidae | Ambystoma | Ambystoma mavortium |
Western toad | Eukaryota | Chordata | Amphibia | Anura | Bufonidae | Anaxyrus | Anaxyrus boreas |
Notice that NCBI taxonomies could not be found for a single taxon, the
Klamath black salamander. Species for which NCBI taxonomies could not be
found appear at the top of the output local taxa list, and NCBI
taxonomies should be filled in manually using taxonomies found on the
NCBI taxonomy browser (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi).
Be sure to apply appropriate taxonomy adjustments when inputting
taxonomies manually, and use NCBI superkingdom names in the domain field
(e.g., “Eukaryota”). For the Klamath black salamander, typing
its genus name Aneides into the NCBI taxonomy browser yields
the following taxonomy for this species: Eukaryota, Chordata, Amphibia,
Caudata, Plethodontidae, Aneides, Aneides
klamathensis.
After filling in missing NCBI taxonomies, remove duplicate species from the local taxa list before using the list with the Local Taxa Tool. Note that the taxize package may have changed some species binomials to synonyms which are present in the NCBI taxonomy database. For example, searching for Rana catesbeiana (American bullfrog) in the NCBI taxonomy browser yields the synonym Lithobates catesbeianus, which is the name NCBI and GenBank (and therefore MIDORI) apply to American bullfrog sequences.
Head to the IUCN Red List advanced search page (https://www.iucnredlist.org/search), and use the taxonomy and land regions filters to query your local species list. Here is an example filter for chordates (which includes vertebrates) in North America:
After setting your filters, click on the “Download” drop-down menu on
the right side of the screen and select “Search Summary”. You will be
prompted to sign-in or create an IUCN Red List account, answer questions
relating to your desired use of the data, and then you will be able to
download. The downloaded taxonomy.csv and common_names.csv files are of
interest here, and these can be used with the
get_taxonomies.IUCN
function to create a formatted local
taxa list.
We will use a taxonomy edits file to adjust the IUCN Red List taxonomies to better reflect the taxonomy system of our reference database. Locate the example Taxonomy_Edits_IUCN.csv file, and open the file in your preferred spreadsheet application. The taxonomy edits which we performed on our reference database earlier reduce the number of taxonomy edits we need to perform on the IUCN Red List taxonomies, and we apply two taxonomy adjustments using the Taxonomy_Edits_IUCN.csv file. This taxonomy edits file works the same way as the earlier Taxonomy_Edits_NCBI.csv file, but the taxonomy system is that of the IUCN Red List instead of NCBI. We encourage users to familiarize themselves with the IUCN Red List taxonomy system by browsing the taxonomy filter on the IUCN Red List advanced search page (https://www.iucnredlist.org/search) and adding any necessary taxonomy corrections of their own to the Taxonomy_Edits_NCBI.csv file. In our IUCN Red List taxonomy edits file, we assign members of class Cephalaspidomorphi to class Hyperoartia, and we assign members of order Cetartiodactyla to order Artiodactyla (NCBI taxonomy considers cetaceans to be artiodactylans).
Let’s run the get_taxonomies.IUCN
function to format the
downloaded IUCN Red List species list for the Local Taxa Tool.
# Get taxonomies from downloaded IUCN Red List files.
get_taxonomies.IUCN(
# Path to input taxonomy.csv file downloaded from the IUCN Red List.
path_to_IUCN_taxonomies="taxonomy.csv",
# Path to input common_names.csv file downloaded from the IUCN Red List.
path_to_IUCN_common_names="common_names.csv",
# Path to output local taxa list with formatted taxonomies (with .csv extension).
path_to_output_local_taxa_list="LocalTaxa_NorthAmerica_Vertebrates.csv",
# Domain name to use for all taxa. The default is "Eukaryota".
# The IUCN Red List files do not include domain information, so a domain name must be provided.
# If using a reference database from UNITE, provide a kingdom name here (e.g., "Fungi").
domain_name="Eukaryota",
# Path to taxonomy edits file (with .csv extension).
# If no taxonomy edits are desired, then set this variable to NA (the default).
path_to_taxonomy_edits="Taxonomy_Edits_IUCN.csv")
The Local Taxa Tool may be used for taxonomic assignment following
the use of any other bioinformatics pipeline given that either the
sequences to classify can be read into R, or the sequences are in fasta
format. If the sequences are in fasta format and all header lines are
unique, then the fasta file may be used directly with the Local Taxa
Tool. Ideally, the fasta file should contain only unique sequences to
classify, since this will save computation time and reduce memory (RAM)
usage. If any edits need to be made to the fasta file, the
read.fasta
function may be used to read the fasta file into
R to perform the edits. Then the write.fasta
function may
be used to write the data back out in fasta format.
# If your sequences to classify are already in fasta format, you can read the fasta file into
# R as a data frame to make any necessary edits. The data frame will contain two fields, one
# containing the text from the header lines (the "Name" field), and one containing the text
# from the sequence lines (the "Sequence" field). After making any necessary edits, use the
# write.fasta function (see below) to write the data back out in fasta format.
df<-read.fasta(file="your_file_path_here.fasta")
Here, we will assume the sequences to classify are contained in a sequence table. That is, a table in which each row represents a sample, and each column represents a DNA sequence. Each cell contains either the number of reads for the sequence in the sample, or in our case, a zero or a one representing whether the sequence was detected in the sample. To follow this tutorial, locate the example Mock_Data.csv file, and open the file in your preferred spreadsheet application. We’ll pretend that the mock data are from Oregon mammalian carnivore (order Carnivora) fecal samples, and the sequences are of the 12S vertebrate metabarcoding region. Let’s read the example sequence table into R and get the sequences to classify into fasta format.
# Read in sequence table.
sequence_table<-read.csv(file="Mock_Data.csv",row.names=1)
# Get sequences to classify from the sequence table.
sequences_to_classify<-colnames(sequence_table)
# Write out sequences to classify as a fasta file.
# All header lines should be unique. In this case, we use "Sequence_1", "Sequence_2", etc.
# The "names" argument takes a vector which will become the header lines,
# and the "sequences" argument takes a vector which will become the sequence lines.
write.fasta(names=paste0("Sequence_",1:length(sequences_to_classify)),
sequences=sequences_to_classify,
file="Sequences.fasta")
Now we’re ready to run the Local Taxa Tool. We will provide file paths to our sequences to classify, our reference database, and our local taxa list as arguments. Here, we use a list of Oregon vertebrates, the example LocalTaxa_Oregon_Vertebrates.csv file, as our local taxa list. The list was derived from species listed at the following sources:
We will also set some BLAST options (see blastn help documentation for details). First, we will set the maximum BLAST E-value to a low number (i.e., 1e-5) to ensure that only high-quality BLAST hits are returned. Next, we will set the maximum number of BLAST target sequences returned per query sequence. BLAST cannot return all best-matching hits out of the box; it will sort the hits by bit score and return a user-provided number of hits per query sequence. So we will need to ask BLAST to return a bunch of hits, and the Local Taxa Tool will isolate all of the best-matching hits (i.e., the hits with the highest bit score) for each query sequence. Therefore, enough target sequences should be returned to ensure that all maximum bit score matches are returned for each query sequence. Here, we set the maximum number of BLAST target sequences returned per query sequence to 2000, which we’ve found to generally be sufficient. A warning will be produced if this value is not high enough to ensure that all best BLAST matches are returned. Finally, we set the BLAST task specification. Here, we use “megablast” (the blastn default) to find very similar sequences (e.g., intraspecies or closely related species). “blastn-short” may be used for sequences shorter than 50 bases. Now let’s run the Local Taxa Tool!
# Run the Local Taxa Tool.
local_taxa_tool(
# Path to file containing sequences to classify (with .fasta extension).
path_to_sequences_to_classify="Sequences.fasta",
# Path to BLAST reference database (with .fasta extension).
path_to_BLAST_database="BLAST_DB_srRNA.fasta",
# Path to output file of classified sequences (with .csv extension).
path_to_output_file="Sequences_Classified.csv",
# Path to list of local taxa (with .csv extension).
# If local taxa suggestions are not desired, set this variable to NA (the default; all best BLAST matches will still be returned).
path_to_list_of_local_taxa="LocalTaxa_Oregon_Vertebrates.csv",
# Maximum E-value of returned BLAST hits (lower E-values are associated with more "significant" matches). See blastn help documentation for details.
blast_e_value=1e-5,
# Maximum number of BLAST target sequences returned per query sequence. Enough target sequences should be returned to ensure that all maximum bit score matches are returned for each query sequence. A warning will be produced if this value is not sufficient.
blast_max_target_seqs=2000,
# BLAST task specification. Use "megablast" (the default) to find very similar sequences (e.g., intraspecies or closely related species). Use "blastn-short" for sequences shorter than 50 bases. See blastn help documentation for additional options and details.
blast_task="megablast")
You may find your Local Taxa Tool output at your supplied path_to_output_file in the command above. Output field definitions are:
Let’s check out the example Local Taxa Tool output.
# Read in Local Taxa Tool output.
LTT_output<-read.csv(file="Sequences_Classified.csv")
# View Local Taxa Tool output.
print(LTT_output)
Sequence_name | Sequence | Best_match_references | Best_match_E_value | Best_match_bit_score | Best_match_query_cover.mean | Best_match_query_cover.SD | Best_match_PID.mean | Best_match_PID.SD | Local_taxa | Local_species |
---|---|---|---|---|---|---|---|---|---|---|
Sequence_1 | TTAGCCCTAAACCTAGATAGTTAACTCAAACAAAACTATCCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTACATCCCT | Acinonyx jubatus, Catopuma badia, Lynx rufus, Puma concolor, Puma yagouaroundi | 2.61e-47 | 187 | 100 | 0 | 100 | 0 | Lynx rufus, Puma concolor | Lynx rufus, Puma concolor |
Sequence_2 | TTAGCCCTAAACATAAATAGTTGTCCAACAAAACAATTCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTATATCCCT | Vulpes lagopus, Vulpes zerda | 7.12e-43 | 172 | 100 | 0 | 97.98 | 0 | Vulpes | Vulpes macrotis, Vulpes vulpes |
Sequence_3 | CCTTACGAATCCTGTTTGTCTTCTCCACCCGCGGATATTGCGCCCGGAGTCCCTGTAGTATCGTAACAACTAGGCTCGTGACGTATGTCAAGGTACTGTC | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found |
Let’s consider the output records to better understand how the Local
Taxa Tool functions.
In record 1, there were best BLAST matches which were local species. The best-matching local species are suggested in the Local_taxa and Local_species fields. Since the best-matching BLAST hits included local species, the Local_taxa and Local_species fields are the same. The standard deviation fields for both query cover and percent identity were zero, suggesting that all best-matching BLAST hits had the same query cover and percent identity. If a standard deviation of NA is observed, this would suggest that there was a single best BLAST match.
In record 2, best-matching BLAST hits belonged to two non-local species. Because these species belong to genus Vulpes, which includes local species, Vulpes appears in the Local_taxa field, and local species belonging to Vulpes are displayed in the Local_species field.
In record 3, there were no “significant” BLAST hits, resulting in “No significant similarity found” in the Local Taxa Tool output for this record.
We also have several options in the local_taxa_tool
function to control how taxa names are formatted in the output file. If
we want to return full taxonomies in the output file, we can set the
full_names argument to TRUE. We can also tell the tool to replace spaces
in taxa names in the output file with underscores by setting the
underscores argument to TRUE, and we can change the separator between
taxa names using the separator argument. In the below example, we return
full taxonomies in the output file in which spaces are replaced by
underscores, and we change the separator between taxa names.
# Run the Local Taxa Tool.
local_taxa_tool(
path_to_sequences_to_classify="Sequences.fasta",
path_to_BLAST_database="BLAST_DB_srRNA.fasta",
path_to_output_file="Sequences_Classified_Taxonomies.csv",
path_to_list_of_local_taxa="LocalTaxa_Oregon_Vertebrates.csv",
blast_e_value=1e-5,
blast_max_target_seqs=2000,
blast_task="megablast",
# Set to TRUE to return full taxonomies in the output file.
full_names=TRUE,
# Replace spaces in taxa names in the output file with underscores.
underscores=TRUE,
# Set the separator between taxa names in the output file.
separator=" | ")
# Read in Local Taxa Tool output.
LTT_output_taxonomies<-read.csv(file="Sequences_Classified_Taxonomies.csv")
# View Local Taxa Tool output.
print(LTT_output_taxonomies)
Sequence_name | Sequence | Best_match_references | Best_match_E_value | Best_match_bit_score | Best_match_query_cover.mean | Best_match_query_cover.SD | Best_match_PID.mean | Best_match_PID.SD | Local_taxa | Local_species |
---|---|---|---|---|---|---|---|---|---|---|
Sequence_1 | TTAGCCCTAAACCTAGATAGTTAACTCAAACAAAACTATCCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTACATCCCT | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Acinonyx;Acinonyx_jubatus | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Catopuma;Catopuma_badia | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Lynx;Lynx_rufus | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Puma;Puma_concolor | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Puma;Puma_yagouaroundi | 2.61e-47 | 187 | 100 | 0 | 100 | 0 | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Lynx;Lynx_rufus | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Puma;Puma_concolor | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Lynx;Lynx_rufus | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Puma;Puma_concolor |
Sequence_2 | TTAGCCCTAAACATAAATAGTTGTCCAACAAAACAATTCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTATATCCCT | Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes;Vulpes_lagopus | Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes;Vulpes_zerda | 7.12e-43 | 172 | 100 | 0 | 97.98 | 0 | Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes | Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes;Vulpes_macrotis | Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes;Vulpes_vulpes |
Sequence_3 | CCTTACGAATCCTGTTTGTCTTCTCCACCCGCGGATATTGCGCCCGGAGTCCCTGTAGTATCGTAACAACTAGGCTCGTGACGTATGTCAAGGTACTGTC | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found |
What if we wanted to take the traditional approach of subsetting the reference database to the sequences of local species? Then could we get all best BLAST matches of local species? Yes, we have the option of doing this with the Local Taxa Tool!
We can use our prepared local taxa lists to subset reference
databases to just local species using an optional argument in the
format_reference_database
function. Let’s use the same
command we used earlier to format the srRNA reference database, but this
time we’ll provide our Oregon vertebrate local taxa list to the function
to subset the reference database to sequences of local species.
# Format srRNA MIDORI reference database.
# This time, we'll provide a our Oregon vertebrate list to the
# optional path_to_list_of_local_taxa_to_subset argument.
format_reference_database(
path_to_input_reference_database="MIDORI2_UNIQ_NUC_SP_GB252_srRNA_BLAST.fasta",
path_to_output_BLAST_database="BLAST_DB_srRNA_OR_Vertebrates.fasta",
input_reference_database_source="MIDORI",
path_to_taxonomy_edits="Taxonomy_Edits_NCBI.csv",
path_to_sequence_edits="Sequence_Edits_Region_srRNA.csv",
# Path to list of local taxa to subset the reference database to (with .csv extension).
path_to_list_of_local_taxa_to_subset="LocalTaxa_Oregon_Vertebrates.csv")
Now we can run the Local Taxa Tool using the subsetted reference database without providing a local taxa list. This will cause the tool to return all best BLAST matches without providing local taxa suggestions.
# Run the Local Taxa Tool.
local_taxa_tool(path_to_sequences_to_classify="Sequences.fasta",
path_to_BLAST_database="BLAST_DB_srRNA_OR_Vertebrates.fasta",
path_to_output_file="Sequences_Classified_Reference_Subset.csv",
# Not providing a local taxa list disables local taxa suggestions.
path_to_list_of_local_taxa=NA,
blast_e_value=1e-5,
blast_max_target_seqs=2000,
blast_task="megablast")
Let’s check out the example Local Taxa Tool output in which the reference database was subsetted to the sequences of local species.
# Read in Local Taxa Tool output.
LTT_output_reference_subset<-read.csv(file="Sequences_Classified_Reference_Subset.csv")
# View Local Taxa Tool output.
print(LTT_output_reference_subset)
Sequence_name | Sequence | Best_match_references | Best_match_E_value | Best_match_bit_score | Best_match_query_cover.mean | Best_match_query_cover.SD | Best_match_PID.mean | Best_match_PID.SD |
---|---|---|---|---|---|---|---|---|
Sequence_1 | TTAGCCCTAAACCTAGATAGTTAACTCAAACAAAACTATCCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTACATCCCT | Lynx rufus, Puma concolor | 1.07e-48 | 187 | 100 | 0 | 100 | 0 |
Sequence_2 | TTAGCCCTAAACATAAATAGTTGTCCAACAAAACAATTCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTATATCCCT | Vulpes vulpes | 1.36e-42 | 167 | 100 | 0 | 96.97 | 0 |
Sequence_3 | CCTTACGAATCCTGTTTGTCTTCTCCACCCGCGGATATTGCGCCCGGAGTCCCTGTAGTATCGTAACAACTAGGCTCGTGACGTATGTCAAGGTACTGTC | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found | No significant similarity found |
Why does the Local Taxa Tool use bit score to select the best BLAST matches? Why not E-value?:
Selecting matches with the highest bit score is the same as selecting matches with the lowest E-value. If we consider the formula below for calculating E-values, we can see that the matches with the highest bit score will also have the lowest E-value.
\(E\text-value=\frac{query~sequence~length~*~total~database~length}{2^{bit~score}}\)
Because all matches with the same bit score will also have the same E-value, no mean or standard deviation is provided for E-value because all the E-values are the same. By default, BLAST sorts hits by decreasing bit score, which is the same as sorting hits by increasing E-value. The hiccup is that sometimes E-values are so small that BLAST returns them as zero, in which case hits with the same lowest E-value (zero) can have different bit scores, which should not occur. When E-values underflow to zero, BLAST still sorts hits by bit score, which does not experience underflow issues. This makes selecting best BLAST matches by bit score more reliable than selecting best BLAST matches by E-value, and considering hits with the highest bit score to be the best-matching hits follows BLAST’s default behavior.
Why are means and standard deviations presented for query cover and percent identity?:
BLAST hits with the highest bit score can have different query covers and percent identities, which is why the query cover and percent identity for all best-matching BLAST hits are summarized with their means and standard deviations. A standard deviation of zero means that there was no variation in the query cover or percent identity of the best-matching BLAST hits. A standard deviation of NA can occur if there is only a single best-matching BLAST hit, in which case standard deviation is undefined.
How do local taxa suggestions work?:
Query sequences are BLASTed against a global reference database, and the tool suggests locally occurring taxa which are most closely related (by taxonomy) to any of the best-matching BLAST hits (by bit score). Let’s consider some examples. Imagine a query sequence with best-matching BLAST hits belonging to two species of frog and one species of caecilian: Pelophylax lessonae, Pelophylax ridibundus, and Typhlonectes natans. None of these species are local, so the tool works up the taxonomic levels of both the best-matching reference species and the local species until it finds a shared taxonomic unit. The tool works up to genus and finds no Pelophylax or Typhlonectes species in the local taxa list, so the tool works up to family and checks whether any local species belong to Ranidae or Typhlonectidae. Our local taxa list contains members of Ranidae, so the tool stops working up taxonomic levels and lists Ranidae in the Local_taxa field. Local species belonging to Ranidae are listed in the Local_species field. If our local taxa list included members of both Ranidae and Typhlonectidae (which did not belong to Pelophylax or Typhlonectes), then the tool would list both Ranidae and Typhlonectidae in the Local_taxa field, and local species belonging to both of these families would be listed in the Local_species field.
How does the Local Taxa Tool handle hemihomonyms?:
In order to handle hemihomonyms, the Local Taxa Tool considers the full taxonomy when checking for matches between the names of best-matching BLAST hits and local taxa at any taxonomic level, including species. For example, the species binomial Agathis montana is shared between a wasp and a conifer. The Local Taxa Tool can tell the difference between these two species because their full taxonomies do not match. That is, “Eukaryota;Arthropoda;Insecta;Hymenoptera;Braconidae;Agathis;Agathis_montana” differs from “Eukaryota;Streptophyta;Pinopsida;Araucariales;Araucariaceae;Agathis;Agathis_montana”. Similarly, the genus name Lawsonia is shared by beetles, plants, and bacteria. Again, the Local Taxa Tool can tell the difference between these genera because their taxonomies differ: “Eukaryota;Arthropoda;Insecta;Coleoptera;Anthribidae;Lawsonia” vs “Eukaryota;Streptophyta;Magnoliopsida;Myrtales;Lythraceae;Lawsonia” vs “Bacteria;Proteobacteria;Deltaproteobacteria;Desulfovibrionales;Desulfovibrionaceae;Lawsonia”.
Why are local species suggested which are most closely related to any best-matching BLAST hit? Why not suggest local species which are most closely related to all best-matching BLAST hits?:
The Local Taxa Tool was designed to automate what we observed researchers in our lab performing manually. Let’s consider a query sequence with best-matching BLAST hits belonging to the species Vulpes vulpes and Nyctereutes procyonoides. Vulpes vulpes is local while Nyctereutes procyonoides is not. In suggesting local species which are most closely related to any best-matching BLAST hit, the tool will suggest Vulpes vulpes in the Local_taxa and Local_species fields. If the tool suggested local species which were most closely related to all best-matching BLAST hits, then all local members of Canidae would be suggested (which includes both best-matching references Vulpes vulpes and Nyctereutes procyonoides), even if other canid species were present in the reference database and were not best BLAST matches. In such a situation, a researcher would likely prefer Vulpes vulpes to be suggested instead of all local canids.
Should the local taxa suggestions be taken as definitive classifications?:
No. The local taxa suggestions are just that, suggestions. The Local Taxa Tool was made to automate what was previously a manual task, and we encourage users to vet local taxa suggestions for reasonableness within the context of their studies. Local taxa lists which are not comprehensive or which use a different taxonomy system than the reference database may cause the tool to fail to suggest local species. Users should also consider the query cover and percent identity statistics to determine the quality of the BLAST hits. Online BLAST searches (https://blast.ncbi.nlm.nih.gov/Blast.cgi) are also a good way for users to explore interesting sequences.
How can I report bugs?:
Please report bugs to https://github.com/Urodelan/LocaTT/issues
How should I cite the Local Taxa Tool?:
# Display citation for the Local Taxa Tool (LocaTT).
citation("LocaTT")
##
## To cite the Local Taxa Tool (LocaTT) in publications use:
##
## Kenen B Goodwin and Taal Levi (2023). Local Taxa Tool (LocaTT):
## Geographically-conscious taxonomic assignment for DNA metabarcoding.
## R package version 1.1.1, https://github.com/Urodelan/LocaTT.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {Local Taxa Tool (LocaTT): Geographically-conscious taxonomic assignment for DNA metabarcoding},
## author = {Kenen B Goodwin and Taal Levi},
## year = {2023},
## note = {R package version 1.1.1},
## url = {https://github.com/Urodelan/LocaTT},
## }