Background

The Local Taxa Tool (LocaTT, pronounced “locate”) is an R package which performs taxonomic assignment of DNA metabarcoding sequence data while considering geographic location. Sequences are BLASTed against a global reference database, and the tool suggests locally occurring taxa which are most closely related (by taxonomy) to any of the best-matching BLAST hits (by bit score). Compared to other implementations of BLAST, this approach provides the following advantages. First, all best BLAST matches are considered, as opposed to either the first or a set number of matches. Second, all suggested taxa are local species which actually occur in the study area. And third, local species which are not present in the reference database can still be suggested if related species have reference sequences, compared to approaches which subset the reference database to just local species.

The tool may be used for taxonomic assignment following the use of any other bioinformatics pipeline given that either the sequences to classify can be read into R, or the sequences are in fasta format. There are three main steps to using the Local Taxa Tool. First, reference databases from MIDORI or UNITE are formatted for use with the tool, and users can apply any desired edits to the taxonomies or sequences in the reference databases. Second, local species lists with taxonomies are prepared from either a list of species binomials or species lists downloaded from the IUCN Red List. Third, the Local Taxa Tool is run to perform taxonomic assignment and suggest local species for each query sequence. This tutorial describes how to prepare reference databases, prepare local species lists, and run the Local Taxa Tool.

The LocaTT package also contains basic bioinformatics functions for DNA metabarcoding. A full bioinformatics pipeline using the LocaTT package is in development.

Installing Software

R and BLAST are required to run the Local Taxa Tool. R can be installed from https://www.r-project.org, and RStudio Desktop can optionally be installed from https://posit.co/download/rstudio-desktop/. BLAST can be installed from https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ (use the file ending in .dmg for Mac, use the file ending in -win64.exe for Windows). Throughout this tutorial, all code is run in R. No use of the command line is necessary, as BLAST programs are called from R.

# Check R version.
R.Version()$version.string
## [1] "R version 4.2.0 (2022-04-22)"
# Display platform.
R.Version()$platform
## [1] "x86_64-apple-darwin17.0"

Now let’s install the Local Taxa Tool package, which is named “LocaTT”. To install the current CRAN release, use the install.packages function below.

# Install the current Local Taxa Tool (LocaTT) release from CRAN.
install.packages("LocaTT")

The development version of the package is available on GitHub. To install the development version, install the devtools package first in order to get the install_github function.

# Install the devtools package.
install.packages("devtools")

# Load the devtools package.
library(devtools)

# Install the development version of the Local Taxa Tool (LocaTT) package from GitHub.
devtools::install_github("Urodelan/LocaTT")

Now, load the Local Taxa Tool package.

# Load the Local Taxa Tool package.
library(LocaTT)
## Welcome to the Local Taxa Tool (LocaTT)! This is version 1.1.1.
##    ______  ______  ______   Geographically-conscious
##     |  | \/ |  | \/ |  |    taxonomic assignment
##    _|__|_/\_|__|_/\_|__|_   for DNA metabarcoding.
## For a detailed tutorial, see: https://urodelan.github.io/Local_Taxa_Tool_Tutorial/
## Command line BLAST is required, see tutorial for details.
## See citation('LocaTT') if used in publications.

Once the Local Taxa Tool is installed and loaded, let’s check that the BLAST installation was successful.

# Display the BLAST version.
blast_version()
## [1] "blastn: 2.12.0+ Package: blast 2.12.0, build Jun  4 2021 04:06:33"

If the BLAST version was displayed, then we’re good to go! If you’re using a non-standard installation of BLAST, set the path to the blastn program using the blast_command argument. For non-standard BLAST installations, paths to BLAST programs can be set in the format_reference_database and local_taxa_tool functions as well.

Some functions in the Local Taxa Tool package take file paths as arguments which are used by both R and the command line. If using Windows, please use file paths which can be understood by both R and the command line. For example, by using forward slashes in file paths instead of backslashes. Further, some functions do not allow spaces in certain file paths which are read by both R and the command line. Therefore, it is recommended that file paths provided as function arguments lack spaces. If the working directory contains spaces, you can set R’s working directory with the setwd function, then ensure that relative file paths used as function arguments contain no spaces.

Using the Local Taxa Tool on a high performance computing cluster:

On some high performance computing clusters, BLAST is installed as a module which must be loaded from the terminal (e.g., module load blast) in order to be recognized by R. Further, on some clusters R is set up as a container which does not recognize modules loaded outside of the container. Under this condition, adding the cluster’s module path for BLAST to R’s PATH environmental variable may allow R to locate command line BLAST programs. The cluster’s module path for BLAST can be found using module show blast on the terminal, and this can be added to R’s PATH environmental variable using R’s Sys.setenv function. Below is an example of how to do this.

# Run in Unix terminal. Get module path for BLAST on high performance computing cluster.
module show blast 2>&1 | grep ^prepend_path\( | sed 's/^prepend_path("PATH",//' | sed 's/)$//'
## "/your/path/to/BLAST/module/bin/folder/here"
# In R, add the cluster's module path for BLAST to R's PATH environmental variable.
Sys.setenv(PATH=paste("/your/path/to/BLAST/module/bin/folder/here",Sys.getenv("PATH"),sep=":"))

# Load the Local Taxa Tool package.
library(LocaTT)

# Check that R can find BLAST programs by checking the BLAST version.
blast_version()
## [1] "blastn: 2.12.0+ Package: blast 2.12.0, build Jun  4 2021 04:06:33"

Alternatively, paths to BLAST programs can be provided as arguments in the functions blast_version, format_reference_database, and local_taxa_tool. If you continue to encounter issues running command line BLAST programs from R on a cluster, please contact your cluster’s help desk.

Example Data

This tutorial uses example data files which can be downloaded from the GitHub repository: https://github.com/Urodelan/Local_Taxa_Tool_Tutorial_Data. To download the example data files, head to the GitHub repository and click on the green “Code” drop-down menu at the top right. Then click “Download ZIP”, and unzip the downloaded file. Some example data files include DNA sequences. These example sequences are either from GenBank, were derived from existing GenBank sequences, or were randomly generated.

Preparing Reference Databases

For the Local Taxa Tool to provide local species suggestions, a local species list must be provided with the same taxonomy system as the reference database. In turn, the reference database must contain taxonomy information for each reference sequence. Reference databases must be in fasta format, and fasta header lines must contain seven-level taxonomies with taxonomic units delimited by semi-colons and underscores representing spaces (e.g., in species binomials). The format_reference_database function formats reference databases from MIDORI or UNITE (general fasta release) for use with the Local Taxa Tool.

MIDORI Reference 2 (http://www.reference-midori.info) hosts reference databases derived from GenBank sequences with NCBI taxonomies which can be formatted for use with the Local Taxa Tool. MIDORI databases are available for multiple barcode regions, including COI and srRNA. Here, we will download the MIDORI srRNA reference database, which encompasses the 12S vertebrate metabarcoding region, for an example of using the Local Taxa Tool to classify the sequences of vertebrate prey items from the fecal samples of mammalian carnivores (order Carnivora).

Let’s go to download section of MIDORI’s website (http://www.reference-midori.info/download.php#) and select the following drop-down menus. Select “Databases”, then the desired release (in our case, the most recent release is “GenBank252”). Then, select either “BLAST” or “BLAST_sp” to access databases which are ready to be formatted for BLAST. We selected “BLAST_sp” for databases that include both sequences which have species-level identification and sequences which lack a binomial species-level description (e.g., sp.; see the MIDORI README file for details). Now, select “uniq” for databases containing only unique sequences for each species, and select “fasta” to find the databases in fasta format. From here, we can see the MIDORI databases for various barcode regions. For this tutorial, download the srRNA database and unzip it.

When formatting reference databases for the Local Taxa Tool using the format_reference_database function, please consider the taxonomy system of your reference databases and whether any adjustments should be made to improve compatibility between the taxonomies of the reference sequences and your local species lists. Within this tutorial, we provide options for preparing local taxa lists using two different taxonomy systems, the NCBI taxonomy database and the IUCN Red List. If the reference database and local species list use different taxonomy systems, we strongly recommend adjusting the taxonomies of either the reference database, local species list, or both to improve compatibility between the taxonomy systems. The functions included in the Local Taxa Tool package for preparing reference databases and local taxa lists have options for adjusting taxonomies, and we will cover adjusting taxonomies in reference databases now.

MIDORI reference sequences use taxonomies from the NCBI taxonomy database, which can be browsed at https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi. In instances where NCBI does not provide a taxonomic level, then MIDORI uses the lower taxonomic level with the name of the missing taxonomic level as a prefix. NCBI’s superkingdom (e.g., Eukaryota) is used as the domain, and phylum, class, order, family, genus, and species are included to complete a 7-level taxonomy. Here is an example taxonomy of Ministeria vibrans, which lacks a phylum, order, and family in NCBI:

  • Domain: Eukaryota
  • Phylum: phylum_Filasterea
  • Class: Filasterea
  • Order: order_family_Ministeria
  • Family: family_Ministeria
  • Genus: Ministeria
  • Species: Ministeria_vibrans

Full taxonomy: Eukaryota;phylum_Filasterea;Filasterea;order_family_Ministeria;family_Ministeria;Ministeria;Ministeria_vibrans

In this tutorial, we will adjust taxonomies in the downloaded MIDORI srRNA reference database to better match other taxonomic sources, including the IUCN Red List. Locate the example Taxonomy_Edits_NCBI.csv file, and open the file in your preferred spreadsheet application. You will notice three fields in the file, which are all required. The first field, “Old_Taxonomy”, contains the original taxonomies which need to be changed, and the second field, “New_Taxonomy”, contains the taxonomies which the old ones will be changed to. Within both the old and new taxonomy fields, taxonomies should begin with the highest taxonomic level (e.g., the domain name) and proceed to progressively finer taxonomic levels. Taxonomic levels should be delimited by semi-colons. The “Notes” field states what each taxonomy edit accomplishes. For example, let’s consider the third record in the downloaded Taxonomy_Edits_NCBI.csv file. The old taxonomy is “Eukaryota;Chordata;class_Testudines;”, and the new taxonomy is “Eukaryota;Chordata;Reptilia;”. In the “Notes” field, we can see that this record assigns members of order Testudines (the turtles) to class Reptilia. NCBI taxonomy does not have Reptilia. Because there is no class associated with Testudines in NCBI taxonomy, the class for Testudines used by MIDORI is “class_Testudines”. Within the format_reference_database function, the old reference taxonomies are replaced with the new taxonomies as listed in the taxonomy edits file. The replacements are made sequentially in the order records appear in the taxonomy edits file. In addition to editing taxonomies, records in the taxonomy edits file can also be used to fix typos in the reference database. For example, the final record in the Taxonomy_Edits_NCBI.csv file corrects a typo in the species binomial “Panthera_sp_laea” to “Panthera_spelaea”.

Finally, we will add and remove certain sequences from the reference database. Locate the example Sequence_Edits_Region_srRNA.csv file, and open the file in your preferred spreadsheet application. There are 11 fields, which are all required. The first field, “Action”, must contain the values “Add” or “Remove”. The next field, “Common_Name”, is required to be in the file, but it is optional to provide values to this field. The next fields contain taxonomic information: “Domain”, “Phylum”, “Class”, “Order”, “Family”, “Genus”, “Species”. Values are required in all these fields, and the taxonomic information should follow the taxonomy system of the reference database after taxonomy edits are applied (e.g., if following this tutorial, use “Reptilia” in the class field for turtles instead of “class Testudines”). The species field should contain species binomials, and spaces should be used instead of underscores in all of the taxonomy fields (e.g., use “Canis lupus” in the species field instead of “Canis_lupus”). If the reference database is from UNITE, then use kingdom names (e.g., “Fungi”) in the domain field. The next field, “Sequence”, contains the DNA sequence to be either added or removed from the reference database, depending on the value of the “Action” field. For example, the first two records of the Sequence_Edits_Region_srRNA.csv file add sequences with the provided taxonomies to the reference database, and the final two records remove the listed reference sequences. If reference sequences to be removed are not found in the reference database, the format_reference_database function will return warnings stating which sequences were not found. We’ll pretend that the first two records of the Sequence_Edits_Region_srRNA.csv file are additional reference sequences which were sequenced by your lab. The final two records are reference sequences which the lab believes were incorrectly identified, and these will be removed from the reference database. To reclassify a reference sequence, the sequence could be removed using one record, and then added with a different taxonomy using another record within the sequence edits file.

Now we’re ready to format the reference database for use with the Local Taxa Tool. Run the line below to format the srRNA MIDORI reference database and apply taxonomy and sequence edits.

# Format srRNA MIDORI reference database.
format_reference_database(
  # Path to input reference database (with .fasta extension).
  path_to_input_reference_database="MIDORI2_UNIQ_NUC_SP_GB252_srRNA_BLAST.fasta",
  # Path to output BLAST database (with .fasta extension).
  path_to_output_BLAST_database="BLAST_DB_srRNA.fasta",
  # Input reference database source (must be "MIDORI" [the default] or "UNITE").
  input_reference_database_source="MIDORI",
  # Path to taxonomy edits file (with .csv extension).
  # Set to NA (the default) if no taxonomy edits are desired.
  path_to_taxonomy_edits="Taxonomy_Edits_NCBI.csv",
  # Path to sequence edits file (with .csv extension).
  # Set to NA (the default) if no sequence edits are desired.
  path_to_sequence_edits="Sequence_Edits_Region_srRNA.csv")

Preparing Local Species Lists

We provide two approaches to creating lists of local species with taxonomies. The first approach involves creating a list of species binomials from any source of your choosing, then using the get_taxonomies.species_binomials function to fetch NCBI taxonomies from the species binomials. The second approach involves downloading regional species lists from the IUCN Red List, and then using the get_taxonomies.IUCN function to adjust the taxonomies to be compatible with the reference databases.

Approach 1: Manual entry of local species and semi-automated fetching of taxonomies

Create a list of local species from a source of your choosing. Here, we use an example list of amphibians and reptiles of Oregon from the Oregon Department of Fish and Wildlife websites:

Locate the example ODFW_Herptiles_SpeciesBinomials.csv file, and open the file in your preferred spreadsheet application. The list contains two fields which are required by the get_taxonomies.species_binomials function: “Common_Name” and “Scientific_Name”. Values are required in the scientific name field, whereas values are optional in the common name field. The scientific name field should contain species binomials with spaces, not underscores, representing spaces.

The get_taxonomies.species_binomials function will fetch taxonomies from the NCBI taxonomy database using the taxize package, and the NCBI taxonomies can be adjusted with a user-supplied taxonomy edit file. The taxonomy edit file here works in the same way as the taxonomy edit file used to adjust the reference database taxonomies, and we recommend that the same taxonomy edit file is used here. Due to differences in the NCBI and UNITE taxonomy systems, we do not recommend using the get_taxonomies.species_binomials function to prepare local taxa lists if using UNITE reference databases.

Optionally, taxonomies can be fetched quicker if we get an Entrez API key, which can be created and viewed on the settings page of your NCBI account. Here, we will set our Entrez API key as an R environment variable. More information on setting up and using an API key can be found at ??"taxize-authentication". If the user does not have an Entrez API key, that is okay. Just skip this line of code and move on to the get_taxonomies.species_binomials function; it will run slower but will otherwise work the same.

# Optionally set up your Entrez API key as an R environment variable.
Sys.setenv(ENTREZ_KEY="your_API_key")

Now let’s run the get_taxonomies.species_binomials function to fetch NCBI taxonomies for your local species list. By default, species queries will print as the function runs, but printing the queries can be disabled by setting the print_taxize_queries argument to FALSE. You may experience a “Bad Request (HTTP 400)” error. If this error occurs, then run the function again. If your species list is large, you may experience the bad request error several times before the function completes successfully. Just be patient, and consider running this function multiple times on smaller subsets of your species list if the bad request error is persistent.

# Fetch NCBI taxonomies from species binomials.
get_taxonomies.species_binomials(
  # Path to input list of common and scientific names (with .csv extension).
  path_to_input_species_binomials="Oregon_Herptiles_SpeciesBinomials.csv",
  # Path to output local taxa list with added NCBI taxonomies (with .csv extension).
  path_to_output_local_taxa_list="Oregon_Herptiles_Taxonomies.csv",
  # Path to taxonomy edits file (with .csv extension).
  # We recommend using the same taxonomy edits file used to adjust the reference database taxonomies.
  # If no taxonomy edits are desired, then set this variable to NA (the default).
  path_to_taxonomy_edits="Taxonomy_Edits_NCBI.csv",
  # Do not print species queries.
  print_taxize_queries=FALSE)
## Warning in get_taxonomies.species_binomials(path_to_input_species_binomials =
## "Oregon_Herptiles_SpeciesBinomials.csv", : NCBI taxonomies could not be found
## for 1 local species. Be sure to update unknown taxonomies in the local taxa list
## before using the Local Taxa Tool. Unknown taxonomies appear at the top of the
## local taxa list.

Let’s read in the output local taxa list with NCBI taxonomies and take a look at it.

# Read in file.
ncbi_taxonomies_local_taxa<-read.csv(file="Oregon_Herptiles_Taxonomies.csv")

# View file.
head(ncbi_taxonomies_local_taxa)
Common_Name Domain Phylum Class Order Family Genus Species
Klamath black salamander Aneides klamathensis
Western pond turtle Eukaryota Chordata Reptilia Testudines Emydidae Actinemys Actinemys marmorata
Northwestern salamander Eukaryota Chordata Amphibia Caudata Ambystomatidae Ambystoma Ambystoma gracile
Long-toed salamander Eukaryota Chordata Amphibia Caudata Ambystomatidae Ambystoma Ambystoma macrodactylum
Western tiger salamander Eukaryota Chordata Amphibia Caudata Ambystomatidae Ambystoma Ambystoma mavortium
Western toad Eukaryota Chordata Amphibia Anura Bufonidae Anaxyrus Anaxyrus boreas


Notice that NCBI taxonomies could not be found for a single taxon, the Klamath black salamander. Species for which NCBI taxonomies could not be found appear at the top of the output local taxa list, and NCBI taxonomies should be filled in manually using taxonomies found on the NCBI taxonomy browser (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi). Be sure to apply appropriate taxonomy adjustments when inputting taxonomies manually, and use NCBI superkingdom names in the domain field (e.g., “Eukaryota”). For the Klamath black salamander, typing its genus name Aneides into the NCBI taxonomy browser yields the following taxonomy for this species: Eukaryota, Chordata, Amphibia, Caudata, Plethodontidae, Aneides, Aneides klamathensis.

After filling in missing NCBI taxonomies, remove duplicate species from the local taxa list before using the list with the Local Taxa Tool. Note that the taxize package may have changed some species binomials to synonyms which are present in the NCBI taxonomy database. For example, searching for Rana catesbeiana (American bullfrog) in the NCBI taxonomy browser yields the synonym Lithobates catesbeianus, which is the name NCBI and GenBank (and therefore MIDORI) apply to American bullfrog sequences.

Approach 2: Download regional species list from the IUCN Red List and update taxonomies to better reflect the taxonomy system of the reference database

Head to the IUCN Red List advanced search page (https://www.iucnredlist.org/search), and use the taxonomy and land regions filters to query your local species list. Here is an example filter for chordates (which includes vertebrates) in North America:

  • Taxonomy: Chordata – Phylum
  • Geographic scope (default): Global
  • Red List category: All but extinct and extinct in the wild
  • Land regions: North America
  • Include (default): Species

After setting your filters, click on the “Download” drop-down menu on the right side of the screen and select “Search Summary”. You will be prompted to sign-in or create an IUCN Red List account, answer questions relating to your desired use of the data, and then you will be able to download. The downloaded taxonomy.csv and common_names.csv files are of interest here, and these can be used with the get_taxonomies.IUCN function to create a formatted local taxa list.

We will use a taxonomy edits file to adjust the IUCN Red List taxonomies to better reflect the taxonomy system of our reference database. Locate the example Taxonomy_Edits_IUCN.csv file, and open the file in your preferred spreadsheet application. The taxonomy edits which we performed on our reference database earlier reduce the number of taxonomy edits we need to perform on the IUCN Red List taxonomies, and we apply two taxonomy adjustments using the Taxonomy_Edits_IUCN.csv file. This taxonomy edits file works the same way as the earlier Taxonomy_Edits_NCBI.csv file, but the taxonomy system is that of the IUCN Red List instead of NCBI. We encourage users to familiarize themselves with the IUCN Red List taxonomy system by browsing the taxonomy filter on the IUCN Red List advanced search page (https://www.iucnredlist.org/search) and adding any necessary taxonomy corrections of their own to the Taxonomy_Edits_NCBI.csv file. In our IUCN Red List taxonomy edits file, we assign members of class Cephalaspidomorphi to class Hyperoartia, and we assign members of order Cetartiodactyla to order Artiodactyla (NCBI taxonomy considers cetaceans to be artiodactylans).

Let’s run the get_taxonomies.IUCN function to format the downloaded IUCN Red List species list for the Local Taxa Tool.

# Get taxonomies from downloaded IUCN Red List files.
get_taxonomies.IUCN(
  # Path to input taxonomy.csv file downloaded from the IUCN Red List.
  path_to_IUCN_taxonomies="taxonomy.csv",
  # Path to input common_names.csv file downloaded from the IUCN Red List.
  path_to_IUCN_common_names="common_names.csv",
  # Path to output local taxa list with formatted taxonomies (with .csv extension).
  path_to_output_local_taxa_list="LocalTaxa_NorthAmerica_Vertebrates.csv",
  # Domain name to use for all taxa. The default is "Eukaryota".
  # The IUCN Red List files do not include domain information, so a domain name must be provided.
  # If using a reference database from UNITE, provide a kingdom name here (e.g., "Fungi").
  domain_name="Eukaryota",
  # Path to taxonomy edits file (with .csv extension).
  # If no taxonomy edits are desired, then set this variable to NA (the default).
  path_to_taxonomy_edits="Taxonomy_Edits_IUCN.csv")

Running the Local Taxa Tool

The Local Taxa Tool may be used for taxonomic assignment following the use of any other bioinformatics pipeline given that either the sequences to classify can be read into R, or the sequences are in fasta format. If the sequences are in fasta format and all header lines are unique, then the fasta file may be used directly with the Local Taxa Tool. Ideally, the fasta file should contain only unique sequences to classify, since this will save computation time and reduce memory (RAM) usage. If any edits need to be made to the fasta file, the read.fasta function may be used to read the fasta file into R to perform the edits. Then the write.fasta function may be used to write the data back out in fasta format.

# If your sequences to classify are already in fasta format, you can read the fasta file into
# R as a data frame to make any necessary edits. The data frame will contain two fields, one
# containing the text from the header lines (the "Name" field), and one containing the text
# from the sequence lines (the "Sequence" field). After making any necessary edits, use the
# write.fasta function (see below) to write the data back out in fasta format.
df<-read.fasta(file="your_file_path_here.fasta")

Here, we will assume the sequences to classify are contained in a sequence table. That is, a table in which each row represents a sample, and each column represents a DNA sequence. Each cell contains either the number of reads for the sequence in the sample, or in our case, a zero or a one representing whether the sequence was detected in the sample. To follow this tutorial, locate the example Mock_Data.csv file, and open the file in your preferred spreadsheet application. We’ll pretend that the mock data are from Oregon mammalian carnivore (order Carnivora) fecal samples, and the sequences are of the 12S vertebrate metabarcoding region. Let’s read the example sequence table into R and get the sequences to classify into fasta format.

# Read in sequence table.
sequence_table<-read.csv(file="Mock_Data.csv",row.names=1)

# Get sequences to classify from the sequence table.
sequences_to_classify<-colnames(sequence_table)

# Write out sequences to classify as a fasta file.
# All header lines should be unique. In this case, we use "Sequence_1", "Sequence_2", etc.
# The "names" argument takes a vector which will become the header lines,
# and the "sequences" argument takes a vector which will become the sequence lines.
write.fasta(names=paste0("Sequence_",1:length(sequences_to_classify)),
            sequences=sequences_to_classify,
            file="Sequences.fasta")

Now we’re ready to run the Local Taxa Tool. We will provide file paths to our sequences to classify, our reference database, and our local taxa list as arguments. Here, we use a list of Oregon vertebrates, the example LocalTaxa_Oregon_Vertebrates.csv file, as our local taxa list. The list was derived from species listed at the following sources:

We will also set some BLAST options (see blastn help documentation for details). First, we will set the maximum BLAST E-value to a low number (i.e., 1e-5) to ensure that only high-quality BLAST hits are returned. Next, we will set the maximum number of BLAST target sequences returned per query sequence. BLAST cannot return all best-matching hits out of the box; it will sort the hits by bit score and return a user-provided number of hits per query sequence. So we will need to ask BLAST to return a bunch of hits, and the Local Taxa Tool will isolate all of the best-matching hits (i.e., the hits with the highest bit score) for each query sequence. Therefore, enough target sequences should be returned to ensure that all maximum bit score matches are returned for each query sequence. Here, we set the maximum number of BLAST target sequences returned per query sequence to 2000, which we’ve found to generally be sufficient. A warning will be produced if this value is not high enough to ensure that all best BLAST matches are returned. Finally, we set the BLAST task specification. Here, we use “megablast” (the blastn default) to find very similar sequences (e.g., intraspecies or closely related species). “blastn-short” may be used for sequences shorter than 50 bases. Now let’s run the Local Taxa Tool!

# Run the Local Taxa Tool.
local_taxa_tool(
  # Path to file containing sequences to classify (with .fasta extension).
  path_to_sequences_to_classify="Sequences.fasta",
  # Path to BLAST reference database (with .fasta extension).
  path_to_BLAST_database="BLAST_DB_srRNA.fasta",
  # Path to output file of classified sequences (with .csv extension).
  path_to_output_file="Sequences_Classified.csv",
  # Path to list of local taxa (with .csv extension).
  # If local taxa suggestions are not desired, set this variable to NA (the default; all best BLAST matches will still be returned).
  path_to_list_of_local_taxa="LocalTaxa_Oregon_Vertebrates.csv",
  # Maximum E-value of returned BLAST hits (lower E-values are associated with more "significant" matches). See blastn help documentation for details.
  blast_e_value=1e-5,
  # Maximum number of BLAST target sequences returned per query sequence. Enough target sequences should be returned to ensure that all maximum bit score matches are returned for each query sequence. A warning will be produced if this value is not sufficient.
  blast_max_target_seqs=2000,
  # BLAST task specification. Use "megablast" (the default) to find very similar sequences (e.g., intraspecies or closely related species). Use "blastn-short" for sequences shorter than 50 bases. See blastn help documentation for additional options and details.
  blast_task="megablast")

You may find your Local Taxa Tool output at your supplied path_to_output_file in the command above. Output field definitions are:

  • Sequence_name: The query sequence name.
  • Sequence: The query sequence.
  • Best_match_references: Species binomials of all best-matching BLAST hits (by bit score) from the reference database.
  • Best_match_E_value: The E-value associated with the best-matching BLAST hits.
  • Best_match_bit_score: The bit score associated with the best-matching BLAST hits.
  • Best_match_query_cover.mean: The mean query cover of all best-matching BLAST hits.
  • Best_match_query_cover.SD: The standard deviation of query cover of all best-matching BLAST hits.
  • Best_match_PID.mean: The mean percent identity of all best-matching BLAST hits.
  • Best_match_PID.SD: The standard deviation of percent identity of all best-matching BLAST hits.
  • Local_taxa (Field only present if a path to a local taxa list is provided): The finest taxonomic unit(s) which include both any species of the best-matching BLAST hits and any local species. If the species of any of the best-matching BLAST hits are local, then the finest taxonomic unit(s) are at the species level.
  • Local_species (Field only present if a path to a local taxa list is provided): Species binomials of all local species which belong to the taxonomic unit(s) in the Local_taxa field.

Let’s check out the example Local Taxa Tool output.

# Read in Local Taxa Tool output.
LTT_output<-read.csv(file="Sequences_Classified.csv")

# View Local Taxa Tool output.
print(LTT_output)
Sequence_name Sequence Best_match_references Best_match_E_value Best_match_bit_score Best_match_query_cover.mean Best_match_query_cover.SD Best_match_PID.mean Best_match_PID.SD Local_taxa Local_species
Sequence_1 TTAGCCCTAAACCTAGATAGTTAACTCAAACAAAACTATCCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTACATCCCT Acinonyx jubatus, Catopuma badia, Lynx rufus, Puma concolor, Puma yagouaroundi 2.61e-47 187 100 0 100 0 Lynx rufus, Puma concolor Lynx rufus, Puma concolor
Sequence_2 TTAGCCCTAAACATAAATAGTTGTCCAACAAAACAATTCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTATATCCCT Vulpes lagopus, Vulpes zerda 7.12e-43 172 100 0 97.98 0 Vulpes Vulpes macrotis, Vulpes vulpes
Sequence_3 CCTTACGAATCCTGTTTGTCTTCTCCACCCGCGGATATTGCGCCCGGAGTCCCTGTAGTATCGTAACAACTAGGCTCGTGACGTATGTCAAGGTACTGTC No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found


Let’s consider the output records to better understand how the Local Taxa Tool functions.

In record 1, there were best BLAST matches which were local species. The best-matching local species are suggested in the Local_taxa and Local_species fields. Since the best-matching BLAST hits included local species, the Local_taxa and Local_species fields are the same. The standard deviation fields for both query cover and percent identity were zero, suggesting that all best-matching BLAST hits had the same query cover and percent identity. If a standard deviation of NA is observed, this would suggest that there was a single best BLAST match.

In record 2, best-matching BLAST hits belonged to two non-local species. Because these species belong to genus Vulpes, which includes local species, Vulpes appears in the Local_taxa field, and local species belonging to Vulpes are displayed in the Local_species field.

In record 3, there were no “significant” BLAST hits, resulting in “No significant similarity found” in the Local Taxa Tool output for this record.

We also have several options in the local_taxa_tool function to control how taxa names are formatted in the output file. If we want to return full taxonomies in the output file, we can set the full_names argument to TRUE. We can also tell the tool to replace spaces in taxa names in the output file with underscores by setting the underscores argument to TRUE, and we can change the separator between taxa names using the separator argument. In the below example, we return full taxonomies in the output file in which spaces are replaced by underscores, and we change the separator between taxa names.

# Run the Local Taxa Tool.
local_taxa_tool(
  path_to_sequences_to_classify="Sequences.fasta",
  path_to_BLAST_database="BLAST_DB_srRNA.fasta",
  path_to_output_file="Sequences_Classified_Taxonomies.csv",
  path_to_list_of_local_taxa="LocalTaxa_Oregon_Vertebrates.csv",
  blast_e_value=1e-5,
  blast_max_target_seqs=2000,
  blast_task="megablast",
  # Set to TRUE to return full taxonomies in the output file.
  full_names=TRUE,
  # Replace spaces in taxa names in the output file with underscores.
  underscores=TRUE,
  # Set the separator between taxa names in the output file.
  separator=" | ")

# Read in Local Taxa Tool output.
LTT_output_taxonomies<-read.csv(file="Sequences_Classified_Taxonomies.csv")

# View Local Taxa Tool output.
print(LTT_output_taxonomies)
Sequence_name Sequence Best_match_references Best_match_E_value Best_match_bit_score Best_match_query_cover.mean Best_match_query_cover.SD Best_match_PID.mean Best_match_PID.SD Local_taxa Local_species
Sequence_1 TTAGCCCTAAACCTAGATAGTTAACTCAAACAAAACTATCCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTACATCCCT Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Acinonyx;Acinonyx_jubatus | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Catopuma;Catopuma_badia | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Lynx;Lynx_rufus | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Puma;Puma_concolor | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Puma;Puma_yagouaroundi 2.61e-47 187 100 0 100 0 Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Lynx;Lynx_rufus | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Puma;Puma_concolor Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Lynx;Lynx_rufus | Eukaryota;Chordata;Mammalia;Carnivora;Felidae;Puma;Puma_concolor
Sequence_2 TTAGCCCTAAACATAAATAGTTGTCCAACAAAACAATTCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTATATCCCT Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes;Vulpes_lagopus | Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes;Vulpes_zerda 7.12e-43 172 100 0 97.98 0 Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes;Vulpes_macrotis | Eukaryota;Chordata;Mammalia;Carnivora;Canidae;Vulpes;Vulpes_vulpes
Sequence_3 CCTTACGAATCCTGTTTGTCTTCTCCACCCGCGGATATTGCGCCCGGAGTCCCTGTAGTATCGTAACAACTAGGCTCGTGACGTATGTCAAGGTACTGTC No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found

Extra: Subsetting Reference Databases to Local Species

What if we wanted to take the traditional approach of subsetting the reference database to the sequences of local species? Then could we get all best BLAST matches of local species? Yes, we have the option of doing this with the Local Taxa Tool!

We can use our prepared local taxa lists to subset reference databases to just local species using an optional argument in the format_reference_database function. Let’s use the same command we used earlier to format the srRNA reference database, but this time we’ll provide our Oregon vertebrate local taxa list to the function to subset the reference database to sequences of local species.

# Format srRNA MIDORI reference database.
# This time, we'll provide a our Oregon vertebrate list to the
# optional path_to_list_of_local_taxa_to_subset argument.
format_reference_database(
  path_to_input_reference_database="MIDORI2_UNIQ_NUC_SP_GB252_srRNA_BLAST.fasta",
  path_to_output_BLAST_database="BLAST_DB_srRNA_OR_Vertebrates.fasta",
  input_reference_database_source="MIDORI",
  path_to_taxonomy_edits="Taxonomy_Edits_NCBI.csv",
  path_to_sequence_edits="Sequence_Edits_Region_srRNA.csv",
  # Path to list of local taxa to subset the reference database to (with .csv extension).
  path_to_list_of_local_taxa_to_subset="LocalTaxa_Oregon_Vertebrates.csv")

Now we can run the Local Taxa Tool using the subsetted reference database without providing a local taxa list. This will cause the tool to return all best BLAST matches without providing local taxa suggestions.

# Run the Local Taxa Tool.
local_taxa_tool(path_to_sequences_to_classify="Sequences.fasta",
                path_to_BLAST_database="BLAST_DB_srRNA_OR_Vertebrates.fasta",
                path_to_output_file="Sequences_Classified_Reference_Subset.csv",
                # Not providing a local taxa list disables local taxa suggestions.
                path_to_list_of_local_taxa=NA,
                blast_e_value=1e-5,
                blast_max_target_seqs=2000,
                blast_task="megablast")

Let’s check out the example Local Taxa Tool output in which the reference database was subsetted to the sequences of local species.

# Read in Local Taxa Tool output.
LTT_output_reference_subset<-read.csv(file="Sequences_Classified_Reference_Subset.csv")

# View Local Taxa Tool output.
print(LTT_output_reference_subset)
Sequence_name Sequence Best_match_references Best_match_E_value Best_match_bit_score Best_match_query_cover.mean Best_match_query_cover.SD Best_match_PID.mean Best_match_PID.SD
Sequence_1 TTAGCCCTAAACCTAGATAGTTAACTCAAACAAAACTATCCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTACATCCCT Lynx rufus, Puma concolor 1.07e-48 187 100 0 100 0
Sequence_2 TTAGCCCTAAACATAAATAGTTGTCCAACAAAACAATTCGCCAGAGAACTACTAGCAACAGCTTAAAACTCAAAGGACTTGGCGGTGCTTTATATCCCT Vulpes vulpes 1.36e-42 167 100 0 96.97 0
Sequence_3 CCTTACGAATCCTGTTTGTCTTCTCCACCCGCGGATATTGCGCCCGGAGTCCCTGTAGTATCGTAACAACTAGGCTCGTGACGTATGTCAAGGTACTGTC No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found No significant similarity found

FAQ

Why does the Local Taxa Tool use bit score to select the best BLAST matches? Why not E-value?:

Selecting matches with the highest bit score is the same as selecting matches with the lowest E-value. If we consider the formula below for calculating E-values, we can see that the matches with the highest bit score will also have the lowest E-value.

\(E\text-value=\frac{query~sequence~length~*~total~database~length}{2^{bit~score}}\)

Because all matches with the same bit score will also have the same E-value, no mean or standard deviation is provided for E-value because all the E-values are the same. By default, BLAST sorts hits by decreasing bit score, which is the same as sorting hits by increasing E-value. The hiccup is that sometimes E-values are so small that BLAST returns them as zero, in which case hits with the same lowest E-value (zero) can have different bit scores, which should not occur. When E-values underflow to zero, BLAST still sorts hits by bit score, which does not experience underflow issues. This makes selecting best BLAST matches by bit score more reliable than selecting best BLAST matches by E-value, and considering hits with the highest bit score to be the best-matching hits follows BLAST’s default behavior.

Why are means and standard deviations presented for query cover and percent identity?:

BLAST hits with the highest bit score can have different query covers and percent identities, which is why the query cover and percent identity for all best-matching BLAST hits are summarized with their means and standard deviations. A standard deviation of zero means that there was no variation in the query cover or percent identity of the best-matching BLAST hits. A standard deviation of NA can occur if there is only a single best-matching BLAST hit, in which case standard deviation is undefined.

How do local taxa suggestions work?:

Query sequences are BLASTed against a global reference database, and the tool suggests locally occurring taxa which are most closely related (by taxonomy) to any of the best-matching BLAST hits (by bit score). Let’s consider some examples. Imagine a query sequence with best-matching BLAST hits belonging to two species of frog and one species of caecilian: Pelophylax lessonae, Pelophylax ridibundus, and Typhlonectes natans. None of these species are local, so the tool works up the taxonomic levels of both the best-matching reference species and the local species until it finds a shared taxonomic unit. The tool works up to genus and finds no Pelophylax or Typhlonectes species in the local taxa list, so the tool works up to family and checks whether any local species belong to Ranidae or Typhlonectidae. Our local taxa list contains members of Ranidae, so the tool stops working up taxonomic levels and lists Ranidae in the Local_taxa field. Local species belonging to Ranidae are listed in the Local_species field. If our local taxa list included members of both Ranidae and Typhlonectidae (which did not belong to Pelophylax or Typhlonectes), then the tool would list both Ranidae and Typhlonectidae in the Local_taxa field, and local species belonging to both of these families would be listed in the Local_species field.

How does the Local Taxa Tool handle hemihomonyms?:

In order to handle hemihomonyms, the Local Taxa Tool considers the full taxonomy when checking for matches between the names of best-matching BLAST hits and local taxa at any taxonomic level, including species. For example, the species binomial Agathis montana is shared between a wasp and a conifer. The Local Taxa Tool can tell the difference between these two species because their full taxonomies do not match. That is, “Eukaryota;Arthropoda;Insecta;Hymenoptera;Braconidae;Agathis;Agathis_montana” differs from “Eukaryota;Streptophyta;Pinopsida;Araucariales;Araucariaceae;Agathis;Agathis_montana”. Similarly, the genus name Lawsonia is shared by beetles, plants, and bacteria. Again, the Local Taxa Tool can tell the difference between these genera because their taxonomies differ: “Eukaryota;Arthropoda;Insecta;Coleoptera;Anthribidae;Lawsonia” vs “Eukaryota;Streptophyta;Magnoliopsida;Myrtales;Lythraceae;Lawsonia” vs “Bacteria;Proteobacteria;Deltaproteobacteria;Desulfovibrionales;Desulfovibrionaceae;Lawsonia”.

Why are local species suggested which are most closely related to any best-matching BLAST hit? Why not suggest local species which are most closely related to all best-matching BLAST hits?:

The Local Taxa Tool was designed to automate what we observed researchers in our lab performing manually. Let’s consider a query sequence with best-matching BLAST hits belonging to the species Vulpes vulpes and Nyctereutes procyonoides. Vulpes vulpes is local while Nyctereutes procyonoides is not. In suggesting local species which are most closely related to any best-matching BLAST hit, the tool will suggest Vulpes vulpes in the Local_taxa and Local_species fields. If the tool suggested local species which were most closely related to all best-matching BLAST hits, then all local members of Canidae would be suggested (which includes both best-matching references Vulpes vulpes and Nyctereutes procyonoides), even if other canid species were present in the reference database and were not best BLAST matches. In such a situation, a researcher would likely prefer Vulpes vulpes to be suggested instead of all local canids.

Should the local taxa suggestions be taken as definitive classifications?:

No. The local taxa suggestions are just that, suggestions. The Local Taxa Tool was made to automate what was previously a manual task, and we encourage users to vet local taxa suggestions for reasonableness within the context of their studies. Local taxa lists which are not comprehensive or which use a different taxonomy system than the reference database may cause the tool to fail to suggest local species. Users should also consider the query cover and percent identity statistics to determine the quality of the BLAST hits. Online BLAST searches (https://blast.ncbi.nlm.nih.gov/Blast.cgi) are also a good way for users to explore interesting sequences.

How can I report bugs?:

Please report bugs to https://github.com/Urodelan/LocaTT/issues

How should I cite the Local Taxa Tool?:

# Display citation for the Local Taxa Tool (LocaTT).
citation("LocaTT")
## 
## To cite the Local Taxa Tool (LocaTT) in publications use:
## 
##   Kenen B Goodwin and Taal Levi (2023). Local Taxa Tool (LocaTT):
##   Geographically-conscious taxonomic assignment for DNA metabarcoding.
##   R package version 1.1.1, https://github.com/Urodelan/LocaTT.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {Local Taxa Tool (LocaTT): Geographically-conscious taxonomic assignment for DNA metabarcoding},
##     author = {Kenen B Goodwin and Taal Levi},
##     year = {2023},
##     note = {R package version 1.1.1},
##     url = {https://github.com/Urodelan/LocaTT},
##   }

Tutorial written and maintained by Kenen Goodwin ().
Tutorial license: CC BY-SA 4.0