Spaced seed


In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

Principle

Due to a number of functional and evolutionary constraints, nucleic acid sequences between individuals tend to be highly conserved, with the typical difference between two human genomes estimated on the order of 0.6%. Identification of highly similar regions in the genome may indicate functional importance, as mutations in these areas that would result in cessation of function or loss of regulatory ability would be evolutionary unfavorable. More observed differences between two sequences may arise as a result of stochastic sequencing errors. Similarly, when performing assembly of a previously characterized genome, an attempt is made to align the newly sequenced DNA fragments to the existing genome sequence.
In both cases, it is useful to be able to directly compare nucleic acid sequences. Since the sequences are not expected to be exactly identical, however, it is beneficial to focus on smaller subsequences that are more likely to be locally identical. Spaced seeds allow for even more permissive local matching by allowing certain base pairs to mismatch without penalty, thus allowing algorithms that use the general "hit-extend" strategy of alignment to explore additional potential matches that would be otherwise ignored.

Example

As a simple example, consider the following DNA sequences:

CTAAGTCACG

CTAACACACG

1111001111


Upon visual inspection, it's easy to see that there is a mismatch between the two sequences at the fifth and six base positions. However, the sequences still share 80% sequence similarity. The mismatches may be due to a real change or a sequencing error. In a non-spaced model, this putative match would be ignored if a seed size greater than 4 is specified. But a spaced seed of could be used to effectively zero-weighting the mismatch sites, treating the sequences as same for the purposes of hit identification. In reality, of course, we don't know the relative positioning of the "true" mismatches, so there can be different spaced seed patterns depending on where the mismatches are anticipated.

History

The concept of spaced seeds has been explored in literature under different names. One of the early uses was in sequence homology where the FLASH algorithm from 1993 referred to it as "non-contiguous sub-sequences of tokens" that were generated from all combinations of positions within a sequence window. In 1995, a similar concept was used in approximate string matching where "gapped tuples" of positions in a sequence were explored to identify common substrings between a large text and a query. The term "shape" was used in a 2001 paper to describe gapped q-grams where it refers to a set of relevant positions in a substring and soon after in 2002, PatternHunter introduced "spaced model" which was proposed as an improvement upon the consecutive seeds used in BLAST, which was ultimately adopted by newer versions of it. Finally, in 2003, PatternHunter II settled on the term "spaced seed" to refer to the approach used in PatternHunter

Spaced versus Non-Spaced Models

Popular alignment algorithms such as BLAST and MegaBLAST use a non-spaced model, where the entire length of the seed is made of exact matches. Thus, any mismatching base pair along the length of the seed will result in the program ignoring the potential hit. In a spaced model, the matches are not necessarily consecutive.
More formally, the difference in a spaced seed model as compared to a non-spaced model is the relative positioning of the matched bases. In a non-spaced model, the length of the seed model, and the weight, of the seeds are the same, as they must be consecutive while in a spaced model, the weight is not necessarily equal to the length of the seed model, since match positions may be non-consecutive. Therefore, a spaced seed model may be longer than a non-spaced seed model but have the same weight. For example, a non-spaced seed has the same weight as a spaced model, but their lengths differ.
The predicted number of hits can be calculated from PatternHunter using the following lemma:
Where is the length of the sequence the model is compared to, is the length of the seed model, is the probability of a match and is the weight of the seed used.

Applications

Homology Search

The type of seed model used for sequence alignment can affect the processing time and memory usage when doing large-scale homology searches – two considerations that have been central in the development of modern homology search algorithms. It may also affect the sensitivity. Using spaced seed models has been demonstrated to allow for faster homology searches as seen with PatternHunter wherein homology searches were twenty times faster and used less memory than BLASTn.

Sequence Alignment

Most aligners first find candidate locations in the target sequence and then inspect those more closely to verify the alignment. Ideally, this first step would find all relevant locations in the target so sensitivity is prioritized but due to computational intensity, many popular algorithms use heuristics to "short-cut" exploring all locations, ultimately missing many but running relatively quickly. One possible way to increase sensitivity, as done in the SHRiMP2 algorithm, is to use spaced seeds to allow for small differences between the query and the candidate locations so that somewhat more locations are identified as candidates. SHRiMP2 specifically uses multiple spaced seeds for this and requires multiple matches, increasing sensitivity as it allows different possible combinations of differences while maintaining speed comparable to original methods.

Sequence Assembly

A variation of spaced seeds with a single contiguous gap has been used in de novo sequence assembly. In this instance, the design has an equal number of ones at either end of the sequence with a run of zeroes in between. The reasoning behind this design is that in assemblers that utilize De Bruijn graphs, increasing k-mer size inflates memory usage, as k-mers are more likely to be unique. However, the most important parts of a k-mer are its ends, as they are what are used to extend sequences in a graph. Thus, to circumvent the problem with memory usage, the less-important middle part is ignored. This approach has the additional advantage, as in other uses of spaced seeds, of taking into account any sequencing errors that may have occurred in the gap area. It has been noted that increasing the length of the gap also increases the uniqueness of k-mers in both E. coli and H. sapiens genomes.

Metagenomics

A metagenomics study will commonly start with the high-throughput sequencing of a mixture of distinct species, yielding a set of sequences but with unknown origins. As such, one common goal is to identify which genome each sequence is phylogenetically most similar to. One approach could be to take k-mers from each sequence and see which, in a set of genomes, it has most sequence similarity with. Spaced seeds have been successfully utilized for this purpose by finding how many k-mers are found in each genome and the total number of positions these k-mers cover.

Multiple Spaced Seeds

An improvement to spaced seeds was first demonstrated by PatternHunter in 2002 where a set of spaced seeds were used, in which a "hit" was called whenever one of the set matched. PatternHunter II, in 2003, demonstrated that this approach could offer higher sensitivity than BLAST while maintaining similar speed. Identifying an optimal set of spaced seeds is an NP-hard problem, however, even finding a "good" set of spaced seeds remains difficult, although several attempts have been made to computationally identify them. Since the speed of the algorithm must decrease with an increasing number of spaced seeds, it makes sense to only consider multiple seeds when all offer some useful contribution. There is ongoing research on how to quickly calculate good multiple spaced seeds, as previous homology search software calculated and hard-coded their seeds – it would be advantageous to be able to calculate purpose-driven multiple spaced seeds instead.