MUMmer

From Wikipedia, the free encyclopedia

MUMmer is a bioinformatics software system for sequence alignment. It is based on the suffix tree data structure and is one of the fastest and most efficient systems available for this task, enabling it to be applied to very long sequences. It has been widely used for comparing different genomes to one another. In recent years it has become a popular algorithm for comparing genome assemblies to one another, which allows scientists to determine how a genome has changed after adding more DNA sequence or after running a different genome assembly program. The acronym "MUMmer" comes from "Maximal Unique Matches", or MUMs. The original algorithms in the MUMMER software package were designed by Art Delcher, Simon Kasif and Steven Salzberg. Mummer was the first whole genome comparison system developed in Bioinformatics. It was originally applied to comparison of two related strains of bacteria.

The MUMmer software is open source and can be found at the MUMmer home page. The home page also has links to technical papers describing the system. The system is maintained primarily by Steven Salzberg and Arthur Delcher at Center for Computational Biology at Johns Hopkins University.

MUMmer is a highly cited bioinformatics system in the scientific literature. According to Google Scholar, as of early 2013 the original MUMmer paper (Delcher et al., 1999)[1] has been cited 691 times; the MUMmer 2 paper (Delcher et al., 2002)[2] has been cited 455 times; and the MUMmer 3.0 article (Kurtz et al., 2004)[3] has been cited 903 times.

Overview[]

What is MUMmer?[]

Have you wonder, how are you related to a monkey? If the answer is yes, I can tell you that there are algorithms in which you can find that answer. If you had algorithms classes “Edit distance” (by Wunch and Needleman) might come to mind; which can be used for this same topic. However, the “Edit distance” algorithm is used to compare small sequences, it would take a long time to compute genome alignments.

Genomes are a large sequence of genetic instructions/information about an organism (in a chromosome); now, imagine “comparing a 4 Mb sequence such as M. Tuberculosis to another 4 Mb sequence, many algorithms either runs out of memory or takes too long to complete” (Arthur et al., 1999).[4] Back then 4Mb of memory was a huge deal; so, they had to do something about that, like creating this type of algorithms. Researchers have been creating algorithms to compare genome sequences. Mummer is a fast algorithm mostly used for the rapid alignment of entire genomes.

MUMmer, as mentioned before, is the system/process for efficient alignment of big-size sequences, this algorithm has been used to make discoveries about genome structure. The algorithm is really well designed and it is capable of comparing alignments within seconds. Since this algorithm is relatively new, this algorithm has four versions, one better than the previous one.

Versions of MUMmers[]

MUMmer1[]

MUMmer1 or just MUMmer consists of three parts, the first part consists of the creation of suffix trees (to get MUMs), the second part in the longest increasing subsequence or longest common subsequences (to order MUMs), lastly any alignment to close gaps.

To start the process, as a reminder, we need to get two Genomes as input. Once we receive them, we are going to search for the maximal unique matches (MUMs). MUMs’ algorithm has its own logic since it can be done in several ways. For instance, the naïve algorithm goes through all the sequences from one genome and compares it with the other sequences, which takes O(m*m2) which m and m2 are the two genomes. However, since MUMmer has to be fast, it uses the data structure called suffix tree (Suffix tree) which takes O(m+m2). From this tree, we extract the MUMs (which are the subsequences that are represented by one internal node by both sequences).

Once we identify the MUMs, we need to choose a genome and sort (and possibly remove) the MUMs based on the position of the other genome. This step can be done easily by choosing a genome and enumerate the nodes (MUMs) in ascending order, and the other one we enumerate each node according to the node from the first sequence. In other words, we enumerate a pair of MUMs (with the exact match from both genomes) with the same value regardless of the position. Then we sort. To make this happen, we can choose any of the methods available. For example, this can be done with Long Common Subsequence (Longest common subsequence problem) or Long Increasing subsequence(LIS). At the end of this process, the MUMs on both sequences are aligned/ordered equally (some MUMs are ignored for now, since we cannot have a value enumerated with a big number before one enumerated with a small number). The best time complexity for this step is O(nlogn).

Finally, we have to deal with the interruption between MUMs-alignment, which may be known as gaps. We employ other alignment algorithms that can fill these gaps. Arthur and his team (Arthur et al., 1999),[4] mention that the gaps fall in the following four classes:

  • An SNPinterruption – when comparing two sequences, one character will differ. In other words, characters before and after that character will be equal.
  • An insertion – when comparing two sequences, there is a subsequence in which only appears in one of the sequences. It would be an empty gap in the other sequence at the moment of comparison of the two sequences.
  • A highly polymorphic region – when comparing two sequences, there can be found a subsequence in which every single character differs.
  • A repeat – it’s the repetition of a sequence. Since MUMs can only take unique sequences, that gap can be one repetition of one of the MUMs.

MUMmer 2[]

This algorithm was redesigned to require less memory, the process runs faster and is more accurate than the first one; It also allows for bigger genomes alignment.

The improvement was the amount stored in the suffix trees by employing the one created by Kurtz. For this tree, the insertion is different since it is just a plugin of just one sequence, the other one is more to compare (is like adding it, but we are not, we are comparing the characters within the tree). It reduces space complexity since it is storing one sequence.

Finally, the first MUMmer algorithm only aligned the two sequences and jumped to another step. However, to achieve better coverage, the MUMs sequences are becoming clusters (meaning that is a group of MUMs that are separated with a smaller number of gaps).

MUMmer 3[]

According to Stefan Kurtz and his teammates, “the most significant technical improvement in MUMmer 3.0, is a complete rewrite of the suffix-tree code, based on the compact suffix- tree representation of” [5] the tree described in the article “Reducing the space requirement of suffix trees”.[6]

Another improvement was the relaxation of the MUMs. This means that now the user has the option of finding all non-unique maximal matches, all the matches that are unique only in a chosen sequence, or original MUMs (which are unique for both sequences). This was added to avoid some kind of missing subsequences that were copies of MUMs.

MUMmer 4[]

According to Guillaume and his team, there are some extra improvements in the implementation and also innovation with Query parallelism. The first, and biggest, improvement is the raise of the size limit for the sequence is being tested. Finally, “MUMmer4 now includes options to save and load the suffix array for a given reference”(Marcais et al., 2018).[7] Thanks to this, the suffix tree can be built once and constructed again after running it from the saved suffix tree.

Software - Open Source[]

MUMmer has open-source software. The best way to begin playing with this package is to go to it website is given below.

This page talks about the open-source, requirements and steps to the installation, the process to run the package, and some examples of running sequences. There is more information on this website. If you get stuck, the website has information on who to contact in those cases. This page has valuable information to work with this algorithm.

Related Topics[]

There are other types of sequence alignments, some of the related topics are shown below:

  • Edit distance
  • BLAST
  • Bowtie
  • BWA
  • Blat
  • Mauve
  • LASTZ
  • BLAST

References[]

  1. ^ Delcher, A. L.; Kasif, S.; Fleischmann, R. D.; Peterson, J.; White, O.; Salzberg, S. L. (1999). "Alignment of whole genomes". Nucleic Acids Research. 27 (11): 2369–2376. doi:10.1093/nar/27.11.2369. PMC 148804. PMID 10325427.
  2. ^ Delcher, A. L.; Phillippy, A.; Carlton, J.; Salzberg, S. L. (2002). "Fast algorithms for large-scale genome alignment and comparison". Nucleic Acids Research. 30 (11): 2478–2483. doi:10.1093/nar/30.11.2478. PMC 117189. PMID 12034836.
  3. ^ Delcher, A.; Harmon, D.; Kasif, S.; White, O.; Salzberg, S. (1999). "Improved microbial gene identification with GLIMMER". Nucleic Acids Research. 27 (23): 4636–4641. doi:10.1093/nar/27.23.4636. PMC 148753. PMID 10556321.
  4. ^ a b Delcher, A.; Kasif, S.; Fleischmann, R.; Peterson, J.; White, O.; Salzberg, S. (1999). "Alignment of Whole Genomes". Nucleic Acids Research. 27 (11): 2369–2376. doi:10.1093/nar/27.23.4636. PMC 148804. PMID 10325427.
  5. ^ Kurtz, S.; Phillippy, A.; Delcher, A.; Smoot, M.; Shumway, M.; Antonescu, C.; Salzberg, S. (2004). "Versatile and open software for comparing large genomes" (PDF). Genome Biology. 5 (2): R12. doi:10.1186/gb-2004-5-2-r12. PMC 395750. PMID 14759262.
  6. ^ Kurtz, S. (1999). "Reducing the Space Requirement of Suffix Trees". Software: Practice and Experience. 29 (13): 1149–1171. doi:10.1002/(SICI)1097-024X(199911)29:13<1149::AID-SPE274>3.0.CO;2-O.
  7. ^ Marçais, Guillaume.; Pillippy, A.; Delcher, A.; Coston, R.; Salzberg, S.; Zimin, A. (2018). "MUMmer4: A fast and versatile genome alignment system". PLOS Computational Biology. 14 (1): e1005944. Bibcode:2018PLSCB..14E5944M. doi:10.1371/journal.pcbi.1005944. PMC 5802927. PMID 29373581.

External links[]

Retrieved from ""