Comparative genomics of microsatellite abundance: a critical analysis of methods and definitions
Type of content
This PhD dissertation is focused on short tandemly repeated nucleotide patterns which occur extremely often across DNA sequences, called microsatellites. The main characteristic of microsatellites, and probably the reason why they are so abundant across genomes, is the extremely high frequency of specific replication errors occurring within their sequences, which usually cause addition or deletion of one or more complete tandem repeat units. Due to these errors, frequent fluctuations in the number of repetitive units can be observed among cellular and organismal generations. The molecular mechanisms as well as the consequences of these microsatellite mutations, both, on a generational as well as on an evolutionary scale, have sparked debate and controversy among the scientific community. Furthermore, the bioinformatic approaches used to study microsatellites and the ways microsatellites are referred to in the general literature are often not rigurous, leading to misinterpretations and inconsistencies among studies. As an introduction to this complex topic, in Chapter I I present a review of the knowledge accumulated on microsatellites during the past two decades. A major part of this chapter has been published in the Encyclopedia of Life Sciences in a Chapter about microsatellite evolution (see Publication 1 in Appendix II). The ongoing controversy about the rates and patterns of microsatellite mutation was evident to me since before starting this PhD thesis. However, the subtler problems inherent to the computational analyses of microsatellites within genomes only became apparent when retrieving information on microsatellite distribution and abundance for the design of comparative genomic analyses. There are numerous publications analyzing the microsatellite content of genomes but, in most cases, the results presented can neither be reliably compared nor reproduced, mainly due to the lack of details on the microsatellite search process (particularly the program’s algorithm and the search parameters used) and because the results are expressed in terms that are relative to the search process (i.e. measures based on the absolute number of microsatellites). Therefore, in Chapter II I present a critical review of all available software tools designed to scan DNA sequences for microsatellites. My aim in undertaking this review was to assess the comparability of search results among microsatellite programs, and to identify the programs most suitable for the generation of microsatellite datasets for a thorough and reproducible comparative analysis of microsatellite content among genomic sequences. Using sequence data where the number and types of microsatellites were empirical know I compared the ability of 19 programs to accurately identify and report microsatellites. I then chose the two programs which, based on the algorithm and its parameters as well as the output informativity, offered the information most suitable for biological interpretation, while also reflecting as close as possible the microsatellite content of the test files. From the analysis of microsatellite search results generated by the various programs available, it became apparent that the program’s search parameters, which are specified by the user in order to define the microsatellite characteristics to the program, influence dramatically the resulting datasets. This is especially true for programs suited to allow imperfections within tandem repeats, because imperfect repetitions can not be defined accurately as is the case for perfect ones, and because several different algorithms have been proposed to address this problem. The detection of approximate microsatellites is, however, essential for the study of microsatellite evolution and for comparative analyses based on microsatellites. It is now well accepted that small deviations from perfect tandem repeat structure are common within microsatellites and larger repeats, and a number of different algorithms have been developed to confront the challenge of finding and registering microsatellites with all expectable kinds of imperfection. However, biologists have still to apply these tools to their full potential. In biological analyses single tandem repeat hits are consistently interpreted as isolated and independent repeats. This interpretation also depends on the search strategy used to report the microsatellites in DNA sequences and, therefore, I was particularly interested in the capacity of repeat finding programs to report imperfect microsatellites allowing interpretations that are useful in a biological sense. After analzying a series of tandem repeat finding programs I optimized my microsatellite searches to yield the best possible datasets for assessing and comparing the degree of imperfection of microsatellites among different genomes (Chapter III) During the program comparisons performed in Chapter II, I show that the most critical search parameter influencing microsatellite search results is the minimum length threshold. Biologically speaking, there is no consensus with respect to the minimum length, beyond which a short tandem repeat is expected to become prone to microsatellite-like mutations. Usually, a single absolute value of ~12 nucleotides is assigned irrespective of motif length.. In other cases thresholds are assigned in terms of number of repeat units (i.e. 3 to 5 repeats or more), which are better applied individually for each motif. The variation in these thresholds is considerable and not always justifiable. In addition, any current minimum length measures are likely naïve because it is clear that different microsatellite motifs undergo replication slippage at different length thresholds. Therefore, in Chapter III, I apply two probabilistic models to predict the minimum length at which microsatellites of varying motif types become overrepresented in different genomes based on the individual oligonucleotide frequency data of these genomes. Finally, after a range of optimizations and critical analyses, I performed a preliminary analysis of microsatellite abundance among 24 high quality complete eukaryotic genomes, including also 8 prokaryotic and 5 archaeal genomes for contrast. The availability of the methodologies and the microsatellite datasets generated in this project will allow informed formulation of questions for more specific genome research, either about microsatellites, or about other genomic features microsatellites could influence. These datasets are what I would have needed at the beginning of my PhD to support my experimental design, and are essential for the adequate data interpretation of microsatellite data in the context of the major evolutionary units; chromosomes and genomes.