The distribution of human endogenous retrovirus solo-LTRs in an internationally diverse cohort of students
Exploring the potential for a novel genetic fingerprinting method
Sebastian Carvello
December 2024
Imperial College London
Abstract
Microsatellites are the prevailing target of modern fingerprinting methods due to their high polymorphism and widespread coverage. However, as evidence mounts of human endogenous retroviruses (HERVs) being more recently active than originally thought, their potential utility in novel fingerprinting methods becomes increasingly plausible. Here, we examine whether five members of the HERV-K family exhibit acceptable distributions across global haplotype sub-populations to make their use in fingerprinting a viable target for future research.

Full-text
Introduction
Transposable elements (TEs) constitute around 45% of the human genome (van de Lagemaat et al. 2003), with human endogenous retroviruses (HERVs) – the remains of exogenous retroviruses that previously integrated into the ancestral host genome at the germline and have since been vertically transmitted to successive generations (Nelson et al. 2003) – representing a fraction of that share, accounting for approximately 5-8% of the genome overall (Lander et al. 2001). While one family, HERV-K, shows evidence of somewhat recent activity (replication since the divergence of humans and chimps) (Belshaw and Tristem 2009), over time, nonsense and frameshift mutations have accumulated, resulting in a loss of competence in almost all HERVs (Belshaw et al. 2004). In addition, recombination between viral LTRs can result in loss of the internal viral coding region of the HERV, leaving behind just a solo-LTR (Stoye 2001).
Traditionally, human genetic fingerprinting has taken advantage of microsatellites, which benefit from being highly polymorphic (Nwawuba Stanley et al. 2020) and widely distributed across the genome (Subramanian et al. 2003). However, analysis of the human genome has revealed thousands of HERV sequences (Vargiu et al. 2016) which, should they too exhibit high polymorphism and widespread genome distribution, could potentially represent a promising future alternative for fingerprinting. In this study, the suitability of HERV solo-LTRs as genetic markers for profiling was investigated by examining the distribution of five across ancestral geographic haplotypes in a sample of 115 students at Imperial College London.
Methods
A simple swab kit was used to collect cheek cell samples from students, which were submerged in 70% ethanol, centrifuged for 30s, and the ethanol then removed (pipetted out and the pellet left to dry). Cells were resuspended in 200µL distilled water and vortexed to disperse the pellet, then placed in a boiling water bath for 10mins before being briefly placed on ice. The sample was then centrifuged for 1min and the supernatant transferred to a new tube. At this stage, DNA concentration was estimated using a NanoDrop spectrophotometer. The PCR reaction mixture was prepared as follows: 10µL Promega 5X PCR reaction buffer, 5µL primer mix, 5µL 2mM nucleotide mix, 0.25µL Promega 5U/mL Taq polymerase, and 200ng DNA sample, made up to 50µL with distilled water if necessary. Five different primer pairs were used in total, some targeting Neanderthal HERV solo-LTRs and others Denisovan HERV solo-LTRs, at a final concentration of 1µM:
Primer pair 1
2024NE1FOR 5’ TCCCCTGTGTAGCTATTGTCT 3’
2024NE1REV 5’ CCACCTGTTTTCCTACCAATG 3’
Primer pair 2
2024DE5FOR 5’ GTAAATGATGAGTTGATGGGTGC 3’
2024DE5REV 5’ GAGGTGGGGTATTTAAGAGGTG 3’
Primer pair 3
2024DE7FOR 5’ CTGCTGACACCTTGATCTTG 3’
2024DE7REV 5’ CTTAGATCCCATCTCCTGTCTG 3’
Primer pair 4
2024NE2FOR 5’ GTCTCTTCTTCTGAATCAAG 3’
2024NE2REV 5’ GAGACCCCATCTATTACAAA 3’
Primer pair 5
2024DE2FOR 5’ GAGAAAACGTGACAGTGAAC 3’
2024DE2REV 5’ GGGCTATCTGTTCTCTAGGT 3’
NE=Neanderthal virus primers
DE=Denisovan virus primers
Primer pair 3 was used for the negative control. PCR was performed using a SimpliAmp thermal cycler run at the following parameters: 35 cycles of denaturation (95°C for 20s), annealing (57°C for 30s) and polymerisation (72°C for 120s), then 72°C for 10 minutes, followed by a 4°C soak. The PCR products were run on a 10cm 1.4% agarose gel containing Invitrogen SYBR Gold using a 1xTAE buffer and a 1Kb Plus DNA ladder (50ng/mL) for reference at 100V for 45 minutes. The final gels were photographed using an imager to visualise the bands and data for all students was pooled.
Results
Gel bands were identified according to the expected sizes of the pre-integration site (PRE) and solo-LTR amplified PCR products, as calculated by running in silico PCRs for all five primer pairs (Table 1), allowing each individual to be categorised as either homozygous PRE, heterozygous, or homozygous solo-LTR with respect to the five HERV primers used (Table 2).
Table 1 Expected PCR product sizes for primer pairs targeting sequences flanking the integration site of five different Neanderthal and Denisovan HERVs, calculated using the UCSC Genome Browser's in silico PCR tool (Perez et al. 2024). Product size depends on whether the target region contains only the pre-integration site (PRE) or a solo-LTR (968 bp) plus the PRE and one direct repeat (6 bp).
Due to the rapid gel electrophoresis (just 45 minutes), band sizes were usually not strictly accurate and so bands were often identified comparatively.
Fig. 1 An example of an analysed gel image, here from an individual of European descent (the author). 2µL and 10µL repeats for each primer were performed to increase the likelihood of visibility and for redundancy – this gel exemplifies the importance of this, since it has a tear between the wells of the 2µL primer 3 and 10µL primer 1 lanes. A faint band of between 1000 and 1500 bp in size can be seen for primer 2, indicating the presence of a solo-LTR fragment. Smaller bands, in the region of 100 to 200 bp, can be seen for all 3 primers, signalling the presence of the PRE fragment for all primers. As such, this gel was categorised as homozygous PRE for primers 1 and 3, and heterozygous for primer 2.
Table 2 Genotype counts for the sample of students analysed for each of the five HERV primer pairs. Individuals were categorised as homozygous PRE, heterozygous, or homozygous solo-LTR according to the size of the PCR fragments present on their gel. Unfortunately, a substantial number of primer lanes showed neither a PRE nor a solo-LTR band, leading to them being classified as a PCR failure and excluded from further analysis.
To assess the stability of the five HERV solo-LTR sites examined in this study, allele frequencies were calculated (Table 3) allowing the conformity of the five solo-LTRs to the Hardy-Weinberg equilibrium (HWE) to be investigated (Table 4). Due to small sample sizes, this analysis was performed with all sub-populations combined to maximise statistical power. A statistically significant Chi-squared value was found for primer pair 5 (5.97, p=0.015, df=1), indicating that Hardy-Weinberg assumptions have been violated and that this locus may be under selective pressure.
Table 3 Allele frequencies across all sub-populations. Allele p represents a PRE/the absence of a solo-LTR, while q represents a solo-LTR.
Table 4 The observed genotype counts for individuals with successful PCRs and those predicted under Hardy-Weinberg equilibrium proportions. No significant deviation from HWE was found for primers 1-4.
To assess any potential skew among sub-populations, the fixation index (FST) was calculated per primer pair (Table 5) from the expected heterozygosity of the overall population (HT) and the mean expected heterozygosity across subpopulations (HS) as follows:
Sub-populations were defined according to the following ancestral haplogroups, which participants self-reported in the pooled data collection: blue = Asian, yellow = European or South American, red = African, and green = Middle Eastern, North American or Pacifican.
Table 5 The expected heterozygosity of the overall population (HT), mean expected heterozygosity across subpopulations/haplotypes (HS), and the resulting fixation index (FST) values for primer pairs 1-5. Primers 1, 3 and 4 showed very little genetic differentiation, while 5 exhibited great differentiation, suggesting that it is unevenly distributed across geographical haplotypes.
Primers 1, 3 and 4 all showed very little genetic differentiation among sub-populations, with FST values all below 0.05. Primer 2, with an FST value of 0.062, exhibits moderate genetic differentiation, while primer 5 displays great genetic differentiation, with an FST value of over 0.15. The allele frequencies within each sub-population for these two primers are shown in Fig. 2; the PRE allele is fixed in the African haplogroup for both primers 2 and 5.
Fig. 2 The allele frequencies for primer pairs 2 and 5: the two HERV solo-LTR loci of the five investigated that were found to have more than very little levels of genetic differentiation among sub-populations, and hence moderate to great uneven distribution across the four geographical haplotype categories used. Fixation of PRE can be seen for the African (‘red’) haplogroup for both primers, while no viable data was collected for ‘green’ for primer 5.
Discussion
The finding that all the solo-LTRs, with the exception of the one targeted by primer 5, were present in Hardy-Weinberg equilibrium proportions was promising. While this might at first seem expected, considering HERVs are the remnants of ancient viruses and were previously thought to constitute “junk DNA” (Weiss 2006), more recent research has uncovered a significant regulatory role that solo-LTRs can play, with pathogenic promoter activity observed (Chen et al. 2024). This means it must not be taken for granted that solo-LTRs will not be under selective pressure; 2024NE1, 2024DE5, 2024DE7 and 2024NE2, therefore, may potentially represent promising neutral markers for fingerprinting. The significant deviation from HWE seen for primer 5 (2024DE2) may indicate that it has some regulatory activity and is thus under selection and would be unsuitable for use in fingerprinting.
The FST analysis led to a similar conclusion regarding the suitability of primer 5 for genetic fingerprinting: a value of 0.153 indicated that its allele frequencies varied greatly between geographic haplotypes. In contrast, microsatellites typically exhibit fairly stable and uniform allele frequency distributions across populations, in spite of their high mutability (Xu et al. 2000).
Microsatellite loci have been found to remain stable (while retaining variability) for hundreds of millions of years (McComish et al. 2024), making them ideal for genetic fingerprinting because their conserved positions in the genome ensure consistent and reliable amplification across individuals and populations. HERVs seem to also exhibit promising levels of stability, with many found to be at identical locations in both apes and Old World monkeys (the divergence of which occurred approximately 25 mya) (Hughes and Coffin 2004).
A good fingerprinting marker should also be highly polymorphic, providing the variability necessary to distinguish between individuals. Microsatellites are extremely useful in this regard due to their highly variable number of simple sequence repeats (Selkoe and Toonen 2006), a product of replication slippage (Ellegren 2004). They also display a mutation rate many orders of magnitude greater than the genome as a whole (Li et al. 2002), further contributing to variation. In this regard, the HERV-K family (five members of which were targeted in this study) shows the most promise: while most HERVs appear to be present at the same locus in the entire population (Moyes et al. 2007), eight insertionally polymorphic HERV-K(HML2) elements have already been identified (Belshaw et al. 2005), introducing some allelic diversity (PRE/no ancestral HERV integration versus solo-LTR/evidence of ancestral HERV integration) but on a far smaller scale than that observed with microsatellites. As such, perhaps HERV solo-LTR fingerprinting would be better placed to complement existing microsatellite fingerprinting techniques rather than replace it.
The main limitation in this study was undoubtedly the small sample size, exacerbated by high rates of PCR amplification failure. Additionally, the cohort was not a representative sample of the global population, with a strong European and Asian bias. Together, this led to some haplotypes having very small sub-population counts or, in some cases, not being represented at all (for example, the green haplotype for primer 5). The resulting high susceptibility to skew means that the probability of our sub-population samples (and calculated allele frequencies) being truly reflective of global haplotypes is questionable, thus likely violating assumptions of HWE and FST. As such, further study involving much larger and more strategically selected individuals to confirm the allele distribution of these solo-LTRs would be needed before further consideration is given to their utility as markers in genetic fingerprinting.
References
Belshaw R, Dawson A, Woolven-Allen J, et al (2005) Genomewide screening reveals high levels of insertional polymorphism in the human endogenous retrovirus family HERV-K(HML2): implications for present-day activity. Journal of Virology 79:. https://doi.org/10.1128/JVI.79.19.12507-12514.2005
Belshaw R, Pereira V, Katzourakis A, et al (2004) Long-term reinfection of the human genome by endogenous retroviruses. Proc Natl Acad Sci U S A 101:4894–4899. https://doi.org/10.1073/pnas.0307800101
Belshaw R, Tristem M (2009) Do humans have replication-competent endogenous retroviruses? Retrovirology 6:1–2. https://doi.org/10.1186/1742-4690-6-S2-P10
Chen M, Huang X, Wang C, et al (2024) Endogenous retroviral solo-LTRs in human genome. Frontiers in Genetics 15:1358078. https://doi.org/10.3389/fgene.2024.1358078
Ellegren H (2004) Microsatellites: simple sequences with complex evolution. Nat Rev Genet 5:435–445. https://doi.org/10.1038/nrg1348
Hughes JF, Coffin JM (2004) Human endogenous retrovirus K solo-LTR formation and insertional polymorphisms: Implications for human and viral evolution. Proceedings of the National Academy of Sciences of the United States of America 101:1668. https://doi.org/10.1073/pnas.0307885100
Lander ES, Linton LM, Birren B, et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921. https://doi.org/10.1038/35057062
Li Y-C, Korol AB, Fahima T, et al (2002) Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Molecular Ecology 11:2453–2465. https://doi.org/10.1046/j.1365-294X.2002.01643.x
McComish BJ, Charleston MA, Parks M, et al (2024) Ancient and Modern Genomes Reveal Microsatellites Maintain a Dynamic Equilibrium Through Deep Time. Genome Biology and Evolution 16:evae017. https://doi.org/10.1093/gbe/evae017
Moyes D, Griffiths DJ, Venables PJ (2007) Insertional polymorphisms: a new lease of life for endogenous retroviruses in human disease. Trends in Genetics 23:326–333. https://doi.org/10.1016/j.tig.2007.05.004
Nelson PN, Carnegie PR, Martin J, et al (2003) Demystified . . . Human endogenous retroviruses. Mol Pathol 56:11–18
Nwawuba Stanley U, Mohammed Khadija A, Bukola AT, et al (2020) Forensic DNA Profiling: Autosomal Short Tandem Repeat as a Prominent Marker in Crime Investigation. Malays J Med Sci 27:22–35. https://doi.org/10.21315/mjms2020.27.4.3
Perez G, Barber GP, Benet-Pages A, et al (2024) The UCSC Genome Browser database: 2025 update. Nucleic Acids Res gkae974. https://doi.org/10.1093/nar/gkae974
Selkoe KA, Toonen RJ (2006) Microsatellites for ecologists: a practical guide to using and evaluating microsatellite markers. Ecology Letters 9:615–629. https://doi.org/10.1111/j.1461-0248.2006.00889.x
Stoye JP (2001) Endogenous retroviruses: Still active after all these years? Current Biology 11:R914–R916. https://doi.org/10.1016/S0960-9822(01)00553-X
Subramanian S, Mishra RK, Singh L (2003) Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biol 4:1–10. https://doi.org/10.1186/gb-2003-4-2-r13
van de Lagemaat LN, Landry J-R, Mager DL, Medstrand P (2003) Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. Trends in Genetics 19:530–536. https://doi.org/10.1016/j.tig.2003.08.004
Vargiu L, Rodriguez-Tomé P, Sperber GO, et al (2016) Classification and characterization of human endogenous retroviruses; mosaic forms are common. Retrovirology 13:7. https://doi.org/10.1186/s12977-015-0232-y
Weiss RA (2006) The discovery of endogenous retroviruses. Retrovirology 3:1–11. https://doi.org/10.1186/1742-4690-3-67
Xu X, Peng M, Fang Z, Xu X (2000) The direction of microsatellite mutations is dependent upon allele length. Nat Genet 24:396–399. https://doi.org/10.1038/74238