Activated T cells as a DNA resource for long-read sequencing analyses
To capitalize on the advantage of long-read sequence technologies in SV analyses, high-molecular-weight DNA should be used for the library preparation step because the read length depends on the DNA size of DNA fragments in a library. Similarly, the data yield for a given cost varies depending on the input. For instance, it has been implied that the fragmentation status affects the sequencing yield per run, indicating that the length of the input libraries can affect the sequencing depth per cost35. As the read length and sequencing depth are important factors in constructing variation panels, these factors should be carefully controlled. In this regard, the TMM Biobank has been establishing proliferating cell resources, which avoid the rapid depletion of biospecimens14. In this study, we used high-molecular-weight DNA specimens derived from activated T lymphocytes for long-read sequence analyses.
The activated T cells were established from CD19-negative cells from the PBMC fraction in the blood of participants (Fig. 1a). As of March 2021, the TMM Biobank had established 4527 LCLs and 4808 activated T lymphocytes (Fig. 1b). The reason for our selection of genomic DNA specimens derived from activated T cells versus LCLs for the long-read sequence analysis is that the former can be established in a much shorter time span with a higher success rate than the latter, ensuring future expandability. We stimulated the cells with the human T-cell activators CD3 and CD28 and expanded them for three to ten days in culture medium supplemented with recombinant IL-2 cytokine14. We successfully recovered more than 99% of the cells stored in liquid nitrogen. Most of these cell resources are accompanied WGS information determined by short-read sequencing. Almost all established cells were positive for CD3, a T-cell marker (Fig. 1c), indicating that T cells dominantly proliferated under cytokine stimulation.
To assess the quality of the genomic DNA samples extracted from these T cells, we measured the optical density (OD) at 260/280 and 260/230 and obtained average OD ratios of 1.89 ± 0.15 and 1.81 ± 0.44 (mean ± SD), respectively. We also conducted pulsed-field gel electrophoresis of 5 random samples (#1–#5). The lengths of the DNA specimens ranged from 20 to 145 kb (Fig. 1d and Supplementary Fig. 1a), demonstrating that the DNA samples used were appropriate for long-read sequencing analyses.
Next, we conducted a long-read WGS analysis with a nanopore sequencer utilizing T-cell genomic DNA specimens. To optimize the DNA fragmentation step to balance the sequencing yield and read length, we designed a step to yield DNA fragments with lengths ranging from 20 to 80 kb (Fig. 1d and Supplementary Fig. 1a). We obtained 85.0 ± 5.4 Gb of yield and 25.8 ± 1.8 kb of N50 length per flowcell (n = 5), indicating that half of the sequence base pairs were derived from reads longer than or equal to 25.8 kb (Fig. 1e and Supplementary Fig. 1b, c). Taken together, these results support our contention that activated T cells constitute a useful genomic DNA resource for a long-read WGS analysis.
To address whether the utilization of activated T cells is a good method for genome analyses, we designed a benchmark analysis using three independent sets of genomic DNA samples obtained from activated T cells and LCLs (Supplementary Fig. 2a). Notably, there are donor-matched and high-quality de novo assemblies available for all three genomes40. In this benchmark analysis, we obtained standard SV-call sets from the assemblies and compared them to SV-call sets from nanopore sequencing data to calculate the precision and recall scores of SV detection (Supplementary Fig. 2a). To select the SV-call pipeline, we experimentally compared the efficiency and accuracy of the CuteSV41 and Sniffles algorithms42, both of which are widely used algorithms available for the nanopore sequencing data. As shown in Supplementary Fig. 2b, CuteSV reproducibly showed higher recall and precision scores than Sniffles in the detection of DELs. This result is concordant with previous benchmark studies41,43, and we therefore decided to utilize the CuteSV algorithm. Utilizing this algorithm, we next compared the precision and recall scores of activated T cells and LCLs. As shown in Supplementary Fig. 2c, we observed that activated T cells and LCLs showed very similar recall and precision scores. Therefore, we concluded that activated T cells were an acceptable resource for a genome analysis similar to LCLs.
Trio-based structural variation analysis using long-read sequencing technology
To clarify the variation spectra, frequencies, and functional impact of SVs in the Japanese population, we constructed an allele-frequency panel. Here, it should be noted that despite the continuous improvements in computational tools, many challenges in read alignment-based SV calling algorithms remain34,41. Therefore, to apply quality assessments based on Mendelian inheritance error profiling, we designed WGS analyses of 333 BirThree cohort participants comprising 111 parent–offspring trios through the long-read sequence procedures established in this study (Fig. 2a).
Using 411 flowcells, we conducted 430 runs in total (Supplementary Data 1). As shown in Supplementary Fig. 3a,b, the sequencing yields increased as the active pore count increased, but the N50 lengths did not. These observations indicate that the quality of the flowcell is a determinant of the sequencing yield. While it has been known that the sequencing yield per flowcell decreases when longer libraries are subjected to sequencing35, we did not observe such a negative correlation between the two variables (Supplementary Fig. 3c). We surmise that this occurred because optimization in the fragmentation step resulted in low diversity of the N50 length, suggesting that it is important to optimize the DNA fragmentation step to balance the sequencing yield and read length for high-quality deep sequencing.
The sequencing data resulted in 69.7 Gb per individual after filtering the sequence reads with low-quality values (lower than a mean quality score44,45 of 6 as shown by the dotted line in Fig. 1e). Our strategy led to relatively long (read N50 of 25.8 ± 3.9 kb) sequence reads (Fig. 2b) compared to previous works35,36. When aligned to the human reference genome (GRCh38), the sequence reads resulted in 22.2 ± 4.4-fold coverage (Fig. 2c), and the median sequencing error rate was 7.9% (2.2% for insertions, 3.5% for deletions, and 2.2% for mismatches) (Supplementary Fig. 3d). Taken together, the results support the integrity of our approach, including the following two important improvements: the use of T-cell resources to stably provide high-quality DNA suitable for SV analyses at the population scale and the use of BirThree Cohort participants for Mendelian error profiling.
Structural variations detected in the Japanese population
Next, we evaluated the structural variation spectra in the Japanese population. We detected two classes of canonical SVs, deletions (DELs) and insertions (INSs), both of which were more than 50 bp in length, identifying 23,056 ± 454 SVs per individual on autosomes composed of 10,923 DELs and 12,133 INSs per individual (Fig. 2d). The numbers of detected SVs in this study are comparable to those of previous estimations by means of long-read sequencing32,35 and other technologies21,33.
Then, we merged these SVs of the 333 individuals into a nonredundant set of SVs to produce a variant repository composed of 37,981 DELs and 36,220 INSs, showing a balanced number of DELs and INSs (Fig. 2e). In this regard, several studies have identified more INSs than DELs35,46. A plausible explanation for this discrepancy may be the lower recall scores in INS detection than DEL detection in our study (Supplementary Fig. 2b). We surmise that biases in SV calling remain and expect the development of elaborate bioinformatics algorithms to address this issue. One additional hypothesis is that a few thousand loci that are underrepresented in GRCh38 may affect the ratio between insertions and deletions. In good agreement with this hypothesis, while we identified more insertions than deletions in the individual-level analysis (Fig. 2d), the ratio between insertions and deletions became closer to 50:50 in the population-scale analysis that included 333 individuals (Fig. 2e).
The number of SVs strongly correlates with the length and rapidly decreases; three peaks at sizes of ~300 bp, 3 kb, and 6 kb are notable. We consider that these peaks are due to retrotransposon elements, especially Alu, SINE/VNTR/Alu, and LINE-1 elements (Fig. 2e), based on their sizes35,36.
To ascertain the benefit of using high-quality DNA in SV analyses, we examined the correlation between the read length and SV detection ability. As shown in Fig. 2f, we observed a strong correlation between the read N50 and the mean size of the INSs and a moderate correlation between the read N50 and the mean size of the DELs (Pearson correlation coefficient [cor] = 0.54 and 0.80 for DEL and INS, respectively). These results suggest that an SV analysis using longer reads has an advantage in the detection of large SVs compared with that using shorter reads. To further verify this finding, we also evaluated the correlation between the read length and SV detection ability by detecting large SVs ranging from 5.9 to 6.1 kb and small SVs ranging from 280 to 350 bp (Supplementary Fig. 4a, b, respectively). The SVs belonging to the former fraction contain LINE-1-related SVs, and those belonging to the latter fraction contain Alu-related SVs. We found that the number of large INSs was correlated with the read N50 (cor = 0.54), whereas that of the DELs (cor = 0.18 and 0.07 for large and small DELs, respectively) and small INSs (cor = 0.25) was not correlated with the read N50. These results indicate that an SV analysis using longer reads ranging from 10 to 35 kb utilizing activated T cells is beneficial for the comprehensive identification of large SVs, especially in the case of INSs.
Next, we evaluated the minor allele frequencies (MAFs) of individual SVs in the Japanese population. To avoid double counting the SVs shared between parents and offspring and, thus, prevent the overestimation of the allele frequencies of the SVs in the general population, we extracted SVs observed in 222 unrelated individuals (i.e., fathers and mothers) from the repository to evaluate MAF. We found that the number of SVs decreased as the allele frequency increased (Fig. 3a). Then, we categorized these SVs into the following four categories: singleton (minor allele count [MAC] = 1); rare (MAC > 1 and MAF < 0.01); low (MAF ≥ 0.01 and MAF < 0.05); and common (MAF ≥ 0.05). Across all SV classes, 12,782 SVs (representing 17.6% of all SVs identified in 222 unrelated individuals) were singletons, 9600 (13.2%) SVs were rare, 12,660 (17.5%) SVs were low, and 37,428 (51.6%) SVs were common. Overall, most SVs are shared among unrelated individuals.
An intriguing observation is that the sizes of the SVs vary among the SV categories (Fig. 3b). For example, large SVs were most frequently found in the singleton category, but the median size of the SVs decreased as the MAF increased (singleton, 274 bp; rare, 201 bp; low, 168 bp; and common, 133 bp). This result is consistent with the previous observation47 in which the allele frequency of SVs in size range of 100 kb to 1 Mb decreased with size. The size of the SVs appeared to be the smallest in the common category. These observations suggest that the size of SVs or the amount of rearranged DNA may be a key determinant in the selection of SVs.
Ethnic diversity of SVs
To assess ethnic differences or diversity in the occurrence of SVs, we compared the DELs in our dataset with those in the recently published Iceland deCODE study35 (see Methods, “Comparison of SVs to the deCODE dataset”). The deCODE dataset contains data derived from a population-based analysis of SVs using a long-read sequencing platform. As shown in Fig. 3c, of all SVs in our dataset (INS and DEL; shown as “TMM”), 38,304 (53.5%) were also found in the deCODE dataset, while 33,279 were unique to the TMM dataset. Next, we compared the MAFs of the unique DELs and INSs to those of overlapping ones. The results revealed that the SVs in the common category were shared preferentially with those in the deCODE dataset; in contrast, those with a low MAF and those in the rare and singleton categories tended to be unique in the TMM dataset (Fig. 3d). Thus, the comparison of the deCODE and TMM datasets revealed significant differences in the ethnic distribution of SVs, even though high-MAF SVs are shared relatively widely across ethnicities.
For further analysis focusing on the ethnic diversity of SVs, we extracted common SVs (MAF ≥ 0.05) and compared the allele frequencies (AFs) between the deCODE dataset and the TMM dataset (Fig. 3e). We observed only modest correlations between these two datasets (Pearson correlation coefficient [cor] = 0.50 and 0.40 for DEL and INS, respectively), indicating that the common SVs showed substantial differences in AFs between the ethnicities.
Next, we compared the TMM dataset to that of gnomAD-SV25, which was based on short-read WGS of more than 14,000 samples. Notably, the samples are mainly derived from European and African/African-American samples, while less than 10% of the samples are derived from East Asian samples (EAS). In the comparison between the TMM dataset and the whole gnomAD-SV, only modest correlations (cor = 0.61 and 0.57 for DELs and INSs, respectively) were observed (Supplementary Fig. 5), showing very good agreement with the observation in the comparison to the deCODE dataset. In contrast, we observed much higher correlations (cor = 0.74 and 0.78 for DELs and INSs, respectively) in the comparison of the TMM dataset and the East Asian subset in the gnomAD-SV (Fig. 3f), demonstrating that the AFs in the TMM dataset reflect ethnicities in AFs.
Quality assessment of SVs based on Mendelian inheritance
To assess the reliability of our SV analysis and dataset, we examined the Mendelian inheritance error (MIE) rates by taking advantage of a trio analysis. In general, MIEs arise from two major events in this type of analysis. One event is germline or nongermline de novo mutations (biological errors), and the other event is variant calling errors or incorrect pedigree information (technical errors). As true de novo mutations occur at an extremely low rate (~1 × 10−8 per base pair per generation)48,49, the former biological errors seem less influential in this case, and we surmise that most MIEs are the consequence of the latter event. Therefore, the MIEs of SVs can serve as an indicator of the reliability of SV analyses. In fact, the MIE rate was calculated in the deCODE long-read study35.
We found that the transmission of 3.5 ± 0.1% and 4.3 ± 0.2% of the DELs and INSs per trio, respectively, did not follow Mendelian inheritance (Fig. 4a). We also observed a lower concordance in trios with lower coverage (Fig. 4b). In particular, the genotyping accuracy of the INSs was more sensitive to the sequencing coverage, indicating that a higher sequencing coverage is desirable for higher genotyping accuracy. A previous analysis of five trios (15 members) using long-read sequencing technology estimated the MIE to be 6.4–15.2%50. Thus, although the MIE rates were still higher than expected, there was a substantial improvement in concordance with Mendelian inheritance, perhaps due to technical improvements in long-read sequencing, including stable data production in terms of the sequencing depth and read length, and in bioinformatics pipelines.
We attempted to examine the characteristics and genomic locations of the SVs that showed MIEs. Taking advantage of the trio analysis, we evaluated the incidence of MIEs in each SV by calculating the error family ratio (number of trios with MIE/total number of trios analyzed). For the SV types that showed MIEs, we observed concordant values in the DELs and INSs (DEL, 3.2 ± 5.3%; INS, 3.7 ± 5.2% of trios showed MIEs; Supplementary Fig. 6a). To address the genomic distribution of SVs with MIEs, we precisely plotted the error family ratios on genomic locations (Supplementary Fig. 6b). We identified SVs located near gaps and chromosome ends that frequently accompany MIEs. We also found that the SVs in the regions near gaps and chromosome ends were often called based on low coverage sequencing reads (Supplementary Fig. 7a, b). Therefore, we surmise that the high incidences of MIEs were derived, at least partly, from erroneous SV calls due to the difficulty in read mapping.
We also evaluated the accuracy of the SV in each MAF category. To this end, we employed the error family ratio. As shown in Supplementary Fig. 6c, we observed that the error family ratios were lower for singleton SVs than for common or highly frequent SVs. One plausible explanation for this unexpected observation is that for high-frequency SVs, the accuracy is affected by systematic errors during the identification of SVs. In low complexity regions in the genome, some SVs with a low accuracy may be called with high frequencies, which results in MIEs. In contrast, singletons and low-frequency SVs may be less affected by such systematic errors.
We evaluated the distribution of the SV genotype calls in all parent–offspring trios to assess whether there are specific combinations of genotypes among trios that lead to erroneous calls (Fig. 4c). While the distributions of DELs and INSs in our dataset were closer to the expected probabilities, for the most part, several minor discrepancies in the genotype distribution were observed. For instance, in the case of the genotype of offspring derived from parents with the genotypes “0/0” and “1/1”, high-rate MIE accumulation was observed (Fig. 4c). This observation was reproducible in the deCODE study35. Thus, these results demonstrate that our long-read sequence analysis achieved reasonable genotype calls, considering that substantial challenges need to be addressed to accomplish fully reliable and accurate genotype calls of SVs using long-read sequence technology.
Functional annotation of SVs
To evaluate the potential function of the SVs, next, we annotated the SVs to genomic features. We first examined how these SVs are distributed on chromosomes. Although the number of SVs correlated with the chromosome length (Fig. 5a), the SVs were differentially distributed on each chromosome (Fig. 5b; P < 2.2 × 10−16 for DELs and P = 6.87 × 10−16 for INSs; Kruskal–Wallis rank-sum test). We also observed several peaks, indicating the high density of SVs on chromosome ends and sites adjacent to the gaps remaining in the reference genome, and this observation is concordant with the observation in a previous study32 (Fig. 5c). To evaluate the nonrandom distribution of the SVs in detail, we plotted the number of sequencing reads supporting the variant call of each SV (Supplementary Fig. 7a,b). The sequencing depths were decreased in the genomic regions adjacent to gap and chromosome ends, suggesting that the difficulty in read mapping might result in the erroneous detection of SVs in these regions. Nonetheless, we found five peaks of SVs located in positions far from the gaps and chromosome ends (Fig. 5c, green arrows). The green arrow position on chromosome 6 involves human leukocyte antigen (HLA) loci. Regarding the position, a closer analysis of our long-read sequence data revealed that these five peaks are located in regions that harbor high-level segmental duplications (Fig. 5c). The regions with SD accumulations are intractable with the current long-read sequencing technology. We surmise that the difficulty in SV detection in these regions might result in the overestimation of SVs.
Next, we examined the localization of these SVs within intergenic regions, introns, exons, and protein-coding sequences (CDSs). Of 74,201 SVs, 34,053 (45.9%) were in intergenic regions, while 38,749 (52.2%), 3099 (4.2%), and 828 (1.1%) SVs overlapped with introns, exons, and CDSs, respectively (P < 0.002, bootstrap test), which is concordant with a previous study35. Thus, SVs located in intergenic regions were overrepresented, and SVs located in introns, exons, and CDSs were underrepresented (Fig. 5d). We also observed elevated rates of rare alleles in exons and CDSs (Fig. 5e). These differences in the distribution of SVs within genomic regions suggest that these SVs influence gene structure, which provides information regarding the strength of negative selection during molecular evolution.
SVs associated with clinical phenotypes
As SVs affect gene structures more drastically than smaller variants, including SNVs and indels (insertions/deletions less than 50 bp), SVs located in CDSs may exert more potent effects on gene functions and downstream phenotypes than SNVs and indels. To gain functional insight into these SVs, we searched for and identified SVs overlapping with CDSs belonging to 461 protein-coding genes (Supplementary Data 2). Of these SVs, we selected four previously shown or suggested to be associated with clinical phenotypes and examined them closely in our set of analyses.
We identified DELs of 4.9 kb in hemoglobin subunit gamma 1 and 2 (HBG1 and HBG2) loci (Fig. 6a), which were present in two parent-offspring pairs out of 111 trios examined in this study (Fig. 6f), and in both cases, the alleles transmitted from parent to offspring (from the father to offspring in the family shown in Fig. 6a). While the HBG1 and HBG2 genes, which encode γ-globin chains consisting of HbF, are expressed predominantly during the fetal stage from the β-globin gene cluster, their expression is progressively silenced during the postnatal period due to the interplay of transcription factors interacting with the locus control region (LCR) and the HBG1 and HBG2 promoters51. The breakpoints of the DELs are located in second introns of the HBG1 and HBG2 genes, resulting in an HBG1-HBG2 fusion gene52, as illustrated in Fig. 6e. Intriguingly, this DEL has been shown to cause an increased expression of γ-globin and elicit hereditary persistence of fetal globin (HPFH), with increased expression of γ-globin in the adult stage52. Another East Asian case of HPFH with this DEL has also been reported53. The allele frequency (AF) of this DEL was 0.45% (2 in 444 alleles) in this study, which appears to be concordant with the estimation obtained using a short-read whole-genome sequence database (Fig. 6f; 0.52% in East Asia from gnomAD25).
We also found a 32-kb DEL in genes encoding late cornified envelope 3B and 3C (LCE3B and LCE3C) proteins related to skin barrier functions, as has been described in the previous studies35,54. This DEL includes whole LCE3B and LCE3C genes (Fig. 6b), and complete loss of these genes is reported to be associated with susceptibility to psoriasis35,55,56. The AF of this DEL was 49.3% in our study, showing very good agreement with the high frequency in East Asia (57.7%; Fig. 6f) from gnomAD25. To evaluate the accuracy of the AF based on the bioinformatics pipeline used in this study, we counted alternative alleles and estimated AF via visual inspections of read alignment. We found 247 ACs and 55.6% AFs, comparable to the AF estimated by a variant caller (CuteSV41; 49.3%).
We also detected a DEL in the gene encoding the drug-metabolizing enzyme cytochrome P450 family 2 subfamily member 6 (CYP2A6) (Fig. 6c). The DEL in the CYP2A6 gene is known to be associated with poor nicotine metabolism57,58,59. The DEL in CYP2A6 results in a fusion between the 3’ UTRs of CYP2A6 and CYP2A760, and the alternative allele is referred to as CYP2A6*4. This DEL in the CYP2A6 gene has been reported to be common (15.1–19.0%58,59, Fig. 6f), but the bioinformatic algorithm used in this study estimated the AF to be only 0.45%. We surmised that this discrepancy is due to an underestimation of AF by the algorithm used because our visual inspection of the CYP2A6 locus read alignment resulted in an AF concordant with previous estimations (15.5%, Fig. 6f). Nonetheless, we used CuteSV in this study as this variant caller is assumed to be the most accurate available to date.
Focusing on INS, we detected the expansion of triplet repeats in the coding sequence of the Ataxin 3 (ATXN3) gene (Fig. 6d). CAG repeat expansion in exon 10 of the ATXN3 gene is known to cause spinocerebellar ataxia type 3 (SCA3) by resulting in an abnormally long polyQ tract in the encoded protein. Affected individuals are usually heterozygous for the expansion and carry 52–86 CAG trinucleotide repeats in the expanded allele, whereas wild-type (WT) alleles have 12–44 CAG repeats61. The numbers of repeats are highly polymorphic62, and our analysis was not optimized to accurately estimate repeat length; nonetheless, we identified the variation in the repeat as INS with a length of 56 bp (Fig. 6d and Supplementary Fig. 8).
In summary, we constructed an allele-frequency panel focusing on SVs by utilizing the activated T-cell resource in our biobank and nanopore sequencing technology. This strategy was successful in terms of supplying a sufficient amount of high-quality genomic DNA suitable for long-read sequencing analysis and high-throughput long-read sequencing at the population scale. We also validated the reliability of the SV panel utilizing trio samples recruited in the TMM biobank to validate the panel by means of Mendelian inheritance error profiling.