Algorithms for structural variation discovery using multiple sequence signatures

Söylev, Arda

View/Open

File_10211386 (3.491Mb)

Date

2018

Author

Söylev, Arda

Metadata

Show full item record

Abstract

Tek nükleotid polimorfizmi (TNP), baz çifti ekleme/çıkarma (Indel) ve yapısal varyasyon (YV) gibi genetik varyasyonların canlılar üzerinde önemli fenotipik etkileri vardır. Bunların içinde 50'den fazla baz çiftini etkileyen YV'ler, Crohn Hastalığı, şizofreni ve otizm gibi çeşitli kalıtsal hastalıkların da temel sebebidir. Ayrıca YV'lerin etkilediği baz çifti sayısı TNP'lere göre çok daha fazladır (3,5 Mbp TNP, 15-20 Mbp YV). Bugün, yeni nesil dizileme (YND) teknolojisini kullanarak tam genom hizalama (WGS) yapabiliyor ve bu tip varyasyonları çok daha hızlı, ucuz ve yüksek doğrulukla keşfedebiliyoruz. Ancak 1000 Genom Projesi'nde de gördüğümüz gibi, YND teknolojisinin bazı yetersizlikleri vardır. En önemli sorun şu an kullanılan YND platformlarının ürettiği kısa okuma (<250 bp) boyutları ve genomların çok tekrarlı bölgeler barındırması sebebiyle bu kısa okumaların yüksek doğrulukla hizalanmasını zorlaştırmasıdır. Bu durum, keşfedilen genomik varyasyonların doğruluk oranını da etkilemektedir. Bu sebeple, bugüne kadar geliştirilmiş algoritmalar ekleme, silinme ve kısa inversiyonlar gibi görece olarak daha basit YV'leri karakterize edebilmesine rağmen birçok genetik hastalıkla bağdaştırılan daha karmaşık varyasyonları göz ardı etmiştir. Bu tip YV'lerin insan genomuna etkilerini gözlemlemek için daha farklı yaklaşımlar kullanan, yüksek doğruluk oranına sahip yeni algoritmalar gerekmektedir.Bu tezde, YND teknolojisiyle kısa okumaları kullanarak bir canlının genomundaki YV'leri bulan TARDIS algoritmasını tanıtıyoruz. TARDIS; silinme, yeni dizi ekleme, inversiyon, transpozon ekleme, mitokondriyal ekleme, ardışık kopya ve ters/düz ayrışık kopya gibi birçok YV'yi karakterize edebilmektedir. Bu varyasyonların yüksek doğrulukta keşfi için okuma çiftleri, okuma derinliği ve ayrık okumalar gibi farklı sinyalleri birarada kullanmaktadır. Ayrıca TARDIS, genomun tekrarlı yapısı sebebiyle aynı okumanın birden çok yere benzer doğrulukta hizalanmasından dolayı oluşan hataları göz önünde bulundurarak, tüm hizalanma lokasyonlarını da kullanabilme özelliğine sahiptir. Son zamanlarda kısa okumaların barındırdığı kısıtlamalar sebebiyle yeni kütüphane hazırlama protokolleri geliştirilmiştir. 10x Genomics de bunlardan biridir. Bu teknik, düşük maliyetle uzun mesafeli bitişiklik bilgisi (Long range contiguity) sağlayan, yüksek maliyetli uzun okumalara alternatif bir yöntemdir. TARDIS, kısa okumaların sebep olduğu kısıtlamaların önüne geçebilmek için 10x Genomics'in bağlantılı okumalarını da kullanabilmektedir.Geliştirdiğimiz algoritmaların doğruluk oranlarını simülasyon ve gerçek veriler kullanarak değerlendirdik. Simülasyonlarda TARDIS %97,67 hassasiyet ve %1,12 hatalı tahmin oranını yakaladı. Gerçek veri deneyleri için de iki haploid (CHM1 ve CHM13) ve bir diploid (NA12878) insan genomu kullandık. Sonuçları PacBio veri setleriyle karşılaştırdığımızda TARDIS'in literatürdeki en başarılı metotlara göre daha yüksek doğruluğa sahip olduğunu gördük. Ayrıca CHM1 genomu için TARDIS'in ardışık ve ayrışık kopya varyasyonlarında çok düşük hata oranına sahip olduğunu gösterdik (En iyi 50 tahmininde hata oranı %5'den azdır). Son olarak belirtmeliyiz ki burada tanıttığımız algoritmalar YND teknolojisini kullanarak ayrışık yapısal varyasyonları karakterize edebilen ilk algoritmalardır.

Genomic variations including single nucleotide polymorphisms (SNPs), small INDELs and structural variations (SVs) are known to have significant phenotypic effects on individuals. Among them, SVs, that alter more than 50 nucleotides of DNA, are the major source of complex genetic diseases such as Crohn's, schizophrenia and autism. Additionally, the total number of nucleotides affected by SVs are substantially higher than SNPs (3.5 Mbp SNP, 15-20 Mbp SV). Today, we are able to perform whole genome sequencing (WGS) by utilizing high throughput sequencing technology (HTS) to discover these modifications unimaginably faster, cheaper and more accurate than before. However, as demonstrated in the 1000 Genomes Project, HTS technology still has significant limitations. The major problem lies in the short read lengths (<250 bp) produced by the current sequencing platforms and the fact that most genomes include large amounts of repeats make it very challenging to unambiguously map and accurately characterize genomic variants. Thus, most of the existing SV discovery tools focus on detecting relatively simple types of SVs such as insertions, deletions, and short inversions. In fact, other types of SVs including the complex ones are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of these SVs to human genome, we need new approaches to accurately discover and genotype such variants. Therefore, there is still a need for accurate algorithms to fully characterize a broader spectrum of SVs and thus improve calling accuracy of more simple variants.Here we introduce TARDIS that harbors novel algorithms to accurately characterize various types of SVs including deletions, novel sequence insertions, inversions, transposon insertions, nuclear mitochondria insertions, tandem duplications and interspersed segmental duplications in direct or inverted orientations using short read whole genome sequencing datasets. Within our framework, we make use of multiple sequence signatures including read pair, read depth and split read in order to capture different sequence signatures and increase our SV prediction accuracy. Additionally, we are able to analyze more than one possible mapping location of each read to overcome the problems associated with repeated nature of genomes. Recently, due to the limitations of short-read sequencing technology, newer library preparation techniques emerged and 10x Genomics is one of these initiatives. This technique is regarded as a cost-effective alternative to long read sequencing, which can obtain long range contiguity information. We extended TARDIS to be able to utilize Linked-Read information of 10x Genomics to overcome some of the constraints of short-read sequencing technology.We evaluated the prediction performance of our algorithms through several experiments using both simulated and real data sets. In the simulation experiments, TARDIS achieved 97.67% sensitivity with only 1.12% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state of the art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (less than 5% for the top 50 predictions). The algorithms we describe here are the first to predict insertion location and the various types of new segmental duplications using HTS data.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/33457

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess