Ulusal Tez Merkezi

Tez No	İndirme	Tez Künye	Durumu
180820		Computational representation of protein sequences for homology detection and classification / Protein dizilimlerinin homoloji sezimi ve sınıflandırma amaçlı bilişimsel gösterimi Yazar:HASAN OĞUL Danışman: Y.DOÇ.DR. ERKAN MUMCUOĞLU Yer Bilgisi: Orta Doğu Teknik Üniversitesi / Enformatik Enstitüsü / Bilişim Sistemleri Ana Bilim Dalı Konu:Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol = Computer Engineering and Computer Science and Control Dizin:	Onaylandı Doktora İngilizce 2006 112 s.

ÖZPROTEİN DİZİLİMLERİNİN HOMOLOJİ SEZİMİ VE SINIFLANDIRMAAMAÇLI BİLİŞİMSEL GÖSTERİMİOğul, HasanDoktora, Bilişim Sistemleri A.B.DTez Yöneticisi: Yrd. Doç. Dr. Erkan Ü. MUMCUOĞLUOcak 2006, 102 sayfaOtomatik öğrenme yöntemleri bilişimsel biyolojide sınıflandırma problemleri içinsıkça kullanılmaktadır. Bu yöntemlerin girdilerinin sabit uzunlukta özellikvektörlerinden oluşması gerekir. Proteinler farklı uzunluklarda olabileceği için,protein dizilimlerini sabit sayıdaki özelliklerle temsil edecek yöntemlere ihtiyaçduyulmaktadır. Bu tezde bu amaçla üç farklı yöntem sunulmaktadır. Bunlardanbirincisi azaltıltılmış alfabelerle n-peptid bileşimi, ikincisi en büyük benzersizeşleşmelere göre ikili benzerlik değerleri, ve üçüncüsü ise olasılıksal sonek ağaçlarıile ikili benzerlik değerleridir.viTezde tarif edilen yeni dizilim gösterim yöntemleri, probleme özgü değişiklilerlebirlikte, bilişimsel biyolojinin üç önemli problemi üzerinde uygulanmıştır; uzakhomoloji sezimi, hücresel konumlanma tahmini, çözgen erişebilirlik tahmini. Herproblem için, ortak kıyaslama kümeleri üzerinde yapılan deneyler sonucunda,mevcut yöntemlerle yeni yöntemler arasında karşılaştırma analizleri sunulmuştur.Uzak homoloji sezimi testlerinde, üç yeni yöntemin hepsi mevcut en iyiyöntemlerle karşılaştırılabilir doğruluk değerleri elde ederken, bunların çok dahaverimli çalıştıkları gözlenmiştir. Yeni yöntemlerin bir kombinasyonu, proteinlerinhücresel konumlanmalarını tahmin eden PredLOC isimli sistemi geliştirmek içinkullanılmış ve bu sistem iki farklı ökaryotik protein kümesi için test edilmiştir.PredLOC her iki veri kümesi için de şu ana kadar elde edilen en iyi doğrulukdeğerine ulaşmıştır. En büyük benzersiz eşleşmelerin kullanımı, çözgen erişebilirliktahmininde az miktarda iyileştirme sağlayabilmiştir.Anahtar kelimeler: n-peptid bileşimi, en büyük benzersiz eşleşme, olasılıksal sonekağacı, uzak homoloji, hücresel konumlanma.vii

ABSTRACTCOMPUTATIONAL REPRESENTATION OF PROTEIN SEQUENCES FORHOMOLOGY DETECTION AND CLASSIFICATIONOğul, HasanPh.D., Department of Information SystemsSupervisor: Assist. Prof. Dr. Erkan Ü. MUMCUOĞLUJanuary 2006, 102 pagesMachine learning techniques have been widely used for classification problems incomputational biology. They require that the input must be a collection of fixed-length feature vectors. Since proteins are of varying lengths, there is a need for ameans of representing protein sequences by a fixed-number of features. This thesisintroduces three novel methods for this purpose: n-peptide compositions withreduced alphabets, pairwise similarity scores by maximal unique matches, andpairwise similarity scores by probabilistic suffix trees.ivNew sequence representations described in the thesis are applied on threechallenging problems of computational biology: remote homology detection,subcellular localization prediction, and solvent accessibility prediction, with someproblem-specific modifications. Rigorous experiments are conducted on commonbenchmarking datasets, and a comparative analysis is performed between the newmethods and the existing ones for each problem.On remote homology detection tests, all three methods achieve competitiveaccuracies with the state-of-the-art methods, while being much more efficient. Acombination of new representations are used to devise a hybrid system, calledPredLOC, for predicting subcellular localization of proteins and it is tested on twodistinct eukaryotic datasets. To the best of author?s knowledge, the accuracyachieved by PredLOC is the highest one ever reported on those datasets. Themaximal unique match method is resulted with only a slight improvement insolvent accessibility predictions.Keywords: n-peptide composition, maximal unique match, probabilistic suffix tree,remote homology, subcellular localization.v