Ulusal Tez Merkezi

Tez No	İndirme	Tez Künye	Durumu
597456		Dynamic data replication and distribution in database systems / Veri tabanı sistemlerinde dinamik veri kopyalama ve dağıtımı Yazar:SAADI HAMAD THALIJ ALLUHAIBI Danışman: Assoc. Prof. Dr. VELİ HAKKOYMAZ Yer Bilgisi: Yıldız Teknik Üniversitesi / Fen Bilimleri Enstitüsü / Bilgisayar Mühendisliği Ana Bilim Dalı / Bilgisayar Mühendisliği Bilim Dalı Konu:Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol = Computer Engineering and Computer Science and Control Dizin:	Onaylandı Doktora İngilizce 2019 142 s.

Veri tabanı sistemleri teknolojisinin gelişmesi ile birlikte yeni teorik temeller oluşmuş ve çok sayıda uygulama kullanılır hale gelmiştir. Benzer biçimde, bilgisayar ağlarının gelişimi de çok sayıda bilgisayarın birbirlerine bağlanarak aralarında veri ve kaynak değişimi yapabilmelerini sağlamıştır. Merkezileştirilmiş Veri Yönetim Sistemi ve bu sisteme çok sayıda kullanıcının aynı anda bağlanabilmesi nedeni ile Veri Yararlanıcılarının veriye tek bir büyük merkezi sistemde odaklanmaları olanaksız hale gelmiştir. Artan ağ trafiği ve azalan etkinlik nedeni ile bir çok alanda verinin bölünmesi zorunlu hale gelmiştir, ve her bir lokasyonun kendi depolama ve lokal işleme becerileri oluşmuştur. Bunun devamında Dağıtılmış Veri Tabanları (Distributed Databases) (DDB) ortaya çıkmıştır. Günümüzde güvenilir ve doğru veriye ihtiyaç zorunluluğu olduğundan bu veri tabanları çok önemli bir rol oynamaktadır. Yaşanan yenilikçi gelişmeler sonucunda donanım, yazılım, protokol, depolama ve ağlar ticari gereksinim haline dönüşmüştür. Bunun ile birlikte DDB kullanımı yapılabilir ve operasyonel bir karar haline gelmiştir. Dağıtılmış veri tabanlarının üstünlüğü fiziksel olarak bağı bulunmayan herhangi bir konumdan başka bir konuma bağlı veriyi aktarabilmesidir. Dağıtılmış Veri Tabanı Yönetim Sistemi (Distributed Database Management System) (DDBMS), dağıtılmış veri tabanını yöneten ve paralellik ile modülariteyi entegre ederek birçok konumda birçok kullanıcıya şeffaf erişim imkanı sunan uygulama yazılımları sınıfına girmektedir. DDB tasarımı etkin olsa da çok sayıda uygulama kısıtlamalarına da sahiptir. Bu kısıtlamalar: verinin parçalanması, tahsisi ve kopyalanması konusunda etkin yöntemlerin seçilebilmesidir. Bu araştırma tezi, DDB tasarım sorunları ile alakalı etkin çözümlerin geliştirilmesi konusuna odaklanmaktadır. Tezin esas amacı ise, DDBlerde sorgulamayı güçlendirip daha iyi performans sağlayabilmek adına verinin parçalanması, tahsisi ve kopyalanması konusunda güçlü yöntemler sunabilmektir. Birinci yaklaşımda, sorgulamalar ile alakalı gözlemlenen verilerinkısıtlamalarına odaklanılır. Buradaki amaç, geçici dağıtılmış veri tabanı tasarımında etkisiz olan parçalama konusu hakkında bir karara varmaktır. Bu aşamada etkinliği hesaplama işi sadece doğru tasarım ve alanlar arasındaki ağ iletişim masrafları üzerinden hesaplanır. Bu sorunu çözebilmek adına geliştirilmiş Hiyerarşik Aglomeratif Kümele (hierarchical agglomerative clustering) (IHAC) algoritma modeli kullanılarak dağıtılmış veri tabanlarının semantik fragmantasyonuı türetilir. IHAC, veri sayıları yerine tüm veri objelerini göz önüne alarak veri temsil matrisini oluşturur. Geleneksel hiyerarşik aglomeratif kümeleme algoritması ise veri temsil matrisini oluştururken benzerlik ölçümlerini seçmek ve hesaplamak için veri sayısı veya sıklığını göz önüne alır. Bu sayede veri objelerinin kümeleme işlemi daha güçlü olur ve bunun sonucu olarak da veri parçalama işlemi daha etkin bir biçimde yapılır. İkinci yaklaşımda, sorgulama uzaktan erişimi ve verinin geri alınması nedeni ile oluşan iletişim masraflarından doğan DDB performans bozulmasına odaklanılır. Bu işlemi optimize etmek için etkin bir veri tahsisi yaklaşımı kullanılabilir. Bu yaklaşımda düşük masraf ile erişilebilen alanlar üzerinden sorgulamanın esnek bir biçimde alınması sağlanır. Bu işlemi yapmak için Chicken Swarm Optimization (CSO) algoritması kullanılır. Bu algoritma, Veri Tahsis Problemi (Data Allocation Problem) (DAP)'ni uygun ve minimal iletişim masrafını seçebilecek bir optimal probleme dönüştürür. Sonrasında, CSO algoritması her bir veri parçası için alanı en uygun biçimde seçer. Bunu yaparken gereksiz yük oluşturmaz ve veri güzergah sapmasına neden olmaz. Bu sayede dağıtılmış veri tabanı tasarımı genel olarak iyileşir ve sonrasında kaliteli kopyalama gerçekleşir. Üçüncü yaklaşımda ise optimal kopya seçimi ve yerleştirme konusu ele alınır. İlk olarak, uygun veri tabanlarına ait anlık (snapshot) kopya ile birleştirme (merge) kopyalama süreçleri gösterilir. MGSO yaklaşımı, ağ içerisine yerleştirilecek kopyaların konumu ve adedini seçmek için kullanılır. Bu yaklaşım, kopyalamanın dinamik pencere mekanizması için read-write taleplerinin rastgele desenlerini kullanırken aynı zamanda MGSO kullanarak kopyalama problemini ve çok-hedefli optimizasyon problemini de modeller. Önerilen tekniklerin değerlendirmesi Hadoop küme ortamında gerçekleştirilmiştir ve bunu yaparken "master-slave" adanmış makineler kullanılmıştır. Değerlendirme işlemleri üç ana kaynaktan büyük bir veri seti üzerinden gerçekleştirilmiştir. Bu kaynaklar Twitter, Facebook ve YouTube olup içlerinde farklı boyutlarda metin, ses ve video türünde veriler bulunmaktadır. Değerlendirme ve karşılaştırma sonuçları göstermektedir ki, bu araştırma tezinde tavsiye edilen teknikler karşılaştırma yapılan bölme, tahsis ve kopyalama tekniklerinden daha iyi sonuç vermektedir. Bu nedenle, bu çalışmanın veri bölme, veri tahsisi ve veri kopyalama sorunlarını çözerek DDB tasarımını çok güçlendirdiği söylenebilir.

The development of data base systems technology has created its own theoretical foundations and has qualified a large number of applications. Similarly, the growth of computer networks enabled the association of multiple computers for interchange of data and resources. The role of centralized Data Management System and its accessibility to multiple users concurrently has made it impossible for the Data Benefactors to focus the data at one large mainframe site. The superior network traffic and reduced efficiency have forced splitting of data at many sites with each location having their own storage and local processing abilities. This directed to the development of Distributed Databases (DDBs) that play a noteworthy role in today's era where dependence on reliable and accurate data has become a compulsion. The innovations in hardware, software, protocols, storage and networks have transformed the position of the business necessities by making the handling of DDBs a feasible and operational decision. The supremacy of distributed databases lies in the capability to deliver interconnected data from any physically separated site to any other site. Distributed Database Management System (DDBMS) fits to the class of system software that manages distributed database and offers transparent access ability to multiple users across multiple sites by integrating parallelism and modularity. Though efficient, the designing of DDB has many practical limitations in selecting efficient methods for fragmentation, allocation and replication of data. This research thesis focuses on developing efficient solutions for the DDB design issues. The main aim of this thesis is to propose powerful schemes for data fragmentation, allocation and replication for enhancing the query processing in DDBs for better performance. The first approach concentrates on the limitations of utilizing the observed data about the queries to decide the fragmentation issue ineffective at the preliminary distributed database design where the efficiency is estimated only through proper design and network communication cost between sites. To resolve this issue, we give the improved model of hierarchical agglomerative clustering (IHAC) algorithm to derive semantic fragmentation of the distributed databases. The IHAC constructs the data representation matrix by considering all data objects instead of data counts while the traditional hierarchical agglomerative clustering algorithm constructs the data representation matrix based on the data count or frequency to select and compute similarity measures. This enhances the performance of clustering the data objects and hence the data fragmentation can be achieved efficiently. The second approach focuses on the performance degradation in DDBs due to the communication cost by remote access query and retrieval of data. This can be optimized through an efficient data allocation approach that will provide flexible retrieval of a query by low cost accessible sites. For this process, Chicken Swarm Optimization (CSO) algorithm is utilized which characterizes the Data Allocation Problem (DAP) into optimal problem of choosing the appropriate and minimal communication cost provoking sites for the data fragments. Then the CSO algorithm optimally chooses the sites for each of the data fragments without creating much overhead and data route diversions. This enhances the overall distributed database design and subsequently ensures quality replication. The third approach considers the issue of optimal replica selection and placement. Initially, the snapshot replication and merge replication process for suitable databases are illustrated. Secondly, the MGSO approach is employed for selecting the location and number of replica for placement in the network. This approach utilizes the random patterns of read-write requests for the dynamic window mechanism for replication while also modelling the replication problem and a multi-objective optimization problem that is resolved using MGSO. Evaluation of the proposed techniques is performed in Hadoop cluster environment using master-slave dedicated machines. The evaluation study performed over a large dataset from three major sources; namely, Twitter, Facebook and YouTube containing various types of data such as text, audio and video files with varying sizes. The evaluation and comparison show that the proposed technique in this research thesis perform better than the compared fragmentation, allocation and replication techniques. Hence it is proved that this work significantly enhance the design performances of DDBs by solving the problems of data fragmentation, allocation and replication.