Ulusal Tez Merkezi

Tez No	İndirme	Tez Künye	Durumu
774007		Precise event sampling: In-depth analysis and sampling-based profiling tools for data locality / Kesin olay örnekleme: Derinlemesine analiz ve örneklemeye dayalı veri konumu için profil oluşturma araçları Yazar:MUHAMMAD ADITYA SASONGKO Danışman: DR. ÖĞR. ÜYESİ DİDEM UNAT ERTEN Yer Bilgisi: Koç Üniversitesi / Fen Bilimleri Enstitüsü / Bilgisayar Bilimleri ve Mühendisliği Ana Bilim Dalı Konu:Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol = Computer Engineering and Computer Science and Control Dizin:	Onaylandı Doktora İngilizce 2022 156 s.

Kesin olay örneklemesi, mevcut emtia işlemcilerde bulunan, donanımsal olayların örneklenmesinde ve bu olayların tetiklenmesine sebep olan komutların tanımlanmasını sağlayan bir profilleme özelliğidir. Bu özellik sayesinde, performans darboğazları düşük ek masraf ile saptanabilir ve bu darboğazların kaynak koddaki yerleri belirlenebilir. Çeşitli performans darboğazlarını belirlemek üzere birkaç profilleme aracı geliştirilmiştir. Ancak bu araçlardan hiçbiri, iş parçacıkları arasındaki veri hareketini belirleyemez, veya çok iş parçacıklı uygulamalarda veri yerelliğini ölçemez. Ek olarak, bu donanım özelliği birçok profilleme aracında kullanılmış olsa da, bu özelliğin doğruluğu ve masrafı analiz eden çok az çalışma vardır. Tüm bu çalışmalar sadece Intel mimarisine yöneliktir; ve bunlardan hiçbiri bu donanım özelliğinin hafıza masrafı, stabilite, ve fonksiyonellik taraflarını değerlendirmemiştir. Bu tezde, üç-yönlü büyük katkı öne sürülmüştür. İlk olarak, sırasıyla Intel ve AMD'nin kesin olay örneklemesi araçları olan PEBS ve IBS üzerinde derinlemesine nitel ve nicel analiz yapılmıştır. İkinci olarak, iş parçacıkları arasında haberleşmeyi saptayabilen ve bunları true-sharing ve false-sharing olarak sınıflandırıp haberleşme matrislerinde kaydedebilen bir profilleme aracı olan ComDetective öne sürülmüştür. Üçüncü olarak, çok iş parçacıklı uygulamalarda özel önbellek ve paylaşılan önbellek içerisinde veri yerelliğini ölçebilen bir profilleme aracı olan ReuseTracker öne sürülmüştür. ComDetective ve ReuseTracker, kesin olay örneklemesi kullanarak yüksek doğruluk oranı ve düşük ek masraf ile çok iş parçacıklı uygulamaları profilleyebilmektedir. Intel PEBS ve AMD IBS arasındaki kilit farkları analiz edebilmek adına, ilk olarak bir dizi dikkatle tasarlanmış microbenchmark geliştirilmiştir. Bu microbenchmarklar ile yapılan nicel analiz ve nitel çalışmalar sonucunda, Intel PEBS'in donanımsal olayları örneklem sayısı bakımından daha yüksek doğruluk ve stabilite ile örnekleyebildiği gözlemlenirken; AMD IBS'in ise bilgi bakımından daha kapsamlı örnekleme yaptığı görülmüştür. Ek olarak, PEBS ve IBS'in farklı komutlar üzerinde aynı donanım olayını örneklerken kötü yönde etkilendiği gözlemlenmiştir. Dahası, elde edilen deney sonuçlarımızın Intel ve AMD makinelerinde çalışabilecek tam teşekküllü bir profilleme aracı bağlamında ilişkilendirilmesi gösterilmiştir. Intel PEBS ve AMD IBS'in incelenmesinden sonra, iş parçacıkları arasındaki haberleşmeyi yüksek doğruluk, düşük ek masraf düşük çalışma zamanı ile saptayabilen bir profilleme aracı olan ComDetective öne sürülmüştür. ComDetective kesin olay örneklemesi ile hafıza erişimlerine örnekleyerek ve donanımsal debug yazmaçlarını kullanarak iş parçacıkları arasındaki haberleşmeleri saptayabilmektedir. Haberleşmeyi saptamaya ek olarak, ComDetective bu haberleşmeleri true-sharing ve false-sharing olarak sınıflandırabilmektedir. 18 farklı uygulamada 500K örnekleme aralığı ile çalıştırıldığında, ComDetective'nin sırasıyla zaman ve hafıza ek masrafları sadece 1.30times ve 1.27times olarak ölçülmüştür. ComDetective kullanarak, birkaç microbenchmark, PARSEC benchmark koleksiyonu ve bazı CORAL uygulamaları için haberleşme matrisleri oluşturulmuş, ve bu matrisler MPI karşıtları ile karşılaştırılmıştır. Bu sayede bazı uygulamalarda haberleşme darboğazları keşfedilmiş olup, düzeltilmeleriyle beraber 13%'e kadar hızlanma başarılmıştır. Ek olarak, bir veri yerelliği ölçütü olarak sıkça kullanılan yeniden-kullanım mesafesi'ni ölçebilen ReuseTracker öne sürülmüştür. Yeniden-kullanım mesafesi, herhangi bir hafıza adresine ard-arda yapılan iki erişim (kullanım ve yeniden-kullanım) arasında erişilen farklı adreslerin sayısıdır, ve dolayısıyla bir veri yerelliği ölçütüdür. ReuseTracker kesin olay örneklemesi ve de donanımsal debug yazmaçlarından faydalanarak yeniden-kullanım mesafesini ölçmektedir. Ek olarak, ReuseTracker önbellek-tutunum etkilerini göz önünde bulundurarak çok iş parçacıklı uygulamalarda, var olan diğer araçlara göre daha az ek masraf ile yeniden-kullanım mesafesini ölçebilmektedir. Öne sürülen bu araç, sadece 2.9x zaman ve 2.8x hafıza ek masrafına sebep olmaktadır. Kullanıcı tarafından belirlenebilen yeniden-kullanım mesafesine sebep olacak şekilde özel olarak yazılmış bir microbenchmark ile ölçlüdüğü üzere, ReuseTracker ortalama 92% doğruluk oranına sahiptir. Paylaşılmış önbelleklerde false-sharing olan mekansal yeniden-kullanım'ların saptanması, ve bazı uygulamaların komşu önbellek-satırı prefetch optimizasyonundan fayda sağlayabileceğine dair tahmin yapılması olarak iki farklı senaryoda ReuseTracker'nin, kod düzenlemesinde nasıl rehber alınabileceği gösterilmiştir. Bu tez içerisinde öne sürülen araçların ve analizlerin, donanım mimarlarının yeni kesin olay örnekleme özellikleri geliştirirken ve de performans mühendislerinin yazılım performansını ayarlarken faydalı olabileceği gibi; performans analiz ve donanım içerisindeki profilleme araçları alanında ileride olabilecek araştırmalar için yeni yollar açabileceği beklentimizdir.

Precise event sampling is a profiling feature in current commodity CPUs that allows sampling of hardware events and identifies the instructions that trigger the sampled events. It offers the ability to detect performance bottlenecks with low overhead as well as the locations of the bottlenecks in source code. There have been a number of profiling tools developed using this feature that detect various sources of performance bottlenecks. However, none of these tools detects inter-thread data movement nor measures data locality in multithreaded applications, which have become widely used due to the ubiquity of multicore architectures. Furthermore, though this hardware facility has been used in multiple profiling tools, there have been only few works that analyze it in terms of accuracy and overhead. All of these works target only the facility in Intel architecture, and none of these works evaluates other aspects of precise event sampling such as memory overhead, stability, and functionality of the facility. In this dissertation, we present threefold major contributions. First, we perform the most comprehensive and in-depth qualitative and quantitative analyses to date on PEBS and IBS, which are the precise event sampling facilities of two major vendors, Intel and AMD, respectively. Next, we show the potential for imaginative use of precise event sampling in developing low overhead yet accurate profiling tools for multicore and design two diagnostic tools with a particular focus on data movement as it constitutes the main source of inefficiencies. First of such tools is ComDetective that detects inter-thread communications, classifies them into true sharing or false sharing, and records them in the form of communication matrices. Second is ReuseTracker that measures data locality in private and shared caches of multithreaded applications. ComDetective and ReuseTracker leverage precise event sampling to profile multithreaded applications accurately and with low overheads compared to their state-of-the-art alternatives. To analyze key differences between Intel PEBS and AMD IBS, we firstly developed a series of carefully designed microbenchmarks. Through our qualitative analysis and quantitative study using the microbenchmarks, we found that Intel PEBS samples hardware events more accurately and with higher stability in terms of the number of samples that it captures, while AMD IBS records richer set of information at each sample. We also discovered that both PEBS and IBS are afflicted with bias when sampling the same event across multiple different instructions in a code. Moreover, we also show how our findings from the quantitative experiments using the microbenchmarks are relevant for a full-fledged profiling tool that runs on Intel and AMD machines. We develop ComDetective, a profiling tool that captures inter-thread communications accurately and with low runtime and memory overheads. ComDetective employs precise event sampling to sample memory accesses and utilizes hardware debug registers to detect inter-thread communications. In addition to detecting communications, ComDetective can also classify them into true or false sharing. Its time and memory overheads are only 1.30× and 1.27×, respectively, for the 18 applications studied under 500K sampling interval. Using ComDetective, we generate insightful communication matrices from several microbenchmarks, PARSEC benchmark suite, and some CORAL applications and compare the produced matrices against the matrices of their MPI counterparts. Using ComDetective, we identify communication bottlenecks in a few codes and achieve up to 13% speedup from code refactoring those codes. We also design ReuseTracker, which is a profiling technique that measures reuse distance - a widely used metric that measures data locality. Reuse distance is a measurement of data locality as it is the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location (use and reuse). ReuseTracker leverages precise event sampling to capture uses and debug registers to detect reuse in measuring reuse distance. ReuseTracker can measure reuse distance in multithreaded applications by also considering cache-coherence effects with much lower overheads than existing tools. It introduces only 2.9x time and 2.8x memory overheads. It achieves 92% accuracy when verified against a carefully crafted configurable microbenchmark that can generate user-specified reuse distance patterns. We demonstrate in two use cases how ReuseTracker can be used to guide code refactoring by detecting spatial reuses in shared caches that are also false sharing and how it can also be used to predict whether certain applications can benefit from adjacent cache line prefetch optimization. We expect that the analysis, algorithms, and the tools presented in this dissertation will benefit hardware architects in designing new precise event sampling features and performance engineers in performance tuning of their software while also paving the way for a new generation of low-overhead profiling tools. Moreover, the outcomes of the dissertation can be used by the end-users (e.g., data analysts, engineers, compiler developer) to identify the performance issues and increase the data locality aspects of their software.