Ulusal Tez Merkezi

Tez No	İndirme	Tez Künye	Durumu
401366		Understanding and improving the performance consistency of distributed computing systems / Yazar:MAHMUT NEZİH YİĞİTBAŞI Danışman: PROF. DICK EPEMA Yer Bilgisi: Technische Universiteit Delft (Delft University of Technology) / Yurtdışı Enstitü Konu:Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol = Computer Engineering and Computer Science and Control Dizin:	Onaylandı Doktora İngilizce 2012 191 s.



During the past few decades, we have seen several major innovations in the field of distributed computing systems, which have really resulted in significant advances in the capabilities of such systems. Initially, around the late 1970s the increasing complexity of workloads resulted in the invention of clusters that comprise multiplemachines connected over a local area network. Later, in the 1990s, grid computing was invented to give users access to a large amount of resources from different administrative domains on-demand, similar to the public utilities, and since then various grids have been deployed all around the world. Recently, cloud computing has been emerging as a new large-scale distributed computing paradigm where service providers rent their infrastructures, services, and platforms to their clients on-demand. With the increasing adoption of distributed systems in both academia and industry, and with the increasing computational and storage requirements of distributed applications, users inevitably demand more from these systems. Moreover, users also depend on these systems for latency and throughput sensitive applications, such as interactive perception applications and MapReduce applications, which make the performance of these systems even more important. Therefore, for the users it is very important that distributed systems provide consistent performance, that is, the systemprovides a similar level of performance at all times. In this thesis we address the problem of understanding and improving the performance consistency of state-of-the-art distributed computing systems. Towards this end, we take an empirical approach and we investigate various resource management, scheduling, and statistical modeling techniques with real system experiments in diverse distributed systems, such as clusters, multi-cluster grids, and clouds, using various types of workloads, such as Bags-of-tasks (BoTs), interactive perception applications, and scientific workloads. In addition, as failures are known to be an important source of significant performance inconsistency, we also provide fundamental insights into the characteristics of failures in distributed systems, which is required to design systems that can mitigate the impact of failures on performance consistency. In Chapter 1 of this thesis we present the performance consistency problem in distributed computing systems, and we describe why this problem is challenging in such systems. In Chapter 2, we assess the benefit of overprovisioning on the performance consistency of BoTs in multi-cluster grids. Overprovisioning can be defined as increasing the capacity, by a factor that we define as the overprovisioning factor, of a system to better handle the fluctuations in the workload, and to provide consistent performance even under unexpected user demands. Through simulations with realistic workload models we explore various overprovisioning strategies with different overprovisioning factors and different scheduling policies. We find that beyond a certain value for the overprovisioning factor there is only slight improvement in performance consistency with significant additional costs. We also find that by dynamically tuning the overprovisioning factor, we can significantly increase the number of BoTs that have a makespan within a user specified range, thus improving the performance consistency. In Chapter 3, we evaluate the performance of throttling-based overload control techniques in multi-cluster grids motivated by our DAS-3 multi-cluster grid, where running hundreds of tasks concurrently leads to overloads in the cluster head-nodes. We find that throttling results in a decrease (in most cases) or at least in a preservation of the makespan of bursty workloads while significantly improving the tail behavior of the application performance, which leads to better performance consistency and reduces the overload of the head-nodes. Our results also show that our adaptive throttling technique significantly improves the application performance and the system responsiveness, when compared with the hand-tuned multi-cluster system without throttling. In Chapter 4, we address the problem of incremental placement of interactive perception applications on clusters of machines to provide a responsive user experience. These applications require both low latency and, if possible, no latency spikes at all; frequent migrations of the application components can introduce such spikes, which reduces the quality of the user experience. We design and evaluate four incremental placement heuristics that cover a broad range of trade-offs of computational complexity, churn in the placement, and ultimate improvement in the latency. Through simulations and real system experiments in the Open Cirrus testbed we find that it is worth adjusting the schedule using our heuristics after a perturbation to the system or the workload, and that our heuristics can approach the improvements achieved by completely rerunning a static placement algorithm, but with significantly less churn. In Chapter 5, using various well-known benchmarks, such as LMbench, Bonnie, CacheBench, and the HPC Challenge Benchmark, we conduct a comprehensive performance study with four public clouds, including Amazon EC2, which is one of the largest production clouds. Notably, we find that the compute performance of the tested clouds is low. Furthermore, we also perform a preliminary assessment of the performance consistency of these clouds, and we find that noticeable performance variability is present for some of the cloud resource types we have explored, which motivates us to explore the performance variability of clouds in depth in the next chapter. Finally, we compare the performance and cost of clouds with those of scientific computing alternatives, such as grids and parallel production infrastructures. Our results show that while current cloud computing services are insufficient for scientific computing at scale, they may still be a good alternative for the scientists who need resources instantly and temporarily. In Chapter 6, we explore the performance variability of production cloud services using year-long traces that comprise performance data for two popular cloud services: Amazon Web Services and Google App Engine. We find that the performance of the investigated cloud services exhibits on the one hand yearly and daily patterns, and on the other hand periods of stable performance. We also find that many of these services exhibit high variation in the monthly median values, which indicates large performance variability over time. In addition, through trace-based simulations of different large-scale distributed applications we find that the impact of the performance variability varies significantly across different application types. In Chapter 7, we develop a statistical model for space-correlated failures, that is, for failures that occur within a short time period across different system components using fifteen data sets in the Failure Trace Archive, which is an online public repository of availability traces taken from diverse parallel and distributed systems. In our failure model we consider three aspects of failure events: the group arrival process, the group size, and the downtime caused by the group of failures. We find that the lognormal distribution provides a good fit for these parameters. Notably, we also find that for seven out of the fifteen traces we investigate, space-correlated failures are a major cause of the system downtime. Therefore, these seven traces are better represented by our model than by traditional models, which assume that the failures of the individual components of the system are independent and identically distributed. In Chapter 8, we investigate the time-varying behavior of failure events in diverse large-scale distributed systems using nineteen data sets in the Failure Trace Archive. We find that for most of the studied systems the failure rates are highly variable, and that failures exhibit strong periodic behavior and time correlations. Moreover, to characterize the peaks in the failure rate we develop a model that considers four parameters: the peak duration, the failure inter-arrival time during peaks, the time between peaks, and the failure duration during peaks. We find that the peak failure periods explained by our model are responsible for a significant portion of the system downtime, suggesting that failure peaks deserve special attention when designing fault-tolerant distributed systems. We believe that our failure models can be used for predictive scheduling and resource management decisions, which can help to mitigate the impact of failures on the performance variability in distributed systems. Finally, in Chapter 9, we present the conclusions of this thesis and we further present several interesting future research directions. With various workloads and distributed computing systems we show empirically how we can improve the performance consistency of such systems. Moreover, this thesis also provides a fundamental understanding of the characteristics of failures in distributed systems, which is required to design systems that can mitigate the impact of failures on performance consistency. A particularly important extension to our work is to investigate how we can improve the performance consistency of commercial cloud computing infrastructures. We believe that our research presented in this thesis has already taken initial steps in this direction.