During the past few decades, we have seen several major innovations in the field of distributed
computing systems, which have really resulted in significant advances in the capabilities
of such systems. Initially, around the late 1970s the increasing complexity of
workloads resulted in the invention of clusters that comprise multiplemachines connected
over a local area network. Later, in the 1990s, grid computing was invented to give users
access to a large amount of resources from different administrative domains on-demand,
similar to the public utilities, and since then various grids have been deployed all around
the world. Recently, cloud computing has been emerging as a new large-scale distributed
computing paradigm where service providers rent their infrastructures, services, and platforms
to their clients on-demand.
With the increasing adoption of distributed systems in both academia and industry, and
with the increasing computational and storage requirements of distributed applications,
users inevitably demand more from these systems. Moreover, users also depend on these
systems for latency and throughput sensitive applications, such as interactive perception
applications and MapReduce applications, which make the performance of these systems
even more important. Therefore, for the users it is very important that distributed systems
provide consistent performance, that is, the systemprovides a similar level of performance
at all times.
In this thesis we address the problem of understanding and improving the performance
consistency of state-of-the-art distributed computing systems. Towards this end, we take
an empirical approach and we investigate various resource management, scheduling, and
statistical modeling techniques with real system experiments in diverse distributed systems,
such as clusters, multi-cluster grids, and clouds, using various types of workloads,
such as Bags-of-tasks (BoTs), interactive perception applications, and scientific workloads.
In addition, as failures are known to be an important source of significant performance
inconsistency, we also provide fundamental insights into the characteristics of
failures in distributed systems, which is required to design systems that can mitigate the
impact of failures on performance consistency.
In Chapter 1 of this thesis we present the performance consistency problem in distributed
computing systems, and we describe why this problem is challenging in such systems.
In Chapter 2, we assess the benefit of overprovisioning on the performance consistency
of BoTs in multi-cluster grids. Overprovisioning can be defined as increasing the
capacity, by a factor that we define as the overprovisioning factor, of a system to better
handle the fluctuations in the workload, and to provide consistent performance even under
unexpected user demands. Through simulations with realistic workload models we
explore various overprovisioning strategies with different overprovisioning factors and
different scheduling policies. We find that beyond a certain value for the overprovisioning
factor there is only slight improvement in performance consistency with significant
additional costs. We also find that by dynamically tuning the overprovisioning factor,
we can significantly increase the number of BoTs that have a makespan within a user
specified range, thus improving the performance consistency.
In Chapter 3, we evaluate the performance of throttling-based overload control techniques
in multi-cluster grids motivated by our DAS-3 multi-cluster grid, where running
hundreds of tasks concurrently leads to overloads in the cluster head-nodes. We find that
throttling results in a decrease (in most cases) or at least in a preservation of the makespan
of bursty workloads while significantly improving the tail behavior of the application performance,
which leads to better performance consistency and reduces the overload of the
head-nodes. Our results also show that our adaptive throttling technique significantly improves
the application performance and the system responsiveness, when compared with
the hand-tuned multi-cluster system without throttling.
In Chapter 4, we address the problem of incremental placement of interactive perception
applications on clusters of machines to provide a responsive user experience.
These applications require both low latency and, if possible, no latency spikes at all; frequent
migrations of the application components can introduce such spikes, which reduces
the quality of the user experience. We design and evaluate four incremental placement
heuristics that cover a broad range of trade-offs of computational complexity, churn in the
placement, and ultimate improvement in the latency. Through simulations and real system
experiments in the Open Cirrus testbed we find that it is worth adjusting the schedule
using our heuristics after a perturbation to the system or the workload, and that our heuristics
can approach the improvements achieved by completely rerunning a static placement
algorithm, but with significantly less churn.
In Chapter 5, using various well-known benchmarks, such as LMbench, Bonnie,
CacheBench, and the HPC Challenge Benchmark, we conduct a comprehensive performance
study with four public clouds, including Amazon EC2, which is one of the largest
production clouds. Notably, we find that the compute performance of the tested clouds is
low. Furthermore, we also perform a preliminary assessment of the performance consistency
of these clouds, and we find that noticeable performance variability is present for
some of the cloud resource types we have explored, which motivates us to explore the performance variability of clouds in depth in the next chapter. Finally, we compare the
performance and cost of clouds with those of scientific computing alternatives, such as
grids and parallel production infrastructures. Our results show that while current cloud
computing services are insufficient for scientific computing at scale, they may still be a
good alternative for the scientists who need resources instantly and temporarily.
In Chapter 6, we explore the performance variability of production cloud services
using year-long traces that comprise performance data for two popular cloud services:
Amazon Web Services and Google App Engine. We find that the performance of the
investigated cloud services exhibits on the one hand yearly and daily patterns, and on
the other hand periods of stable performance. We also find that many of these services
exhibit high variation in the monthly median values, which indicates large performance
variability over time. In addition, through trace-based simulations of different large-scale
distributed applications we find that the impact of the performance variability varies significantly
across different application types.
In Chapter 7, we develop a statistical model for space-correlated failures, that is, for
failures that occur within a short time period across different system components using
fifteen data sets in the Failure Trace Archive, which is an online public repository of
availability traces taken from diverse parallel and distributed systems. In our failure model
we consider three aspects of failure events: the group arrival process, the group size, and
the downtime caused by the group of failures. We find that the lognormal distribution
provides a good fit for these parameters. Notably, we also find that for seven out of the
fifteen traces we investigate, space-correlated failures are a major cause of the system
downtime. Therefore, these seven traces are better represented by our model than by
traditional models, which assume that the failures of the individual components of the
system are independent and identically distributed.
In Chapter 8, we investigate the time-varying behavior of failure events in diverse
large-scale distributed systems using nineteen data sets in the Failure Trace Archive. We
find that for most of the studied systems the failure rates are highly variable, and that
failures exhibit strong periodic behavior and time correlations. Moreover, to characterize
the peaks in the failure rate we develop a model that considers four parameters: the peak
duration, the failure inter-arrival time during peaks, the time between peaks, and the failure
duration during peaks. We find that the peak failure periods explained by our model
are responsible for a significant portion of the system downtime, suggesting that failure
peaks deserve special attention when designing fault-tolerant distributed systems. We
believe that our failure models can be used for predictive scheduling and resource management
decisions, which can help to mitigate the impact of failures on the performance
variability in distributed systems.
Finally, in Chapter 9, we present the conclusions of this thesis and we further present
several interesting future research directions. With various workloads and distributed computing systems we show empirically how we can improve the performance consistency
of such systems. Moreover, this thesis also provides a fundamental understanding
of the characteristics of failures in distributed systems, which is required to design systems
that can mitigate the impact of failures on performance consistency. A particularly
important extension to our work is to investigate how we can improve the performance
consistency of commercial cloud computing infrastructures. We believe that our research
presented in this thesis has already taken initial steps in this direction. |