eCommons

 

EFFICIENT RESOURCE MANAGEMENT OF CLOUD NATIVE SYSTEMS

Other Titles

Abstract

Cloud native architecture has been a prevailing trend and is widely adopted by major online service providers including Netflix, Uber and WeChat. It enables applications to be structured as loosely-coupled distributed systems that can be developed and managed independently, and provide different programming models, namely microservice and serverless, to accommodate different user requirements. Specifically, microservices are a group of small services that collectively perform as a complete application. Each microservice implements a web server that handles specific business logic, and is usually packaged in a container that encapsulates its own runtime and dependencies. Microservice containers typically live for a long time and scale up or down to cope with load fluctuations as per user-specified policies. Serverless provides a further simplified approach to application development and deployment. It allows users to upload their application code as functions, without the need for explicit provisioning or management of containers, through an event-driven interface. Serverless containers are typically short-living ’one-off’ containers handling a single request at a time. The billing of serverless is fine-grained and users only pay for the resources consumed by actual function execution. Despite the popularity of cloud native systems, managing their resources efficiently is challenging. Cloud native applications consist of many component services with diverse resource requirements, posing a greater challenge compared to traditional monolithic applications. Furthermore, the backpressure effect caused by inter-service connections also complicates resource management. Lastly, although cloud-native relives users from the burden of infrastructure management, cloud providers still need to provision and pay for the infrastructure to host cloud native applications, which incurs high cost. This dissertation aims to tackle the challenge of efficient resource management for cloud-native systems and proposes three resource managers. First, we present \textbf{Sinan}, a machine learning (ML)-driven and service level agreement (SLA)-aware resource manager for microservices. Sinan uses a set of validated ML models to learn the per-service resource requirements , taking into account the effects of inter-service dependencies. Sinan's ML models predict the end-to-end latency of a given resource allocation, and the resource manager then chooses the optimal resource allocation that preserves the SLAs, based on the predictions. Sinan highlights the importance of a balanced training dataset that includes an equal share of SLA violations and satisfactions, for the effectiveness of ML models. Additionally, Sinan demonstrates that the system is flawed if the training dataset is dominated by either SLA satisfaction or violation. In order to obtain a balanced training dataset, Sinan explores different resource allocations with an algorithm inspired by multi-arm bandit (MAP). Although Sinan outperforms traditional approaches such as autoscaling, it requires a lengthy exploration process and triggers a large number of SLA violations, hindering its practicality. Furthermore, the ML models are on the critical path of resource management decisions, limiting the speed and scalability of the system. To address these limitations, we further propose \textbf{Ursa}, a lightweight and scalable resource management framework for microservices. By investigating the backpressure-free conditions, Ursa allocates resources within the space that each service can be considered independent for the puropose of resource allocation. Ursa then uses an analytical model that decomposes the end-to-end latency into per-service latency, and maps per-service latency to individually checkable resource allocation threshold. To speed up the exploration process, Ursa explores as many independent microservices as possible across different request paths, and swiftly stops exploration in case of SLA violations. Finally, in order to reduce the infrastructure provisioning cost of cloud-native systems, we propose to leverage harvested resources in datacenter, which cloud providers provide at a massive discount. Orthogonal to the first two parts of the thesis which aim to reduce operation cost by providing the minimum amount of resources that do not compromise performance, this part aims to achieve cost reduction by using cheaper but less reliable resources. We use serverless as the target workload, and propose to run serverless platforms on low-priority Harvest VMs that grow and shrink to harvest all the unallocated CPU cores in their host servers. We quantify the challenges of running serverless on harvest VMs by characterizing the serverless workloads and Harvest VMs in production. We propose a series of policies that uses a mix of Harvest and regular VMs with different tradeoffs between reliability and efficiency, and design a serverless load balancer that is aware of VM evictions and resource variations in Harvest VMs. Our results show that adopting harvested resources improves efficiency and reduces cost significantly, and request failure rate caused by Harvest VM evictions is marginal.

Journal / Series

Volume & Issue

Description

Sponsorship

Date Issued

2023-05

Publisher

Keywords

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Delimitrou, Christina

Committee Co-Chair

Committee Member

Alvisi, Lorenzo
Suh, Gookwon Edward

Degree Discipline

Electrical and Computer Engineering

Degree Name

Ph. D., Electrical and Computer Engineering

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Rights URI

Types

dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record