Visible to the public Intelligent Resource Scheduling at Scale: A Machine Learning Perspective

TitleIntelligent Resource Scheduling at Scale: A Machine Learning Perspective
Publication TypeConference Paper
Year of Publication2018
AuthorsYang, R., Ouyang, X., Chen, Y., Townend, P., Xu, J.
Conference Name2018 IEEE Symposium on Service-Oriented System Engineering (SOSE)
Date PublishedMarch 2018
ISBN Number978-1-5386-5207-7
Keywordsad-hoc heuristics, cloud computing, cloud-scale, Collaboration, composability, data centers, exhibited heterogeneity, Human Behavior, human factors, intelligent resource scheduling, Internet-scale Computing Security, Internet-scale systems, large-scale resource scheduling, Large-scale systems, learning (artificial intelligence), machine learning, Metrics, ML, multidimensional resource requirements, nonfunctional constraints, performance-centric node classification, Policy Based Governance, Processor scheduling, pubcrawl, quality of service, resilience, Resiliency, resource allocation, Resource management, Resource Scheduling, Scalability, scheduling, server characteristics, Servers, straggler, straggler mitigation, Task Analysis, workload

Resource scheduling in a computing system addresses the problem of packing tasks with multi-dimensional resource requirements and non-functional constraints. The exhibited heterogeneity of workload and server characteristics in Cloud-scale or Internet-scale systems is adding further complexity and new challenges to the problem. Compared with,,,, existing solutions based on ad-hoc heuristics, Machine Learning (ML) has the potential to improve further the efficiency of resource management in large-scale systems. In this paper we,,,, will describe and discuss how ML could be used to understand automatically both workloads and environments, and to help to cope with scheduling-related challenges such as consolidating co-located workloads, handling resource requests, guaranteeing application's QoSs, and mitigating tailed stragglers. We will introduce a generalized ML-based solution to large-scale resource scheduling and demonstrate its effectiveness through a case study that deals with performance-centric node classification and straggler mitigation. We believe that an MLbased method will help to achieve architectural optimization and efficiency improvement.

Citation Keyyang_intelligent_2018