Sigiri: A Light-Weight Job Management System for Large Scale Systems

Eran Chinthaka, Suresh Marru, Beth Plale

e-Science applications are often compute and data intensive, requiring large-scale compute systems for execution. Large-scale systems, however, support a variety of resource management interfaces. Grid middleware solutions abstract these heterogeneous resource managers and offer a single unified job management interface. However, Grid middleware tends to be highly complex, needing technically sophisticated system administration skills to deploy and maintain these services. Further, many clusters in the academic setting are not part of a larger scale grid and have to be directly accessed by non-uniform vendor specific resource managers. With the goals of providing a simple, reliable and highly scalable uniform job management, we introduce Sigiri, a light-weight job management and abstraction service. Sigiri supports existing popular job specifications like JSDL and RSL. A Web Service Interface is provided to easily integrate with various scientific workflow systems and each step in job submission and management is decoupled to increase scalability.

Architecture

Sigiri, designed based on the principals of publish-subsribe systems, employs a decoupled architecture to improve the robustness and efficiency of the system. The job management is divided into functional blocks and are decoupled in space and time to provide complete independence from each other. Figure 1 depicts the interactions between the following key components of the Sigiri job management system.

Figure 1 : Sigiri Architecture Diagram

Sigiri Web Service: A Web Services interface to the Sigiri system enables seamless integration of platform independent workflow clients to submit and manage jobs to multiple large-scale systems. The asynchronous job submission is queued and persisted and the request is responded with a unique internal job identifier. The service persists the client handles and the resource manager handles and correlates accordingly.

Job Submission And Management: Each managed compute resource has a light-weight daemon which periodically checks the job request queue, translates the job specification to a resource manager specific script and submits the pending jobs. Sigiri updates the job states by also taking into consideration quality of service metrics like estimated queue wait time, maximum wall time and scaling back accordingly.

Asynchronous Job Status Notification: On status change notified by resource manager or polled by Sigiri, the monitoring daemon verifies for any registered call back clients and notifies correspondingly and updates the persistence data for clients to poll.

The job acceptance and submission is decoupled to sustain the initial response time and to surge protect the rate of resource manager's job submission. The sustained rate of acceptance increases the scalability of the job management system greatly empowering support of large scale workflow systems. This decoupling also helps the system to sustain the communication failures of underlying resources and retry and recover when the system returns to healthy state. This asynchronous job acceptance introduces latencies but the robustness and constant performance out weighs this minimally introduced delays.