Lark: An effective approach for software-defined networking in high throughput computing clusters

  • Zhang Z
  • Bockelman B
  • Carder D
 et al. 
  • 34

    Readers

    Mendeley users who have this article in their library.
  • 3

    Citations

    Citations of this article.

Abstract

High throughput computing (HTC) systems are widely adopted in scientific discovery and engineering research. They are responsible for scheduling submitted batch jobs to utilize the cluster resources. Current systems mostly focus on managing computing resources like CPU and memory; however, they lack flexible and fine-grained management mechanisms for network resources. This has increasingly been an urgent need as current batch systems may be distributed among dozens of sites around the globe like Open Science Grid. The Lark project was motivated by this need to re-examine how the HTC layer interacts with the network layer. In this paper, we present the system architecture of Lark and its implementation as a plugin of HTCondor which is a popular HTC software project. Lark achieves lightweight network virtualization at per-job granularity for HTCondor by utilizing Linux container and virtual Ethernet devices; this provides each batch job with a unique network address in a private network namespace. We extended HTCondor's description language, ClassAds, so users can specify networking requirements in the job submission script. HTCondor can perform matchmaking to make sure user-specified network requirements and resource-specific policies are fulfilled. We also extended the job agent, condor_starter, so that it can manage and configure the job's network environment. Given this important building block as the core, we implement bandwidth management functionality at both the host and network levels utilizing software-defined networking (SDN). In addition to HTCondor, Wide area network bandwidth management for GridFTP traffic is designed and implemented. Our experiments and evaluations show that Lark can effectively manage network resources simultaneously for both applications inside the cluster environment. By not resorting to heavyweight VMs, we keep startup overheads minimal compared to “regular” batch jobs. This mechanism provides the users with better predictability of their job execution and the administrators more policy flexibility in allocation of network resources.

Author-supplied keywords

  • Bandwidth management
  • HTCondor
  • High throughput computing
  • Network-aware scheduling
  • Software-defined networking

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Authors

  • Zhe Zhang

  • Brian Bockelman

  • Dale W. Carder

  • Todd Tannenbaum

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free