Join us

Slurm

Slurm is an open-source workload manager and job scheduler for Linux clusters, providing resource allocation, job execution, and queue management for large-scale high-performance computing environments. Highly scalable and plugin-driven, it powers many of the world’s largest supercomputers.

Featured Course

Cloud Native CI/CD with GitLab

From Commit to Production Ready

Content

Updates and recent posts about Slurm..

Posts
Description

Link

@varbear shared a link, 1 week, 5 days ago

FAUN.dev()

Year in Review: Lessons From 12 Projects Patreon Shipped in 2025

Patreon engineers made massive bets in 2025, shipping code across all areas of the system and enabling impactful features like Autopilot's growth tools suite. Expanding Autopilot's scope, reach, and effectiveness was a challenge, especially guaranteeing recipient redemption after email delivery in a.. read more

Link

@varbear shared a link, 1 week, 5 days ago

FAUN.dev()

Distinguishing yourself early in your career as a developer

A seasoned dev maps the job market into three tiers:local/public companies,VC-backed/startups, andBig Tech/finance. Each step up brings more money, more competition, and a steeper climb. Category 3(Big Tech/finance): Highest salaries. Broadest interview access. Brutal prep required. Category 2(start.. read more

Link

@kaptain shared a link, 1 week, 5 days ago

FAUN.dev()

BadPods Series: Everything Allowed on AWS EKS

A security researcher ran a full-blown container escape on EKS usingBadPods- a tool that spins up dangerously overprivileged pods. The pod broke out of its container, poked around the host node, moved laterally, and swiped AWS IAM creds. All of it slipped past EKS’s defaultPod Security Admission (PS.. read more

BadPods Series: Everything Allowed on AWS EKS

Link

@kaptain shared a link, 1 week, 5 days ago

FAUN.dev()

Streamline your containerized CI/CD with GitLab Runners and Amazon EKS Auto Mode

GitLab Runners now work withAmazon EKS Auto Mode. That means hands-off infra, smarter scaling, and built-in AWS security. Runners spin up onEC2 Spot Instances, so teams can cut CI/CD compute costs by as much as90%- without hacking together flaky pipelines... read more

Streamline your containerized CI/CD with GitLab Runners and Amazon EKS Auto Mode

Link

@kaptain shared a link, 1 week, 5 days ago

FAUN.dev()

Kubernetes GPU Management Just Got a Major Upgrade

Kubernetes 1.34 droppedDynamic Resource Allocation (DRA)- think persistent volumes, but for GPUs and custom hardware. Vendors can now plug in drivers and schedulers for their devices, and workloads can pick exactly what they need. Coming in 1.35: a newworkload abstractionthat speaks the language of .. read more

Link

@kaptain shared a link, 1 week, 5 days ago

FAUN.dev()

From Deterministic to Agentic: Creating Durable AI Workflows with Dapr

Dapr droppedDurable Agents- a mashup of classic workflows and LLM-driven agents that can actually get things done and survive rough edges. They track reasoning steps, tool calls, and chat states like a champ. If things crash, no problem: Dapr Workflows and Diagrid Catalyst bring it all back... read more

From Deterministic to Agentic: Creating Durable AI Workflows with Dapr

Link

@kaptain shared a link, 1 week, 5 days ago

FAUN.dev()

Implementing assurance pipeline for Amazon EKS Platform

AWS released a full-stack CI/CD validation pipeline forAmazon EKS. It pulls in six layers of testing,Terraform,Helm,Locustload testing, and evenAWS Fault Injectionfor pushing resilience to the edge. The goal: bake policy checks, functional tests, and brutal load tests right into pre-deployment. Fewe.. read more

Link

@kaptain shared a link, 1 week, 5 days ago

FAUN.dev()

v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager

Kubernetes v1.35 sneaks in an alphafeature gatethat flips the CCM route controller from "check every X minutes" to "watch and react." It now usesinformersto trigger syncs when nodes change - plus a light periodic check every 12–24 hours... read more

Link

@kaptain shared a link, 1 week, 5 days ago

FAUN.dev()

1.35: Enhanced Debugging with Versioned z-pages APIs

Kubernetes 1.35 makes a quiet-but-crucial upgrade: z-pages debugging endpoints now returnstructured, machine-readable JSON. That means tools- not just tired humans - can parse control plane state directly. The responses areversioned, backward-compatible, and tucked behind feature flags for now... read more

Link

@kaptain shared a link, 1 week, 5 days ago

FAUN.dev()

v1.35: New level of efficiency with in-place Pod restart

Kubernetes 1.35, as you may know, introducedin-place Pod restarts(alpha). It's a real reset: all containers, init and sidecars included - without killing the Pod or kicking off a reschedule. Think restart without the cloud drama. Big win for workloads with heavy inter-container dependencies or massi.. read more

Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.