Updates and recent posts about Slurm..

Posts
Description

Link

@faun shared a link, 2 months, 1 week ago

FAUN.dev()

SLI Evolution Stages

A new SLI evolution model lays out a maturity roadmap—from rebranded latency/error metrics to ones that actually track business impact. It replaces shallow signals and pulls in the stuff that matters: how service failures hit user goals, tasks, and bottom lines... read more

Link

@faun shared a link, 2 months, 1 week ago

FAUN.dev()

%CPU Utilization Is A Lie

Stress tests on the Ryzen 9 5900X uncovered a big gap between **reported CPU utilization** and what the chip actually pushes. Around 50% on paper? Could mean close to full throttle in reality—thanks to sneaky behaviors from **SMT resource sharing** and **Turbo frequency scaling**. **Takeaway:** Raw.. read more

Link

@faun shared a link, 2 months, 1 week ago

FAUN.dev()

Introducing Budget Controls for AWS: Automatically Manage Your Cloud Costs

**Budget Controls for AWS** just got better. The open-source tool now reins in more than just EC2. It wrangles **RDS Aurora**, **SageMaker**, and **OpenSearch** too. Under the hood, it taps **AWS Budgets**, **AWS Config**, and **custom tags** to watch spend like a hawk. Hit a budget threshold? It c.. read more

Link

@faun shared a link, 2 months, 1 week ago

FAUN.dev()

Fast, Secure Kubernetes with AKS Automatic

Azure dropped **AKS Automatic**, a new managed Kubernetes tier that tries to do it all—so you don’t have to. It comes with baked-in best practices: autoscaling via HPA, VPA, KEDA, and Karpenter. Automated patching. Node repair. Monitoring. All wired up by default. You still get full access to the .. read more

Link

@faun shared a link, 2 months, 1 week ago

FAUN.dev()

v1.34: Pods Report DRA Resource Health

Kubernetes v1.34 lands with an alpha upgrade to **[KEP-4680](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4680-add-resource-health-to-pod-status)**, pushing **Dynamic Resource Allocation (DRA)** into smarter territory: health-aware Pods. DRA drivers can now stream device heal.. read more

Link

@faun shared a link, 2 months, 1 week ago

FAUN.dev()

Kubernetes Security: Best Practices to Protect Your Cluster

A new JetBrains IDE plugin throws Kubernetes security best practices straight into your deployment manifests—right where they belong. Think: checks for `runAsRoot`, privileged mode, `hostPath`, host ports, and sketchy sysctls. No hand-waving. It enforces stuff like: - Default `runAsNonRoot` - Drop .. read more

Link

@faun shared a link, 2 months, 1 week ago

FAUN.dev()

v1.34: DRA Consumable Capacity

Kubernetes 1.34 rolls in **consumable capacity** for Dynamic Resource Allocation (DRA). That means device plugins can now carve up resources—GPU memory, NIC bandwidth, etc.—into precise slices for Pods, ResourceClaims, and namespaces. The scheduler tracks it all, so nothing spills over... read more

Link

@faun shared a link, 2 months, 1 week ago

FAUN.dev()

v1.34: Recovery From Volume Expansion Failure (GA)

Kubernetes v1.34 bumps **automated recovery from botched PVC expansions** to GA. Users can now fix bad volume size requests—no admin, no drama. It cleans up unused quota, slows down retry spam, and surfaces progress with new PVC status fields... read more

Link

@faun shared a link, 2 months, 1 week ago

FAUN.dev()

v1.34: Decoupled Taint Manager Is Now Stable

Kubernetes 1.34 graduates the taint eviction controller to GA. Now, the node lifecycle controller only applies taints, while a dedicated taint eviction controller manages pod eviction. First split in 1.29, now stable in 1.34... read more

Story

@laura_garcia shared a post, 2 months, 1 week ago

Software Developer, RELIANOID

Secure Boot Advanced Targeting (SBAT): Scaling Boot Security 🔐

Discover how SBAT enhances Secure Boot by introducing a smarter way to handle vulnerabilities, reducing overhead, and ensuring your system's boot process stays secure. Learn how it works, how it addresses scalability, and why it's a game-changer for modern boot security across Linux and Windows envi..

Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.