Join us

ContentUpdates and recent posts about Slurm..
Link
@faun shared a link, 6 months, 1 week ago
FAUN.dev()

Introducing Gateway API Inference Extension

Gateway API Inference Extensiontakes AI workload routing on Kubernetes and infuses it with model-savvy powers. It slices latency on GPU clusters like a samurai. Meanwhile, theEndpoint Selection Extensionacts like a traffic cop on caffeine, using live metrics to steer pods and trim those nagging tail.. read more  

Introducing Gateway API Inference Extension
Link
@faun shared a link, 6 months, 1 week ago
FAUN.dev()

Improving Cost Efficiency with Karpenter 1.0: An Upgrade Guide

Karpenter 1.0is the speedy barista of Kubernetes. It whips up nodes on demand, slashing AWS EC2 costs by 30-50%. Why? Real-time scaling magic, Spot instance wizardry, and APIs that won't stab you in the back. Sure,Cluster Autoscalerhas an extensive resume of compatibility and control, but it's like .. read more  

Improving Cost Efficiency with Karpenter 1.0: An Upgrade Guide
Link
@faun shared a link, 6 months, 1 week ago
FAUN.dev()

We saved 30% on Kubernetes by switching to 70% more expensive VMs

Omio swapped Spot VMs for standard ones in a single region and unearthed a shocker. Costs didn't skyrocket; they actually dropped. Network glitches? Gone. They braced for a70% budget implosionbut emerged with a grin. Standardizing on 16-core, 1:4 RAM machines cranked up performance and dialed down c.. read more  

We saved 30% on Kubernetes by switching to 70% more expensive VMs
Link
@faun shared a link, 6 months, 1 week ago
FAUN.dev()

Tracing Syscalls with eBPF in Docker: A Practical Example

This post walks through an example of combining a FastAPI service with an eBPF tracer to monitor syscalls. It covers common pitfalls encountered during development on macOS, the shift to containerizing the environment, and how the author ultimately succeeded in capturing the desired syscalls—a hands.. read more  

Link
@faun shared a link, 6 months, 1 week ago
FAUN.dev()

Deep Dive: Amazon EKS Dashboard for Visibility into Multi-Cluster Operations and Governance

Amazon EKS Dashboardtames the Kubernetes chaos with finesse. It brings all your clusters into one sharp, centralized view on AWS. Sprawl, security snags, ballooning support costs—gone in a flash. Assess upgrade needs, peek into cost forecasts, and manage add-ons without breaking a sweat. Wave farewe.. read more  

Deep Dive: Amazon EKS Dashboard for Visibility into Multi-Cluster Operations and Governance
Link
@faun shared a link, 6 months, 1 week ago
FAUN.dev()

Not Every Problem Needs Kubernetes

Most projects don’t need Kubernetes;for 90% of teams, it adds unnecessary complexity and operational burden compared to simpler alternatives like managed cloud services, VMs, or actor-model frameworks. Unless you’re running at hyperscale, need true hybrid environments, or have a massive, mature plat.. read more  

Link
@faun shared a link, 6 months, 1 week ago
FAUN.dev()

Securing Kubernetes: Integrating AKS with Tetragon for eBPF-Powered Observability

Tetragontaps into the kernel usingeBPF, giving containers an all-access pass without the agent baggage. When you pair Tetragon with AKS, you unlock crystal-clear views of process executions and system calls. Security teams revel in this treasure trove, primed for spotting and squashing threats swift.. read more  

Securing Kubernetes: Integrating AKS with Tetragon for eBPF-Powered Observability
Link
@faun shared a link, 6 months, 1 week ago
FAUN.dev()

How to Use AI to Detect PPE Compliance in Edge Environments

Meet the motley crew that is theYOLOv8-based AI team. These guys get serious about detecting hard hats across countless video streams and they do it in real time. Their secret weapon? The metallic trio ofZEDEDA,Rancher, andTerraform.ZEDEDAtames edge management.Rancherwrangles Kubernetes.Terraform? I.. read more  

Link
@faun shared a link, 6 months, 1 week ago
FAUN.dev()

How We Migrated 30+ Kubernetes Clusters to Terraform

Terraformisn't just making waves atSCHIP; it's rewriting the rulebook. Watching CI plan times dive from a sluggish 10 minutes to a snappy 30 seconds feels like magic, thanks to its knack for spitting out import statements like they're hotcakes. While flashy automation dazzles, it's actually the grit.. read more  

How We Migrated 30+ Kubernetes Clusters to Terraform
Link
@adyrcz shared a link, 6 months, 1 week ago
Head of Security & Compliance, Linkfire

Agentic AI Manifest – A Schema to Describe What Agents Do

Just launched agent-manifest.org — a schema for describing what AI agents do, what they need, and how they work.

Agents are the new APIs

Agents are becoming the next layer of software abstraction—autonomous tools that act on our behalf, perform tasks, make decisions, and interact with APIs, data, and humans.

But as agents proliferate, we face a growing challenge:

How do we understand what an agent does, what it needs, and what it can be trusted with?

See the proposed standard to solve this problem

Robot Librarian
Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.