Join us

ContentUpdates and recent posts about Slurm..
 Activity
@goutham-annem started using tool Amazon Web Services , 1 week, 1 day ago.
 Activity
@goutham-annem started using tool Amazon ECS , 1 week, 1 day ago.
 Activity
@eon01 gave 🐾 to The unwritten laws of software engineering , 1 week, 1 day ago.
Link
@varbear shared a link, 1 week, 1 day ago
FAUN.dev()

Build and Deploy a Remote MCP Server to GKE in 30 Minutes

Google walks you through shipping a remoteMCP serveronGKE AutopilotusingFastMCPandstreamable-http, swapping localstdiofor shared HTTP endpoints. The clever bit: theGateway APIhandles managed SSL plusCLIENT_IP session affinity, so one centralized server beats everyone running redundant local copies... read more  

Build and Deploy a Remote MCP Server to GKE in 30 Minutes
Link
@varbear shared a link, 1 week, 1 day ago
FAUN.dev()

How building an HTML-first site doubled our users overnight

Building HTML-first forms using Astro instead of React dramatically increased completion rates and sustainability, highlighting the effectiveness of lightweight, accessible web components for all users, regardless of browser or connectivity... read more  

How building an HTML-first site doubled our users overnight
Link
@varbear shared a link, 1 week, 1 day ago
FAUN.dev()

The unwritten laws of software engineering

- Always related - first rollback, then debug. - Backups aren’t real until restored. - You’ll hate yourself for bad logs. - ALWAYS have a rollback plan. - Every external dependency will fail. - If there's risk, use the “4 eyes” rule. - Nothing lasts like a temporary fix... read more  

The unwritten laws of software engineering
Link
@varbear shared a link, 1 week, 1 day ago
FAUN.dev()

Google hits 50% IPv6

The 50% IPv6 milestone is real, but adoption differs by country. Analysts who report lower figures use population-weighted sampling, while their per-country adoption rates match the higher estimate... read more  

Google hits 50% IPv6
Link
@varbear shared a link, 1 week, 1 day ago
FAUN.dev()

Building in the Age of Collaborative Coding

The speed of innovation is crucial for teams, and AI tools have enabled faster work. A collaborative coding model where teams build, review, and ship alongside AI agents is key to staying ahead in workflows. Three shifts have reshaped how teams build, leading to the adoption of a new collaborative c.. read more  

Building in the Age of Collaborative Coding
Link
@kaptain shared a link, 1 week, 1 day ago
FAUN.dev()

Tigera introduces unified control plane for Kubernetes-based AI agent security

Tigera launched Lynx for general availability, a Kubernetes-native control plane that operators place in the path of AI agent calls so teams can enforce identity and policy... read more  

Tigera introduces unified control plane for Kubernetes-based AI agent security
Link
@kaptain shared a link, 1 week, 1 day ago
FAUN.dev()

How Netflix Simplified Batch Compute with Kueue

Netflix migratedmillions of batch jobsfrom their custom queuing system toKueue, a cloud-native job queueing system, as part of transitioning to a more Kubernetes-native infrastructure. Kueue offers features such as preemption, fair sharing, and hierarchical tenants that were missing in their homegro.. read more  

Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.