Join us

ContentUpdates and recent posts about Slurm..
Story
@laura_garcia shared a post, 8 months, 1 week ago
Software Developer, RELIANOID

RELIANOID Load Balancer Community Edition v7 on AWS using Terraform

🚀 New Guide Available! Learn how to quickly deploy RELIANOID Load Balancer Community Edition v7 on AWS using Terraform. Our step-by-step article shows you how to provision everything automatically — from VPCs and subnets to EC2 and key pairs — in just minutes. 👉 https://www.relianoid.com/resources/k..

Knowledge base Deploy RELIANOID Load Balancer Community Edition v7 with Terraform on AWS
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Deploy a containerized application with Kamal and Terraform

A Docker-first workflow combinesTerraformandKamalinto a lean, Elastic Beanstalk-ish alternative—without the bloat. Terraform spins up a three-tier VPC and wires it toECR. Kamal takes it from there, booting containers on a raw EC2 box: app, proxy, monitor. One script. Done... read more  

Deploy a containerized application with Kamal and Terraform
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

AWS, Microsoft and Google unite behind Linux Foundation DocumentDB database to cut enterprise costs and limit vendor lock-in

Document databases are crucial for AI apps in the gen AI era. Microsoft's open-source DocumentDB project, based on PostgreSQL, is moving to the Linux Foundation, offering a vendor-neutral, open-source alternative to MongoDB. DocumentDB's compatibility with MongoDB drivers and open source governance .. read more  

Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Sandboxed to Compromised: New Research Exposes Credential Exfiltration Paths in AWS Code Interpreters

Researchers poked holes insandboxed Bedrock AgentCore code interpreters—and found a way to leak execution role credentials through theMicroVM Metadata Service (MMDS). No outside network? Doesn’t matter. The exploit dodges basic string filters in requests and lets non-agentic code swipe AWS creds to .. read more  

Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Measuring Developer Productivity with Amazon Q Developer and Jellyfish

Amazon Q Developer now plugs into Jellyfish. Teams get a clearer view of how AI fits into the real flow of work—prompt usage, code adoption, PR throughput. Not just surface stats. The setup pipes data from AWS S3 straight into Jellyfish’s analytics engine. It tags AI users, tracks velocity gains, an.. read more  

Measuring Developer Productivity with Amazon Q Developer and Jellyfish
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Being on the Same Page During an Incident: Not Actually Telepathy

Collaboration in incident response is crucial for effective resolution, starting with establishing a basic compact among responders. Grounding is a process that ensures alignment and common ground is maintained throughout an incident, encompassing initial common ground, public events so far, and the.. read more  

Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Which LLM writes the best analytical SQL?

Tinybird threw 19 top LLMs at a 200M-row GitHub dataset, testing how well they could turn plain English into solid SQL. Most models kept their syntax clean—but when it came to writing SQL that actually ran well and returned the right results, they lagged behind human pros. Messy schemas or tricky pr.. read more  

Which LLM writes the best analytical SQL?
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Building a Scalable, Flexible, Cloud-Native GenAI Platform with Open Source Solutions

A fresh reference architecture built withEnvoy AI GatewayandKServebrings order to the GenAI chaos. One clean interface to route requests across internal and external LLMs—locked down with policies. It’s called aTwo-Tier Gateway Architecture. Think of it like a split-brain: external API traffic goes.. read more  

Building a Scalable, Flexible, Cloud-Native GenAI Platform with Open Source Solutions
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Container Logs in Kubernetes: How to View and Collect Them

This guide shows how to wrangle container logs in Kubernetes—usingkubectl, shell tools, structured logging, and the Kubernetes Dashboard. It covers the basics and dives into how to scale up log collection and make observability less painful across clusters... read more  

Container Logs in Kubernetes: How to View and Collect Them
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

v1.34: DRA has graduated to GA

Kubernetes 1.34 turnsDynamic Resource Allocation (DRA)loose into General Availability—enabled by default. That cements native support for high-maintenance gear like GPUs, FPGAs, and any other quirky hardware your workloads need. The release also packs a fresh mix of alpha/beta features: tighter admi.. read more  

Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.