Join us

ContentUpdates and recent posts about Pelagia..
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Easy will always trump simple

Rich Hickey’s classic “Simple Made Easy” talk is making the rounds again—as a mirror held up to dev culture under pressure. The punchline: we keep picking solutions that areeasy but tangled, instead ofsimple and sane. The essay draws a sharp line between that habit and a concept from biology: exapt.. read more  

Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

The Hidden AWS Cost Traps No One Warns You About (and How I Avoid Them)

Calling outfive sneaky AWS cost traps—the kind that creep in through overlooked defaults and quiet misconfigs, then blow up your bill while no one's watching... read more  

The Hidden AWS Cost Traps No One Warns You About (and How I Avoid Them)
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Why "What Happened First?" Is One of the Hardest Questions in Large-Scale Systems

Logical clocks trackevent orderin distributed systems—no need for synced wall clocks. Each node keeps a counter. On every event: tick it. On every message: tack on your counter. When you receive one? Merge and bump. This flips the script. Instead of chasing global time, distributed systems lean int.. read more  

Why "What Happened First?" Is One of the Hardest Questions in Large-Scale Systems
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Paused Kubernetes project finds path forward

TheExternal Secrets Operator (ESO)is moving again. After hitting pause from maintainer burnout, it’s back under CNCF incubation—with a rebooted structure in place. New governance, clear contributor paths, and support tracks for CI, core dev, and testing are all in. But don’t expect fresh releases ju.. read more  

Paused Kubernetes project finds path forward
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Pooling Connections with RDS Proxy at Klaviyo

Klaviyo replaced ProxySQL on EC2 and moved toAWS RDS Proxy. Why? Less overhead. Simpler failovers. Smarter pooling. RDS Proxy handlesmultiplexing, packing thousands of client queries into way fewer DB connections. IAM access and built-in failover routing sweeten the deal... read more  

Pooling Connections with RDS Proxy at Klaviyo
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Kubernetes right-sizing with metrics-driven GitOps automation

AWS just dropped a GitOps-native pattern for tuning EKS resources—built to runoutsidethe cluster. It’s wired up withAmazon Managed Service for Prometheus,Argo CD, andBedrockto automate resource recommendations straight into Git. Here’s the play: it maps usage metrics to templated manifests, then sp.. read more  

Kubernetes right-sizing with metrics-driven GitOps automation
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Dynamic Kubernetes request right sizing with Kubecost

Kubecost’s Amazon EKS add-on now handlesautomated container request right-sizing. That means teams can tweak CPU and memory requests based on actual usage—once or on a recurring schedule. Optimization profiles are customizable, and resizing can be baked into cluster setup using Helm. Yes, that mean.. read more  

Dynamic Kubernetes request right sizing with Kubecost
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Kubernetes Primer: Dynamic Resource Allocation (DRA) for GPU Workloads

Kubernetes 1.34 brings serious heat for anyone juggling GPUs or accelerators. MeetDynamic Resource Allocation (DRA)—a new way to schedule hardware like you mean it. DRA addsResourceClaims,DeviceClasses, andResourceSlices, slicing device management away from pod specs. It replaces the old device plu.. read more  

Kubernetes Primer: Dynamic Resource Allocation (DRA) for GPU Workloads
Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Why I Ditched Docker for Podman (And You Should Too)

Older container technologies like Docker have been prone to security vulnerabilities, such as CVE-2019-5736 and CVE-2022-0847, which allowed for potential host system compromise. Podman changes the game by eliminating the need for a persistent background service like the Docker daemon, enhancing sec.. read more  

Link
@faun shared a link, 8 months, 1 week ago
FAUN.dev()

Rethinking Efficiency for Cloud-Native AI Workloads

AI isn’t just burning compute—it's torching old-school FinOps. Reserved Instances? Idle detection? Cute, but not built for GPU bottlenecks and model-heavy pipelines. What’s actually happening:Infra teams are ditching cost-first playbooks for something smarter—business-aligned orchestrationthat chas.. read more  

Rethinking Efficiency for Cloud-Native AI Workloads
Pelagia is a Kubernetes controller that provides all-in-one management for Ceph clusters installed by Rook. It delivers two main features:

Aggregates all Rook Custom Resources (CRs) into a single CephDeployment resource, simplifying the management of Ceph clusters.
Provides automated lifecycle management (LCM) of Rook Ceph OSD nodes for bare-metal clusters. Automated LCM is managed by the special CephOsdRemoveTask resource.

It is designed to simplify the management of Ceph clusters in Kubernetes installed by Rook.

Being solid Rook users, we had dozens of Rook CRs to manage. Thus, one day we decided to create a single resource that would aggregate all Rook CRs and deliver a smoother LCM experience. This is how Pelagia was born.

It supports almost all Rook CRs API, including CephCluster, CephBlockPool, CephFilesystem, CephObjectStore, and others, aggregating them into a single specification. We continuously work on improving Pelagia's API, adding new features, and enhancing existing ones.

Pelagia collects Ceph cluster state and all Rook CRs statuses into single CephDeploymentHealth CR. This resource highlights of Ceph cluster and Rook APIs issues, if any.

Another important thing we implemented in Pelagia is the automated lifecycle management of Rook Ceph OSD nodes for bare-metal clusters. This feature is delivered by the CephOsdRemoveTask resource, which automates the process of removing OSD disks and nodes from the cluster. We are using this feature in our everyday day-2 operations routine.