Hey HN.
I'm an engineering student at Waterloo building stateful AI agents, and I kept hitting the same wall: whenever my Python scripts crashed or dropped a connection, the underlying Puppeteer or Ollama processes would just sit there orphaned, eating RAM until the node OOM-killed itself. Standard load balancers break sticky sessions, and passive HTTP timeouts are too slow for cleanup.
I couldn't find a good local process pool that actually cleaned up dead stateful sessions reliably, so I built Herd in Go.
It uses a persistent stream (gRPC/Unix sockets) strictly as a dead-man's switch. If your client script dies, the stream breaks. Herd registers the EOF and instantly fires a SIGKILL to the worker process (relying on Pdeathsig on Linux). For the actual heavy data, you just blast HTTP traffic through Herd's internal proxy, which routes it directly to the active process port.
My actual goal is to turn this into a multi-node distributed mesh with a Redis registry, where a client can drop off and an edge gateway routes them back to the exact pod holding their stateful memory.
But I know building a distributed mesh on top of a leaky local engine is a death sentence. The single-node cleanup has to be flawless first.
I'd love for you guys to roast the architecture. Specifically: is relying on Pdeathsig actually robust enough for a local dead-man's switch in production, or am I being naive and need to just bite the bullet and wrap everything in cgroups & microvms right now?
Repo link: https://github.com/herd-core/herd
Comments URL: https://news.ycombinator.com/item?id=47511866
Points: 6
# Comments: 0