Skip to content

Durability Guarantees

glide-mq is durable to the extent that your Valkey deployment is durable.

The queue stores jobs in native Valkey data structures:

  • job hashes for payload, metadata, attempts, progress, and return values
  • streams + consumer groups for waiting and active delivery state
  • sorted sets for delayed / prioritized jobs, completed jobs, and failed jobs
  • one Valkey function library for queue operations

Those structures survive restarts only if Valkey is configured to persist them.

The Short Version

  • appendonly yes is the recommended baseline for production queues.
  • appendfsync everysec is a good default when you want low latency with a small worst-case loss window.
  • appendfsync always minimizes the persistence window further, but costs throughput and latency.
  • RDB-only snapshots are usually not enough for queues unless you can tolerate losing all writes since the last snapshot.
  • glide-mq provides at-least-once delivery semantics, not exactly-once.
  • If a worker crashes after claiming a job but before acking it, the job is not silently lost; another worker can reclaim it from the pending entries list.

What Persists

When Valkey persistence is enabled, these glide-mq state transitions are persisted with the rest of the dataset:

  • enqueued jobs
  • delayed / prioritized jobs waiting in the scheduled sorted set
  • completed / failed job state
  • stream entries and consumer groups
  • pending entries list (PEL) state for jobs that were claimed but not yet acked
  • the glidemq Valkey function library itself

This matters because glide-mq uses Valkey Streams and consumer groups instead of a BRPOP-style list pop. A claimed-but-unacked job stays recoverable through the PEL until it is acked or reclaimed.

Persistence Modes

RDB snapshots

RDB snapshots write point-in-time dumps on a schedule such as save 60 10000.

Pros:

  • low write overhead
  • simple to operate

Trade-offs:

  • A crash can lose every write since the last successful snapshot
  • Not ideal if you need queue state to survive process or host failure with a small loss window

For queues, RDB-only is usually a recovery/snapshot tool, not the primary durability strategy.

AOF

AOF appends write operations to a log and is the usual recommendation for queue workloads.

Common appendfsync choices:

  • everysec: better performance, but a crash can lose roughly the last second of acknowledged writes
  • always: strongest persistence, but highest latency / lowest throughput
  • no: lowest durability; usually not appropriate for production queues

Hybrid AOF + RDB preamble

If your Valkey version supports it, keeping AOF enabled with the RDB preamble gives faster restarts while preserving the operational model of AOF.

For most production queue workloads, start here and tune from measurement:

conf
appendonly yes
appendfsync everysec
# If supported by your Valkey version:
aof-use-rdb-preamble yes

Then choose RDB snapshots based on your backup / restore needs, for example:

conf
save 60 10000

If AOF is already enabled, treat RDB snapshots as optional and tune them against your write volume and save overhead. Frequent background saves can add fork / persistence pressure on busy queue workloads.

If your queue cannot tolerate the everysec loss window, move to appendfsync always and benchmark the throughput cost before rollout.

Restart Behavior

With persistence enabled:

  • waiting jobs remain in the stream
  • delayed and prioritized jobs remain in the scheduled sorted set
  • completed / failed jobs remain in their sorted sets and hashes
  • consumer groups and pending entries survive with the stream data
  • the Valkey function library survives in the dataset and is also re-ensured by clients on startup

glide-mq additionally re-ensures the function library when clients connect or reconnect, including after topology changes and cluster failover.

Worker Crash Behavior

If a worker crashes mid-processing:

  1. the job has already been claimed into the consumer group's pending entries list
  2. it is not removed from Valkey just because the worker died
  3. another worker's scheduler can reclaim the stalled entry via XAUTOCLAIM
  4. after enough stalled-recovery cycles, glide-mq moves the job to failed instead of leaving it in limbo

This is an important difference from older queue designs that rely on destructive pops and app-side lock bookkeeping.

Cluster Failover Behavior

glide-mq works with Valkey cluster replication and re-ensures its function library on reconnect/failover, but durability during failover still depends on the underlying replication and persistence settings.

Important caveat:

  • Valkey replication is asynchronous
  • A primary failure can still lose recently acknowledged writes that were not yet replicated or fsynced

That means failover improves availability, but it does not by itself create exactly-once or zero-loss guarantees.

Delivery Semantics

glide-mq is designed for at-least-once processing:

  • A job may be redelivered after worker crash, connection loss, or stalled reclaim
  • processors should therefore be idempotent whenever side effects matter

glide-mq does not guarantee:

  • exactly-once execution
  • atomic coordination with external side effects
  • zero-loss behavior independent of Valkey persistence / replication settings

Comparison at a Glance

SystemTypical durability modelDelivery semanticsMain trade-off
glide-mq + Valkey AOF (everysec)AOF-backed queue state with a small crash windowAt-least-onceLower write cost than always, in exchange for a small possible loss window on hard crash
glide-mq + Valkey AOF (always)Strongest Valkey persistence modeAt-least-onceSmallest persistence window, in exchange for higher write cost
pg-boss / PostgreSQLBacked by PostgreSQL WAL / transaction durabilityAt-least-onceStrong database durability, in exchange for PostgreSQL operational and latency tradeoffs
RabbitMQ with durable queues + persistent messagesBroker durability when queues/messages are durable and producers use publisher confirms appropriatelyAt-least-onceMature broker durability model, in exchange for more broker-specific configuration surface

Compared with BullMQ-style Redis queues, glide-mq still depends on the same underlying Valkey durability settings, but its in-flight recovery model is built on Streams consumer groups, the PEL, and XAUTOCLAIM rather than a separate lock-renew / stall-detection cycle.

Feature-Specific Durability

The core persistence guarantees above apply to all queue state. Several features have additional durability characteristics worth noting.

Schedulers

Scheduler configurations persist in the glide:{queueName}:schedulers hash. Each field stores the scheduler name and its next-run timestamp. These entries survive restart with any persistence mode that captures hash writes. When a worker reconnects after a restart, it resumes its promotion loop and picks up any schedulers whose next-run time has already passed.

Metrics

Time-series metrics data is stored in glide:{queueName}:metrics:* keys. Data is retained for 24 hours and auto-trimmed. This data survives restart if AOF is enabled; with RDB-only, recent metrics may be lost depending on snapshot frequency.

Group State

Token bucket state, rate limit windows, and ordering sequence counters are stored in glide:{queueName}:group:{key} hash keys. All fields (active count, maxConcurrency, rateMax, rateDuration, window start, token count, refill rate, etc.) are persisted as part of the hash and survive restart with the rest of the dataset.

Deduplication

Dedup entries are stored in the glide:{queueName}:dedup hash, with field-level TTL semantics managed by the Lua functions. These entries persist with AOF. After a restart, dedup windows that have not expired continue to prevent duplicate job submission.

Step Jobs

When a processor calls moveToDelayed, the job is moved to the scheduled ZSet with a new timestamp score and data.step is updated atomically within the same Lua function call. The delayed job survives restart like any other entry in the scheduled ZSet, and the promotion loop re-queues it once the timestamp is reached.

Broadcast

Stream entries in broadcast queues are intentionally not deleted after processing (no XDEL). Entries are retained according to the maxMessages XTRIM policy. Each consumer group independently tracks its own read offset. After a restart, consumer groups resume from their last acknowledged position, so no subscriber misses messages that were added while it was down.

Practical Guidance

  • If your priority is throughput with strong-enough durability, use AOF with appendfsync everysec.
  • If your priority is minimizing acknowledged-write loss on crash, use AOF with appendfsync always and benchmark first.
  • If your jobs trigger irreversible external side effects, make handlers idempotent and store your own deduplication / business keys.
  • Treat failover as an availability feature, not proof of zero-loss durability.

Released under the Apache-2.0 License.