Durability Guarantees
glide-mq is durable to the extent that your Valkey deployment is durable.
The queue stores jobs in native Valkey data structures:
- job hashes for payload, metadata, attempts, progress, and return values
- streams + consumer groups for waiting and active delivery state
- sorted sets for delayed / prioritized jobs, completed jobs, and failed jobs
- one Valkey function library for queue operations
Those structures survive restarts only if Valkey is configured to persist them.
The Short Version
appendonly yesis the recommended baseline for production queues.appendfsync everysecis a good default when you want low latency with a small worst-case loss window.appendfsync alwaysminimizes the persistence window further, but costs throughput and latency.- RDB-only snapshots are usually not enough for queues unless you can tolerate losing all writes since the last snapshot.
- glide-mq provides at-least-once delivery semantics, not exactly-once.
- If a worker crashes after claiming a job but before acking it, the job is not silently lost; another worker can reclaim it from the pending entries list.
What Persists
When Valkey persistence is enabled, these glide-mq state transitions are persisted with the rest of the dataset:
- enqueued jobs
- delayed / prioritized jobs waiting in the scheduled sorted set
- completed / failed job state
- stream entries and consumer groups
- pending entries list (PEL) state for jobs that were claimed but not yet acked
- the
glidemqValkey function library itself
This matters because glide-mq uses Valkey Streams and consumer groups instead of a BRPOP-style list pop. A claimed-but-unacked job stays recoverable through the PEL until it is acked or reclaimed.
Persistence Modes
RDB snapshots
RDB snapshots write point-in-time dumps on a schedule such as save 60 10000.
Pros:
- low write overhead
- simple to operate
Trade-offs:
- A crash can lose every write since the last successful snapshot
- Not ideal if you need queue state to survive process or host failure with a small loss window
For queues, RDB-only is usually a recovery/snapshot tool, not the primary durability strategy.
AOF
AOF appends write operations to a log and is the usual recommendation for queue workloads.
Common appendfsync choices:
everysec: better performance, but a crash can lose roughly the last second of acknowledged writesalways: strongest persistence, but highest latency / lowest throughputno: lowest durability; usually not appropriate for production queues
Hybrid AOF + RDB preamble
If your Valkey version supports it, keeping AOF enabled with the RDB preamble gives faster restarts while preserving the operational model of AOF.
Recommended Valkey Config
For most production queue workloads, start here and tune from measurement:
appendonly yes
appendfsync everysec
# If supported by your Valkey version:
aof-use-rdb-preamble yesThen choose RDB snapshots based on your backup / restore needs, for example:
save 60 10000If AOF is already enabled, treat RDB snapshots as optional and tune them against your write volume and save overhead. Frequent background saves can add fork / persistence pressure on busy queue workloads.
If your queue cannot tolerate the everysec loss window, move to appendfsync always and benchmark the throughput cost before rollout.
Restart Behavior
With persistence enabled:
- waiting jobs remain in the stream
- delayed and prioritized jobs remain in the scheduled sorted set
- completed / failed jobs remain in their sorted sets and hashes
- consumer groups and pending entries survive with the stream data
- the Valkey function library survives in the dataset and is also re-ensured by clients on startup
glide-mq additionally re-ensures the function library when clients connect or reconnect, including after topology changes and cluster failover.
Worker Crash Behavior
If a worker crashes mid-processing:
- the job has already been claimed into the consumer group's pending entries list
- it is not removed from Valkey just because the worker died
- another worker's scheduler can reclaim the stalled entry via
XAUTOCLAIM - after enough stalled-recovery cycles, glide-mq moves the job to
failedinstead of leaving it in limbo
This is an important difference from older queue designs that rely on destructive pops and app-side lock bookkeeping.
Cluster Failover Behavior
glide-mq works with Valkey cluster replication and re-ensures its function library on reconnect/failover, but durability during failover still depends on the underlying replication and persistence settings.
Important caveat:
- Valkey replication is asynchronous
- A primary failure can still lose recently acknowledged writes that were not yet replicated or fsynced
That means failover improves availability, but it does not by itself create exactly-once or zero-loss guarantees.
Delivery Semantics
glide-mq is designed for at-least-once processing:
- A job may be redelivered after worker crash, connection loss, or stalled reclaim
- processors should therefore be idempotent whenever side effects matter
glide-mq does not guarantee:
- exactly-once execution
- atomic coordination with external side effects
- zero-loss behavior independent of Valkey persistence / replication settings
Comparison at a Glance
| System | Typical durability model | Delivery semantics | Main trade-off |
|---|---|---|---|
glide-mq + Valkey AOF (everysec) | AOF-backed queue state with a small crash window | At-least-once | Lower write cost than always, in exchange for a small possible loss window on hard crash |
glide-mq + Valkey AOF (always) | Strongest Valkey persistence mode | At-least-once | Smallest persistence window, in exchange for higher write cost |
| pg-boss / PostgreSQL | Backed by PostgreSQL WAL / transaction durability | At-least-once | Strong database durability, in exchange for PostgreSQL operational and latency tradeoffs |
| RabbitMQ with durable queues + persistent messages | Broker durability when queues/messages are durable and producers use publisher confirms appropriately | At-least-once | Mature broker durability model, in exchange for more broker-specific configuration surface |
Compared with BullMQ-style Redis queues, glide-mq still depends on the same underlying Valkey durability settings, but its in-flight recovery model is built on Streams consumer groups, the PEL, and XAUTOCLAIM rather than a separate lock-renew / stall-detection cycle.
Feature-Specific Durability
The core persistence guarantees above apply to all queue state. Several features have additional durability characteristics worth noting.
Schedulers
Scheduler configurations persist in the glide:{queueName}:schedulers hash. Each field stores the scheduler name and its next-run timestamp. These entries survive restart with any persistence mode that captures hash writes. When a worker reconnects after a restart, it resumes its promotion loop and picks up any schedulers whose next-run time has already passed.
Metrics
Time-series metrics data is stored in glide:{queueName}:metrics:* keys. Data is retained for 24 hours and auto-trimmed. This data survives restart if AOF is enabled; with RDB-only, recent metrics may be lost depending on snapshot frequency.
Group State
Token bucket state, rate limit windows, and ordering sequence counters are stored in glide:{queueName}:group:{key} hash keys. All fields (active count, maxConcurrency, rateMax, rateDuration, window start, token count, refill rate, etc.) are persisted as part of the hash and survive restart with the rest of the dataset.
Deduplication
Dedup entries are stored in the glide:{queueName}:dedup hash, with field-level TTL semantics managed by the Lua functions. These entries persist with AOF. After a restart, dedup windows that have not expired continue to prevent duplicate job submission.
Step Jobs
When a processor calls moveToDelayed, the job is moved to the scheduled ZSet with a new timestamp score and data.step is updated atomically within the same Lua function call. The delayed job survives restart like any other entry in the scheduled ZSet, and the promotion loop re-queues it once the timestamp is reached.
Broadcast
Stream entries in broadcast queues are intentionally not deleted after processing (no XDEL). Entries are retained according to the maxMessages XTRIM policy. Each consumer group independently tracks its own read offset. After a restart, consumer groups resume from their last acknowledged position, so no subscriber misses messages that were added while it was down.
Practical Guidance
- If your priority is throughput with strong-enough durability, use AOF with
appendfsync everysec. - If your priority is minimizing acknowledged-write loss on crash, use AOF with
appendfsync alwaysand benchmark first. - If your jobs trigger irreversible external side effects, make handlers idempotent and store your own deduplication / business keys.
- Treat failover as an availability feature, not proof of zero-loss durability.