Celery Tasks¶

Queues¶

We have two named queues that tasks are routed to: fast and slow. These names refer to the expected execution duration of the task (or the subtasks it dispatches). If a task is expected to run for longer than a minute, or is scheduled at an interval of several minutes or more, it should be routed to the slow queue. Otherwise, tasks default to the fast queue.

Workers¶

CeleryWorkers: services the slow queue only
CeleryWorkers2: services the fast and slow queues (in that priority order)
CeleryWorkers3: services the fast queue only

The queue(s) a worker services are defined by the CELERY_QUEUES environment variable as a comma-delimited list, for example fast,slow. If not set, the default is fast,slow.

The concurrency of a worker (how many task slots it has) is defined by the CELERY_CONCURRENCY environment variable and is currently set to 12 in production (default is 8).

Broker¶

The Celery infrastructure includes a dedicated Redis instance used as the broker, called CeleryBroker. This instance does not persist data between restarts and therefore does not write to an append file.

Accessing the Broker¶

The broker URL is crossover.proxy.rlwy.net:28721 and the credentials can be found in 1password.

CLI:

redis-cli -u "redis://<username>:<password>@crossover.proxy.rlwy.net:28721"

TablePlus: Create a new Redis connection and use the details above.

Scheduler¶

The scheduler (redbeat) service is called CeleryBeat. It executes no tasks — it is solely responsible for scheduling tasks and enqueuing them at the appropriate time. CeleryBeat is configured to automatically restart on failure (up to 10 times), so stopping the task at the process level is not sufficient; the Railway service must be stopped directly to prevent it restarting.

Task Deduplication¶

All tasks use QueueOnce (via celery-once) to prevent duplicate executions. If a task is already running or queued when a new execution is triggered, the new execution is silently dropped (when configured with graceful=True). The deduplication lock is stored in the same Redis broker instance.

Observability¶

The primary source of information regarding the performance and status of the Celery infrastructure is a Flower instance (service name: CeleryFlower) accessible at https://celeryflower-production-8da8.up.railway.app/ (credentials are in 1Password).

The primary screen lists the status of the three workers. In normal operation they should all have a status of Online, and the number of tasks each worker is currently executing is shown and updated in pseudo-realtime.

Clicking on the name of a worker and selecting the Tasks tab displays a list of tasks the worker has executed and a total count for each. It also shows the tasks currently being executed and those reserved to be executed next.

Clicking on Broker in the main toolbar shows the queues (detailed above). The Messages column shows the queue depth for each queue. Ideally, both queues should be at 0 or near 0. However, during busy periods (lots of tasks processing large amounts of data) it is normal for queue depth to be non-zero. In these circumstances, we would expect the queue depth to spike as tasks are enqueued and then decrease gradually as they are processed. If the queue depth steadily increases without returning to 0, the workers may be saturating and intervention may be required.

Logfire Dashboard¶

A Logfire dashboard is available at https://logfire-us.pydantic.dev/paperrun/api/dashboards/celery and provides a richer, time-series view of the Celery infrastructure. It includes:

Task runtime graph — per-task runtime over time, colour-coded by task name, useful for spotting tasks that are running slower than usual.
Total Task Event Rate — ops/sec broken down by event type (task-received, task-started, task-succeeded, task-failed, task-revoked, task-sent). A rising task-failed line warrants investigation.
Average Task Runtime — overall average runtime across all tasks, useful for detecting system-wide slowdowns.
Queue Depths — current depth of the fast, slow, and default Celery queues.
Broker — number of workers online, broker memory usage (MB), and connected client count.

Potential Issues and Resolution Steps¶

Queue depth is high and not draining¶

Check Flower first to confirm workers are online and actively processing. If they are, the system may just be under load — monitor and allow it to drain. If the queue depth is growing steadily, consider:

Increase worker concurrency — raise CELERY_CONCURRENCY in the Railway env vars for the relevant worker(s) and redeploy. This gives each worker more task slots without requiring a new service.
Scale up worker replicas — add another instance of the relevant worker service in Railway (e.g. a second CeleryWorkers3 for fast queue pressure).
Stop CeleryBeat — halts new task enqueuing at the Railway service level, allowing the existing queue to drain without new work arriving. Note: CeleryBeat auto-restarts on process failure, so it must be stopped at the service level in Railway, not just the process.

Workers are consuming excessive resources¶

Reduce worker concurrency — lower CELERY_CONCURRENCY in Railway env vars and redeploy the affected worker(s).
Stop workers for the affected queue — e.g. stop CeleryWorkers3 to halt all fast queue processing, or CeleryWorkers to halt slow queue processing.
Stop CeleryBeat to prevent further task enqueuing while you investigate.

A task is stuck or hanging¶

If a task appears to be running for an abnormally long time and is occupying a worker slot, it can be revoked individually via Flower:

In Flower, click on the name of the worker that is executing the task.
Select the Tasks tab — active tasks are listed in the Active section.
Click the UUID of the stuck task to open the task detail page.
Press the Revoke button in the top right of the page.

Note: revoking a task signals Celery to terminate it, but because tasks use QueueOnce deduplication, the lock may not be released immediately. If the task fails to re-run on its next scheduled interval, the lock can be cleared manually via redis-cli:

# Lock keys follow the pattern: qo_<task_name>
# For example, to clear the lock for the sync_orders.enqueue_order_sync task:
DEL qo_sync_orders.enqueue_order_sync

# For tasks that use the 'keys' option (scoped per-argument), kwargs are appended:
# e.g. DEL qo_sync_orders.sync_orders_for_org_organization_id_<id>

To list all current lock keys: KEYS qo_*

Clearing a queue¶

Connect to CeleryBroker via redis-cli (see Accessing the Broker) and delete the queue:

DEL fast   # clears the fast queue
DEL slow   # clears the slow queue

Note: this permanently discards all enqueued tasks. CeleryBeat will resume scheduling new tasks when restarted.

De-scheduling a single task without stopping the scheduler¶

To prevent a specific task from being scheduled again without stopping CeleryBeat or any workers, delete its redbeat key directly from the broker:

# Keys follow the pattern: redbeat:<task_name>
# For interval-based tasks, the interval is appended as a suffix, e.g.:
#   redbeat:sync_orders.enqueue_order_sync-300.0s
# For crontab-based tasks, there is no interval suffix:
#   redbeat:automator.send_campaigns

# To find the exact key:
KEYS redbeat:*

# Then delete it:
DEL redbeat:<task_name>

This removes the task from redbeat's schedule so it will no longer be enqueued. Any tasks that are already queued or currently running are unaffected. The task will not be rescheduled unless CeleryBeat is restarted (which re-reads the schedule from the application configuration).

Restarting the infrastructure¶

Restart workers before the scheduler to prevent back pressure: bring up CeleryWorkers, CeleryWorkers2, and CeleryWorkers3 first, then restart CeleryBeat.