Scaling Ayon Server

nobleyknees · 11 December 2025 13:48

Hi all,

We have an Ayon server setup with 300+ artists and a sync process with Kitsu, and are experiencing timeout errors during periods of heavy load.

The Ayon server is running on a machine with 40 CPUs, but when looking at top on the server I see that it’s only using a single process as the number of configured Gunicorn workers is the default of 1 (see here) and that CPU is very regularly maxed out at 100% usage.

We are in the process of testing setting the AYON_SERVER_WORKERS environment variable to scale up the number of workers, but saw a comment on Discord from @martin.wacker here suggesting not to use this setting, and instead to scale through the Docker Compose level (with multiple server images and a load balancer I’m guessing? Not sure). I can also see that the granian startup command has no option for setting a number of workers.

@martin.wacker I’ve had a look at the following PR and the separation of maintenance processes totally makes sense, but I don’t quite understand why we shouldn’t be running multiple workers for the server through Gunicorn - could you please clarify that a bit?

Any information from others about their experiencing scaling their Ayon server to take heavy load would also be very much appreciated

Thanks!
Nikhil

mustafa_jafar · 11 December 2025 20:00

About this topic, I’ve some notes from several discussions on discord. (I only saved the options to my notes but didn’t delve into implementing any of them, anyways hope this helps).

Approach one: Use one docker stack with replicas and a reverse proxy/load balancer. (one AYON instance one ip and port but multiple processors under the hood)
Approach two: run DB separately, and multiple docker stacks that are connected to the same DB (In this case each AYON instance can have a different ip or port but they interact with the same data)
Approach three: Use k8s.

nobleyknees · 12 December 2025 22:19

Thanks Mustafar, much appreciated!

Do you know why the approach of spinning up multiple instances of the server with a reverse proxy is preferable to running multiple Gunicorn workers for the sake of concurrency?

It has the high availability advantage of being able to spin down a server instance and re-route traffic to another instance of course. But just in terms of adding more workers to deal with the CPU bottleneck of a single core, I would have imagined that utilising Gunicorn workers would be the most straightforward solution as that’s what it’s designed for - but I might be missing something particular to the setup of Ayon.

Cheers,
Nikhil

mustafa_jafar · 16 December 2025 11:13

I think the answer lies in the difference between the definitions of these two options:

Gunicorn-Level Scaling (Internal Concurrency - Vertical Scaling - Scale Up)
Container-Level Scaling (External Orchestration - Horizontal Scaling - Scale Out)

As far as I know,

The industry standard strategy is Horizontal Scaling (Container-level).
AYON should be okay with whatever option you choose.

sjt · 5 January 2026 11:39

Are there any publicly available example of setting up container-level scale setup with multiple server instances+reverse proxy?

mustafa_jafar · 5 January 2026 23:05

Unfortunately not. Also, such examples are not much asked tbh.

nebukadhezer · 7 January 2026 10:13

I am toying with this atm…

github.com/dev-holobay/ayon-docker

docker-compose.yml

5cd7a91e9

version: "3.9"

services:
  traefik:
    image: traefik:${TRAEFIK_STACK_SERVER_TAG}
    command:
      # Docker provider
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--providers.docker.network=ayon-network"
      # HTTP entrypoint
      - "--entrypoints.web.address=:80"
      # Dashboard (LOCAL ONLY!)
      - "--api.dashboard=true"
      - "--api.insecure=true"

    ports:
      - "80:80"
      - "8080:8080" # Traefik Dashboard

This file has been truncated. show original

This is far from being production proven or ready…
But for testing this is using traefik as a loadbalancer and when you scale the server instances in theory this should improve performance (but we are just getting started with Ayon here so I have no pressure on the system whatsoever), also workers can be scaled in this setup.

It does not have SSL so it is just an extension to the default from ayon-docker.
And I am not an expert in any of this, just trying to figure this out atm.

I am more interested in migration and maintenance, I just saw that there is this PR Server: Decoupled maintenance procedure by martastain · Pull Request #447 · ynput/ayon-backend · GitHub merged. I would now go ahead and implement that, as I understand it the maintenance part would “block” or “hinder” the scaling benefits if it is not detached as a separate container. Or this just means that during scaling the application is still there due to the maintenance instance … maybe someone can shed some light…

mustafa_jafar · 8 January 2026 20:34

This can be a separate post, I’m not sure if there’s much to do as AYON should have a default maintenance procedure which perform cleaning up as mentioned in this post server storage management - uploaded reviewables

The linked PR above is for larger environments that needs to separate maintenance into its own container or trigger it via cron jobs.

About migration, you can find info here How to Migrate Projects Between AYON Servers?