We have an Ayon server setup with 300+ artists and a sync process with Kitsu, and are experiencing timeout errors during periods of heavy load.
The Ayon server is running on a machine with 40 CPUs, but when looking at top on the server I see that it’s only using a single process as the number of configured Gunicorn workers is the default of 1 (see here) and that CPU is very regularly maxed out at 100% usage.
We are in the process of testing setting the AYON_SERVER_WORKERS environment variable to scale up the number of workers, but saw a comment on Discord from @martin.wackerhere suggesting not to use this setting, and instead to scale through the Docker Compose level (with multiple server images and a load balancer I’m guessing? Not sure). I can also see that the granian startup command has no option for setting a number of workers.
@martin.wacker I’ve had a look at the following PR and the separation of maintenance processes totally makes sense, but I don’t quite understand why we shouldn’t be running multiple workers for the server through Gunicorn - could you please clarify that a bit?
Any information from others about their experiencing scaling their Ayon server to take heavy load would also be very much appreciated
About this topic, I’ve some notes from several discussions on discord. (I only saved the options to my notes but didn’t delve into implementing any of them, anyways hope this helps).
Approach one: Use one docker stack with replicas and a reverse proxy/load balancer. (one AYON instance one ip and port but multiple processors under the hood)
Approach two: run DB separately, and multiple docker stacks that are connected to the same DB (In this case each AYON instance can have a different ip or port but they interact with the same data)
Do you know why the approach of spinning up multiple instances of the server with a reverse proxy is preferable to running multiple Gunicorn workers for the sake of concurrency?
It has the high availability advantage of being able to spin down a server instance and re-route traffic to another instance of course. But just in terms of adding more workers to deal with the CPU bottleneck of a single core, I would have imagined that utilising Gunicorn workers would be the most straightforward solution as that’s what it’s designed for - but I might be missing something particular to the setup of Ayon.
This is far from being production proven or ready…
But for testing this is using traefik as a loadbalancer and when you scale the server instances in theory this should improve performance (but we are just getting started with Ayon here so I have no pressure on the system whatsoever), also workers can be scaled in this setup.
It does not have SSL so it is just an extension to the default from ayon-docker.
And I am not an expert in any of this, just trying to figure this out atm.
I am more interested in migration and maintenance, I just saw that there is this PR Server: Decoupled maintenance procedure by martastain · Pull Request #447 · ynput/ayon-backend · GitHub merged. I would now go ahead and implement that, as I understand it the maintenance part would “block” or “hinder” the scaling benefits if it is not detached as a separate container. Or this just means that during scaling the application is still there due to the maintenance instance … maybe someone can shed some light…
This can be a separate post, I’m not sure if there’s much to do as AYON should have a default maintenance procedure which perform cleaning up as mentioned in this post server storage management - uploaded reviewables
The linked PR above is for larger environments that needs to separate maintenance into its own container or trigger it via cron jobs.