On 13 July 2022, between 15:30 and 16:50 UTC (all times below in UTC), all our WebRTC clusters were unable to scale. This affected exam takers trying to start a system check and those starting an exam session. All clusters were affected, but only eu4 was scaling out at that time, so only exam takers on eu4 were actually affected. On all our other clusters, new exam takers were accommodated by existing nodes that did not have the issue.
At 15:30, new infrastructure was deployed to our production environments to help processing the video recordings. This infrastructure was previously deployed and tested successfully to our staging environment with no issues. Also this infrastructure is supposed to be independent of the existing infrastructure and does not modify the existing infrastructure.
At 15:40 UTC, customers started complaining of long waits by students who were entering the system check or the exam session. No other alarms were raised.
The engineering team responded immediately and escalated the issue to a P1 at 15:45 UTC. It was clear that the WebRTC cluster eu4 wasn’t allowing new sessions to start correctly and no cause could be identified immediately.
At 16:20 UTC, the root cause of the problem was identified: New nodes in our WebRTC cluster were not able to pull the docker images they needed to run due to an I/O timeout. We then proceeded to test other WebRTC clusters (eu1, ca2 and us2) and confirmed that this issue was happening on all our clusters, prompting us to update our customer-facing status page accordingly.
We started to investigate whether the deployment of the new infrastructure could be related and found that it added some endpoints to the VPCs (Virtual Private Clouds) in which the WebRTC clusters are located. For some yet unknown reasons, these new endpoints seemed to prevent our nodes from pulling docker images from the registry, something which isn’t happening on our staging VPC.
As soon as we removed the endpoints, the nodes started working again. At 16:45 UTC, the eu4 cluster was fixed and at 16:50 UTC, all other clusters were working again.
As a follow-up:
There is no risk of this issue happening again.