Starting Monday 7 November, we started receiving reports from customers that live streams were being lost (in the live proctoring view or the support view) and in one occasion that test takers were having to wait a long time before being able to access the exam setup.
Our investigation showed that the nodes in our WebRTC clusters were using up a lot of CPU, up to 100%, which caused all processes on these nodes to stall. This includes the WebRTC media servers (“kms pods”), which are responsible for handling and recording the live streams of the candidates.
Impact:
We first suspected this might be caused by an update to Kubernetes we performed on 31 October and proceeded to allocate more CPU for each kms pod. This change was deployed on Tuesday 8 November, after 18:00 UTC.
On 9 November, the issues were still happening. We proceeded to deploy additional metrics to our clusters and found that CPU usage was much higher than we would expect. The Kubernetes update could not be causing this, since there is a lot of experience within the company with this version of Kubernetes and no one reported CPU issues.
We now identified the new version of the Proctorexam mobile app to be at the root cause of the issues. With the new version, test takers are streaming in higher quality (higher resolution), causing the CPU usage to be much higher than before.
Remediation: