Incident affecting Proctoring and Taking Exam functionalities
Incident Report for ProctorExam
Postmortem

Starting Monday 7 November, we started receiving reports from customers that live streams were being lost (in the live proctoring view or the support view) and in one occasion that test takers were having to wait a long time before being able to access the exam setup.

Our investigation showed that the nodes in our WebRTC clusters were using up a lot of CPU, up to 100%, which caused all processes on these nodes to stall. This includes the WebRTC media servers (“kms pods”), which are responsible for handling and recording the live streams of the candidates.

Impact:

  • Loss of live streams, proctors and support agents unable to subscribe to the streams to view them
  • Loss of recordings: streams are interrupted mid-exam and nothing is recorded after that.
  • In most cases test takers would not notice any issue, and could continue taking their exams without interruption.
  • Although these issues were most prominent on 7-9 November, we believe they already might have affected exams prior to that, since the launch of the new mobile app on 31 October. This issue only occurs when most exams (at least 75%) taken at one given moment also include the mobile stream.

We first suspected this might be caused by an update to Kubernetes we performed on 31 October and proceeded to allocate more CPU for each kms pod. This change was deployed on Tuesday 8 November, after 18:00 UTC.

On 9 November, the issues were still happening. We proceeded to deploy additional metrics to our clusters and found that CPU usage was much higher than we would expect. The Kubernetes update could not be causing this, since there is a lot of experience within the company with this version of Kubernetes and no one reported CPU issues.

We now identified the new version of the Proctorexam mobile app to be at the root cause of the issues. With the new version, test takers are streaming in higher quality (higher resolution), causing the CPU usage to be much higher than before.

Remediation:

  • On 9 November, after 18:00 UTC, we deployed a new change to increase CPU allocation of the KMS pods even further, which will prevent these issues in the future.
  • We are developing an update of the Proctorexam mobile app to cause kms pods to consume less CPU, which will be released over the next few days.
  • Once we confirm that CPU usage has indeed lowered, we will revert the allocation changes.
  • Learning: We will include specific CPU usage tests in all future updates of the mobile app.
Posted Nov 10, 2022 - 11:26 UTC

Resolved
We have deployed a fix overnight that resolved the incident.
Posted Nov 09, 2022 - 08:13 UTC
Identified
We are experiencing some degraded performance affecting Proctoring and Taking Exam functionalities at the moment. The issue has been identified, and Test takers can still start their exams, but the recordings and live proctoring might be impacted. We are releasing a fix later today to address this issue.
Posted Nov 08, 2022 - 15:05 UTC
This incident affected: European Cluster EU2 (EU2 - Taking Exams, EU2 - Proctoring), US Cluster US1 (US1 - Taking Exams, US1 - Proctoring), Canadian Cluster CA1 (CA1 - Taking Exams, CA1 - Proctoring), and European Cluster EU3 (EU3 - Taking Exams, EU3 - Proctoring).