Issue with Release on the EU environments
Incident Report for ProctorExam
Postmortem
  • On 30 May 2023 a minor release was deployed to all EU environments (clusters eu2 and eu3). Such a release should not cause any downtime, as our servers are deployed in a rollout mode, i.e. a new server is deployed before a server with the previous version is terminated.
  • We got alerted at 18:41 CEST that 3 environments were not reachable anymore (internal alarming system).
  • We quickly identified that the servers in question had been evicted due to lack of resources on our (Kubernetes) cluster and therefore scaled the cluster manually to add resources.
  • The 3 environments were up and running again at 18:46 CEST.
  • Impact: 3 customers were affected, as their environments were not reachable anymore during these 5 minutes. This would have affected test takers trying to open the system check of exam setup as well as administrators, proctors and reviewers. Note: Test takers in the middle of an exam were not affected, no video recordings were lost or interrupted.
  • Lead-up: Deployment of release 4.5.4.
  • Resolution: We scaled the cluster manually to create extra resources to host the servers.
  • Prevention measures: A deprecation warning was causing our deployment script to update more environments than normal, leading to the resource problem. Immediate remedy: Remove the deprecation. Longer term remedy: We will update the script to take warnings into account.
Posted Jun 05, 2023 - 11:52 UTC

Resolved
This incident has been resolved.
Posted May 30, 2023 - 17:11 UTC
Identified
Between 16:41 and 16:48 UTC, some environments may have been temporarily unavailable, responding with a 503 or 502 error page. This is now resolved.
Posted May 30, 2023 - 17:10 UTC
This incident affected: European Cluster EU2 (EU2 - User Sign-in, EU2 - Exam Administration) and Contacting Support Team.