- Posted by John on October 10, 2019
2019-10-10 Post Mortem for Outage and Incident Affecting Access Control
Last updated: 2019-10-14 17:00 GMT
What happened?
On 10 October, at 02:40 GMT, an outage at one of our infrastructure providers caused a site-wide outage at Overleaf, which lasted for approximately 40 minutes. During this time, most users were not able to access their projects on Overleaf.
When the provider recovered, Overleaf came back online. However, during this restart, a component of one of our services did not restart correctly. Unfortunately, this led to an incident affecting access control, as summarized below:
- The issue affected a small percentage of users who were using Overleaf between 02:47 GMT and 02:52 GMT or 03:18 GMT and 03:48 GMT.
- The issue caused some users to see the project dashboard for other users and a small number of projects to appear in the project dashboard of other users. Only projects with Link Sharing turned on were potentially accessible, not other projects.
- At 03:48 GMT we took Overleaf down for maintenance to fix the incorrectly restarted service and to ensure all users were logged out. This maintenance took 42 minutes, and the site was brought back online at 04:30 GMT.
- As a precaution, we disabled Link Sharing for the 141 projects we identified may have been affected, and have contacted the owners of those projects.
Was I affected?
We have sent an email to the owners of the potentially affected projects which had Link Sharing on at the time of the incident.
The incident only affected users who accessed Overleaf during the periods 02:47 GMT to 02:52 GMT and 03:18 GMT to 03:48 GMT on 10 October and was limited to 141 projects with Link Sharing on.
Importantly, there is no indication that any projects with Link Sharing turned off were affected by this incident.
The incident did not result in the exposure of any password data or, for the avoidance of doubt, credit card or payment information.
Has the issue been corrected?
We believe the measures that we have so far taken have corrected the issue.
We have also been in touch with the infrastructure provider that had the initial outage to discuss the matter.
What should I do / where can I get more information?
If you have any concerns or would like to report a problem, please get in touch.
We will keep this blog post updated with any further information.
To everyone affected, we are sorry
We take this type of issue very seriously and shall continue doing everything we can to protect your content and learn from this incident. In the meantime, we apologise for any inconvenience or concern caused and thank everyone for their patience during the initial outage, and the maintenance window which followed whilst we resolved this.
Updates
Monday 14 October: Since our initial investigation on Thursday, we identified a shorter second window, 02:47 GMT to 02:52 GMT, during which the site was at least partially responsive and the access control problem was also present, in addition to the original window of 03:18 GMT to 03:38 GMT. We have updated this article to mention both windows. For projects with Link Sharing turned on that might have been affected in the new window, we have again turned off Link Sharing as a precaution and have emailed the owners of those projects today.
We have also since had time to incorporate more data into our analysis and were able to show that some projects we previously thought might have been affected were not. Taking into account these reductions and also an increase from the new window, we now believe that 141 projects were affected, which is down from Thursday’s upper bound of 203.