Saturday, September 16, 2017

September 16, 2017 - Hosting downtime incident

This is the writeup for the temporary downtime of an remote domain pointing to an app engine instance through 170820 to 170916

I transferred a domain from Gsuite to Google Domains earlier last month, but have encountered a few hiccups. It may be possible that DNS records were not selected to be migrated during the transfer, so the domain was inaccessible for a period for at least a few weeks.

This issue was noticed on September 15 during an infrequent site maintenance. It was observed that the domain and its respective subdomains were inaccessible. Investigations led to the observation of missing DNS records. Recovering the records back was simple and resolved the issues.

Satisfied with the results, I closed out the Gsuite Trial account and attempted to add the domain as an alias domain to LDI. Verification of the MX records was taking over 24 hours, but I was prepared to wait for the full 72 hours. 

Later in the day, I happened to noticed that the naked domain on the recovered site was failing to resolve. The naked custom domain had somehow been removed from the App Engine console. This should have been an easy fix, but attempting to add it back kept throwing a few errors in the UI: "There is an operation pending for this application. Please wait and try again." Another item of note, which was not blocking, was Google's recent addition to App Engine for automatic SSL management alleviates needs for solutions such as LetsEncrypt. But the certificate generation for this domain was not completing in any reasonable fashion. I choose to fall back to my LetsEncrypt certs for the subdomain during the investigation. There were few other even less descriptive warnings or small bugs where subdomains (and not the naked domain) would be auto added even if I did not complete the prompts.

These are a very vague issues, but due to personal urgency, I attempted a few steps to resolve.
  • Create a synthetic record to foward from the naked domain to the subdomain.
  • Creating a new duplicated application, and associate the naked and subdomain as custom domains.
  • Canceling the MX record verification, considering an official comment about incompatibility with hosting and processing email on the same domain. 
  • Wait out this error, and hope for some batch process, etc to be resolved out. 
  • Set up the site at another host.
I tried first three options to no further results. The synthetic record mapping had the most potential, but would not properly allow SSL certificates to be forwarded from the subdomain.  I felt that since a few hours had past since investigating this issue, and having exhausted my options, I had to fall back to the last option.

As a temporary resort, I migrated the static portions of the website to Github pages and set up custom domains there. Github currently does not seem to support secure web pages on custom domains, and led to unsightly ~"This connection is not secure", and ~"This website is using insecure practices" warnings to be shown by web browsers. Adding free cloudflare DNS, SSL plan resolved the insecure warnings. Interestingly domain, MX, text record verification from Google to Cloudflare was significantly faster than verification from Google to its own DNS servers. 

This failure has come at a critical time as I have been starting to apply to other opportunities. Having others accessing prominent links to a non existing webpage hurts my outreach efforts. For over six years, LDI has never encountered a known outage of this scope and duration, and it is deeply regrettable. In the future this may be mitigated through a few simple scripts for monitoring.