Last week Green Mountain Access experienced a global email outage across all email platforms and webmail. That outage started around midnight on April 22 and was resolved by late afternoon, early evening.
Another outage occurred Monday morning, April 26, around 8:45 a.m. and was resolved by midmorning. During that outage webmail was available.
Green Mountain Access and Waitsfield Champlain Valley Telecom spokesperson Kurt Gruendling said he and his colleagues understood people’s frustration with the two outages and said it impacted not just local users, but 50 million email inboxes.
“First of all, I truly apologize for these email issues. This is certainly not the level of service availability that we strive for or want to provide to our customers. We recognize outages have an impact on everyone and we don't take these lightly. We have escalated these issues to the top executives of the partners we rely on to provide these services, Gruendling said.
THIRD PARTY VENDOR
“In this case, we pick our partners very carefully and fully vet them on their technical and operational capabilities. As you know we receive our email services from a third-party vendor. The National Rural Telecommunications Cooperative (NRTC) is member-owned cooperative which is made up of other companies like ours and allows us to increase the scope and scale of the services we provide (such as email). NRTC has the scale to be able to partner directly with Synacor (Zimbra). Despite this, they have obviously had some challenges over the course of the past few days with their cloud infrastructure,” he explained.
Gruendling explained that NRTC members had maintained their own cloud infrastructure for email and after spending two years in a deep dive with Synacor (Zimbra) researching its cloud-hosted platform, NRTC moved to the larger company.
Gruendling said that after Monday’s outage, the technical staff at NRTC provided a detailed analysis of what happened over the past few days. He said there would be another review of the issues with Synacor this week as well.
SOME HUMAN ERROR
“As with many things it wasn’t just one thing that impacted services, but rather some human error mixed in to create a series of cascading events that led to the ultimate service interruption. This is a very complex, highly technical environment that is built to the highest levels of redundancy with three, fully redundant data centers to support approximately 50 million email boxes. These outages impacted not only all of NRTC’s users (including us) but all of Synacor’s hosted Zimbra email users globally. This was a big outage and its being taken very seriously by all involved to make sure it doesn’t happen again,” Gruendling explained.
“Essentially it all started with planned maintenance. Synacor maintains three fully redundant data centers. They brought one data center down for the planned maintenance and when they did, services didn’t properly failover to the other two data centers as it should have. This was the start of the interruption of email services. After extensive troubleshooting with the firewall vendor, they found a misconfiguration. The firewall vendor executed a command that then locked them out of all of the virtual firewalls knocking the other two data centers offline. After a lot of analysis and multi-vendor troubleshooting, the decision was made that the virtual firewalls needed to be manually rebuilt. This all took a lot of time to get services functional again. Temporarily, only one data center was brought back online. The brief outage yesterday was the result of bringing the entire fully redundant system back online and a configuration change that resulted in a failover from the active datacenter and the need to restart some processes during this period of time,” he detailed.
He added that currently, all services are fully functional in high availability mode. “Additional bandwidth and server resources have been allocated to increase performance of all services. Synacor is operating around the clock to ensure overall performance of the entire system,” Gruendling said.
Regarding concerns people voiced after reading a Green Mountain Access Facebook post suggesting that the backlogged emails from the first outage might or might not be lost, Gruendling said that information needed more context.
“When an email is sent, the outgoing server will attempt to deliver it to the inbound server that is associated with the domain of the email address. If the inbound server is unavailable (which was the case during the outage), the sending server will queue the message and try and deliver it at a later time. If it fails again, it will wait for a longer period of time and retry. Each server has its own policy on how long it waits to retry and how many times it retries (we don’t control the other servers in the ecosystem or their policies on resending). If the sending server gives up, it will send a delivery failure message back to the sender and discard the original message. Otherwise, it will keep trying until it is delivered. There is no way for us to know the various policies that are set on sending email servers in this regard. Most emails should be redelivered. I agree that statement is a bit misleading and needed more context to explain how the email ecosystem works, “he said.