From: Mosso Sent: Thursday, February 05, 2009 7:28 PM To: Mark Harbeke Subject: Mosso - Recap of Incidents 2/2 - 2/4 Mark, It is with deep apologies and regret that I write this post. First and foremost, I want to assure you that everyone in Mosso is focused on your uptime. Over the past 48 hours, we have experienced two extended periods of intermittent outages due to two separate technology issues within the Mosso cloud infrastructure in San Antonio. Sites hosted in our Dallas facility have been unaffected. These issues have caused individual sites hosted in our San Antonio facility to be unavailable and in some cases respond slowly. This one-two punch of consecutive issues is very alarming to us. We have all hands on deck to remedy the situation and get you back to the stability you expect from the Mosso Cloud. We have identified root cause on these two issues, and have been stable for the past 24 hrs. Let me explain the issues: Issue #1: Network Storage Array We experienced a connectivity failure in one of our storage arrays. This array was exhibiting intermittent connectivity issues, and ultimately went offline. Data Center staff determined that a drive array cable had loosened / failed during other unrelated maintenance activities in an adjacent device. This drive array contained shared configuration files for multiple Windows Clusters, and PHP nodes, and this situation contributed to two outage incidents of 10 minutes each, and a period of noticeable degradation of 47 minutes. Concurrently and totally unrelated, a systematic change initiated by Mosso staff caused a distributed configuration file to increase in size in an unusual manner. This particular file is replicated across all nodes and validated every two minutes against replication masters for size and timestamp changes. If a change is detected, the file is refreshed on the individual node. Due to the change in file size, combined with increasing network load, the refreshes were unable to be accomplished within the trigger threshold. This ultimately created a cascading backlog of refresh requests in the queues. Once the connectivity issues were resolved, once restarted, and with the shared configuration file restructured and compressed, the issues stopped and we have been operating within normal performance since then. We apologize that it caused you downtime, and that it took us a few tries to get everything stabilized. Issue #2: Load Balancer Unrelated to the storage array problems above, we experienced a saturation failure in our primary load balancer cluster in the San Antonio facility. Our load balancers in this facility are currently operating in an n+2 configuration. They are sized so that there is ceiling available, allowing for automated failover to the passive nodes in the event of failure. On Tuesday, we experienced four cascading failures in our active pool, in rapid succession, resulting in a limited number of available nodes being online for a period of several minutes. This resulted in widespread service interruptions, which presented as slow response, dropped sessions (no suitable nodes), and SSL service failures. Ultimately, it was determined that the root cause of these failures was due to a physical configuration error that routed a portion of the traffic across an onboard network interface, rather than an installed high performance NIC in each node. This situation, coupled with spikes in network traffic, resulted in kernel panics as the device drivers in the low capacity interfaces saturated and failed. The configuration error was corrected as soon as identified, and we have not experienced additional failures since that time. Over a period of approximately 24 hours, this sequence of events resulted in 17 total minutes of service unavailability (downtime) and 117 minutes of severe traffic congestion (measured as average response times of greater than 5 seconds). Issue #3: Stats Additionally, all website statistics from the evening of 28 JAN to the morning of 05 FEB were either lost, or will show an unusually small amount of traffic. As of the morning 0f 05 FEB, we have corrected the issue and all stats ARE being captured and processed. A replacement for our current stats infrastructure is high on our priority list and is being actively worked. We apologize for these issues, and would like to assure you that our entire company is focused on your site's uptime. We will spare no resource in our disposal to ensure that you get the uptime that you expect from Mosso. Robert Autenrieth Director - Special Operations Mosso