TrueCommerce Root Cause Analysis (RCA) for May 26th, 2021 Service Disruption
TrueCommerce places great emphasis on the reliability and security of our products and services, and we want to assure our customers that this remains a top priority for TrueCommerce. On May 26th 2021 at approximately 1:31 p.m. ET, TrueCommerce’s private hosting and infrastructure provider experienced a production down event caused by two self-encrypting drives (SED) data stores losing storage connectivity. The loss of connectivity to the SEDs impacted TrueCommerce’s production applications and resulted in a disruption in service. As TrueCommerce’s hosting and infrastructure provider resolved the issue, our Technical Operations and Engineering teams restored the TrueCommerce production environment via a controlled restart of the environment and subsequent procedures to resume all production services and transactional processing. In response to the May 26th disruption to your TrueCommerce application and services, TrueCommerce has prepared a Root Cause Analysis (RCA) describing the events that occurred and the preventative actions implemented by TrueCommerce and our hosting and infrastructure partner to minimize potential future disruptions to your service.
Incident Timeline
Incident Discovery: The unplanned disruption occurred at approximately 1:31 p.m. ET and impacted all customers utilizing the TrueCommerce Trading Network (TC.net) and the Transaction Manager, Pack & Ship, and Product Manager applications. The unplanned disruption of service occurred due to an unexpected issue encountered within TrueCommerce’s environment.
Issue Identification: The source of the issue was identified at approximately 1:47 p.m. ET and our Technical Operations team worked in partnership with our hosting and infrastructure provider to remedy the issue.
Issue Resolution: At approximately 3:31 p.m. ET TrueCommerce’s hosting and infrastructure partner confirmed that the SED data stores were back online. TrueCommerce immediately initiated the process of recycling and restoring all services. In accordance with best practices, the virtual machines hosting TrueCommerce applications were brought online in a scripted and controlled fashion.
Restoration of Service: All services and access to applications were fully restored at approximately 6:16 p.m. ET. TrueCommerce Technical Operations and Engineering resources executed full cycle testing scripts to validate full restoration of service and continued to monitor production operations.
Root Cause Analysis
The originating source of the issues was related to an issue experienced within TrueCommerce’s hosting and infrastructure provider that caused certain Truecommerce applications and services to lose connectivity with the SAN (storage area network). The underlying encrypted storage nodes dedicated to TrueCommerce’s environment experienced a failover due to a power distribution unit (PDU) fault. The SAN failed over and recovered as expected; however, the two SED data stores failed to recover in the hypervisor and were unable to be manually re-scanned. TrueCommerce’s hosting and infrastructure partner engaged VMware support and rebooted each host in the cluster to force storage reconnects for the SED data stores. Connectivity was then fully restored to the SAN.
Preventative Measures
TrueCommerce’s hosting and infrastructure partner remains engaged with VMware support to determine what caused the SED data stores to fail to recover after the SAN failover event. All relevant logs have been provided for review. Per VMware’s recommendation, TrueCommerce’s hosting and infrastructure partner will be coordinating updates to the hosts’ BIOS, firmware, and drivers as well as the installation of missing operating system and virtual center updates.
Separately, it was observed during the reboots that a single host has a faulty dual in-line memory module (DIMM) which will be addressed by TrueCommerce’s hosting and infrastructure partner in a separate case.