Foundry Connectivity Issue
Incident Report for TrueCommerce
Postmortem

TrueCommerce Root Cause Analysis (RCA) for May 26th, 2021 Service Disruption

TrueCommerce places great emphasis on the reliability and security of our products and services, and we want to assure our customers that this remains a top priority for TrueCommerce.  On May 26th 2021 at approximately 1:31 p.m. ET, TrueCommerce’s private hosting and infrastructure provider experienced a production down event caused by two self-encrypting drives (SED) data stores losing storage connectivity.  The loss of connectivity to the SEDs impacted TrueCommerce’s production applications and resulted in a disruption in service.  As TrueCommerce’s hosting and infrastructure provider resolved the issue, our Technical Operations and Engineering teams restored the TrueCommerce production environment via a controlled restart of the environment and subsequent procedures to resume all production services and transactional processing.  In response to the May 26th disruption to your TrueCommerce application and services, TrueCommerce has prepared a Root Cause Analysis (RCA) describing the events that occurred and the preventative actions implemented by TrueCommerce and our hosting and infrastructure partner to minimize potential future disruptions to your service. 

 Incident Timeline

Incident Discovery:  The unplanned disruption occurred at approximately 1:31 p.m. ET and impacted all customers utilizing the TrueCommerce Trading Network (TC.net) and the Transaction Manager, Pack & Ship, and Product Manager applications.  The unplanned disruption of service occurred due to an unexpected issue encountered within TrueCommerce’s environment.  

Issue Identification: The source of the issue was identified at approximately 1:47 p.m. ET and our Technical Operations team worked in partnership with our hosting and infrastructure provider to remedy the issue.

Issue Resolution: At approximately 3:31 p.m. ET TrueCommerce’s hosting and infrastructure partner confirmed that the SED data stores were back online.  TrueCommerce immediately initiated the process of recycling and restoring all services.  In accordance with best practices, the virtual machines hosting TrueCommerce applications were brought online in a scripted and controlled fashion.     

Restoration of Service: All services and access to applications were fully restored at approximately 6:16 p.m. ET.  TrueCommerce Technical Operations and Engineering resources executed full cycle testing scripts to validate full restoration of service and continued to monitor production operations. 

Root Cause Analysis

The originating source of the issues was related to an issue experienced within TrueCommerce’s hosting and infrastructure provider that caused certain Truecommerce applications and services to lose connectivity with the SAN (storage area network).  The underlying encrypted storage nodes dedicated to TrueCommerce’s environment experienced a failover due to a power distribution unit (PDU) fault.  The SAN failed over and recovered as expected; however, the two SED data stores failed to recover in the hypervisor and were unable to be manually re-scanned.  TrueCommerce’s hosting and infrastructure partner engaged VMware support and rebooted each host in the cluster to force storage reconnects for the SED data stores.  Connectivity was then fully restored to the SAN.

Preventative Measures

TrueCommerce’s hosting and infrastructure partner remains engaged with VMware support to determine what caused the SED data stores to fail to recover after the SAN failover event.  All relevant logs have been provided for review. Per VMware’s recommendation, TrueCommerce’s hosting and infrastructure partner will be coordinating updates to the hosts’ BIOS, firmware, and drivers as well as the installation of missing operating system and virtual center updates.

Separately, it was observed during the reboots that a single host has a faulty dual in-line memory module (DIMM) which will be addressed by TrueCommerce’s hosting and infrastructure partner in a separate case.

Posted Jun 02, 2021 - 13:27 EDT

Resolved
This incident has been resolved.
Posted May 26, 2021 - 19:24 EDT
Monitoring
All services have been restored at this time. We're currently monitoring the network activity.
Posted May 26, 2021 - 19:01 EDT
Update
We're still working to stabilize the environment with our hosting data center. Please note that you may be able to get to the main Foundry page but you'll receive errors at this time. We will update again once we have more information.
Posted May 26, 2021 - 17:25 EDT
Update
We're in the process of bringing the TrueCommerce services online. We will update update once all of the services are fully available.
Posted May 26, 2021 - 15:25 EDT
Update
We are continuing to work with our Data Center to resolve the issue.
Posted May 26, 2021 - 14:37 EDT
Update
We are continuing to work on a fix for this issue.
Posted May 26, 2021 - 14:03 EDT
Identified
We have identified the issue with our Data Center and currently working with them to resolve it.
Posted May 26, 2021 - 13:47 EDT
Investigating
We are currently experiencing an outage amongst the TrueCommerce Foundry applications.
Posted May 26, 2021 - 13:40 EDT
This incident affected: Foundry Applications (Transaction Manager, Pack & Ship, Data Hub, Unified Commerce Hub, Pulse), Channel Integrations (Amazon Marketplace, Shopify, Magento, WooCommerce, Online Marketplaces), Trading Network (TrueCommerce AS2, TrueCommerce Internal FTP), and Platform (Integration Mapping Tool, Scheduler, Customer Center).