fbpx

The following updates are related to the outage from Neverfail Cloud Solutions on 9/20/18. Our engineers were in constant communication with the team at Neverfail during the outage. At this point, Neverfail has informed us that they have come to a resolution to the outage. We are sorry for any inconvenience that this outage has caused. Our management team has also been in contact with Neverfail to address the issues on a larger level.

Dear Valued Partner,

Please review this RFO regarding to recent outage in the Austin datacenter on September 20, 2018. Please let us know if you have any further questions.

2018-09-20 – AU Partial Network Outage for Compute Workloads – RFO

Incident Analysis

Two of our compute workload racks became disjoined from the network. This was determined to be caused by an interoperability caveat of the Spanning Tree Protocol. The troubleshooting process took much longer than anticipated due to the nature of the symptoms, in which the immediate cause was not clear. This situation required a systematic north-to-south approach to ensure nothing was missed. Even though all hands were on deck to try to expedite the troubleshooting process, the review process could not be accelerated. Even with the engagement of our switch vendors into the investigation, the resolution took a great deal of analysis and collaborative effort between the vendors and Neverfail engineers.

 

Date/TimeNotable Event
CDT 20/Sep/2018 10:30Our network monitoring systems indicate two racks running Cisco Nexus as their leafset data switches are experiencing network connectivity issues. A general health check on the network is started.
CDT 20/Sep/2018 11:10Power supply checks and physical link checks indicate no physical layer problems in the network. The data center staff confirms there are no power outages or events within their infrastructure. Neverfail engineers are dispatched to the data center for additional insight and coverage for our investigation.
CDT 20/Sep/2018 12:03We suspect that there might be a problem with the Arista core switches both racks are uplinked into. We migrate the links for one of the racks to another set of core switches in an attempt to solve the connectivity issue. This procedure results in no improvement of the issue.
CDT 20/Sep/2018 12:47Arista switch support engineers log into the systems that the two racks are connected to and start the examination of these switches.
CDT 20/Sep/2018 13:35After the health check on the Arista switches is cleared, Cisco support log into the Cisco Nexus switches and start the examination of the switch health.
CDT 20/Sep/2018 15:52Cisco identifies that a rogue PVST packet is being flooded on a VLAN. This packet has a PVID embedded in it that is different than the local VLAN used on the switch uplinks. This is causing the Cisco leafset switches to place their uplinks into a Blocking state.
CDT 20/Sep/2018 16:20The VLAN that is flooding the rogue packet is suspended. The Cisco uplinks are bounced, causing the racks to come back online. Customer workloads are verified to have returned to service. A search is initiated for the source of the STP packet.
CDT 20/Sep/2018 19:25A packet capture from a core switch reveals the source MAC address of the STP packet. This allows us to track down the source to a private customer cross-connect. Since this is an edge port on the network, strict switchport security settings have been applied, but are not dropping the STP BPDUs as designed.
CDT 20/Sep/2018 19:47Arista identifies that their switching platform does not recognize PVST as control plane packets and instead treats them as data plane packets. This is causing these packets to bypass the BPDU filters at the edge of the network.
CDT 20/Sep/2018 19:55A Layer2 access-list is placed on the edge port to filter out the offending packets. The VLAN they were being flooded into is brought out of a suspended state. Connectivity is fully restored.

 

Network monitoring systems began alerting us that certain switches were down in our network and that the availability of some customer workloads was likely affected. A ticket was opened with the data center to verify power circuits, and no problems were indicated. An investigation was started and originally focused on the core switch set that both racks are uplinked to. An issue with the rack switches themselves was not initially suspected due to the failures occurring simultaneously. A critical priority ticket was opened with Arista and they were brought into our incident bridge. After technical support data was supplied to Arista, they began to examine the data, but could find no problems with the health of the core switches.

At this point, a critical priority ticket was opened with Cisco. A Nexus specialist was brought into our incident bridge. They began examining the health of the Nexus rack leafset switches. Cisco identified that the uplinks were being placed into a blocking state due to a spanning tree error. The spanning tree error was being caused by receiving an STP packet that was tagged with an 802.1Q vlan tag. Neverfail runs Multiple Spanning Tree Protocol (MSTP) in our network exclusively and the presence of the VLAN tag indicated that this was a Per-Vlan Spanning Tree (PVST) packet.

Once the cause of the disconnection event was identified as a Spanning Tree packet tagged on a particular customer VLAN, this VLAN was placed into a suspended state. This caused the problem Spanning Tree data to be dropped from the Nexus uplinks. At this time affected compute workloads were brought back online.

It was determined that this rogue packet was being ignored by our Arista switches and was only affecting our Nexus switches. The VLAN that was flooding the offending spanning-tree packets was placed into a suspended mode, causing all packets in that VLAN to be dropped instead of learned and flooded, as is normal operation. At this point the Nexus switches and their workloads came back online. Verification of customer workloads was then commenced by Neverfail compute and storage teams.

The debugging output of the Cisco switches only indicated the presence of the offending Spanning Tree data and not the source Media Access Control (MAC) address of the packets. An investigation was initiated to track down the source using packet captures on our core switches. A packet capture on one of our core switches discovered the nature of the rogue packet.

Identifying the source MAC address of this data allowed us to quickly track down the source. The port this was being flooded from had already been configured to filter out any external spanning tree data. Arista was engaged to determine why the Spanning Tree filter was not working. Arista investigated and revealed that since their operating system does not recognize the PVST spanning tree mode (only MSTP), any PVST packets received are not treated as control-plane Bridge Protocol Data Units (BPDUs) but instead as data-plane packets. This caused these packets to bypass the BPDU filters and instead be flooded into their associated VLAN, as any other normal multicast traffic. The difference in the packet handling procedure for these packets between our two vendors caused the Arista switches to ignore the spanning-tree PDUs and instead flood them as unknown traffic, whereas the Cisco Nexus switches identified them as legitimate STP information and attempted to process them. The Port VLAN ID (PVID) embedded inside the packets did not match the PVID configured on the Nexus uplinks. This caused the Nexus to place their uplinks into a Blocking state due to the inconsistent spanning tree state they were seeing. This blocking state on these uplinks caused the packet data for workloads in this rack to be discarded instead of sent on to the rest of the network.

Since Arista was not recognizing the PVST PDUs as BPDUs it was not filtering them at the edge. A MAC access list was created to drop any packets destined to the multicast address 01:00:0C:CC:CC:CD which effectively filtered the packets at the edge. At this point, the affected VLAN was placed back into active state and the network was normalized. All customer workloads were again verified to be stable.

Resolution

Once the cause of the disconnection event was identified as a Spanning Tree packet tagged on a particular customer VLAN, this VLAN was placed into a suspended state. At this time affected compute workloads were brought back online. After a search for the exact source and nature of the unauthorized STP packet, the source was found and filtered.

A review of all edge packet and protocol filters has been initiated. All network edge ports will be configured to explicitly filter the MAC address that PVST uses, along with any other potentially disruptive protocol types.

Respectfully,

Aaron Mayfield, Lead Network Engineer

Dear Valued Partner,
 
The Cisco engineers managed to fix the problem with their routers, so our network is back online and should be stable now. Please check if your VMs are able to provide full service. In case this doesn’t happen please contact our Support by email at support@neverfail.com or by phone at 512-600-4300. We are now expecting Cisco and Arista full RFOs on this issue and will keep you updated as we receive more information from them.
 
Thank you,
Neverfail Support

The above message is a direct response from Neverfail. Our engineers have verified that each server that was affected during this outage is back online. If you find that there is an issue, please contact WheelHouse IT support, and our team will address it accordingly. support@wheelhouseit.com

Dear Valued Partner,

We identified an incompatibility issue between Cisco and Arista physical switches which causes port blocking and we are still working with Arista and Cisco engineers on the issue. However, knowing this, we put in motion a backup plan where the Cisco physical switches will be replaced in order to have this issue resolved.

Thank you,
Neverfail Support

Dear Valued Partner,

We are still working on our networking issue affecting our Austin data-center hosts and have both our vendors’ Cisco and Arista support technicians troubleshooting some inconsistent MAC table issues we identified on our core physical switches.

Thank you,
Neverfail Support

Dear Valued Partner,

We narrowed down the issue to a specific set of physical switches and our Cloud engineers are working on troubleshooting these with the vendor support. We hope to have this solved as soon as possible.

Thank you,
Neverfail Support

Dear Valued Partner,

We identified the networking issue affecting our Austin data-center and are working with our vendor to resolve it. All hosts and VMs are up and running, but the network links are not staying up for more then a few milliseconds. We will provide more details as soon as they will become available

Thank you,
Neverfail Support

Dear Valued Partner,

Further to our issue in Austin data-center… We continue to see intermittent connectivity issues and our Cloud Engineering team is continuously working on fixing this.

We will provide more information as it becomes available.

Thank you,
Neverfail Support

Dear Valued Partner,

We are experiencing a network service interruption. Our engineers are investigating and will provide updates as soon as more information is available.

Neverfail Support