When we get to the point in an investigation where we are about to break out Wireshark, the complexity of the packet analysis can seem quite daunting. And yet by covering a few key points can dramatically cut the time needed to analyze any diagnostic data.
In my previous post I covered the selection of a single symptom for investigation. In this blog we'll discover the need to understand more than just the network connectivity.
I remember visiting a third party data center and chatting to a network engineer who had been leading the investigation into a Citrix performance problem. He had spent months looking at this issue and I was shocked to discover how little he understood about the system he was analyzing. I asked him to draw a rough diagram showing the main components of the system and how they talked to each other. He couldn't and didn't see the need. As far as he was concerned, packets went into one switch port and they came out of another. "I don't need to know what connected to those ports", he informed me.
This may be an extreme example, but I have attended many meetings with teams that have been investigating a performance problem and nobody is able to draw the system on a whiteboard.
Modern systems are very complex, and so we need to sketch out the system with enough detail to provide everyone with an understanding of how it works, but not so much that it's overwhelming. Advance7 has found ...
... that a simple block diagram with lines showing connectivity works well. Typical components to include are:
- Servers, including their role in the system (database, file server, app server, etc.)
- WAN and Internet elements
- Firewalls and load balancers
- User PC (one or two examples) and their location
When it comes to setting up packet captures we will need a bit more detail, but initially the above is enough for us to formulate a capture plan. The diagram above was produced for an investigation into an IP Telephony problem, and we can clearly see the main components of the system.
We then need to be able to show how each component communicates with other components. From a network point of view, we can simply overlay the diagram with colored lines showing the main application protocols flowing between components. Structuring our investigation with RPR enables us to accommodate a level of incorrectness in the description of these flows. If you are not using RPR, you may want to do some discovery work first to make sure you have an accurate list of flows.
Finally, mark on the diagram the tools and data sources available. This may be trace or log data; anything that enables us to gain an understanding of the cause of the problem. From a network engineering point of view, the value in web or system logs may not be obvious, but there is one very important way in which they can help. If the problem experienced by the user is matched by an error in a log, the error will be timestamped and this can be used to take us to the correct point in a network trace.
Without a basic understanding of the system that's having problems, you are going to have a hard time. If the problem is affecting multiple systems, focus on one, in the same way that we focus on a single symptom.
PS: You can get instant access to the RPR manual via the Network Trace Analysis Guide section of the TribeLab site. It includes further tips on gaining an understanding of the system being investigated.
Paul is currently leading the TribeLab project to explore new ways to help IT support people troubleshoot performance and stability problems.