LMTV TribeLab | Where IT all Begins (with Paul Offord)
Remember the Opisometer? (by Scott Register)

When Is A Lost Packet Not Lost? (by Tony Fortunato)

I was working with a customer and started explaining the concept of what I affectionately refer to as, “application baselining”.  I asked my client to pick an application to test and he chose his favorite network monitoring application.

I then explained the concepts of an “application baseline”. For those of you who may not be familiar with the term, an application baseline (or snapshot) is a process where you document the behavior of an application. An example of items that would be of interest are:

  • IP addresses the application communicates to
  • Function of those IP addresses (i.e. DNS, DHCP, Web server, etc)
  • Amount of data transferred
  • Protocols used
  • Other observations such as clear text data, proprietary application notes
  • Application behavior when errors or timeouts are encountered.

I asked him to put my laptop’s ip address into the monitoring application.  We then captured the pings from the management server and noticed some cool things. For example, all the pings had a 99 byte payload with a specific string in the payload. We also noticed that it was using ipv6 as well as ipv4 and IPX.  Yes folks I said IPX. I suggested we disable the unnecessary protocols (IPX and IPV6) as an exercise in reducing extra multicasts and broadcasts and verified that the protocols were actually disabled by taking another trace. I also noticed that RIP was enabled and we turned that off as well.

My client then commented, “When you get the hang of it, this really isn’t a big deal.” I love it when I believe that someone really gets it. He went on to explain that he thought a baseline took days, if not weeks to complete and at the end you would end up with binders of information that no one reads. Man, did he ever hit that myth on the head.  Not true!!

As I mentioned earlier, a baseline is just another word for a snapshot. That snapshot can be basically anything, a login, a boot up, a query. As long as you document how you did it, it is correct. The snapshot can be a simple paragraph, a page or several pages of observations and findings.  I prefer to take many simple specific snapshots since they can be gathered quickly.  If they are related, then it is easy to combine them. The bonus is that they are easy to compare when you have an upgrade or reported problem. For example, if you have a snapshot of a user performing an account query and it took 10 seconds, you could easily compare it to the same query when it is reported that it now takes 60 seconds.  The only thing you have to be careful about is server handoffs or multi-tiered applications behind the scenes. In those cases you need to capture from the backend as well to get a complete picture.

Lets get back to our network monitoring application baseline.  I simply powered off my laptop to see how long it takes for the application to report that I’m down, how often it checks and what exactly the outage looks like in the log. Here’s where the fun starts. Background info; the network monitoring server is configured with a threshold to determine if a host is really down, or simply experiencing packet loss. The default it was set for is 50%. In other words if you lose more than 51% of your packets, the application considers you down. I then checks every 5 minutes to see if you come back online.

After we powered off my laptop, it did not say my computer was down, just experiencing packet loss.  After 15 minutes, I became suspicious (and impatient) and asked that we open a command prompt on the server and try pinging the laptop’s ip address manually and found something interesting. Check out the results:

C:\ >ping 10.99.10.129

Pinging 10.99.10.129 with 32 bytes of data:

Reply from 10.44.10.37: Destination host unreachable.

Request timed out.

Reply from 10.44.10.37: Destination host unreachable.

Request timed out.

Ping statistics for 10.99.10.129:

    Packets: Sent = 4, Received = 2, Lost = 2 (50% loss)

It seems like a router along the way technically provides a reply even though it is, “Destination host unreachable”. The monitoring application took this as a generic ICMP reply.  The analyst I was working with was initially upset because he thought that the monitoring system had some kind of bug.  After seeing the ping messages, he realized that his network architecture was affecting the way the management system behaves.

After reviewing the trace, we found something else.

  Capture

It seems like his default gateway is sending an ICMP redirect, which is then followed by the Destination host Unreachable.

We identified something that he can now take away and address. I cautioned him that this may have been the case from day one, or after a network change. I also gave some advice that he should do some homework to determine why router 10.44.10.37 is responding? Is there a cache timeout that will eventually address, etc…  Why 10.44.10.1 is sending the redirect? Is there a way to configure the network monitoring application to disregard the ICMP Destination Unreachable messages.

I also reminded him of what I said at the start of this exercise, I mentioned that if you do a baseline correctly, you will probably end up with some work to do ;b

Some analysts will avoid baselining for this very reason since they have enough to do, but I think of it as having 2 huge benefits;

  1. You may uncover network issues that are constantly recovering in the background until one day when it doesn’t recover.
  2. You get the a real understanding of how your applications, clients, servers and network devices in a real environment.  This information can not be taught anywhere but in your own backyard and I call it ‘network tribal knowledge’.

I plan to post some articles of sample application baselines in the future and look forward to your feedback.

Enjoy.

Comments