The cloud is a growing trend in outsourced computing power. On the upside, services can scale horizontally to handle load – but the cost increases with the number of servers. On the downside, there’s less visibility into application and network internals. The paradox is that, while it costs more money to run more servers, it’s more difficult to get the data necessary to troubleshoot and optimize to reduce the need for more servers.
While there are tools to allow packet capture in virtual environments, they’re generally not available in a cloud. Cloud providers won’t give you access to virtual taps, because public cloud multitenancy will expose data from multiple cloud customers. Fortunately, there are situations where cloud packet-level analysis is still possible, by focusing directly on the endpoints.
The goal of packet-level analysis in the cloud generally focuses on two things:
- End-User Experience
- Application efficiency of cloud-hosted services
The primary measurement of an end-user facing cloud service is going to be the end-user experience: is the service “fast” enough to assure satisfaction?
A network-centric approach uses test clients running packet capture agents. In this low-effort method, start a local packet capture, connect to the service, perform some typical operations, stop the packet capture, and measure the results. If the service feels “fast”, the test is done, and the packet capture provides a baseline. If the service feels “slow”, the packet capture provides hard data for analysis.
The best technique is application vs. network latency. TCP signaling provides valuable insight: a fast server will include the ACK and the payload in a single packet, but a slow or overloaded server will send an ACK relatively quickly, followed by the payload later. Since TCP operates at the kernel level, it is a higher priority task than the application, so the server will send an ACK right away. The same is true on a lightly loaded server if the application needs to gather information from an external source like a database. In most cases, there will be a combination of events: the ACK may arrive slowly, indicating slow network, followed by the payload, indicating slow application.
Assuming that there are network-level issues (slow ACK), the way to identify if they’re the cloud host or if they’re elsewhere in the Internet is to test from multiple locations. If you’re not part of a multi-national organization, then the cloud itself can help you. Many cloud providers offer a choice of location, and inexpensive (or free) hosting for small instances. If the network issues only show up in certain locations, it’s the local client network; otherwise it’s likely your cloud service provider.
Once you’ve done some manual end-user experience testing, you may be able to scale up by automating the tests. Most cloud services provide an API, which will let you automate the client’s interaction with the service. Combine that with some OS-level scripting to start and stop the packet capture, and you’ve got a low-maintenance test system with global deployment. To make analysis even easier to automate, add some basic timing information into your test wrapper script, then automatically upload the results into a basic web service, and it will parse the wrapper timing information to tell you whether further (packet) analysis is necessary.
Organizations that use an external cloud provider, e.g. public cloud or hosted private cloud, can still perform packet-level analysis if they use an IaaS model. Given that you won’t be able to capture from the network, you’ll need to capture directly on your VM image. Once again, this is becoming easier, as low-impact agents are available that won’t use excessive memory or CPU.
The analysis for server-based packet capture is very similar to end-user packet capture.
Look at the TCP ACK versus payload timing. The difference here is that a delay between when the server sends data and when the client sends an ACK is evidence of slow network. If all client flows demonstrate the same network delays, then the problem is within the cloud provider. However, if the server seems slow, but the system stats don’t show a lot of load, the app is probably waiting for external data.
Most cloud-hosted apps are distributed: multiple front end servers provide horizontal scalability coordinated by a central back end, like a database. Packet capture on the front end server will show you client requests, followed by a transaction between the front end and the database, followed by the response back to the client. The timing here is also very informative: if the front end waits before transacting with the database, the front end is slow. If the database has a lot of delay in its response, it’s the slow part.
Slow back-end services are a frustrating source of latency if the server team isn’t monitoring all parts of the distributed app. The instinct in cloud is to add more servers, but that won’t help if they’re added in the wrong layer. In the cloud, it will only raise the cost, not the end-user experience.
Using the Information
Classic packet capture still has a role in the cloud: demonstrably slow response can help you save money by increasing your application efficiency. Maybe you can use smaller front-end instances and larger back-end instances, or, if constant data consistency isn’t necessary, increase app performance by adding local caching of common data. Either way, packet analysis lets you figure out how to save money and gain the economic benefits of moving to the cloud.
Author Profile - Jim MacLeod is a Product Manager at WildPackets. He has been in the networking industry since 1994, and started doing protocol analysis in 1996. His experience includes positions in firewall and VPN setup and policy analysis, log management, Internet filtering, anti-spam, intrusion detection, network monitoring and control, and of course packet sniffing.