How long should I store packet captures? How much storage should I provision to monitor a 10Gbps link? When is NetFlow enough, and when do I need to capture at the packet level?
These are questions network operations managers everywhere are asking, because unfortunately best practices for network data retention policies are hard to find. Whereas CIOs now generally have retention policies for customer data, internal emails and other kinds of files, and DBAs generally know how to implement those policies, the right retention policy for network capture data is less obvious.
The good news is that there are IT shops out there that are ahead of the curve and have figured a lot of this out.
To begin with, it’s important to clarify for your own organization what the goals are for network history. Some common answers include:
- Respond faster to difficult network issues
- Establish root cause and long-term resolution
- Contain cyber-security breaches
- Optimize network configuration
- Plan network upgrades
You may notice that the objectives listed above vary in who might use them: stakeholders could include network operations, security operations, risk management and compliance groups, among others. While these different teams often operate as siloes in large IT shops, in best-practice organizations these groups are cooperating to create a common network-history retention policy that cuts across these siloes (and in the most advanced cases, they have even begun to share network-history infrastructure assets).
Some of your objectives may be met by keeping summary information – events, statistics or flow records, for example – and others commonly require keeping partial or full packet data, as well. A good retention policy should address the different types of network history data, including:
- Flow records – sampled
- Flow records – 100 percent
- Enhanced flow records or metadataFull packet data – control plane
- Full packet data – select servers, clients, or applications
- “Sliced” packet headers – all traffic
- Full packet data – all traffic.
Generally speaking, the items at the top of the list are smaller and therefore, cheaper to keep for long periods of time, while the items at the bottom are larger and more expensive to keep, but much more general. If you have the full packet data available you can re-create any of the other items on the list as needed, without the full packet data you can answer a subset of questions. That leads to the first principle: Keep the largest objects (like full packet captures) for as long as you can afford (which is generally not very long, because the data volumes are so large), and keep summarized data for longer.
Next, you should always take guidance from your legal advisor. There may be legal requirements arising from regulation (PCI, Rule 404, IEC 61850, etc.), e-discovery or other sources; this article is not meant to be legal advice.
Now that said, in the absence of specific legal requirements that supersede, here are the best practices we’re seeing in the industry. Working the list from bottom to top:
Packet data for all traffic: 72 hours
- Full packet data or “sliced” packet headers? The choice here will depend on how tightly controlled your network is and on what level of privacy protection your users are entitled to. For highly controlled networks with a low privacy requirement, such as banking, government or public utilities, full packet capture is the norm. For consumer ISPs in countries with high privacy expectations, packet header capture may be more appropriate. General enterprise networks fall somewhere in between.
- Whichever type of packet data is being recorded, the goal consistently stated by best-practice organizations is a minimum of 72 hours retention, to cover a 3-day weekend.
- For the most tightly-controlled networks retention requirements may be 30 days, 90 days, or longer.
Packet data for control plane & for select traffic: 30+ days
Control plane traffic can be extremely useful in troubleshooting a wide variety of issues. It’s also a type of traffic that is owned by the network operator, not the customer, so even networks that don’t record all traffic should keep history here.
Traffic types of interest include for example:
- Routing protocols (OSPF, IS-IS, EIGRP, BGP; plus protocols like RSVP, LDP, BFD, etc. in carriers)
- L2 control plane (ARP, spanning tree, etc.)
- LDAP, RADIUS, Active Directory
- Signaling protocols like SIP, H.225.0, SCCP, etc.
- GTP-C in mobile networks
In addition to control plane traffic, in every network there are particular servers, clients, subnets, or applications that are considered particularly important or particularly problematic. For both control-plane and network-specific traffic of interest, organizations are storing a minimum of 30 days of packet data. Some organizations store this kind of data for up to a year.
Flow records @ 100%: 120+ days
- Best-practice organizations record either enhanced metadata, or at least basic NetFlow v5/v9/IPFIX.
- This flow data is useful for a wide variety of diagnosis and trending purposes. Although a few router models can generate flow records on 100% of traffic, best-practice is to separate this function onto a separate probe appliance connected to the network via tap, SPAN or matrix switch. The probe appliance both offloads the router/switch and also enhances flow data with DPI / application identification information.
- Best-practice here is to store at least 120 days of flow data (we have seen organizations that keep 100 percent flow records for as long as seven years)
Samples and summaries: 2 years or more
- sFlow or sampled NetFlow, using 1:100 or 1:1000 packet samples, can be useful for some kinds of trending and for detecting large-scale Denial of Service attacks. There are significant known problems with sampled NetFlow, so it’s not a replacement for 100 percent flow, but it does have usefulness for some purposes.
- Summary traffic statistics – taken hourly or daily, by link and by application – can also be helpful in understanding past trends to help predict future trends.
- Because this data takes relatively little space, and because it is mostly useful for trending purposes, organizations typically plan to keep it for a minimum of two years.
- One point to remember in maintaining history over periods of a year or longer is that network configurations may change, creating discontinuities. It’s important to record every major network topology change or configuration change alongside your traffic history data, so you don’t compare incomparable data and draw the wrong conclusions.
Average vs. Peak vs. Worst-case?
One challenge faced in sizing network-history storage capacity is the fact that well-designed networks run well below 100 percent capacity most of the time, but in times of stress (which is when network history is most valuable) they may run much hotter. Should you size for 72 hours of typical traffic, or 72 hours of worst-case?
The best-practice we’ve seen here is to make sure your network history system can capture at worst-case rate, but has enough storage provisioned for typical rate. The reasoning here is that when the network gets very highly loaded, someone will be dragged out of bed to fix it much sooner than 72 hours, so a long duration of history is not needed. However, that person will want to be able to rewind to the onset of the event and will want to see a full record of what was happening immediately before and after, so having a system that records all traffic with zero drops is crucial.
Here’s an example to make it concrete:
Suppose you have a bi-directional 10Gbps link that averages 1Gbps in each direction over a 24-hour period, and 3Gbps in each direction over the busiest hour of the day.
Then 72 hours of full packet storage at typical load would require (2Gbit/sec x 72 hours x 3600 sec/hour / 8 bits/byte) = 6480000 Gbytes, or about 64 terabytes.
Under worst-case load, when recording is most important, it could run at the full 10Gbps, which would fill storage ten times as fast. The good news is, best-practice here says you do not need to provision 10x the storage capacity, but you should be using a capture system that can record at the full 10Gbps rate with zero loss. That means that in a worst-case scenario your storage duration would be more like seven hours, instead of 70. But in that kind of scenario, someone will be on the case in much less than seven hours and will have taken action to preserve data from the onset of the event.
Of course, the same considerations apply for other types of network history: Systems need to be able to process and record at the worst-case data rate, but with reduced retention duration.
The above discussion slightly oversimplifies the case; there are actually two more important considerations to keep in mind in sizing storage for network history.
First, most recording systems will store some metadata along with packet captures, and this adds some overhead to the storage needed – typically around 20 percent, though it may vary depending on the traffic mix and on the recording product you use.
Second, while we say above you should provision storage for typical load, most organizations actually use projected typical load, extrapolating the traffic trend out to 18-36 months from design time. How far ahead you look depends on how often you are willing to upgrade the disks in your network recording systems. A three-year upgrade cycle is typical, but with disk capacity and costs improving rapidly there are situations where it can be more cost-effective to provision less storage up front and plan to upgrade every 24 months.
Implementing the policy
When organizations first take on the challenge of standardizing network-history retention policy, they nearly always discover that their current retention regime is far away from where they think it needs to be.
Typically we have seen that implementing a best-practice retention policy happens in six phases:
- Create the “idealized” policy describing where you want to be, without regard to current state
- Inventory the current state and identify how far off it is from the ideal
- Set targets for 3-6 months, 12 months and 24 months
- Over the 3-6 month horizon, take low-hanging fruit by reconfiguring existing systems to optimize for the new policy, and identify what new technologies will be needed to achieve the chosen retention policy
- Over the 12-month horizon, pilot any new technologies that may be required to achieve the long-term policy
- Over the 24-month horizon, roll out these technologies network-wide
- Bring together stakeholders to develop a common network-history retention policy
- Understand everyone’s objectives
- Check with legal advisor
- Choose what types of data will be kept for what purposes
- Set idealized retention goals for each
- Inventory current state and gaps
- Close the gaps over 24 months
Spencer Greene joined Endace in October 2011 from Juniper Networks to head up Product Management and Marketing. He has more than 20 years’ experience in the computer networking industry and is widely regarded as both an expert in his field and a visionary.Spencer was originally the founder of Layer 5 networks which was sold to Juniper Networks in 1999. Between 1999 and 2011 he held a range of senior posts within Juniper including VP Product Management, VP Corporate Development and VP Junos OOPS. He is widely quoted in the industry, has a significant number of technology patents to his name, and is based in Silicon Valley, California.