Network Utilization, KPIs and the Truth

Recently, I came across a question posed on our website asking how to effectively measure network utilization. On a high-level that answer seems easy, but in reality there is more to it. One should certainly measure bandwidth, and for the sake of this conversation we are talking WAN bandwidth, from both inbound and outbound perspectives. While this snapshot could help from an immediate remediation perspective, it is only a snapshot. If you didn’t run the command at the exact correct moment you might have missed something. Actually, we probably already did.
In order to really get a good perspective on bandwidth utilization we want to take a look at a number of variables. First and foremost we want some trending data on this environment. Being able to trend on a WAN interface over the course of time is going to allow us to proactively make informed decisions about upgrades or optimization strategies that could potentially obviate the remediation process before it happens. This is obviously going to entail some more up front work and some type of monitoring solution. There are a lot of them out there.
Not only do we want some trending data – the most important piece to this analysis is not only do we want to understand bandwidth usage patterns; we also want to understand what applications are using what subset of that bandwidth. Having that application level granularity is going to allow us to make more informed decisions about our infrastructure, the applications running on it and the future direction of all of it. Multiple vendors support protocols on their WAN routers that can assist us with this; Netflow, J-Flow, S-flow etc. All require some type of collection device and related head-end application to parse this information and represent it in a nice, graphical manner. One thing that some don’t consider, and it is common amongst all of these, is the storage implications required to store this data. Some of these are appliance based while others require some backend database and the storage required to save this raw data. Depending on how long you want to save it, what the sampling rate is and what the role-ups look like will predicate what you need from a storage capacity perspective.
Lastly, we really should be taking a look at not only bandwidth from a utilization perspective (application-level granularity), but we also need to take a look at some of the other valuable Key Performance Indicators. Other attributes such as errors and uptime are also extremely valuable in determining immediate and future needs.