Network monitoring in the era of high availability

A single point of failure in a network infrastructure could end up causing potentially catastrophic results for the business. A proper testing and troubleshooting strategy can help.

Tags: CommScope IncorporationFluke NetworksRiverbed Technology IncorporatedUnited Arab Emirates
  • E-Mail
Network monitoring in the era of high availability Development in automation of network monitoring tools will be a relief to network managers everywhere.
By  David Ndichu Published  July 20, 2017

There’s no doubt the enterprise network has become increasingly complex and cumbersome in the past few years as more demands are placed on it. How to monitor this complicated sprawl has emerged as a new challenge for network managers.

A typical reaction has been to increase the number of infrastructure monitoring tools, which only adds layers of complexity to the network.

Charbel Khneisser, regional presales director, METNA at Riverbed says research has shown that organisations are using five or more tools to monitor their infrastructure.

Even with all those tools, network managers are still unable to catch all the problems they have in their infrastructure, says Khneisser.

And unfortunately for network operations, the latest network technology such as SD-WAN, software defined data centre, virtualisation and cloud are only increasing the pressure on identifying root causes of network problems. “The IT infrastructure monitoring tools organisations have in place are creating more challenges in dynamic infrastructure environment and not helping organisations proactively identify performance bottlenecks and help troubleshoot faster,” he adds.

The root cause of the weaknesses is that traditional IT infrastructure monitoring systems have typically relied on static thresholds. The network manager would set a parameter and the system would generate an alert if and when that threshold was exceeded.

That is why ongoing development in automation of network monitoring tools will be a relief to network managers everywhere.

Such a solution would set up an intelligent anomaly detection technique that automatically sets dynamic performance deadlines and identifies performance deviation in real time.

“A network behaviour analysis engine with intelligence can be used to create baselines for every device performance. This baseline adjusts with time depending on the load on the network, the evolution of the network and the changes in users’ behaviour,” explains Khneisser.

With inbuilt capacity to detect even slight changes, such a system helps the IT department identify a problem before end users are impacted. “Network managers receive alerts automatically sent by the system before their end users are affected, giving them the ability to be proactive and fix the problem beforehand,” Khneisser says. Khneisser cites recent Riverbed research which revealed that 40% of IT managers say their network problems are reported by end users to oblivious network operations teams.

A key part of effective network monitoring is building resilience in the physical infrastructure from the very beginning.

A great deal of network failures are due to the infrastructure being static and unsuitable for emerging technologies, says David Hughes, director, field application engineering, MEA, CommScope.  Automated infrastructure management (AIM) applications help with documenting, troubleshooting and managing the physical layer.

“Automation can significantly impact operational expenditure (OPEX) and sustain business continuity and compliance by providing a comprehensive connectivity map of the infrastructure and its assets,” explains Hughes. “Automation also reduces human error and the mean time to repair (MTTR), a basic measure representing the average time required to repair a failed component or device,” he adds.

Causes of failure in the enterprise physical infrastructure system are many. The evolution of the data centre and its subsequent architecture due to the proliferation of fibre is one of these, says Hughes. Latency is also a key issue in high performance networks so building a design that is future-ready is critical, he adds.

Automating some of the testing functions at the physical layer to avoid errors is an important technique to make the testing process more efficient, says Werner Heeren, regional sales director for Middle East for Fluke Networks.

Half of more than 800 installers worldwide surveyed by Fluke Networks reported having to retest links because they were tested to the wrong limits. Of these, 37% reported dealing with negative loss fibre measurements, a clear indication of failure in the system.

Intelligence is also increasingly being built into the testing and monitoring equipment itself.

During a test on a cabling system, the testing equipment can automatically notice an unrelated failure on the system and carry an additional test, identifying some potential future point of failure, explains Heeren. “We could do a cable test and even if the cable passes the test, the testing device could offer additional commands and recommend an improvement in the cable at a certain distance because of a bad splice.”

Catching anomalies at the testing stage is a far more effective proposition than troubleshooting later on.

Fluke Networks recently surveyed system integrators in a survey about the time they spent performing various test-related functions. Effectively dealing with problems and inefficiencies during the testing process improves profit margins, with some companies reporting as much as 10% percent profit added to their bottom line, says Heeren. “This additional profit means more revenues for system integrators, or if they choose to pass along their savings to prospective customers, it means more competitive bids.”

As businesses expand, migrating to faster network infrastructure is imperative. Organisations with legacy networks can’t scale their operations easily. “The need for ultra-low loss links and multiple interconnects often causes legacy systems to slow down or fail completely as the architecture is re-configured,” says Hughes.

Infrastructure migration to high performing networks such as fibre is now part of the organisational strategy for many enterprises, says Hughes. “It’s crucial for businesses to understand this strategy and partner with technology companies like CommScope to assess their needs and make sure they design the infrastructure to suit the future applications and strategy.”

A network that’s not properly tested, diagnosed and configured can put the future growth of a business at risk or indeed its security posture, says Hughes. “This direct consequences including loss of revenue, downtime and increased capital expenditure (CAPEX) due to increased spending on the physical layer,” he adds.

Other consequences include increased OPEX due to the extra time spent troubleshooting a network as valuable resources are consumed, while the network is left more error-prone due to inadequate documentation. The business will also suffer from loss of reputation, Hughes adds.

“Organisations can mitigate against these shortcomings by using qualified partners to help plan their greenfield journey and strategy, and subsequently deploy it in a standards compliant fashion that can scale to suit the business needs moving forward,” Hughes says.

Improper network configuration is a major cause of network outage. Up to 38% of IT professionals identified by a recent Riverbed research cited issues in network configurations as root cause for major outages. Only 27% say their problems emanated from server issues, says Khneisser.

Human error is also still seen as a major cause of failure. “But so is understanding the limitations on what you have and where you want to go, especially as networks become more complex,” Hughes says.

Fluke Networks provides certification, troubleshooting, and installation tools for technicians who install and maintain critical network cable infrastructure.

Improved planning and setup is an essential first step to reduce time and costs in testing and troubleshooting, says Heeren. “A test system that offers a single job management function to track all job requirements at the outset can pay big dividends during the course of the job. Such a system can efficiently set the job requirements and progress from setup to systems acceptance, making sure that all tests are completed accurately.”

Test parameters set up in one central point and transmitted using the cloud by both technicians and administrative staff saves time and reduces errors. “Advanced testing systems eliminate travel to and from the office for test reporting, resulting in more accurate reporting and faster project closeout,” says Fluke Network’s Heeren.

In an age where applications drive the entire business, ensuring applications delivery and availability is paramount. Khneisser of Riverbed cites infrastructure-related problems as one of the major reasons applications fail. Something as simple as an underperforming device can cause significant downstream performance issues with the app, he adds.

That’s where Riverbed’s SteelCentral NetIM infrastructure monitoring software can help, says Khneisser. “Riverbed’s approach is to help customers by first building the whole infrastructure topology, with a diagram, to detect performance and configuration issues. Second is to map application network paths and to help organisations visualise the entire path that an app would take on the network to reach end users,” he adds.

The modern IT network is under tremendous stress; driven by customer demand for 24/7 availability. This makes failure in the infrastructure potentially catastrophic.

“We are in a constant period of change,” observes Hughes. “Application standards continue to evolve to meet the demands of mobility, IoT, big data etc. In simple terms, you need an infrastructure that is agile, scalable and secure.”

Automation the physical layer, with AIM solutions such as CommScope imVision, provides the ability to manage critical links. “As networks continue to develop, simplifying complexity is key for data centre operators. Inadequate documentation and human error are prime causes of network issues and downtime. When problems occur, the time should be spent on fixing the problem and not finding the cause,” Hughes asserts.

Data centres are driving the economy and are the life-blood of a business and this needs to be understood and recognised at the highest levels of an organisation, says Hughes.

In an increasingly applications-centric enterprise environment, network performance, without understanding how applications themselves behave, is simply not enough.

Mapping the application path helps for more efficient troubleshooting, says Riverbed’s Khneisser.

“Customers need to understand how application data transverses the network, making measuring performance faster and easier. By monitoring individual components on the network, network managers can identify the root cause of the individual device failure and its impact on application delivery over the network,” says Khneisser.
An end to end platform that can bridge the gap application performance and network performance is required.

“Networks are built to deliver applications; the intersection between network and applications performance is that they reinforce each other. Network performance management is very important, but we simply cannot monitor the network without having an understanding on the application level,” Khneisser adds.

A fundamental part of having a resilient infrastructure platform is a team of properly trained engineers with the correct test equipment. This should be approved vendor equipment with up to date calibration.

The IT personnel must also understand the test parameters and methodology, follow the application guidelines and the link-loss limits for the system. “Educating and working with the end-user is also a key factor, especially on the maintenance of fibre links such as effective cleaning and handling of connectors and cable,” says Hughes.

Add a Comment

Your display name This field is mandatory

Your e-mail address This field is mandatory (Your e-mail address won't be published)

Security code