Blog | Round-the-clock monitoring

In my next installment of my blogs on DevOps at EMIS [now Optum], I want to talk about the practice of proactive monitoring. This process is all about trying to prevent an incident before it occurs by proactively monitoring the software and hardware that together make up a live service such as EMIS Web®.

Monitoring is often a hot topic for discussion, because most people only think about monitoring when it fails to prevent an incident occurring. However, on an average day, the EMIS technical teams will catch and fix potential incidents, before you would ever have even known about them.

This is why it’s vital to provide round-the-clock monitoring. EMIS has dedicated IT operations teams who work 24 hours a day, 7 days a week, 365 days of the year, monitoring our live services to make sure they’re running for our users.

As well as the dedicated 24/7 operations teams, our engineering teams in the software development space also play a crucial role in the monitoring and response to issues that arise on our services. We have dedicated technical specialists on call to support the operational teams, and we have other engineers working the DevOps space like myself who take part in monitoring of our services.

So you may be wondering, how do we monitor our live services?

Well that’s done in lots of different ways. First, we must decide on a metric that we think is important for a given service. This may be how much disk space the server has, or how much memory its using.

Once you’ve defined the metric, you then need to establish what the value for that metric is under normal conditions. The answer to this may not be straight forward depending upon the service. If we use the online interface for Patient Access and the NHS App for example, this is always at peak usage on Monday mornings when new appointments are made available. This would be much less than at 4am when everybody is in bed and virtually nobody is using the service.

Once you’ve established what the metric is, and what values to be worried about, you then need to tell somebody about it when it goes wrong. That is done by sending an alert to technical teams through something like an instant messaging tool. The threshold to create an alert has to be carefully considered, it would be pointless for example, to create an alert to notify the technical teams when memory usage increases beyond 60%, if this happens every Monday morning during the busy period. This would lead to engineers becoming used to seeing alerts happen every Monday morning, and potentially fail to spot the genuine alert mixed in with the ones that happen all the time and don’t indicate anything actually going wrong. This is what we call alert fatigue.

Equally you don’t want to set the threshold too low, it should give you enough time to react and prevent an incident. For example, with disk space, there is no point setting the threshold at the point you run out of disk space. This is great for ensuring you never get an alert that is a false positive, but by then it would be too late and fail to prevent the failure. You want to set the alert at a level that allows you to proactively respond and resolve before the space runs out and suffers from a fault.

This process is an iterative one, teams are constantly reviewing and evolving their monitoring to ensure it provides the best balance between creating too many or not enough alerts. Every time a new service is added or changed, this has to be reviewed. In addition, we’re always looking at the tools we use to do our monitoring and looking to see where can improve.

About the author

Robbie Frodsham

Senior site reliability engineer

Robbie has worked in Healthcare IT with Optum for 15 years in a number of different roles and has extensive knowledge of IT infrastructure. In his current position, he works with our engineering teams as a DevOps engineer on both new and existing products. He's passionate about DevOps and sharing his operational expertise with others to improve the way Optum delivers solutions to its customers.