In the first article of this serie dedicated to monitoring, you may have noticed that I do not talk about monitoring tools but about monitoring solutions. The first mistake to avoid when setting up monitoring is to look for a tool instead of thinking about its needs. It is essential to understand the different types of existing monitoring because they are not covered by all the tools on the market.
Types of monitoring
Monitoring applications can be materialized and classified in 4 categories:
- Log Operation:
Tool for centralizing application logs and offering advanced search functionalities. This is the basis of monitoring as it is necessary to be able to search in the logs of all services through a single interface without having to connect to different servers. For example, in order to trace a user’s activity, the logging tool must allow you to filter on the user’s ID and thus trace his path through the different services.
As an example of an existing tool, the Elastic suite (ElasticSearch Logstash Kibana) is popular.
- Collection of metrics:
It’s about being able to collect the many metrics that can be returned by the different layers of the system, from hardware to business metrics. For example: CPU load, used memory, free disk space, http or database connection pools, number of requests per second, number of calls to a business feature, etc.
Tools such as Prometheus are specialized to quickly collect and restore these data in particular formats composed of the numerical value and time pair. We will be able to use Grafana for visualization and dashboard creation.
APM smeans Agent Performance Monitoring. This advanced monitoring tool will be embedded in the application runtime. As a result, it has the ability to monitor everything that happens during the execution of the treatments: how much time elapses in a method, the memory allocation used, SQL queries, etc.
Many tools exist with a wide range of pricing. An APM is specific to the technology of your application, for example if your application is in java, you will need an APM designed for java. By their nature, APMs are able to provide other types of monitoring such as tracing and metrics.
An essential tool for micro-services architectures, it is a trace monitoring tool that enables the processing flow to be tracked through applications. A bit like a side-car that will follow the processing wherever it passes to collect data and thus be able to restore the complete trace of the processing, even if it has passed through a dozen different applications, unlike an APM that will be limited to the perimeter of the application (note that some advanced APMs also support tracing). For example, we will be able to collect the trace of a user’s treatment, from the click on the GUI button to the last micro-service called.
The goal is to keep the trace even if the application technology is different by propagating a context. This type of monitoring is quite recent and standards are emerging such as OpenTracing/OpenTelemetry to ensure cross-technology accounting.
Leaders in the monitoring market are investing heavily in artificial intelligence to help identify the source of the problem. We then talk about AIOps characterized by the use of Artificial Intelligence (AI) to solve IT Ops problems. Gartner’s feedback is promising on this new advance.
Monitoring tools such as our partner Dynatrace are able to cover all these different types of monitoring. Another way is to build your own stack of monitoring tools based on open source solutions such as Prometheus, Grafana, ELK, Jaeger (Tracing). This approach will require you to evolve your code to expose the metrics and of course to configure and install these new tools. In short, monitoring applications will require either a financial investment for the acquisition of a turnkey tool or time to set up a custom tool.
Now that you have a global view on the different types of monitoring, the question is what to do with all this collected information?
Organizing around monitoring
Here you are with your monitored applications and you collect several thousand data per minute: logs, metrics, traces, stacktraces, etc…
The first step is to design dashboards to centralize this information. This step is not as simple as it sounds and it is crucial to investigate quickly in case of problems. Many of the metrics collected will be obscure or misunderstood. It is essential to make dashboards as clear as possible by putting understandable chart names and completing their descriptions as completely as possible. We often have to transform the metrics to make them more assimilable and understandable. A metric such as the http request counter will be transformed into a rate to display the number of requests per second. It is sometimes interesting to combine several metrics. As an example, I’ve seen myself combining several connection pool metrics to display in a synthetic and visual way the state of the connection pool: a negative value shows how many connections are waiting.
Several approaches are possible to define the content and perimeter of the dashboard. For the metrics, I opted for an approach with 2 types of dashboard, the specifics by component and the global ones.
Specific dashboards group together metrics from the same component. For example, we will have one dashboard to display the metrics for the http protocol and another one for the database pool connection metrics. It is also possible to have dashboards for business component metrics such as the number of form cancellations and validations.
Global dashboards are important to get a synthetic view of the production status. This is the first dashboard you’ll see when you get a phone call that production has become slow. The choice of metrics to display must be thought out intelligently to keep the overall view as light and ergonomic as possible. There are a few metrics like CPU, but more often than not, they will have to be adapted to the type of technology and the way the application works. Web applications with many active users will tend to saturate in terms of the number of http requests and it will therefore be important to monitor their active number. Conversely, a management application with a low number of users but with heavy processing is more likely to saturate the number of database connections.
The metrics used in global dashboards will be the first to be plugged into alerts so you don’t have to monitor them all the time.
The second step is to set up an alerting system to be notified in case of problems. Most of the monitoring tools integrate an alert mechanism, at least by email. I recommend if possible to use the one that hosts your dashboards as Grafana and not Prometheus because it simplifies the management of alerts by linking them with your dashboards.
Setting up an alert is delicate and requires time to make it reliable. Indeed, it is very disruptive to receive false alerts because the trigger threshold setting has been configured too strictly. I would advise to use alert thresholds based on floating periods. For example, for a java application, there is a metric measuring the execution time of the garbage collector (a process that stops all executions to free memory). If it stays high for a long time it often indicates that it can no longer free memory and that the system will crash in OutOfMemory. In this case, it is interesting to set up an alert on the non-attainment of a minimum for a period of time.
Alerting should not be limited to simply sending an email. It must be taken care of and processed as soon as possible. It is preferable to formalize the process of handling these alerts and to define the actions to be taken.
The last step is training. The monitoring tools must be easily accessible in reading by the greatest number of people. The dev and ops should then be trained in these tools so that they are autonomous in the search for information, while being able to interpret it. In this way, they will know how their application behaves, how it interacts with other applications and they will be able to speed up incident resolution.
Monitoring concerns several complex problems under one term. It is not a buzzword but a real necessity in this period of excessive digitalization. Applications are becoming more and more numerous and this phenomenon is accentuated by the abandonment of monolithic architectures towards distributed services. Without an adequate monitoring solution, the complexity of the system makes it impossible to understand what is going on.
Depending on the degree of maturity of the information system, the implementation of application monitoring solutions can turn out to be a project in its own right. This deployment requires the implementation of an organization around these tools, which, depending on their complexity, will require the training of qualified people. Indeed, monitoring requires a good level of expertise in the monitored technology in order to understand the collected metrics, to be able to select them and to be able to interpret them correctly.
Once the monitoring solution is complete, during an incident, the person analyzing the dashboards will always have to ask himself this question: Is this metric in the red the source of the problem or the consequence?
In the third and last article of this saga dedicated to application monitoring, we will present you a concrete case study coupled with feedback on the use of these tools.