Analysis and feedback in the implementation of application monitoring

The first 2 articles of our saga dedicated to Application Monitoring enabled us to discover the challenges of Monitoring, before presenting the different solutions in a second article.

In this last part, we’re delighted to share with you a feedback about the implementation of a monitoring solution within a renowned asset management company. The monitoring was deployed on a set of application modules covering portfolio management up to the placing of stock market orders.

We will go through the different monitoring needs encountered and how they have been addressed. We will mainly talk about the tools deployed. These have been chosen because of their popularity, their cost and their ability to best meet the need to find the root of a problem in a minimum amount of time.

Context

Migration from monolith to micro-services

The initial starting state is a monolith written in java gathering all the application modules replicated on 6 weblogic servers. Today, this huge monolith is being broken down into functional modules deployed in their own tomcat server hosted in their own virtual machine.
The number of tomcat servers grows with each production launch and it is essential to have the right tools to control and monitor this application park.

Context in numbers :

4 agile teams, i.e. approx. 40 people
First production (Monolith) > 15 years old
Number of war deployed > 100
Number of Virtual Machine > 60
Number of functional modules > 40

Existing monitoring

For these new tomcat servers, only the basic needs in terms of monitoring were identified with the centralization of the application logs with Kibana-ElasticSearch and the infra monitoring of the VM (CPU, Memory, Network, Disk) with Grafana-ElasticSearch.

As we will see later, this deployment gave us a very limited view of application behavior, and insufficient to analyze malfunctions.

What is the health status of my application?

The existing monitoring (i.e. application logs and Virtual Machine metrics) was not sufficient to answer these questions:

Is my server set up correctly in terms of resources?
Does it have enough memory (Xmx)?
Is the sizing of my cache, database connection pool and other threadpools adapted to the needs of the application?
Do we have processing on hold due to a lack of resources? etc.?

Metrics

We needed to know the health status of the application to know if it was in great shape or if it showed signs of fatigue and weakness. So we needed to be able to measure its health status, and to do that, it was essential to collect different metrics. The tool couple Grafana and Prometheus was the perfect answer to this need: Prometheus for retrieving and storing the metrics from each application and Grafana for visualizing and building a dashboard based on these metrics.

In the application code we have added probes to expose these application metrics, for example :

The pause time of the garbage collector,
The number of elements in the near-cache hazelcast,
The number of threads waiting for a database connection,
The number of threads used,
etc…

Most are supplied by micrometer.io. It should be noted that it is easy to activate the actuator module with spring boot to expose most of these essential metrics to Prometheus.

Dashboards

The next step was to create dashboards in Grafana to visualize these metrics. A dashboard is a web page that gathers a set of graphs displaying data and information. For this purpose, Grafana offers a set of dashboards for download. This is a good starting point, but make sure you understand the meaning of each metric displayed to speed up the analysis during a production problem. The amount of metrics that are displayed can be impressive and confusing.

Our dashboards are regularly updated to better meet our needs and always with the goal of optimizing analysis time in the event of a problem. Dashboard creation is an iterative process that is specific to each type of application.

Several dashboard content strategies can be adapted. Indeed, it is impossible to display everything on a single dashboard and you have to choose how to categorize the information: by application, by type of metric, by server, etc. We have opted for the approach of making a dashboard by category of metrics, for example JVM metrics, and to provide a filter by applications and servers with a graph by server instance.

That said, it can be useful to have a global dashboard that groups all the important metrics together with a graph by metrics containing all the servers in order to provide a transversal view of all applications. That’s what we did to create an alert dashboard, which we’ll talk about late

A final recommendation is to avoid the use of gauge-type graphs that display the current value of the metric. It’s nice and cool but not very useful. To analyze a metric in the vast majority of cases, you need to know its history. With a gauge graph it is impossible to answer these questions:

How long has this metric been in the red?
Is this the first time we have had such a high value? Is this normal? etc…

The values of the past are essential to a good understanding of the current value of metrics. For example, when viewing a peak of 100 http requests per second, is this a malfunction or normal behavior? The history will help answer this question.

At this stage, we were therefore able to measure the state of health of our servers and adjust the resources as required. For example, we could easily see that a server was slowed down because of a long Garbage Collector due to the heap memory being too small. On the other hand, we had no precise information on what was consuming memory. For this, we needed another tool closer to the code: an APM (Application Performance Monitoring).

Need a scanner to understand the problems

In living beings, measuring pulse, blood pressure or temperature is often not enough to know what a patient is suffering from. We will therefore use other tools such as CT, MRI or X-rays to go inside the body and diagnose a disease, fracture or abnormality. As you will have understood, the approach is exactly the same for our applications. We need to know what’s going on inside if something goes wrong.

An APM (Application Performance Monitoring) will allow us to profile the application code during execution. Concretely, it will instrument the application code to monitor it. We will thus know the time spent in each method. In this case, we have chosen to use the open source APM Glowroot. This APM has nothing to envy to the major players in the market, as it is simple to use and relevant in the information reported, with a very low impact on performance. A set of pre-installed plugins, such as Hibernate and Spring, allow to very quickly locate the source of the problem. Unlike tools such as JProfiler, which are activated on demand, APMs are intended to be permanently active. It is therefore easy to consult them after a problem has occurred.

Most APMs, such as Glowroot, allow instrumentation to be added dynamically to the code without rebooting the server. We can therefore specify the areas of code to be monitored and complete the extracted information with method parameters, returned value, duration and number of calls, etc.

Once this step was done, however, we still had a problem to solve since Glowroot did not offer us a transversal vision of all applications, but only a vision centered on one application. It was therefore difficult to follow a process involving several microservices and to easily determine in which microservice the time was spent.

Microservices, how to have a transversal vision?

As explained at the beginning of this article, we were in the process of migrating from a monolith to a multitude of microservices. User-triggered processing will cascade a multitude of microservices, and in some cases parallelism to speed up processing. To trace these calls, we use “tracing” type monitoring tools.

This type of monitoring is still in its infancy, and standards are being established with OpenTelemetry. Indeed, the objective is to trace the processing whatever the technology (java, javascript, .Net, …) and the protocol used (http, amqp, …). We chose to use Jaeger because of its popularity. In our case, the code had to be adapted to ensure the propagation of the context. Moreover, for critical processing such as portfolio valuation, we enriched the traces to know the time spent in each business step.

Jaeger allowed us to know precisely the time spent in each service, to visualize it and extract the critical path, as well as to count the number of calls to another service.

This type of tool is decisive when migrating from a monolith to a microservice. Indeed, in a monolith, a processing that triggers 1000 calls to a calculation method goes unnoticed. On the other hand, if this calculation method is migrated to a microservice, we will have 1000 http calls and we will have to review the code to maintain the initial performance and not stress the microservice.

At this stage, we have a nice panel of monitoring tools ready to investigate the slightest incident. But we still had to know when we had an incident… One solution is to regularly scan the dashboards, put a screen in the openspace to always keep an eye on it. Another less restrictive solution is to configure alerts to be notified in case of malfunction.

Alerts

Setting up alerts is not as easy as it seems …

Avoid false positives

An alert is triggered on a rule. We are not going to go into detail on the best practices for establishing this rule, but we must keep in mind that we must always avoid triggering false alerts, at the risk of quickly losing confidence in the relevance of the alert, which will then be perceived as spam. The alert rule must be regularly adapted to avoid false positives.

Who should receive the alert?

Those who need to receive the alert are first and foremost those who are able to take action to restore the situation. It is also relevant to add the people who wish to be informed of the status of the system.

What should we do?

Triggering an alert is a good thing as long as you know how it should be handled. A clear process must be defined and formalized. At the very least, I advise to inform in the alert message the consequences of the alert and the possible actions to be taken.

Tools

Alerts are mainly based on metrics. Prometheus and Grafana are able to trigger alerts. However, I have a preference for Grafana because it natively allows to have the link on the dashboard corresponding to the metric in red in the notification.

Training

These monitoring tools are intended for use by devs and ops. They must have access to these tools and of course understand them. To do so, it is essential that they become familiar with them and train on them. We have set up short training sessions of about 30 min once a week for 8 weeks.

Conclusion

In this experience feedback, we were able to apprehend a set of tools from the open source world. These proved to be particularly reliable and robust. Only a few very minor bugs were encountered on Grafana.

Today, they bring us a precious help to understand the behavior of our applications, to anticipate problems and of course the analysis of incidents. Unfortunately, many companies have not yet taken the step to implement application monitoring and limit themselves to infrastructure monitoring. Without them, problem resolution proves to be complicated and risky. In the worst case scenario, it is sometimes impossible.

The choice of tool is also strongly constrained by the budget. Some tools can quickly become very expensive depending on their use. Another criterion for the choice is to want or be able to export these metrics and logs to the Cloud. Monitoring solutions, such as Datadog, are purely Cloud-based and no on-site deployment is then possible.

It is important to know that the world of monitoring tools is in full expansion. Major players are redoubling their investment in their tools to respond to this promising market. One of the battlefields is the introduction of Artificial Intelligence to facilitate the search for the cause of an incident or to trigger an alert. To be continued…

On September 24, 2019, Datadog successfully completed its IPO at NASDAQ. The action was filed at $27 and ended its first day at $37.55, allowing it to raise $648 million.by rebound effect. On the same day, Elastic’s share price increased by 2.4%, Dynatrace’s by 2.5%, New Relic’s by 4.5% and Splunk’s by 5.6%.

Thanks to our
Author

Jean-Philippe Laurent

Functional analysis and feedback in the implementation of application monitoring