Nudge APM has an alerting engine that can inform of any failure in a proactive way.
Alerts can be set up fomr the Alerts main menu.
An alert is configured with these informations:
Nudge APM provides a wizard that will let you set all these informations.
This screen let you specify the kind of notification and what metrics to be observed.
You have two kinds of notifications: by e-mail or by webhook.
In case of notifications by e-mail you simply have to set the list of the addresses.
Webhooks are http messages that the alerts engine can send to third party services such as Jira, Slack, SMS broker…
Webhooks are configured with two or three parameters depending on the case:
Messages are sent with a application/json content type.
Informations on the alert, or others informations related to it can be placed inside the URL or the message. The parameters that are placed inside the URL will be url-encoded.
This is the list of the parameters available:
${notif_desc}
: description of the notification${notif_type}
: type of the notification (example: at the beginning of alert, at the end of alert)${notif_start}
: start timestamp up of the notification${notif_end}
: end timestamp of the notification${alert_metric}
: checked metric (time of answer, number of errors)${alert_threshold}
: threshold of release of the control${transaction_name}
: name of the transaction (depending on configuration, this information may be not available)${server_name}
: name of the service where the threshold has been exceeded (depending on configurations, this information may be not available)${app_name}
: name of the applicationExample:
Here are some examples of configurations for specific tools:
POST
https://hooks.slack.com/services/slack-keys
{"channel":"#nudge","username":"webhookbot","text":"Alerte Nudge : ${notif_desc}"}
POST
https://jirauser:jirapassword@yourdomain.jira.com/rest/api/latest/outcome
{"fields":{"project":{"key":"PROJECT-KEY"},"issuetype":{"Issue"},"description":"${notif_desc}","summary":"${alert_metric} not matching threshold ${alert_threshold}"}}
GET
https://api.primotexto.com/v2/notification/messages/send?ApiKey=YOUR_API_KEY*identifier=YOUR_NUM$BER*sender=YOUR_SENDER*message={notif_desc}
Choose the metrics to control:
Entity | Metric | Description |
---|---|---|
Application | Mean resp. time | Mean response time |
Application | Errors | Number or rate of errors |
Application | Satisfaction | Satisfaction index |
Service | Outage | Service stopped to send data |
Service | JMX | MBean JMX Attribut’s (only Java) |
The apdex is available in two contexts:
The content of the threshold screen depends on the metric. Elements to use as reference to evaluate the threshold and also its units available will depend on the kind of metric to control. For instance, a response time is comparable to a duration in seconds, an error rate is comparable to a percentage …
Depending on the nature of the measures, it is somettime possible to use a threshold based on the evaluation of a trend.
This kind of control let you detect the degradation of measures from what used to be measured previously.
The relevance a trend depends on the seasonality of the measures: if the historical measures are very volatile, this kind of control risks to raise a lot of alerts.
To compare a measure to its trend, the alerting engine first estimates the trend. It does this by grabing previous measures in a previous comparable period depending on the seasonality (daily, weekly …) which is defined in the alert configuration.
The trend is multiplied by a coefficient of deterioration tolerance to obtain the threshold of the control.
Example:
Let’s imagine a transaction which has a weekly seasonal performance. In that case the trend cill depend on measures observed around one week ago.
We position the following parameters:
The threshold for the control of an event which was observed at 11:52 am on 13/01/2017, will be evaluated with measures observed on 06/01/2017 between 11:37 am and 12:07 am multiplied by 1,5.
This screen let you choose wich element has to be controled.
As an example, here is some combinations for the response time control:
Define here the period during which the control has to be effective.
Instead of considering any measure alone (that could raise lots of alerts), the alerting engine consider measures observed in a time frame previously to any event analysed.
By default the wizard proposes 5 minutes and 80 %, it means that when the alerting engine will control the measure of an event, it will consider every measures of the 5 previous minutes and it will raise an alert only if the threshold is exceeded on at least 4 of these 5 minutes.
Has an exemple, let’s consider a threshold of 2 on a given metrics, this it what will happend on several sequences of measures: