Category: DevOps

Introduction to Alerta: Open Source Aggregated Alerts

There are a number of platforms available these days to assist operations in terms of dealing with alerts. Namely Pagerduty, VictorOps and OpsGenie. These are unfortunately pay for tools/

These tools are known as monitoring aggregation

I was looking through the integrations of elastalert and found that there is an integration for alerta.io, so I checked the website and it seemed to check all the boxed of monitoring aggregation.

I used the docker compose way of setting it up quickly, but if you want to set it up proper then follow the alerta.io deployment guide.

Update some config:


docker exec -u root -it alerta_web_1 /bin/bash
apt update
apt install vim
# Edit the config in /app/alertad.conf
# Restart the container

Add the housekeeping cron job:


echo "* * * * * root /venv/bin/alerta housekeeping" >/etc/cron.daily/alerta

The default timeout period for an alert is 86400 seconds, or one day.

Check out the alerta plugins

What popular alerting and monitoring tools does alerta.io integrate with?

Reducing and Learning from Monitoring Alerts in Business Environments

How often is it the case where monitoring alerts and notifications get out of hand in an organisation. The alerts become too many, alert only via a single channel, alert for minor and major severities in the same manor and generally take time off engineers hands for improving and fixing these errors when they constantly have to check these alerts.

Ideally we want alerts to be relevant for things that need to be fixed in a short time frame. Other alerts (non-critical) still may be good but should be reviewed looking back over a longer period. That is how I see it at least…

The key things to get right in my opinion:

  • Relevancy: disregard irrelevant alerts for the present
  • Channels: Sending critical alerts to instant messaging / phone calls and non-critical to email / analytics platform
  • Structured data – alerts should be as structured as possible so it lets you make specific criteria and rules based on them. If the data you receive is garbage text (like an email) then you won’t have a good way of classifying and remediating from them.
  • Let the various departments own their rules / criteria – the people running these systems are the ones that should receive the alerts and manage the channel, severity etc.
  • Machine learning?

A Note on Machine Learning

Naturally we want this all done for us at the click of a button, but it is not that easy. Some people will just shout machine learning or AI will handle it, without the slightest idea of what that entails.

Leveraging machine learning (specifically supervised) I think is the way to go. This way you train the machine to identify critical / relevant messages – with a human. Much like how Google uses Captcha to train robots to identify bus stops and shop entrances or read books. Making structured data from unstructured….

I thought having a user assign a severity level, 1 – 5 based on each alert from each relevant department for a while will help a machine learning algorithm identify important and not important messages.

I did a bit of research but there is already existing solution yet and will probably need some more time or a custom solutions…

What Can you do now to control your alerts?

You need to keep all your alerts first, store them so that you can run analytics on them later for more insight.

Store all the things…

elastalert-pipeline

Picture taken from: https://engineeringblog.yelp.com/2016/03/elastalert-part-two.html

You also want the ability to add rules / criteria easily to the alerts coming in, and you want this to be easy enough for non-developers to create and manage them.

If you look at the image above you want to collect all the data you can (so prefer to get it direct from the system, instead of the monitoring systtem controlling the alerts).

So use elasticsearch…bottomline. That is the ELK stack (you could also try using the TICK-L stack). We just need to figure out what we want to interact with it in terms of creating rules, criteria and possible machine learning for it.

Tutorials:

The Proof however is in the Tasting

Lets try out the various options…first ensure you have an ELK stack instance you can check this digitalocean tutorial. Ensure you are getting data, try one of the various beats to monitor your system.

I set it up and now I have data:

metricbeat-dashboard-not-great
As you can see Metricbeat probably isn’t as robust and reliable as something like newrelic

I then set up elastalert…

Elastalert

Elastart supports only python2.7 which as we know is going / went out of support in 2020.

It is a bit tricky to set up as well, not super tricky but trick. Creating rules is also not trivial, you need to know the different rule types, the parameters they accept and they need to be tested. All these paramters are configured in yaml which developers seem to think non-technical people or even relatively technical people can use. The truth is yaml is tough and an html form with dropdowns and validation is usually better.

You pretty much have to look at the examples to try and create a rule, also querying the elasticsearch index is important.

I created a test rule after 30 minutes: rules/elasticsearch_memory_high.yaml


name: Metricbeat Elasticsearch Memory High Rule
type: metric_aggregation

es_host: localhost
es_port: 9200

index: metricbeat-*

buffer_time:
  hours: 1

metric_agg_key: system.memory.used.pct
metric_agg_type: avg
query_key: beat.hostname
doc_type: doc

bucket_interval:
  minutes: 5

sync_bucket_interval: true
allow_buffer_time_overlap: true
use_run_every_query_size: true

min_threshold: 0.1
max_threshold: 0.9

filter:
- term:
    metricset.name: memory

alert:
- "debug"

Then I test the rule with:


elastalert-test-rule rules/elasticsearch_memory_high.yaml

We can see the matches in the stdout:


INFO:elastalert:Alert for Metricbeat Elasticsearch Memory High Rule, st2.fixes.co.za at 2019-05-28T06:40:00Z:
INFO:elastalert:Metricbeat Elasticsearch Memory High Rule

Threshold violation, avg:system.memory.used.pct 0.942444444444 (min: 0.1 max : 0.9) 

@timestamp: 2019-05-28T06:40:00Z
beat.hostname: st2.fixes.co.za
metric_system.memory.used.pct_avg: 0.942444444444
num_hits: 1296
num_matches: 39

INFO:elastalert:Ignoring match for silenced rule Metricbeat Elasticsearch Memory High Rule.st2.fixes.co.za
INFO:elastalert:Ignoring match for silenced rule Metricbeat Elasticsearch Memory High Rule.st2.fixes.co.za

Working

And bang! I got it working with telegram.

Just updated my config and ran it with:

python -m elastalert.elastalert --config config.yaml --verbose --rule rules/elasticsearch_memory_high.yaml

 

telegram-elastalertThe only problem was it was sending this alert every minute.

From a stackoverflow question it seemed the answer was the realert option. We don’t want the alert to be spam (that is why we did this all along)

It is very important to understand the following terms:

bucket_interval, buffer_time, use_run_every_query_size and realert.

The next thing is that you need is to run it as a service via systemd of supervisord, but I will skip this part.

I want to try the other options.

Elastalert Kibana Plugin

After setting up elastalert I realised that creating rules via yaml for non-technical people that struggle to read and apply docs will be impossible. For me it took about an hour plus debugging to figure out a single rule.

So we need a frontend that makes it easy for people to figure out and set rules for systems that they manage.

For that purpose and when still using elastalert we can use the 2 frontends available – Elastalert Kibana Plugin and Praeco. They are both in active development but Praeco is in a pre release phase.

To make use of these frontends, you need an api which apparently vanilla Elastalert from Yelp does not have. So to use these frontends we need to use the Bitsensor Elastalert fork.

Bitsensor elastalert is setup with docker according to their documentation.

I’m no docker expert but managed to sort it out, using the following steps.

Install Docker

The instructions on the docker docs site are good.

Install Bitsensor Elastalert API

The instructions on the bitsensor elastalert site did not work perfectly for me, what I did:


# Ran the recommended way
docker run -d -p 3030:3030 -p 3333:3333 \
>     -v `pwd`/config/elastalert.yaml:/opt/elastalert/config.yaml \
>     -v `pwd`/config/elastalert-test.yaml:/opt/elastalert/config-test.yaml \
>     -v `pwd`/config/config.json:/opt/elastalert-server/config/config.json \
>     -v `pwd`/rules:/opt/elastalert/rules \
>     -v `pwd`/rule_templates:/opt/elastalert/rule_templates \
>     --net="host" \
>     --name elastalert bitsensor/elastalert:latest

# That created the container but it exited prematurely, this did the same thing
docker run -d -p 3030:3030 -p 3333:3333 bitsensor/elastalert:latest

# I noticed it was exiting, so checked the logs and saw that it could not access elasticsearch running on the host (not in the container)
# I needed the container to access the hosts network, which is done with https://docs.docker.com/network/host/

docker run -d -p 3030:3030 -p 3333:3333 bitsensor/elastalert:latest --network host

It still could not connect to the host using 127.0.0.1:9200, which I assume means that ip points to the container and not to the host. Debugging this is difficult though – damn I don’t want this to be a docker post. The solution looks to be connect to the hose using host.docker.internal – but there is a caveat. This only works on mac and windows, linux and production whoops.

Ah, I messed it up, jsut run the command they give and you will get a relevant error. I have elasticsearch version 6.8.0, so the latest elastalert will not work as it uses elasticseach python package for 7.0.0.

This was the error and this is the issue on github:


09:15:04.176Z ERROR elastalert-server:
    ProcessController:      return func(*args, params=params, **kwargs)
    TypeError: search() got an unexpected keyword argument 'doc_type'

To fix that you need to build the image yourself with:

make build v=v0.1.39

But that fails with:


step 24/29 : COPY rule_templates/ /opt/elastalert/rule_templates
failed to export image: failed to create image: failed to get layer sha256:66d9b1e58ace9286d78c56116c50f7195e40bfe4603ca82d473543c7fc9b901a: layer does not exist

This was fixed by running the build again. Alas another issue is that the yelp requirements file for that version did not lock the elasticsearch version so I had to juk it with this:


RUN sed -i 's/jira>=1.0.10/jira>=1.0.10,<1.0.15/g' setup.py && \
    python setup.py install && \
    pip install elasticsearch==6.3.1 && \
    pip install -r requirements.txt

Boom, so first step done. Next step is getting the elastalert kibana frontend plugin working:

To install it you go to: cd /usr/share/kibana/.

and then:


sudo ./bin/kibana-plugin install https://github.com/bitsensor/elastalert-kibana-plugin/releases/download/1.0.3/elastalert-kibana-plugin-1.0.3-6.7.2.zip

Only problem was that the elasticsearch version I was using 6.8.0 was not supported, so I am going to try praeco.

Praeco

This damn thing doesn’t use docker, it uses docker-compose, a different thing – an orgchestrator of docker.

which can be installed and used following these docker-compose install docs

Pull the repo then do:


export PRAECO_ELASTICSEARCH=
docker-compose up

Thing about Praeco is it includes both the bitsensor elastalert API and Praeco…damn I wasted so much time setting it up manually. Also it runs on port 8080 so try not have that port already in use on the host.

I wasn’t able to fix the issue of the docker conatainers not being able to connect to the local host elasticsearch, read the troubleshooting guide for more info. I did get it working on a remote elasticsearch.

Wow this thing is amazing…

Praeco is awesome and interactive and can be very powerful, it is still in development and has one or two bugs but overall excellent.

praeco-alert-rules-engine

The rules are limited compared to the yaml based elastalert, but other than that it is an excellent and useful frontend.

Sentinl

Sentinl is more native to kibana, in that it plugs right in much like the existing xpack plugin.

Check you elasticsearch (or kibana) version:


http :9200

or:


sudo /usr/share/kibana/bin/kibana --version

Again the issue of compatibility rears its ugly head, where my version 6.8.0 does not have a respective release for sentinl. Only version 6.6.0 is there. Perhaps a tactic by elasticsearch?

Damn it also looks like sentinl will only be looking to support siren going forward:

Dear all, with the launch of Open Distro we feel the needs of the Kibana community are sufficiently served and as such we are focusing Sentinl on the needs of the Siren platform only.

Siren.io is an investigative intelligence platform

Even attempting to install it fails:


Attempting to transfer from https://github.com/sirensolutions/sentinl/releases/download/tag-6.6.0-0/sentinl-v6.6.1.zip
Transferring 28084765 bytes....................
Transfer complete
Retrieving metadata from plugin archive
Extracting plugin archive
Extraction complete
Plugin installation was unsuccessful due to error "Plugin sentinl [6.6.1] is incompatible with Kibana [6.8.0]"

So that is the end of that…

411

Ah the PHP and apache crew…looks ok but not something I want to look at right now. Elastalert is my guy.

 

Is telegram good for ChatOps and DevOps?

I’ve been using telegram for some chat ops related activities and started integrating them with things like hubot and StackStorm ChatOps to figure out if it works seemlessly. This will also end up being a Telegram vs Slack vs insert chat client here if I get time.

telegram-app-for-chatops

I’ve picked up certain things that make things difficult with telegram:

  • Showing html raw and not as a string
  • Showing graphs and charts
  • The formatting is not different when coming from a bot or a real person

Unicode and UTF-8

Importantly, the python-telegram-bot expects utf-8 encoded strings. Not the normal python3 str which is unicode (I think)…nothing is clear with this.

There is a webpage for telegram icons so I used python to try send this message and it is still not working as expected…encodings…


chop = b'\xF0\x9F\x9A\x80'.decode(encoding='utf-8')
utf8_str_icon = bytes('\xF0\x9F\x9A\x80', 'utf-8').decode()
unicode_decoded = 'U+1F680'
unicode_icon = '🚀'

So I think I will try use ChatOps and integrate with Telegram somehow

As I don’t want to be dealing with this stuff.

Errbot and Telegram

I’ve been using stackstorm and asked the community what options there were when integrating with telegram and I got a response from the err-stackstorm maintainer

Installing errbot:


cd ~
python3 -m venv errbot-env
source errbot-env/bin/activate
pip install errbot

Ok, now we need a bit extra for the telegram backend configuration:

pip install python-telegram-bot

Then a bit on configuration, in config.py:


BACKEND = 'Telegram'

BOT_IDENTITY = {
    'token': '8043xxxx:xxxxxxxxxxxxxxxxxxxxx4',
}

CHATROOM_PRESENCE = ()

BOT_PREFIX = '/'

Boom…everything working well. Run your bot, and send some commands to it via telegram. The only thing that is not great is the formatting that you can test with the command

/render test

It isn’t beautifully formatted:

errbot-telegram-render-test

Errbot and Slack

I was dissapointed with the formatting, so I decided to try it with slack.

Check out the slack configuration

Be careful though, the latest stackclient has major changes. So ensure to:

pip install slackclient==1.3.1

Running the same thing on slack was much better in terms of markdown formatting – bold, italics, tables, showing images, dispalying code and links etc.

slack-errbot-render-test-1slack-errbot-render-test-2

There are other clients I want to try out in the future rocket.chat, microsoft teams, Cisco Spark and mattermost but I’ll try them when they cross my path.

Connecting Stackstorm to Errbot

So now let us connect the errbot to stackstorm chatops.

I am trying the errbot-stackstorm plugin. So it needs to be configured and then enable errbots webhook support.

Which is done by sending this command via chatops:

/plugin config Webserver {'HOST': '0.0.0.0', 'PORT': 3141, 'SSL': {'enabled': False, 'host': '0.0.0.0', 'port': 3142, 'certificate': '', 'key': ''}}

By speaking with the maintainer, I managed to get it working but it wasn’t easy. Errbot also logged alot of stuff and seemed like it was constantly restarting. Furthermore, it had bugs. For example it would repeat the output from a command 100 times.

It also wasn’t stable, it would just error out:

errbot-stackstorm-errors

Just to show you it did work:

stackstorm-chatops-telegramUsing Hubot with Telegram

from the docs:

If you installed StackStorm following the install docs, the st2chatops package will take care of almost everything for you. Hubot with the necessary adapters is already installed

It seems that stackstorm prefers hubot.

The first step would be using the telegram hubot adapter. Since stackstorm hubot doesn’t support telegram by default, follow the docs on how to create an external adapter


cd /opt/stackstorm/chatops
sudo npm install --save hubot-telegram

Then modify the chatops configuration in /opt/stackstorm/chatops/st2chatops.env:


export HUBOT_ADAPTER=telegram
export HUBOT_TELEGRAM_TOKEN="xxx"
export HUBOT_TELEGRAM_WEBHOOK=""
export HUBOT_TELEGRAM_INTERVAL=5000

Reload the config and restart chatops:


sudo systemctl restart st2chatops
sudo st2ctl reload

Make sure in the output that chatops is running, not like this:


st2chatops is not running.

So check what is going on with:


sudo journalctl --unit=st2chatops

# The error
Error: The environment variable "TELEGRAM_TOKEN" is required.

So I added the environment variable, without the HUBOT prefix:

That worked but still had a weird error in the logs:


ERROR Error: Conflict: terminated by other getUpdates request; make sure that only one bot instance is running

This error is related to the interval that needs to be increased to 3500.

I still wasn’t getting a response on telegram so I did a check:


[cent@st2 ~]$ bash self-check.sh 

Starting the Hubot Self-Check Program
===============================================

Step 1: Hubot is running.
Step 2: Hubot-stackstorm is installed (0.9.3).
Step 3: StackStorm has aliases that are registered and enabled.
Step 4: Chatops.notify rule is present.
Step 5: Chatops.notify rule is enabled.
Step 6: Hubot responds to the "help" command.
Step 7: Hubot loads commands from StackStorm.
Step 8 failed: chatops.post_message doesn't work.

I manually did a post_message and that worked. I ran the script again and:


[cent@st2 ~]$ bash self-check.sh 

Starting the Hubot Self-Check Program
===============================================

Step 1: Hubot is running.
Step 2: Hubot-stackstorm is installed (0.9.3).
Step 3: StackStorm has aliases that are registered and enabled.
Step 4: Chatops.notify rule is present.
Step 5: Chatops.notify rule is enabled.
Step 6: Hubot responds to the "help" command.
Step 7: Hubot loads commands from StackStorm.
Step 8: chatops.post_message execution succeeded.
Step 9: The hubot adapter token is ok
Step 10: chatops.post_message has been received.
End to end test failed: Hubot not responding to "st2 list" command.

    Try reinstalling the st2chatops package. This error shouldn't
    happen unless the Hubot installation wasn't successful.
    It's also possible you changed the bot's name; this script
    assumes that "hubot" is something the bot will respond to.

I fixed this by doing:


cd /opt/stackstorm/chatops/
sudo npm install --save hubot-telegram
sudo systemctl restart st2chatops
st2ctl reload

that fixed the checker, but the commands were still not showing up.

So looked at the logs with journalctl -u st2chatops

and saw this error:

This is solved by the following issue on the hubot-stackstorm package. It has not been published to the npm package though.

There was another issue of this error


ERROR Error: Bad Request: can't parse entities: Can't find end of the entity starting at byte offset

Which seems to be coming from the coffescript based telegram adapter

FFS!

It is discussed in these github issues: Can’t parse message text bad entities

This is a problem with the telegram api, so the message needs to be parsed beforehand.

Using Slack by Default

Getting a slack token

You then just add the api key to: /opt/stackstorm/chatops/st2chatops.env

and enable the slack adapter.

Then restart chatops:

sudo service st2chatops restart

I had an issue where the st2chatops service was inactive.


st2chatops is not running.

I installed with the ansible playbook so I had to remove st2chatops and then reinstall it with yum.

Then ran: journalctl --unit=st2chatops

You have to update the token in /opt/stackstorm/chatops/st2chatops.env

The bot should now be running (online on slack) and you can run:

 

slack-stackstorm-hubot-chatops

But none of the st2 commands are showing. The reason for that (damn, always issues) the API key is wrong.

So create a new key:

st2 apikey create -k -m '{"used_by": "dotty"}'

# StackStorm API key
export ST2_API_KEY=XXX

Restart the service and you will now see the st2 commands:

sudo service st2chatops restart

Everything should work now. Ie. the message should be sent to slack.

If you need to troubleshoot, check the chatops troubleshooting guide

Ensure the stream is accessible by doing: http https://123.123.123.123/stream/v1/stream --verify=False

And you get a response in quick time, if it 504‘s then you should restart the st2stream

Also ensure ST2_HOSTNAME is set correctly. Ensure to restart the service: sudo st2ctl reload --register-all and sudo service st2chatops restart

In your channel do: @dotty help

Advanced Troubleshooting

If you are getting a 502 or 308 when accessing the api, you need to change the nginx config.

If you are getting a 401 or 403, then you need to create a new user and password and token for chatops and add that to st2chatops.env. Also ensure to change your hostname to point to your actual host – espescially when you have setup server blocks / virtualhosts.