There were still exceptions not caught - and I have the suspicion
that the memory leaks we see in production are caused by the reconnects
(if you google for python memory leaks you end up with pika and lxml
examples - both we use here, so a restart every couple of hours can't
hurt)
This way we don't miss anything - we just have to make sure we're
done within heartbeart interval, or the server will close the connection.
But that's 60 seconds, so we're safe for this bot (and if we fail
once, we have to reconnect)
This is ugly on first look, but has several advantages:
- we can more easily support a cold start
- as such we don't need to have a persistant queue and
can directly bind the routing keys we want
- we do the same on all openqa events, simplifying the code
- we can cope support short names for the checks
The last is the most significant benefit (not yet implemented though).
We can name the openqa jobs RAID1 and gnome and only have to append
the machine name (or other settings) if they conflict