Fixing Mattermost mobile client reconnection issues over HAProxy

As I already have a reverse proxy, when the Mattermost installation documentation told me to set up a separate Nginx instance as a proxy in front of the server I simply skipped the chapter. I know how to proxy a TLS connection from an inbound port to a backend service.

Unfortunately it had the strange side effect of clients attempting to reconnect in a rapidly recurring way. Very irritating, especially in the mobile client. Then I read in the documentation that Mattermost uses web sockets for its client communication. Usually, this shouldn’t matter to HAProxy – it should handle this just fine – but I’ve had strange side effects with backends some times, and this was obviously such a case.

The solution was simple: Tell HAProxy to tag Websocket traffic, and set up a separate but otherwise identical backend for this specific use case. The net result looks something like this in the config file:

frontend web
    acl host_ws hdr_beg(Host) -i ws.
    acl hdr_connection_upgrade hdr(Connection) -i upgrade
    acl hdr_upgrade_websocket hdr(Upgrade) -i websocket
    use_backend bk_ws_mattermost if host_ws { hdr(Host) -i mattermost.mydomain.tld }
    use_backend bk_ws_mattermost if hdr_connection_upgrade hdr_upgrade_websocket { hdr(Host -i mattermost.mydomain.tld }
    use_backend bk_mattermost if { hdr(Host) -i mattermost.mydomain.tld }

backend bk_mattermost
    server mattermost mattermostsrv.mydomain.tld:8065 check

backend bk_ws_mattermost
    server mattermost mattermostsrv.mydomain.tld:8065 check

We look for the characteristics of a protocol upgrade and tell our reverse proxy to handle that data flow separately. This was enough to solve the issue.

Simple DNS over HTTPS setup

I read that Mozilla had been named an Internet villain by a number of British ISPs, for supporting encrypted DNS queries using DNS over HTTPS. I guess the problem is that an ISP by default knows which sites you browse even though the traffic itself is usually encrypted nowadays, since the traditional way of looking up the IP address of a named service has been performed in plaintext.

The basic fact is that knowledge of what you do on the Internet can be monetized – but the official story naturally is a combination of “Terrorists!” and “Think about the children!”. As usual.

Well, I got a sudden urge to become an Internet villain too, so I put a DoH resolver in front of my Bind server at home. Cloudflare – whom I happen to trust when they say they don’t sell my data – provide a couple of tools to help here. I chose to go with Cloudflared. The process for installing the daemon is pretty well documented on their download page, but for the sake of posterity looks a bit like this:

First we’ll download the installation package. My DNS server is a Debian Stretch machine, so I chose the correct package for this:

wget https://bin.equinox.io/c/VdrWdbjqyF/cloudflared-stable-linux-amd64.deb
dpkg -i cloudflared-stable-linux-amd64.deb

Next we need to configure the service. It doesn’t come with a config file out of the box, but it’s easy enough to read up on their distribution page what it needs to contain. I added a couple of things beyond the bare minimum. The file is stored as /etc/cloudflared/config.yml.

---
logfile: /var/log/cloudflared.log
proxy-dns: true
proxy-dns-address: 127.0.0.1
proxy-dns-port: 5353
proxy-dns-upstream:
         - https://1.1.1.1/dns-query
         - https://1.0.0.1/dns-query

After this we make sure the service is active, and that it’ll restarts if we restart our server:

cloudflared service install
service cloudflared start
systemctl enable cloudflared.service

Next let’s try it out:

dig @127.0.0.1 -p 5353 slashdot.org

If we get an answer, it works.

The next step is to make Bind use our cloudflared instance as a DNS forwarder. We’ll edit /etc/bind/named.conf.options. The new forwarder section should look like this:

(...)
options {
(...)
	forwarders {
                127.0.0.1 port 5353;
	};
(...)
};

Restart bind (service bind9 restart), and try it out by running dig @127.0.0.1 against a service you don’t usually visit. Note the absence of a port number in the latter command: if it keeps working, the chain is up and running.

Restoring a really old domain controller from backups

I had an interesting experience this week, where I was faced with the need to restore an entire Active Directory environment from backups that were more than a year old.

The company whose servers I was restoring had been using an older version of Veeam Backup and Recovery, which always simplifies matters a lot: The entire thing was delivered to me over sneaker net, on a 2.5″ USB drive containing several restore points for each machine.

The restore was uneventful, as expected, and most machines simply started up in their new home. Unfortunately, one of the Active Directory controllers would bluescreen on boot, with a C00002E2 error message.

After some reading up on things, I realized the machine had passed the Active Directory tombstone period: as I wrote, the backups were taken over a year ago. Since I had one good domain controller, I figured I would simply cheat with the local time on the failing DC. It would boot successfully into Directory Services Recovery Mode, so I could set the local clock, but anybody who has a bit of experience with the VMware line of virtualization products knows that by default, VMware ESXi synchronizes the guest system clock in a few situations; amongst them on reboot.

Fortunately VMware has a knowledgebase article covering how to disable all synchronization of time between guests and hosts. A total of eight advanced settings must be set to False, with the guest turned off:

tools.syncTime
time.synchronize.continue
time.synchronize.restore
time.synchronize.resume.disk
time.synchronize.shrink
time.synchronize.tools.startup
time.synchronize.tools.enable
time.synchronize.resume.host

The procedure is documented in KB1189.

After setting these properties on the machine, I started it back up, with the system time set well back into the range before the tombstone cutoff date, let it start up and rest for a while for all services to realize everything was alright, and then I set the time forward to the current date, waited a bit longer, and restarted the VM. After this, the system started working as intended.