Fix for slow TFTP transfers from tftpd-hpa

After setting up a TFTP server based on tftpd-hpa I was disappointed in the transfer speeds I was seeing. A 15 MB file was enough to make the request time out before transfer completed.

The recommendation I found was to increase the maximum block size in the server configuration. However I also found a warning that some network equipment is unable to deal with fragmented packets when loading files over TFTP. The compromise I chose was to set a maximum block size below my network MTU.

...
TFTP_OPTIONS="--secure -B 1468"
...

Et voilà: TFTP with good enough performance for regular use.

Fixing Mattermost mobile client reconnection issues over HAProxy

As I already have a reverse proxy, when the Mattermost installation documentation told me to set up a separate Nginx instance as a proxy in front of the server I simply skipped the chapter. I know how to proxy a TLS connection from an inbound port to a backend service.

Unfortunately it had the strange side effect of clients attempting to reconnect in a rapidly recurring way. Very irritating, especially in the mobile client. Then I read in the documentation that Mattermost uses web sockets for its client communication. Usually, this shouldn’t matter to HAProxy – it should handle this just fine – but I’ve had strange side effects with backends some times, and this was obviously such a case.

The solution was simple: Tell HAProxy to tag Websocket traffic, and set up a separate but otherwise identical backend for this specific use case. The net result looks something like this in the config file:

frontend web
    acl host_ws hdr_beg(Host) -i ws.
    acl hdr_connection_upgrade hdr(Connection) -i upgrade
    acl hdr_upgrade_websocket hdr(Upgrade) -i websocket
    use_backend bk_ws_mattermost if host_ws { hdr(Host) -i mattermost.mydomain.tld }
    use_backend bk_ws_mattermost if hdr_connection_upgrade hdr_upgrade_websocket { hdr(Host -i mattermost.mydomain.tld }
    use_backend bk_mattermost if { hdr(Host) -i mattermost.mydomain.tld }

backend bk_mattermost
    server mattermost mattermostsrv.mydomain.tld:8065 check

backend bk_ws_mattermost
    server mattermost mattermostsrv.mydomain.tld:8065 check

We look for the characteristics of a protocol upgrade and tell our reverse proxy to handle that data flow separately. This was enough to solve the issue.

Simple DNS over HTTPS setup

I read that Mozilla had been named an Internet villain by a number of British ISPs, for supporting encrypted DNS queries using DNS over HTTPS. I guess the problem is that an ISP by default knows which sites you browse even though the traffic itself is usually encrypted nowadays, since the traditional way of looking up the IP address of a named service has been performed in plaintext.

The basic fact is that knowledge of what you do on the Internet can be monetized – but the official story naturally is a combination of “Terrorists!” and “Think about the children!”. As usual.

Well, I got a sudden urge to become an Internet villain too, so I put a DoH resolver in front of my Bind server at home. Cloudflare – whom I happen to trust when they say they don’t sell my data – provide a couple of tools to help here. I chose to go with Cloudflared. The process for installing the daemon is pretty well documented on their download page, but for the sake of posterity looks a bit like this:

First we’ll download the installation package. My DNS server is a Debian Stretch machine, so I chose the correct package for this:

wget https://bin.equinox.io/c/VdrWdbjqyF/cloudflared-stable-linux-amd64.deb
dpkg -i cloudflared-stable-linux-amd64.deb

Next we need to configure the service. It doesn’t come with a config file out of the box, but it’s easy enough to read up on their distribution page what it needs to contain. I added a couple of things beyond the bare minimum. The file is stored as /etc/cloudflared/config.yml.

---
logfile: /var/log/cloudflared.log
proxy-dns: true
proxy-dns-address: 127.0.0.1
proxy-dns-port: 5353
proxy-dns-upstream:
         - https://1.1.1.1/dns-query
         - https://1.0.0.1/dns-query

After this we make sure the service is active, and that it’ll restarts if we restart our server:

cloudflared service install
service cloudflared start
systemctl enable cloudflared.service

Next let’s try it out:

dig @127.0.0.1 -p 5353 slashdot.org

If we get an answer, it works.

The next step is to make Bind use our cloudflared instance as a DNS forwarder. We’ll edit /etc/bind/named.conf.options. The new forwarder section should look like this:

(...)
options {
(...)
	forwarders {
                127.0.0.1 port 5353;
	};
(...)
};

Restart bind (service bind9 restart), and try it out by running dig @127.0.0.1 against a service you don’t usually visit. Note the absence of a port number in the latter command: if it keeps working, the chain is up and running.

Restoring a really old domain controller from backups

I had an interesting experience this week, where I was faced with the need to restore an entire Active Directory environment from backups that were more than a year old.

The company whose servers I was restoring had been using an older version of Veeam Backup and Recovery, which always simplifies matters a lot: The entire thing was delivered to me over sneaker net, on a 2.5″ USB drive containing several restore points for each machine.

The restore was uneventful, as expected, and most machines simply started up in their new home. Unfortunately, one of the Active Directory controllers would bluescreen on boot, with a C00002E2 error message.

After some reading up on things, I realized the machine had passed the Active Directory tombstone period: as I wrote, the backups were taken over a year ago. Since I had one good domain controller, I figured I would simply cheat with the local time on the failing DC. It would boot successfully into Directory Services Recovery Mode, so I could set the local clock, but anybody who has a bit of experience with the VMware line of virtualization products knows that by default, VMware ESXi synchronizes the guest system clock in a few situations; amongst them on reboot.

Fortunately VMware has a knowledgebase article covering how to disable all synchronization of time between guests and hosts. A total of eight advanced settings must be set to False, with the guest turned off:

tools.syncTime
time.synchronize.continue
time.synchronize.restore
time.synchronize.resume.disk
time.synchronize.shrink
time.synchronize.tools.startup
time.synchronize.tools.enable
time.synchronize.resume.host

The procedure is documented in KB1189.

After setting these properties on the machine, I started it back up, with the system time set well back into the range before the tombstone cutoff date, let it start up and rest for a while for all services to realize everything was alright, and then I set the time forward to the current date, waited a bit longer, and restarted the VM. After this, the system started working as intended.

Managing Windows servers with Ansible

Although I to a large degree get to play with the fun stuff at work, much of our environment still consists of Windows servers, and that will not be changing for a long time. As I’ve mentioned in earlier posts, I try to script my way around singular Windows servers using Powershell whenever it makes sense, but when a set of changes needs to be performed across groups of servers – especially if it’s something recurring – my tool of choice really is Ansible.

The Ansible management server (which has to be running a Unix-like system) needs to be able to communicate securely with the Windows hosts. WinRM, which is the framework used under the hood, allows for a number of protocols for user authentication and transfer of commands. I personally like to have my communications TLS secured, and so I’ve opted for using CredSSP which defaults to an HTTPS-based communications channel.

A huge gotcha: I tried running the tasks below from a Ubuntu 16.04 LTS server, and there was nothing I could do to get the Python 2.7-dependent Ansible version to correctly verify a TLS certificate from our internal CA. When I switched to running Ansible through Python 3, the exact same config worked flawlessly. The original code has been updated to reflect this state of things.

Enable CredSSP WinRM communications in Windows

Our production domain has a local Certificate Authority, which simplifies some operations. All domain members request their computer certificates from this CA, and the resulting certs have subject lines matching their hostname. The following PowerShell script will allow us to utilize the existing certificates to secure WinRM communications, along with enabling the necessary listener and firewall rules.

$hostname=hostname
# Get the thumbprint of the latest valid machine certificate
$cert=Get-ChildItem -Path cert:\LocalMachine\My -Recurse|? { ($_.Subject -match $hostname) -and ($_.NotAfter -gt $today.date) } | sort { $_.NotAfter } | select -last 1
# Enable Windows Remote Management over CredSSP
Enable-WSManCredSSP -Role Server -Force
# Set up an HTTPS listener with the machine certificate’s thumbprint
New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $cert.Thumbprint -Force
# Allow WinRM HTTPS traffic through the firewall
New-NetFirewallRule -DisplayName 'Windows Remote Management (HTTPS-In)' -Name 'Windows Remote Management (HTTPS-In)' -Direction Inbound -Protocol TCP -LocalPort 5986 -RemoteAddress LocalSubnet

Depending on your desired security level you may want to change the RemoteAddress property of the firewall rule to only allow management traffic from a single host or similar. It is a bad idea to allow remote management from untrusted networks!

Enable CredSSP WinRM communications from Ansible

To enable Ansible to use CredSSP on an Ubuntu server, we’ll install a couple of packages:

sudo apt install libssl-dev
pip3 install pyOpenSSL
pip3 install pywinrm[credssp]

We then need to ensure that the Ansible server trusts the certificates of any Windows servers:

sudo chown root our-ca.crt
sudo chmod 744 our-ca.crt
sudo mv our-ca.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates

And finally we’ll tell Ansible how to connect to our Windows servers – including where to find the CA-file – by adding the following to the group_vars for the server group:

ansible_user: "username@domain.tld"
ansible_password: "YourExcellentPasswordHere"
ansible_connection: winrm
ansible_port: 5986
ansible_winrm_transport: credssp
ansible_winrm_ca_trust_path: /etc/ssl/certs

Naturally, if we’re storing credentials in a file, it should be protected as an Ansible vault.

Finally we can try our config out. Note, as mentioned in the beginning of this article, that I had to resort to running Ansible through Python3 to correctly validate my CA cert. It’s time to get with the times, folks.. 🙂

python3 $(which ansible) windowsserver.domain.tld --ask-vault-pass -m win_ping
Vault password: 
windowsserver.domain.tld | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

To ensure that playbooks targeting Windows servers run using Python3, add the following to the Windows server group_vars:

ansible_python_interpreter: /usr/bin/python3  

Happy server management!

Simple DMARC report parsing and visualizing toolkit

Just a short post to recommend techsneeze‘s tools for downloading, parsing, and displaying DMARC reports. I’m not exactly a Perl expert, so it took me a few minutes to install the necessary modules to get the scripts working, but after that I am a happy camper.

On that note: “I was today years old when I realized the usefulness of apt-file in Debian-based distros.”

The web reporting tool should not be presented outside of a secured network, but at first glance it seems to do exactly what it sets out to do, in visualizing SPF and DKIM failures.

File system rights on mounted drives in Windows

As I repeatedly state, the same object oriented design that makes PowerShell potentially powerful in complex tasks, also makes it require ridiculous verbosity on our part to make it accomplish simple ones. Today’s post is a perfect example.

Consider a volume mounted to an NTFS mountpoint in a directory. Since this is an obvious afterthought in the file system design, setting access rights on the mountpoint directory won’t do you any good if you expect these rights to propagate down through the mounted file system. While the reason may be obvious once you think about the limitations in the design, it certainly breaks the principle of least astonishment. The correct way to set permissions on such a volume is to configure the proper ACL on the partition object itself.

In the legacy Computer Management MMC-based interface, this was simply a matter of right-clicking in the Disk Management module to change the drive properties, and then setting the correct values in the Security tab. In PowerShell, however, this isn’t a simple command, but a script with three main components:

  • Populate an ACL object with the partition object’s current security settings
  • Modify the properties of the ACL object
  • Commit the contents of the ACL object back into the partition object

Here’s how it’s done:

First we need to find the volume identifier. For this we can use get-partition | fl, optionally modified with a where, or ?, query, if we know additional details that can help narrow the search. What we’re looking for is something looking like the following example in our DiskPath property:

\\?\Volume{f0e7b028-8f53-42fa-952b-dc3e01c161d8}

Armed with that we can now fill an object with the ACL for our volume:

$acl = [io.directory]::GetAccessControl("\\?\Volume{f0e7b028-8f53-42fa-952b-dc3e01c161d8}\")

We then create a new access control entry (ACE):

$newace = New-Object -TypeName System.Security.AccessControl.FileSystemAccessRule -ArgumentList "DOMAIN\testuser", "ReadAndExecute, Traverse",
 "ContainerInherit, ObjectInherit", "None", "Allow"

The reason we must enter data in this order is because of the definition of the constructor for the access control entry object. There’s really no way of understanding this from within the interactive scripting environment; you just have to have a bunch of patience and read dry documentation, or learn from code snippets found through searching the web.

The next step is to load our new ACE into the ACL object:

$acl.SetAccessRule($newace)

What if we want to remove rights – for example the usually present Everyone entry? In that case we need to find every ACE referencing that user or group in our ACL, and remove it:

$acl.access | ?{$_.IdentityReference.Value -eq "Everyone"} | ForEach-Object { $acl.RemoveAccessRule($_)}

If we’ve done this job interactively, we can take a final look at our ACL to confirm it still looks sane by running $acl | fl.

Finally we’ll commit the ACL into the file system again:

[io.directory]::SetAccessControl("\\?\Volume{f0e7b028-8f53-42fa-952b-dc3e01c161d8}\",$acl)

And there we go: We’ve basically had to write an entire little program to make it, and the poor inventors of the KISS principle and of the principle of least astonishment are slowly rotating like rotisserie chickens in their graves, but we’ve managed to set permissions on a mounted NTFS volume through PowerShell.

NTFS mount points via PowerShell

As I mentioned in an earlier post, it’s sometimes useful to mount an additional drive in a directory on an existing drive, Unix-style, rather than presenting it with its own traditional Windows-style drive letter.

Here’s how we do it in PowerShell:

If the volume is already mounted to a drive letter, we need to find the disk number and partition number of the letter:

Get-Partition | select DriveLetter, DiskNumber, PartitionNumber | ft

DriveLetter DiskNumber PartitionNumber
----------- ---------- ---------------
                     0               1
          C          0               2
                     1               1
          E          1               2
                     2               1
          F          2               2
                     3               1
          G          3               2

In this example, we see that volume G corresponds to DiskNumber 3, PartitionNumber 2.

Let’s say we want to mount that disk under E:\SharedFiles\Mountpoint. First we need to make sure the directory exists. Then we’ll run the following commands:

Add-PartitionAccessPath -DiskNumber 3 -PartitionNumber 2 -AccessPath 'E:\SharedFiles\Mountpoint\'
Remove-PartitionAccessPath -DiskNumber 3 -PartitionNumber 2 -AccessPath 'G:\'

Summary

As usual, PowerShell is kind of “wordy”, but we do get our things done.

Creating a working Ubuntu 18.04 VMware template

Long story short: I use VMware and I use Ubuntu. With Ubuntu 16.04 everything worked nicely out of the box. With Ubuntu 18.04 it doesn’t. I finally got tired of manually setting my hostname and network settings every time I need a new server, and decided to fix my template once and for all.

Networking

The first thing that doesn’t work – as mentioned in an earlier post – is deploy-time configuration of the network based on vCenter templates.

For some weird reason, Ubuntu has chosen to entirely replace the old ifupdown system for configuring the network with a combination of Cloud-init and Netplan. If we choose to download the installation image with the traditional installer, at least we don’t get cloud-init, but Netplan remains.

False start

According to the Netplan FAQ, we can install Ubuntu Server without using Netplan by pressing F6 followed by ‘e’ in the installer boot menu, and adding netcfg/do_not_use_netplan=true to the preseed command line.

Unfortunately this leaves us with a disconnected machine after first boot: It turns out Ubuntu isn’t smart enough to actually install ifupdown if netplan is deselected – at least not using the current installer, 18.04.01.

The working way

The solution to the problem above is still (in February 2019) to perform a clean install with Netplan, and then manually remove open-vm-tools and replace it with VMware’s official tools, since open-vm-tools do not yet support Ubuntu’s weirdness even 10 months after 18.04 was released.

…However…

The default DHCP behavior in Ubuntu 18.04 is nothing other than idiotic for use in VMware templates: Despite newly deployed machines naturally getting new MAC addresses, they insist on asking to be handed the same IP address as their template, and they naturally don’t understand if the lease is already taken but will keep stealing the IP address from each other.

Fortunately, according to this post over at superuser.com, there’s a way to fix this. Edit /etc/netplan/01-netcfg.yaml, and tell Netplan to use the MAC address as the DHCP identifier, like this:

      dhcp4: yes
      dhcp-identifier: mac

After this, new machines deployed from the template should behave slightly more sanely.

Painfully long Grub menu timeout

Grub’s boot menu has a default timeout of 30 seconds in Ubuntu 18.04. The relevant setting is apparently modifiable in /etc/default/grub. Only it isn’t. The default value for GRUB_TIMEOUT is 2 seconds, which it doesn’t adhere to at all. Logically (no, not at all), the “fix” is to add the following line to /etc/default/grub:

GRUB_RECORDFAIL_TIMEOUT=2

Re-run update-grub with superuser rights, and reboot the computer to confirm it worked as intended.

End result

With the changes detailed above, and after installing Python to allow Ansible to perform its magic on VMs deployed from this template, I finally have reached feature parity with my Ubuntu 16.04 template.

Rescuing vVol-based virtual machines

Background

As mentioned in a previous post, I had a really bad experience with vVols presented from IBM storage. Anyhow, the machines must be migrated to other storage, and reading how vVols work, that’s a scary prospect.

The good thing: Thanks to Veeam, I have excellent backups.

The bad thing: Since they’re dependent on the system’s ability to make snapshots, I only have backups up until my vVols failed. Troubleshooting, identifying the underlying issue, having VMware look at the systems and point at IBM, and finally realizing IBM won’t touch my issue unless I sign a year’s worth of software support agreements took several days, during which I’ve had no new backups for the affected VMs.

Fortunately, most of the systems I had hosted on the failed storage volumes were either more or less static, or stored data on machines on regular LUNs or vSAN.

The Three Trials Methods

Veeam restore

Templates and turned off-machines were marked as Inaccessible in the vCenter console. Since they had definitely seen no changes since the vVol storage broke down, I simply restored them to other datastores from the latest available backup.

VMware Converter

I attempted to use a Standalone VMware Converter to migrate a Ubuntu VM, but for some reason it kept having kernel panics on boot time. I suspect it may have something to do with the fact that Converter demands that the paravirtual SCSI controller is replaced with the emulated LSI one. I have yet to try with a Windows server, but my initial tests made me decide to only use Converter as an extra backup.

Cold migration

This is one method I was surprised worked, and which simplified things a lot. It turns out that – at least with the specific malfunction I experienced – turning off a VM that has been running doesn’t actually make it inaccessible to vCenter. And since a turned off VM doesn’t require the creation of snapshots to allow migration, moving it to accessible storage was a breeze. This is what I ended up doing with most of the machines.

Summary

It turns out that at least for my purposes, the vVols system decided to ”fail safe”, relatively speaking, allowing for cold migration of all machines that had been running when the management layer failed. I had a bit of a scare when the cold migration of a huge server failed due to a corrupt snapshot, but a subsequent retry where I moved the machine to a faster datastore succeeded, meaning I did not have to worry about restoring data from other copies of the machine.