Archive

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

## Spam statistics are live!

I've blogged about spam a few times before, and as you might have guessed defending against it and analysing the statistics thereof is a bit of a hobby of mine. Since I first installed the comment key system (and then later upgraded) in 2015, I've been keeping a log of all the attempts to post spam comments on my blog. Currently it amounts to ~27K spam attempts, which is about ~14 comments per day overall(!) - so far too many to sort out manually!

This tracking system is based on mistakes. I have a number of defences in place, and each time that defence is tripped it logs it. For example, here are some of the mistake codes for some of my defences:

Code Meaning
website A web address was entered (you'll notice you can't see a website address field in the comment form below - it's hidden to regular users)
shortcomment The comemnt was too short
invalidkey The comment key)
http10notsupported The request was made over HTTP 1.0 instead of HTTP 1.1+
invalidemail The email address entered was invalid

These are the 5 leading causes of comment posting failures over the past month or so. Until recently, the system would only log the first defence that was tripped - leaving other defences that might have been tripped untouched. This saves on computational resources, but doesn't help the statistics I've been steadily gathering.

With the new system I implemented on the 12th June 2020, a comment is checked against all current defences - even if one of them has been tripped already, leading to some interesting new statistics. I've also implemented a quick little statistics calculation script, which is set to run every day via cron. The output thereof is public too, so you can view it here:

Failed comment statistics

Some particularly interesting things to note are the differences in the mistake histograms. There are 2 sets thereof: 1 pair that tracks all the data, and another that only tracks the data that was recorded after 12th June 2020 (when I implemented the new mistake recording system).

From this, we can see that if we look at only the first mistake made, invalidkey catches more spammers out by a landslide. However, if we look at all the mistakes made, the website check wins out instead - this is because the invalidkey check happens before the website check, so it was skewing the results because the invalidkey defence is the first line of defence.

Also interesting is how comment spam numbers have grown over time in the spam-by-month histogram. Although it's a bit early to tell by that graph, there's a very clear peak around May / June, which I suspect are malicious actors attempting to gain an advantage from people who may not be able to moderate their content as closely due to recent happenings in the world.

I also notice that the overall amount of spam I receive has an upwards trend. I suspect this is due to more people knowing about my website since it's been around for longer.

Finally, I notice that in the average number of mistakes (after 2020-06-12) histogram, most spammers make at least 2 mistakes. Unfortunately there's also a significant percentage of spammers who make only a single mistake, so I can't yet relax the rules such that you need to make 2 or more mistakes to be considered a spammer.

Incidentally, it would appear that the most common pair of mistakes to make are shortcomment and website - perhaps this is an artefact of some specific scraping / spamming software? If I knew more in this area I suspect that it might be possible to identify the spammer given the mistakes they've made and perhaps their user agent.

This is, of course, a very rudimentary analysis of the data at hand. If you're interested, get in touch and I'm happy to consider sharing my dataset with you.

## Automatically organising & optimising photos and videos with Bash

As I promised recently, this post is about a script I implemented a while back that automatically organises and optimises the photos and videos that I take for me. Since I've been using it a while now and it seems stable, I thought I'd share it here in the hopes that it might be useful to someone else too.

I take quite a few photos and the odd video or two with my phone. These are automatically uploaded to a Raspberry Pi 3B+ that's acting as a file server on my home network with FolderSync (yes, it has ads, but it's the best I could find that does the job). Once uploaded to a folder, I then wanted a script that would automatically sort the uploaded images and videos into folders by year and month according to their date taken.

To do this, I implemented a script that uses exiftool (sudo apt install libimage-exiftool-perl I believe) to pull out the date taken from JPEGs and sort based on that. For other formats that don't support EXIF data, I take the last modified time with the date command and use that instead.

Before I do this though, I run my images through a few preprocessing tools:

• PNGs are optimised with optipng (sudo apt install optipng)
• JPEGs are optimised with jpegoptim (sudo apt install jpegoptim)
• JPEGs are additionally automatically reoriented with mogrify -auto-orient from ImageMagick, as many cameras will set an EXIF tag for the rotation of an image without bothering to physically rotate the image itself

It's worth noting here that these preprocessing optimisation steps are lossless. In other words, no quality lost by performing these actions - it simply encodes the images more efficiently such that they use less disk space.

Once all these steps are complete, images and videos are sorted according to their date taken / last modified time as described above. That should end up looking a bit like this:

images
+ 2019
+ 07-July
+ image1.jpeg
+ 2020
+ 05-May
+ image2.png
+ image3.jpeg
+ 06-June
+ video1.mp4

Now that I've explained how it works, I can show you the script itself:

(Can't see the above? Check out the script directly on GitLab here: organise-photos)

The script operates on the current working directory. All images directly in the working directory will be sorted as described above. Once you've put it in a directory that is in your PATH, simply call it like this:

organise-photos

The script can be divided up into 3 distinct sections:

1. The setup and initialisation
2. The function that sorts individual files themselves into the right directory (handle_file - it's about half-way down)
3. The preprocessing steps and the driver code that calls the above function.

So far, I've found that it's been working really rather well. During development and testing I did encounter a number of issues with the sorting system in handle_file that caused it to sort files into the wrong directory - which took me a while finally squash.

I'm always tweaking and improving it though. To that end, I have several plans to improve it.

Firstly, I want to optimise videos too. I'd like to store them in a standard format if possible. It's not that simple though, because some videos don't take well to being transcoded into a different format - indeed they can even take up more space than they did previously! In those cases it's probably worth discarding the attempt at transcoding the video to a more efficient format if it's larger than the original file.

I'd also like to reverse-geocode (see also the usage policy) the (latitude, longitude) geotags in my images to the name of the place that I took them, and append this data to the comment EXIF tag. This will make it easier to search for images based on location, rather than having to remember when I took them.

Finally, I'd also like to experiment with some form of AI object recognition with a similar goal as reverse-geocoding. By detecting the objects in my images and appending them to the comment EXIF tag, I can do things like search for "cat", and return all the images of cats I've taken so far.

I haven't started to look into AI much yet, but initial search results indicate that I might have an interesting time locating an AI that can identify a large number of different objects.

Anyway, my organise-photos script is available on GitLab in my personal bin folder that I commit to git if you'd like to take a closer look - suggestions and merge requests are welcome if you've got an idea that would make it even better :D

## Avoiding accidental array mutation when iterating arrays in PHP

Pepperminty Wiki is written in PHP, and I've posted before about the search engine I've implemented for it that's powered by an inverted index. In this post, I want to talk about an anti-feature of PHP that doesn't behave the way you'd expect, and how to avoid running into the same problem I did.

To do this, let's introduce a simple example of the problem at work:

<?php
$arr = []; for($i = 0; $i < 3;$i++) {
$key = random_int(0, 2000);$arr[$key] =$i;
echo("[init] key: $key, i:$i\n");
}

foreach($arr as$key => &$value) { // noop } echo("structure before: "); var_dump($arr);

foreach($arr as$key => $value) { echo("key:$key, i was $value\n"); } echo("structure after: "); var_dump($arr);
?>

The above code initialises an associative array with 3 elements. The contents might look like this:

Key Value
469 0
1777 1
1685 2

Pretty simple so far. It then iterates over it twice: once referring to the values by reference (that's what the & there is for), and the second time referring to the items by value.

You'd expect the array to be identical before and after the second foreach loop, but you'd be wrong:

Key Value
469 0
1777 1
1685 1

Wait, what? That's very odd. What's going on here? How can a foreach loop that's iterating an array by value mutate an array? To understand why, let's take a step back for a moment. Here's another snippet:

<?php

$arr = [ 1, 2, 3 ]; foreach($arr as $key =>$value) {
echo("$key:$value\n");
}

echo("The last value was $key:$value\n");
?>

What do you expect to happen here? While in Javascript with a for..of loop with a let declaration both $key and $value would have fallen out of scope by now, in PHP foreach statements don't create a new scope for variables. Instead, they inherit the scope from their parent - e.g. the global scope in the above or their containing function if defined inside a function.

To this end, we can still access the values of both $key and $value in the above example even after the foreach loop has exited! Unexpected.

It gets better. Try prefixing $value with an ampersand & in the above example and re-running it - note that both $key and $value are both still defined. This leads us to why the unexpected behaviour occurs. For some reason because of the way that PHP's foreach loop is implemented, if we re-use the same variable name for $value here in a subsequent loop it replaces the value of the last item in the array.

Shockingly enough this is actually documented behaviour (see also this bug report), though I'm somewhat confused as to how it happens on the last element in the array instead of the first.

With this in mind, to avoid this problem in future if you iterate an array by reference with a foreach loop, always remember to unset() the $value, like so: <?php$arr = [];
for($i = 0;$i < 3; $i++) {$key = random_int(0, 2000);
$arr[$key] = $i; echo("[init] key:$key, i: $i\n"); } foreach($arr as $key => &$value) {
// noop
}
unset($key); unset($value);

echo("structure before: "); var_dump($arr); foreach($arr as $key =>$value) {
echo("key: $key, i was$value\n");
}

echo("structure after: "); var_dump($arr); ?> By doing this, you can ensure that you don't accidentally mutate your arrays and spend weeks searching for the bug like I did. It's language features like these that catch developers out: and being aware of the hows and whys of their occurrence can help you to avoid them next time (if anyone can explain why it's the last element in the array that's affected instead of the first, I'd love to know!). Regardless, although I'm aware of how challenging implementing a programming language is, programming language designers should take care to avoid unexpected behaviour like this that developers don't expect. Found this interesting? Comment below! ### Sources and further reading ## Website change detection with headless Firefox and ImageMagick This wasn't the script I had in mind in the previous blog post (so you can look forward to another blog post about it), but have you ever wanted to know when a web page changes? If it does change, it's almost impossible to tell where on the page it's changed. Recently, I was thinking about the problem, and realised a few things: • Firefox can be operated headlessly (with --headless) to take screenshots • ImageMagick must be advanced enough to diff images With this in mind, I set about implementing a script. Before we continue, here's an example diff image: It's rather tall because of the webpage I chose, but the bits that have changed appear in red. The script I've written also generates an animated PNG showing the difference too: Again, it's very tall because of the page I tested with, but I think it's pretty cool! If you'd like to check the script out for yourself, you find it in the following git repository: sbrl/url-diff For the curious, the script in question is written in Bash. It uses apcalc (available in Debian / Ubuntu based Linux distributions with sudo apt install apcalc) to crunch the numbers, and headless Firefox + Imagemagick as described above to take the screenshots and do the image processing. It should in theory work on Windows, but you'll need to jump through a number of hoops: • Install call url-diff.sh from [git bash]() • Install [ImageMagick]() and make sure the binaries are in your PATH • Install Firefox and make sure firefox is in your PATH • Explicitly set the URLDIFF_STORAGE_DIR environment variable when calling the script (do this by prefixing the command at the bottom of this post with URLDIFF_STORAGE_DIR=path/to/directory) With my fancy new embed system, I can show you the code behind it: (Can't see the above? Check it out in the git repository.) I'm working on line numbers (sadly the author of highlight.js doesn't like them, so an alternative solution is required). Anyway, the basic layout of the script is as follows: 1. First, the settings are read in and the default values set 2. Then, I define some utility functions. • The calculate_percentage_colour function is integral to the image change detection algorithm. It counts percentage of an image that is a given colour. 3. Next, the help text is displayed if necessary 4. The case statement that follows allows multiple subcommands to be implemented. Currently I only have a check subcommand, but you never know! 5. Inside this case statement, the screenshots are taken and compared. • A new screenshot is taken with headless Firefox • If we don't have a screenshot stored away already, we stash the new screenshot and exit • If we do have a pre-existing screenshot, we continue with the comparison, starting by generating a diff image where pixels that have changed are given 1 colour, and pixels that haven't changed another • It's at this point that calculate_percentage_colour is called to calculate how much of the image has changed - the diff image is passed in and the changed pixels are counted • If more than 2% (by default) has changed, then we continue on to generate the output images • The first output image consists of the new screenshot with the diff image overlaid - this is generated with some ImageMagick wizardry: -compose over -composite • The second is an animated PNG comprised of the old and new screenshots. This is generated with ffmpeg - which supports animated PNGs • Finally, the old screenshot that we have stored away is replaced with the new one It sounds more complicated than it is - hopefully my above explanation makes sense (post a comment below if you're confused about something!). You can call the script like so: git clone https://git.starbeamrainbowlabs.com/sbrl/url-diff.git cd url-diff; ./url-diff.sh check URL_HERE path/to/output_diff.png path/to/output.apng ....replacing URL_HERE with the URL to check, and the paths with the places you'd like to write the output images to. ## EmbedBox: Lightweight syntax-highlighted embeds I was planning posting about something else yesterday, but I wanted to show some GitLab code in a syntax-highlighted embed. When I wasn't able to figure out how to do that, I ended up writing EmbedBox. The whole thing is best explained with an example. Have an embed: (Can't see the above? Check out the original file here) Pretty cool, right? The above is the default settings file for EmbedBox. Given any URL (e.g. https://raw.githubusercontent.com/sbrl/EmbedBox/master/src/settings.default.toml), it will generate a syntax-highlighted embed for it. It does so using highlight.php to do the syntax-highlighting server-side, Stash PHP for the cache, and without any Javascript in the embed itself. It comes with a web interface that generates the embed code given the input URL and a few other settings and shows a preview of what it'll look like. EmbedBox is open-source too (under the Mozilla Public Licence 2.0), so you're welcome to setup your own instance! To do so, check out the code here: https://github.com/sbrl/EmbedBox/ The installation instructions should be pretty straightforward in theory, but if you get stuck please open an issue. Now that I've implemented EmbedBox, you can expect to see it appear in future blog posts. I'm planning to write about my organise-photos script in the near future, so expect a blog post about it soon. Found this interesting? Got a suggestion? Want to say hi? Comment below! ## Cluster, Part 8: The Shoulders of Giants | NFS, Nomad, Docker Registry Welcome back! It's been a bit of a while, but now I'm back with the next part of my cluster series. As a refresher, here's a list of all the parts in the series so far: In this one, we're going to look at running our first job on our Nomad cluster! If you haven't read the previous posts in this series, you'll probably want to go back and read them now, as we're going to be building on the infrastructure we've setup and the groundwork we've laid in the previous posts in this series. Before we get to that though, we need to sort out shared storage - as we don't know which node in the cluster tasks will be running on. In my case, I'll be setting up NFS. This is hardly the only solution to the issue though - other options include: If you're going to choose NFS like me though, you should be warned that it's neither encrypted not authenticated. You should ensure that NFS is only run on a trusted network. If you don't have a trusted network, use the WireGuard Mesh VPN trick in part 4 of this series. ### NFS: Server Setting up a server is relatively easy. Simply install the relevant package: sudo apt install nfs-kernel-server ....edit /etc/exports to look something like this: /mnt/somedrive/subdirectory 10.1.2.0/24(rw,async,no_subtree_check) /mnt/somedrive/subdirectory is the directory you'd like clients to be able to access, and 10.1.2.0/24 is the IP range that should be allowed to talk to your NFS server. Next, open up the relevant ports in your firewall (I use UFW): sudo ufw allow nfs ....and you're done! Pretty easy, right? Don't worry, it'll get harder later on :P ### NFS: Client The client, in theory, is relatively straightforward too. This must be done on all nodes in the cluster - except the node that's acting as the NFS server (although having the NFS server as a regular node in the cluster is probably a bad idea). First, install the relevant package: sudo apt install nfs-common Then, update /etc/fstab and add the following line: 10.1.2.10:/mnt/somedrive/subdirectory /mnt/shared nfs auto,nofail,noatime,intr,tcp,bg,_netdev 0 0 Again, 10.1.2.10 is the IP of the NFS server, and /mnt/somedrive/subdirectory must match the directory exported by the server. Finally, /mnt/shared is the location that we're going to mount the directory from the NFS server to. Speaking of, we should create that directory: sudo mkdir /mnt/shared I have yet to properly tune the options there on both the client and the server. If I find that I have to change anything here, I'll both come back and edit this and mention it in a future post that I did. From here, you should be able to mount the NFS share like so: sudo mount /mnt/shared You should see the files from the NFS server located in /mnt/shared. You should check to make sure that this auto-mounts it on boot too (that's what the auto and _netdev are supposed to do). If you experience issues on boot (like me), you might see something like this buried in /var/log/syslog: mount[586]: mount.nfs: Network is unreachable ....then we can quickly hack this by creating a script in the directory /etc/network/if-up.d. It should read something like this should fix the issue: #!/usr/bin/env bash mount /mnt/shared Save this to /etc/network/if-up.d/cluster-shared-nfs for example, not forgetting to mark it as executable: sudo chmod +x /etc/network/if-up.d/cluster-shared-nfs Alternatively, there's autofs that can do this more intelligently if you prefer. ### First Nomad Job: Docker Registry Now that we've got shared storage online, it's time for the big moment. We're finally going to start our very first job on our Nomad cluster! It's going to be a Docker registry, and in my very specific case I'm going to be marking it as insecure (gasp!) because it's only going to be accessible from the WireGuard VPN - which I figure provides the encryption and authentication for us to get started reasonably simply without jumping through too many hoops. In the future, I'll probably revisit this in a later post to tighten things up. Tasks on a Nomad cluster take the form of a Nomad job file. These can written in JSON or HCL (Hashicorp Configuration Language). I'll be using HCL here, because it's easier to read and we're not after machine legibility yet at this stage. Nomad job files work a little bit like Nginx config files, in that they have nested sequences of blocks in a hierarchical structure. They loosely follow the following pattern: job > group > task The job is the top-level block that contains everything else. tasks are the items that actually run on the cluster - e.g. a Docker container. groups are a way to logically group tasks in a job, and are not required as far as I can tell (but we'll use one here anyway just for illustrative purposes). Let's start with the job spec: job "registry" { datacenters = ["dc1"] # The Docker registry *is* pretty important.... priority = 80 # If this task was a regular task, we'd use a constraint here instead & set the weight to -100 affinity { attribute = "${attr.class}"
value       = "controller"
weight      = 100
}

# .....

}

This defines a new job called registry, and it should be pretty straight forward. We don't need to worry about the datacenters definition there, because we've only got the 1 (so far?). We set a priority of 80, and get the job to prefer running on nodes with the controller class (though I observe that this hasn't actually made much of a difference to Nomad's scheduling algorithm at all).

Let's move on to the real meat of the job file: the task definition!

group "main" {
driver = "docker"

config {
image = "registry:2"
labels { group = "registry" }

volumes = [
"/mnt/shared/registry:/var/lib/registry"
]

port_map {
registry = 5000
}
}

resources {
network {
port "registry" {
static = 5000
}
}
}

# .......
}
}

There's quite a bit to unpack here. The task itself uses the Docker driver, which tells Nomad to run a Docker container.

In the config block, we define the Docker driver-specific settings. The docker image we're going to run is registry:2 where registry is the image name, and 2 is the tag. This will to automatically pulled from the Docker hub. Future tasks will pull docker images from our very own private Docker registry, which we're in the process of setting up :D

We also mount a directory into the Docker container to allow it to persist the images that we push to it. This is done through a volume, which is the Docker word for bind-mounting a specific directory on the host system into a given location inside the guest container. For me I'm (currently) going to store the Docker registry data at /mnt/shared/registry - you should update this if you want to store it elsewhere. Remember this this needs to be a location on your shared storage, as we don't know which node in the cluster the Docker registry is going to run on in advance.

The port_map allows us to tell Nomad the port(s) that our service inside the Docker container listens on, and attach a logical name to them. We can then expose them in the resources block. In this specific case, I'm forcing Nomad to statically allocate port 5000 on the host system to point to port 5000 inside the container, for reasons that will become apparent later. This is done with the static keyword there. If we didn't do this, Nomad would allocate a random port number (which is normally what we'd want, because then we can run lots of copies of the same thing at the same time on the same host).

The last block we need to add to complete the job spec file is the service block. with a service block, Nomad will inform Consul that a new service is running, which will then in turn allow us to query it via DNS.

service {
name = "${TASK}" tags = [ "infrastructure" ] address_mode = "host" port = "registry" check { type = "tcp" port = "registry" interval = "10s" timeout = "3s" } } The service name here is pulled from the name of the task. We tell Consul about the port number by specifying the logical name we assigned to it earlier. Finally, we add a health check, to allow Consul to keep an eye on the health of our Docker registry for us. This will appear as a green tick if all is well in the web interface, which we'll be getting to in a future post. The health check in question simply ensures that the Docker registry is listening via TCP on the port it should be. Here's the completed job file: job "registry" { datacenters = ["dc1"] # The Docker registry *is* pretty important.... priority = 80 # If this task was a regular task, we'd use a constraint here instead & set the weight to -100 affinity { attribute = "${attr.class}"
value       = "controller"
weight      = 100
}

group "main" {

driver = "docker"

config {
image = "registry:2"
labels { group = "registry" }

volumes = [
"/mnt/shared/registry:/var/lib/registry"
]

port_map {
registry = 5000
}
}

resources {
network {
port "registry" {
static = 5000
}
}
}

service {
tags = [ "infrastructure" ]

port = "registry"
check {
type        = "tcp"
port        = "registry"
interval    = "10s"
timeout     = "3s"
}

}
}

//  driver = "docker"
//
//  config {
//      // We're going to have to build our own - the Docker image on the Docker Hub is amd64 only :-/
//      // See https://github.com/Joxit/docker-registry-ui
//      image = ""
//  }
// }
}
}

Save this to a file, and then run it on the cluster like so:

nomad job run path/to/job/file.nomad

I'm as of yet unsure as to whether Nomad needs the file to persist on disk to avoid it getting confused - so it's probably best to keep your job files in a permanent place on disk to avoid issues.

Give Nomad to start the job, and then you can check on it's status like so:

nomad job status

This will print a summary of the status of all jobs on the cluster. To get detailed information about our new job, do this:

nomad job status registry

It should show that 1 task is running, like this:

ID            = registry
Name          = registry
Submit Date   = 2020-04-26T01:23:37+01:00
Type          = service
Priority      = 80
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
main        0       0         1        5       6         1

Latest Deployment
ID          = ZZZZZZZZ
Status      = successful
Description = Deployment completed successfully

Deployed
main        1        1       1        0          2020-06-17T22:03:58+01:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
XXXXXXXX  YYYYYYYY  main        4        run      running  6d2h ago  2d23h ago

Ignore the Failed, Complete, and Lost there in my output - I ran into some snags while learning the system and setting mine up :P

You should also be able to resolve the IP of your Docker registry via DNS:

dig +short registry.service.mooncarrot.space

mooncarrot.space is the root domain I've bought for my cluster. I highly recommend you do the same if you haven't already. Consul exposes all services under the service subdomain, so in the future you should be able to resolve the IP of all your services in the same way: service_name.service.DOMAIN_ROOT.

Take care to ensure that it's showing the right IP address here. In my case, it should be the IP address of the wgoverlay network interface. If it's showing the wrong IP address, you may need to carefully check the configuration of both Nomad and Consul. Specifically, start by checking the network_interface setting in the client block of your Nomad worker nodes from part 7 of this series.

### Conclusion

We're getting there, slowly but surely. Today we've setup shared storage with NFS, and started our first Nomad job. In doing so, we've started to kick the tyres of everything we've installed so far:

• wesher, our WireGuard Mesh VPN
• Unbound, our DNS server
• Consul, our service discovery superglue

Truly, we are standing on the shoulders of giants: a whole host of open-source software that thousands of people from across the globe have collaborated together to produce which makes this all possible.

Moving forwards, we're going to be putting that Docker registry to good use. More immediately, we're going to be setting up Fabio (who's documentation is only marginally better than Traefik's, but just good enough that I could figure out how to use it....) in order to take a peek at those cool web interfaces for Nomad and Consul that I keep talking about.

We're also going to be looking at setting up Vault for secret (and certificate, if all goes well) management.

Until then, happy cluster configuration! If you're confused about anything so far, please leave a comment below. If you've got a suggestion to make it even better, please comment also! I'd love to know.

## Ensuring a Linux machine's network connection stays up with Bash

Recently, I had the unpleasant experience of my Lab machine at University dropping offline. It has a tendency to do this randomly - and normally I'd just reboot it myself, but since I'm working from home at the moment it meant that I couldn't go in to fix it. This unfortunately meant that I was stuck waiting for a generous technician to go in and reboot it for me.

With access now restored I decided that I really didn't want this to happen again, so I've written a simple Bash script to resolve the issue.

It works by checking for an Internet connection every hour by pinging starbeamrainbowlabs.com - and if it doesn't manage to do so successfully, then it will reboot. A simple concept, but I discovered a number of things that needed considering while writing it:

1. To avoid detecting transient network issues, we should make multiple attempts before giving up and rebooting
2. Those multiple attempts need to be delayed to be effective
3. We mustn't reboot more than once an hour to avoid getting into a 'reboot loop'
4. If we're running an experiment, we need a way to temporarily delay it from doing it's checks that will resume automatically
5. We could try and diagnose the network error or turn the networking of and on again, but if it gets stuck halfway through then we're locked out (very undesirable) - so it's easier / safer to just reboot

With these considerations in mind, I came up with this: ensure-network.sh (link to part of a GitHub Gist, as it's quite long)

This script requires Bash version 4+ and has a number of environment variables that can configure its behaviour:

Environment Variable Description
CHECK_EXTERNAL_HOST The domain name or IP address to ping to check the connection
CHECK_INTERVAL The interval to check the connection in seconds
CHECK_TIMEOUT Wait at most this long for a reply to our ping
CHECK_RETRIES Retry this many times before giving up and rebooting
CHECK_RETRY_DELAY Delay this many seconds in between retries
CHECK_DRY_RUN If true, then don't actually reboot (useful for testing)
CHECK_REBOOT_DELAY Leave at least this many minutes in between reboots
CHECK_POSTPONE_FILE If this file exists and has a recent last-modified time (mtime), don't actually reboot
CHECK_POSTPONE_MAXAGE The maximum age in minutes of the CHECK_POSTPONE_FILE to consider it fresh and avoid rebooting

With these environment variables, it covers all 4 points in the above list. To expand on CHECK_POSTPONE_FILE, if I'm running an experiment for example and I don't want it to reboot in the middle of said experiment, then I can simply run touch /path/to/postpone_file to delay network connection-related reboots for 7 days (by default). After this time, it will automatically start rebooting again if it drops off the network. This ensures that it will always restart monitoring eventually - as if I had a more manual system I'd forget to re-enable it and then loose access.

Another consideration is that the /var/cache directory must exist. This is because an empty tracking file is created there to keep track of when the last network connection-related reboot occurred.

With the script written, then next step is to have it run automatically on boot. For systemd-based systems such as my lab machine, a systemd service is the order of the day. This is relatively simple:

[Unit]
Description=Reboot if the network connection is down
After=network.target

[Service]
Type=simple
# Because it needs to be able to reboot
User=root
Group=root
EnvironmentFile=-/etc/default/ensure-network
ExecStartPre=/bin/sleep 60
ExecStart=/bin/bash "/usr/local/lib/ensure-network/ensure-network.sh"
SyslogIdentifier=ensure-access
StandardError=syslog
StandardOutput=syslog

[Install]
WantedBy=multi-user.target

This assumes that the ensure-network.sh script is located at /usr/local/lib/ensure-network/ensure-network.sh. It also allows for an environment file to optionally be created at /etc/default/ensure-network, so that you can customise the parameters. Here's an example environment file:

CHECK_EXTERNAL_HOST=example.com
CHECK_INTERVAL=60

The above example environment file checks against example.com every minute instead of the default starbeamrainbowlabs.com every hour. You can, of course, specify any (or all) of the environment variables detailed above in the environment file if you wish.

That completes my setup - so hopefully I don't encounter any more network-related issues that lock me out of accessing my lab machine remotely! To install it yourself, you can do this:

# Create the directory for the script to live in
sudo mkdir /usr/local/lib/ensure-network
sudo curl -L -O /usr/local/lib/ensure-network/ensure-network.sh https://gist.githubusercontent.com/sbrl/08e13f2ceedafe35ac7f8dbdfb8bfde7/raw/cc5ab4226472c08b09e448a257256936cc749193/ensure-network.sh
sudo curl -L -O /etc/systemd/system/ensure-network.service https://gist.githubusercontent.com/sbrl/08e13f2ceedafe35ac7f8dbdfb8bfde7/raw/adf5ed4009b3e1a09f857936fceb3581897072f4/ensure-network.service
# Start the service & enable it on boot
sudo systemctl start ensure-network.service
sudo systemctl enable ensure-network.service

You might need to replace the URLs there with the latest ones that download the raw content from the GitHub Gist.

Did you find this useful? Got a suggestion to make it better? Running into issues? Comment below!

## Analysing logs with lnav

Before I forget about it, I want to make a note on here about lnav. It's available in the default Ubuntu repositories, and I discovered it a while back.

(Above: a screenshot of lnav. The pixellated bits are the IPs, which I've hidden for privacy.)

Essentially, it's a tool to make reading and analysing log files much easier. It highlights the interesting bits, and also allows you to filter log lines in or out with regular expressions. It even allows you to query your logs with SQLite if they are in any of the well-known formats that it can parse - and you can write your own log line parser definitions too with a JSON configuration file!

I find it a great tool to us every now and then to get an overview of my various devices that I manage to see if there are any issues I need to take care of. The error and warning message highlighting (while not perfect) is also rather useful to help in spotting the things that require my attention.

If you're on a Debian-based distribution of Linux, you should be able to install it like so:

sudo apt install lnav

Then, to analyse some log files:

lnav path/to/log/files

You can also use Bash's glob-star feature to specify multiple log files. it can also automatically unpack gzipped logfiles too:

lnav /var/log/syslog*

Of course, don't forget to prefix with sudo if you require it to read a given logfile.

## PhD Update 4: Ginormous Data

Hello again! In the last PhD update blog post, I talked about patching HAIL-CAESAR to improve performance and implementing a Temporal Convolutional Neural Net (Temporal CNN).

Since making that post, I've had my PhD Panel 1 (very useful, thanks to everyone who was on that panel!). I've also got an initial - albeit untested - implementation of a Temporal CNN. I've also been wrangling lots of data in more ways than one. I'm definitely seeing the Big Data aspect of my project title now.

### HAIL-CAESAR

I ran HAIL-CAESAR initially at 50m per pixel. This went ok, and generated lots of data out, but in 2 weeks of real time it barely hit 43 days worth of simulation time! The other issue I discovered due to the way I compressed the output of HAIL-CAESAR, for some reason it compressed the output files before HAIL-CAESAR had finished writing to them. This resulted in the data being cut off randomly in the output files for each time step.

Big problem - clearly another approach is in order.

To tackle these issues, I've done several things. Firstly, I patched HAIL-CAESAR again to support writing the output water depth files to the standard output. As a refresher, they are actually identical in format to the heightmap, which looks a bit like this:

ncols 4
nrows 3
xllcorner 400000
yllcorner 300000
cellsize 1000
1 2 3 4
1 1 2 3
0 1 1 2

The above is a 4x3 grid of points, with the bottom-left corner being at (400000, 300000) on the Ordnance Survey National Grid (I know, latitude / longitude would be so much better, but all the data I'm working with is on the OS national grid :-/). Each point represents a 1km square area.

To this end, I realised that it doesn't actually matter if I concatenate multiple files in this format together - I can guarantee that I can tell them apart. As soon as I detect a metadata line that the current file has already declared, then I know that the next file is starting and we're starting to read the next file along. To this end, I implemented a new Terrain50.ParseStream() function that is an async generator that will take a stream, and then iteratively yield the Terrain50 instances it parses out of the stream. In this way, I can split 1 big continuous stream back up again into the individual parts.

By patching HAIL-CAESAR such that it outputs the data in 1 continuous stream, it also means that I can pipe it to a single compression program. This has 2 benefits:

• It avoids the "compressing the individual files before HAIL-CAESAR is ready" problem (the observant might note that inotifywait would solve this issue neatly too, but it isn't installed on Viper)
• It allows for more efficient compression, as the compression program can use data from other time step files as context

Finding a compression tool was next. I wanted something CPU efficient, because I wanted to ensure that the maximum number of CPU cycles were dedicated to HAIL-CAESAR for running the simulation, rather than compressing the output it generates - since it is the bottleneck after all.

I ended up using lz4 in the end, an extremely fast compression algorithm. It compiles easily too, which is nice as I needed to compile it from source automatically on Viper.

With all this in place, I ran HAIL-CAESAR again 2 more times. The first run was at the same resolution as before, and generated 303 GiB (!) of data.

The second run was at 500m per pixel (10 times lower resolution), which generated 159 GiB (!) of data and, by my calculations, managed to run through ~4.3 years in simulation time in 5 days of real time. Some quick calculations suggest that to get through all 13 years of rainfall radar data I have it would take just over 11 days, so since I've got everything setup already, I'm going to be contacting the Viper administrators to ask about running a longer job to allow it to complete this process if possible.

### Temporal CNN Preprocessing

The other major thing I've been working on since the last post is the Temporal CNN. I've already got an initial implementation setup, and I'm currently in the process of ironing out all the bugs in it.

I ran into a number of interesting bugs. One of these was to do with incorrectly specifying the batch size (due to a typo), which resulted in the null values you may have noticed in the model summary in the last post. With those fixed, it looks much more sensible:

_________________________________________________________________
Layer (type)                 Output shape              Param #
=================================================================
conv3d_1 (Conv3D)            [32,2096,3476,124,64]     16064
_________________________________________________________________
conv3d_2 (Conv3D)            [32,1046,1736,60,64]      512064
_________________________________________________________________
conv3d_3 (Conv3D)            [32,521,866,28,64]        512064
_________________________________________________________________
pooling (AveragePooling3D)   [32,521,866,1,64]         0
_________________________________________________________________
reshape (Reshape)            [32,521,866,64]           0
_________________________________________________________________
conv2d_output (Conv2D)       [32,517,862,1]            1601
_________________________________________________________________
reshape_end (Reshape)        [32,517,862]              0
=================================================================
Total params: 1041793
Trainable params: 1041793
Non-trainable params: 0
_________________________________________________________________

This model is comprised of the following:

• 3 x 3D convolutional layers
• 1 x pooling layer to average out the temporal dimension
• 1 x reshaping layer to remove the redundant dimension
• 1 x 2D convolutional layer that will produce the output
• 1 x reshaping layer to remove another redundant dimension

I'll talk about this model in more detail in a future post in this series once I've managed to get it running and I've played around with it a bit.

Another significant one I ran into was to do with stacking tensors like an image. I ended up asking on Stack Overflow: How do I reorder the dimensions of a rank 3 tensor in Tensorflow.js?

The input to the above model is comprised of a sliding window that moves along the rainfall radar time steps. Each time step contains a 2D array, representing the amount of rain that has fallen in a given area. This needs to be combined with the heightmap, so that the AI model knows what the terrain that the rain is falling on looks like.

The heightmap doesn't change, but I'm including a copy of it with every rainfall radar time step because of the way the 3D convolutional layer works in Tensorflow.js. 2D convolutional layers in Tensorflow.js, for example, take in a 2D array of data as a tensor. They can also take in multiple channels though, much like pixels in an image. The pixels in an image might look something like this:

R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3 .....

As you might have seen in the Stack Overflow answer I linked to above, Tensorflow.js does support stacking multiple 2D tensors in this fashion. It is unfortunately extremely slow however. It is for this reason that I've been implementing a multi-process program to preprocess the data to do this stacking in advance.

As I'm writing this though, I've finally understood what the dataFormat option is for in the conv3d and conv2d layers is for, and I think I might have been barking up the wrong tree......

### What's next

From here, I'm going to investigate that dataFormat option for the TemporalCNN - it would hugely simplify my setup and remove the need for me to preprocess the data beforehand, since stacking tensors directly 1 after another is very quick - it's just stacking them along a different dimension that's slow.

I'm also hoping to do a longer run of that 500m per pixel HAIL-CAESAR simulation. More data is always good, right? :P

After I've looked into the dataFormat option, I'd really like to get the Temporal CNN set off training and all the bugs ironed out. I'm so close I can almost taste it!

Finally, if I have time, I want to go looking for a baseline model of sorts. By this, I mean an existing model that's the closest thing to the task I'm doing - even though they might not be as performant or designed for my specific task.

Found this interesting? Got a suggestion of something I could do better? Confused about something I've talked about? Comment below!

## PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

Phew, that's a bit of a mouthful. We're taking a short break from the cluster series of posts (though those will be back next week I hope), because I've just run into a fascinating problem, the solution to which I thought I'd share here - since I didn't find a solution elsewhere on the web.

For my PhD, I've got a big old lump of data, and it all needs preprocessing before I train an AI model (or a variant thereof, since I'm effectively doing video-to-image translation). Unfortunately, one of the preprocessing steps is really slow. And because I'll naturally be training my AI for multiple epochs, the problem is multiplied.....

The solution, of course, is to do all the preprocessing up front such that I can just read the data in and push it directly into a Tensor in the right format. However, doing this on such a large dataset would take forever if I did the items 1 by 1. The thing is that Javascript isn't inherently multithreaded. I like this quote, as it describes the situation rather well:

In Javascript everything runs in parallel... except your code

In other words, when Node.js is reading or writing to and from the network, disk, or other places it can do lots of things at the same time because it does them asynchronously. The Javascript that gets executed though is only done on a single thread though.

This is great for io-bound tasks (such as a web server), as Node.js (a Javascript runtime) can handle many requests at the same time. On a side note, this is also the reason why Nginx is more efficient than Apache (because Nginx is event based too like Javascript, unlike Apache which is thread based).

It's not so great though for CPU bound tasks, such as the one I've got on my hands. All is not lost though, because Node.js has a number of useful functions inbuilt that we can use to tackle the issue.

Firstly, Node.js has a clever forking system. By using child_process.fork(), a single Node.js process can create multiple copies of itself to act as workers:

// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i < os.cpus().length; i++) {
workers.push(
child_process.fork("worker.mjs")
);
}
// worker.js
console.log(Hello, world from a child process!);

Very useful! The next much more sticky problem though is how to actually preprocess the data in a performant manner. In my specific case, I'm piping the data in from a shell script that decompresses a number of gzip archives in a specific order (as of the time of typing I have yet to implement this).

Because this is a single pipe we're talking about here, the question now arises of how to allow all the child processes to access the data that's coming in from the standard input of the master process.

I've actually encountered an issue like this one before. I initially tried reading it in on the master process, and then using worker.send(message) to send it to the worker processes for processing. This didn't end up working very well, because the master process became a bottleneck as it couldn't read from the standard input and send stuff to the workers fast enough.

With this in mind, I came up with a new plan. In Node.js, when you're forking to create a worker process, you can supply it with some custom file descriptors upon initialisation. So long as it has at least IPC (inter-process communication) channel for passing messages back and forth with the .send() and .on("message", (message) => ....) method and listeners, it doesn't actually care what you do with the others.

Cue file descriptor cloning:


// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i 

I've highlighted the key line here (line 10 for those who can't see it). Here we tell it to clone file descriptors 0, 1, and 2 - which refer to stdin, stdout, and stderr respectively. This allows the worker processes direct access to the master process' stdin, stdout, and stderr.

With this, we can read from the same pipe with as many worker processes as we like - so long as they do so 1 at a time.

With this sorted, it gives rise to the next issue: reading line-by-line. Packages exist on npm (such as nexline, my personal favourite) to read from a stream line-by-line, but they have the unfortunate side-effect of maintaining a read buffer. While this is great for performance, it's not so great in my situation because it ends up scrambling the input! This is because said read buffer would be local to each worker process, so when the next worker along reads, it will skip a random number of bytes and start reading from the next bit along.

This means that I need to implement a custom method that reads a single line from a given file descriptor without maintaining a read buffer. I came up with this:

import fs from 'fs';

//  .....

// Global buffer to avoid unnecessary memory churn
let buffer = Buffer.alloc(4096);
let i = 0;
while(true) {
if(bytes_read !== 1 || buffer[i] == 0x0A) {
if(i == 0 && bytes_read == null) return null;
return buffer.toString("utf-8", 0, i); // This is not inclusive, so we can abuse it to trim the \n off the end
}

i++;
if(i == buffer.length) {
let new_buffer = new Buffer(Math.ceil(buffer.length * 1.5));
buffer.copy(new_buffer);
buffer = new_buffer;
}
}
}

I read from the given file descriptor character by character directly into a buffer. As soon as it detects a new line character (\n, or character code 0x0A), it returns the new line. If we run out of space in the buffer, then we create a new larger one, copy the old buffer's contents into it, and keep going.

I maintain a global buffer here, because this helps to avoid unnecessary memory churn. In my case, the lines I'm reading in a rather long (hence the need to clone the file descriptor in the first place), and if I didn't keep a shared buffer I'd be allocating and deallocating a new pretty large buffer every time.

This also has the nice side-effect that we keep the largest buffer we've had to use so far around for next time, avoiding the need for subsequent copies to larger and larger buffers.

Finally, we can also guarantee that it won't be a problem if we call this multiple times, because as I explained above Javascript is single-threaded, so if we call the function multiple times in quick succession each read will happen 1 after another.

With this chain of Node.js features, we can read a large amount of data from and efficiently process the content of a pipe. The trick from here is to implement a proper messaging and locking system to avoid reading from the stream at the same time, and avoid write to the standard output at the same time.

Taking this further, I ended up with this:

(Licence: Mozilla Public Licence 2.0)

This correctly ensures that only 1 worker process reads from the stream at the same time. It doesn't do anything with the result though except log a message to the console, but when I implement that I'll implement a similar messaging system to ensure that only 1 process writes to the output at once.

On that note, my data is also ordered, so I'll have to implement a complicated cache system // ordering system to ensure that I write them to the standard output in the same order I read them in. When I do implement that, I'll probably blog about that too....

The main problem I still have with this solution is that I'm reading from the input stream. I haven't done any proper testing, but I'm pretty sure that doing so will be really slow. I not sure I can avoid this though and read a few KiBs at a time, because I don't currently know of any way to put the extra characters back into the input stream.

If anyone has a solution to that that increases performance, I'd love to know. Leave a comment below!

Art by Mythdael