Monday, April 29, 2019

HA PostgreSQL on Tribblix with Patroni

When it comes to managing PostgreSQL replication, there are a number of options available.

I looked at stolon, but it's not the only game in town. In terms of a fully managed system, there's also patroni.

In terms of overall functionality, stolon and patroni are pretty similar. They both rely on etcd (or something similar) for storing state; they both take care of running the postgres server with the right options, and reconfiguring it as necessary; they'll both promote a replica to continue service if the master fails.

So, here's how to set up a HA PostgreSQL cluster using patroni.

Before starting on anything like this with Tribblix, it's always a good idea to

zap refresh

so that you're up to date in terms of packages and overlays.

First create 3 zones, just like before:

zap create-zone -z node1 -t whole \
  -o base -O patroni -x

zap create-zone -z node2 -t whole \

  -o base -O patroni -x

zap create-zone -z node3 -t whole \

  -o base -O patroni -x

Building the zones like this, with the patroni overlay, will ensure that all the required packages are installed in the zones so you don't need to mess around with installing packages later.

Then zlogin to each node and run the etcd commands as before, to create the user and start etcd.

Now create a user to run postgres on each node

zlogin node1 (and 2 and 3)
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

Now you need to create yaml files containing the configuration for each node. See for the sample files I've used here

Log in to each node in turn

pfexec zlogin -l pguser node1

/usr/versions/python-2.7/bin/patroni ${HOME}/node1.yaml

And it initializes a cluster, with just the one node as of yet, and that node will start off as the master.

Now the 2nd node

pfexec zlogin -l pguser node2

/usr/versions/python-2.7/bin/patroni ${HOME}/node2.yaml

And it sets it up as a secondary, replicating from node1.

What do things look like right now? You can check that with:

/usr/versions/python-2.7/bin/patronictl \
  -d etcd:// \
  list my-ha-cluster

Now the third node:

pfexec zlogin -l pguser node3

/usr/versions/python-2.7/bin/patroni ${HOME}/node3.yaml

You can force a failover by killing (or ^C) the patroni process on the master, which should be node1. You'll see one of the replicas coming up as master, and replication on the other replica change to use the new master. One thing I did notice is that patroni initiates the failover process pretty much instantly, whereas stolon waits a few seconds to be sure.

You can initiate a planned failover too:

/usr/versions/python-2.7/bin/patronictl \
  -d etcd:// \
  failover my-ha-cluster

It will ask you for the new master node, and for confirmation, and then you'll have a new master.

But you're not done yet. There's nothing to connect to. For that, patroni doesn't supply its own component (like stolon does with its proxy) but depends on a haproxy instance. The overlay install we used when creating the zone will have made sure that haproxy is installed in each zone, all we have to do is configure and start it.

zlogin to each node, as root, and

wget -O /etc/haproxy.cfg
svcadm enable haproxy

You don't have to set up the haproxy stats page, but it's a convenient way to see what's going on. If you go to the stats page

Then you can see that it's got the active backend up and the secondaries marked as down - haproxy is checking the patroni REST api which is only showing the active postgres instance as up, so haproxy will route all connections through to the master. And, if you migrate the master, haproxy will follow it.

Which to choose? That's always a matter of opinion, and to be honest while there are a few differences, they're pretty much even.
  • stolon is in go, and comes as a tiny number of standalone binaries, which makes it easier to define how it's packaged up
  • patroni is in python, so needs python and a whole bunch of modules as dependencies, which makes deployment harder (which is why I created an overlay - there are 32 packages in there, over 2 dozen python modules)
  • stolon has its own proxy, rather than relying on a 3rd-party component like haproxy
As a distro maintainer, it doesn't make much difference - dealing with those differences and dependencies is part and parcel of daily life. For standalone use, I think I would probably tend towards stolon, simply because of the much smaller packaging effort.

(It's not that stolon necessarily has fewer dependencies, but remember that in go these are all resolved at build time rather than runtime.)

Thursday, April 25, 2019

HA PostgreSQL on Tribblix with stolon

I wrote about setting up postgres replication, and noted there that while it did what it said it did - ensured that your data was safely sent off to another system - it wasn't a complete HA solution, requiring additional steps to actually make any use of the hot standby.

What I'm going to describe here is one way to create a fully-automatic HA configuration, using stolon. There's a longer article about stolon, roughly explaining the motivations behind the project.

Stolon uses etcd (or similar) as a reliable, distributed configuration store. So this article follows on directly from setting up an etcd cluster - I'm going to use the same zones, the same names, the same IP addresses, so you will need to have got the etcd cluster running as described there first.

We start off by logging in to each zone using zlogin (with pfexec if you set your account up as the zone administrator when creating the zone):

pfexec zlogin node1 (and node2 and node3)

Followed by installing stolon and postgres on each node, and creating an account for them to use:

zap refresh
zap install TRIBblix-postgres11 TRIBblix-stolon TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

In all the following commands I'm assuming you have set your PATH correctly so it contains the postgres and stolon executables. Either add /opt/tribblix/postgres11/bin and /opt/tribblix/stolon/bin to the PATH, or prefix the commands with

env PATH=/opt/tribblix/postgres11/bin:/opt/tribblix/stolon/bin:$PATH

Log in to the first node as pguser.

pfexec zlogin -l pguser node1

Configure the cluster (do this just the once):

stolonctl --cluster-name stolon-cluster \
  --store-backend=etcdv3 init

It's saving the metadata to etcd. Although it's just a single key to mark the stolon cluster as existing at this point.

Now we need a sentinel.

stolon-sentinel --cluster-name stolon-cluster \

It complains that there are no keepers, so zlogin to node1 in another window and start one of those up too:

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres0 --data-dir data/postgres0 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \

after a little while, a postgres instance appears. Cool!

Note that you have to explicitly specify the listen address. That's also the address that other parts of the cluster use, so you can't use "localhost" or '*', you have to use the actual address.

You also specify the postgres superuser password, and the account for replication and its password. Obviously these ought to be the same for all the nodes in the cluster, so they can all talk to each other successfully.

And now we can add a proxy, after another zlogin to node1:

stolon-proxy --cluster-name stolon-cluster \
  --store-backend=etcdv3 --port 25432

If now you point your client (such as psql) at port 25432 you can talk to the database through the proxy.

Just having one node doesn't meet our desire to build a HA cluster, so let's add some more nodes.

Right, go to the second node,

pfexec zlogin -l pguser node2

and add a sentinel and keeper there:

stolon-sentinel --cluster-name stolon-cluster \

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres1 --data-dir data/postgres1 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \

What you'll then see happening on the second node is that stolon will automatically set the new postgres instance up as a replica of the first one (it assumes the first one you run is the master).

Then set up the third node:

pfexec zlogin -l pguser node3

with another sentinel and keeper:

stolon-sentinel --cluster-name stolon-cluster \

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres2 --data-dir data/postgres2 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \

You can also run a proxy on the second and third nodes (or on any other node you might wish to use, come to that). Stolon will configure the proxy for you so that it's always connecting to the master.

At this point you can play around, create a table, insert some data.

And you can test failover. This is the real meat of the problem.

Kill the master (^C its keeper). It takes a while, because it wants to be sure there's actually a problem before taking action, but what you'll see is one of the slaves being promoted to master. And if you run psql against the proxies, they'll send your queries off to the new master. Everything works as it should.

Even better, if you restart the old failed master (as in, restart its keeper), then it successfully sets the old master up as a slave. No split-brain, you get your redundancy back.

I tried this a few more times, killing the new master aka the original slave, and it fails across again.

I'm actually mighty impressed with stolon.

Setting up an etcd cluster on Tribblix

Using etcd to store configuration data is a common pattern, so how might you set up an etcd cluster on Tribblix?

I'll start by creating 3 zones to create a 3-node cluster. For testing these could all be on the same physical system, for production you would obviously want them on separate machines.

As root:

zap refresh

zap create-zone -z node1 -t whole -o base -x

zap create-zone -z node2 -t whole -o base -x

zap create-zone -z node3 -t whole -o base -x

If you add the -U flag with your own username then you'll be able to use zlogin via pfexec from your own account, rather than always running it as root (in other words, subsequent invocations of zlogin could be pfexec zlogin.)

Then zlogin to node1 (and node2 and node3) to install etcd, and create
a user to run the service.

zlogin node1

zap install TRIBblix-etcd
useradd -u 11798 -g staff -s /bin/bash -d /export/home/etcd etcd
passwd -N etcd
mkdir -p /export/home/etcd
chown -hR etcd /export/home/etcd

I'm going to use static initialization to create the cluster. See the
clustering documentation.

You need to give each node a name (I'm going to use the zone name) and the cluster a name, here I'll use pg-cluster-1 as I'm going to use it for some PostgreSQL clustering tests. Then you need to specify the URLs that will be used by this node, and the list of URLs used by the cluster as a whole - which means all 3 machines. For this testing I'm going to use unencrypted connections between the nodes, in practice you would want to run everything over ssl.

zlogin -l etcd node1

/opt/tribblix/etcd/bin/etcd \
  --name node1 \
  --initial-advertise-peer-urls \
  --listen-peer-urls \
  --listen-client-urls, \
  --advertise-client-urls \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=,node2=,node3= \
  --initial-cluster-state new

The same again for node2, with the same cluster list, but its own

zlogin -l etcd node2

/opt/tribblix/etcd/bin/etcd \
  --name node2 \
  --initial-advertise-peer-urls \
  --listen-peer-urls \
  --listen-client-urls, \
  --advertise-client-urls \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=,node2=,node3= \
  --initial-cluster-state new

And for node3:

zlogin -l etcd node3

/opt/tribblix/etcd/bin/etcd \
  --name node3 \
  --initial-advertise-peer-urls \
  --listen-peer-urls \
  --listen-client-urls, \
  --advertise-client-urls \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=,node2=,node3= \
  --initial-cluster-state new

OK, that gives you a 3-node cluster. Initially you'll see complaints about being unable to connect to the other nodes, but it will settle down once they've all started.

And that's basically it. I think in an ideal world this would be an SMF service, with svccfg properties defining the cluster. Something I ought to implement for Tribblix at some point.

One useful tip, while discussing etcd. How do you see what's been stored in etcd? Obviously if you know what the keys in use are, you can just look them up, but if you just want to poke around you don't know what to look for. Also, etcdctl ls has been removed, which is how we used to do it. So to simply list all the keys:

etcdctl get "" --prefix --keys-only

There you have it.

Tuesday, April 23, 2019

Setting up replicated PostgreSQL on Tribblix

When you're building systems, it's nice to build in some level of resilience. After all, failures will happen.

So, we use PostgreSQL quite a bit. We actually use a fairly traditional replication setup - the whole of the data is pushed using zfs send and receive to a second system. Problem at the source? We just turn on the DR site, and we're done.

One of the reasons for that fairly traditional approach is that PostgreSQL has, historically, not had much built in support for replication. Or, at least, not in a simple and straightforward manner. But it's getting a lot better.

Many of the guides you'll find are rather dated, and show old, rather clunky, and quite laborious ways to set up replication. With current versions of PostgreSQL it's actually pretty trivial to get streaming replication running, so here's how to demo it if you're using Tribblix.

First set up a couple of zones. The idea is that pg1 is the master, pg2 the replica. Run, as root:

zap create-zone -z pg1 -t whole -o base -x -U ptribble

zap create-zone -z pg2 -t whole -o base -x -U ptribble

This simply creates a fairly basic zone, without much in the way of extraneous software installed. Adjust the IP addresses to suit, of course. And I've set them up so that I can use zlogin from my own account.

Then login to each zone, install postgres, and create a user.

zlogin pg1
zap install TRIBblix-postgres11 TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir /export/home/pguser
chown -hR pguser /export/home/pguser

And the same for pg2.

Then log in to the master as pguser.

zlogin -l pguser pg1

Now initialise a database, and start it up:

env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/initdb -E UTF8 -D ~/db
env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db

The next thing to do is create a PostgreSQL user that will run the streaming replication.

/opt/tribblix/postgres11/bin/psql -d postgres
set password_encryption = 'scram-sha-256';
\password replicate
Enter new password: my_secret_password

Then you need to edit postgresql.conf (in the db directory) with the following settings:

listen_addresses = '*'
wal_level = replica
max_wal_senders = 3 # or whatever
wal_keep_segments = 64 # or whatever
hot_standby = on

And set up authentication so that the user you just created can actually access the database remotely, by adding the following line to pg_hba.conf

host   replication   replicate    scram-sha-256

ideally we would use hostssl so the connection is encrypted, but that's out of scope for this example.

Then restart the master.

Now log in to the slave.

zlogin -l pguser pg2

And all you have to do to replicate the data is run pg_basebackup:

/opt/tribblix/postgres11/bin/pg_basebackup \
  -h -c fast -D ~/db -R -P \
  -U replicate --wal-method=stream

It will prompt you for the super secret passord you entered earlier. Once that's completed you can start the slave:

env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db

And that's it. Data you insert on the master will be written to the slave. In this mode, you can connect to the replica and issue queries to read the data, but you can't change anything.

Note that this doesn't do anything like manage failover or send client connections to the right place. It just makes sure the data is available. You need an extra layer of management to manage the master and slave(s) to actually get HA.

There are a whole variety of ways to do that, but the simplest way to test it is to stop the database on the master, which will make the slave unhappy. But if you know the master isn't coming back, you can promote the replica:

/opt/tribblix/postgres11/bin/pg_ctl promote -D ~/db

and it will spring into action, and you can point all your clients to it.. The old master is now useless to you; the only course of action you have at this point is to wipe it, and then create a new replica based on the new master.

Tuesday, April 09, 2019

A teeny bug in jkstat char handling

While messing about with illuminate, I noticed an interesting oddity in the disk display:

See on the end of the product string is that "Revision"? That shouldn't be there, and iostat -En doesn't show it. This comes from my JKstat code, so where have I gone wrong?

This comes from the sderr kstat, which is a named kstat of the device_error class.

A named kstat is just a map of keys and values. The key is a string, the value is a union so you need to know what type the data is in order to be able to interpret the bits (and the data type of each entry is stored in the kstat, so that's fine).

For that Product field, it's initialized like this:

kstat_named_init(&stp->sd_pid, "Product", KSTAT_DATA_CHAR);

OK, so it's of type KSTAT_DATA_CHAR. The relevant entry in the union here is value.c, which is actually defined as a char c[16] - the field is 16 characters in size, long enough to hold up to 128-bit ints - most of the numerical data doesn't take that much space.

(For longer, arbitrary length strings, you can stick a pointer to the string in that union instead.)

Back to iostat data. For a SCSI device (something using the sd driver), the device properties are set up in the sd_set_errstats() function in the sd driver. This does a SCSI enquiry, and then copies the Product ID straight out of the right part of the SCSI enquiry string:

strncpy(stp->sd_pid.value.c, un->un_sd->sd_inq->inq_pid, 16);

(If you're interested, you can see the structure of the SCSI enquiry string in the /usr/include/sys/scsi/generic/inquiry.h header file. The inq_pid comes from bytes 16-31, and is 16 bytes long.)

You can see the problem. The strncpy() just copies 16 bytes into a character array that's 16 bytes long. It fits nicely, but there's a snag - because it fits exactly, there's no trailing null!

The problem with JKstat here is that it is (or was, anyway) using NewStringUTF() to convert the C string into a java String, And that doesn't have any concept of length associated with it. So it starts from the pointer to the beginning of the c[] array, and keeps going until it finds the null to terminate the string.

And if you look at the sd driver, the Revision entry comes straight after the product entry in memory, so what JNI is doing here is reading past the end of the Product value, and keeps going until it find the null at the end of the next name "Revision", and takes the whole lot. It is, I suppose, fortunate that there is something vaguely sensible for it to find.

There doesn't appear to be a way of doing the right thing in JNI itself, the fix has to be to copy the correct amount of the value into a temporary string that does have the trailing null added.

(And all the system tools written in C are fine, because they do have a way to just limit the read to 16 characters.)

Monday, April 01, 2019

Notes on web servers and client certificates

With https, web servers have digital certificates to encrypt and authenticate traffic.

Web servers can also require clients to present a valid certificate, which could be used for authentication and identity.

I've recently had the misfortune to end up delving into this, so here are some notes on diagnosing and testing this from a client perspective. One of the problems here is that this all takes place before anything http-related happens, so normal diagnostic techniques are useless - and there won't be anything logged on either the server or client side to work from.

So, rather than trying to work out the exact commands next time, this is an aide-memoire. And hopefully might be useful to others too.

The first thing is to work out whether a server expects a client certificate or not. (If you can get in without, then obviously it's not requiring one, but there are other ways connections can fail.)

Fortunately, openssl can initiate the connection, allowing you to see exactly what's going on:

openssl s_client -showcerts \
  -servername \

Note that you'll usually need the -servername flag here, to stick on the SNI header, otherwise the server or load balancer or proxy won't know what to do with it.

In addition to printing out all the server certificates and some other diagnostics, this will tell you what, if any, client certificates are required. If none are expected, there will be a section that looks like:

No client certificate CA names sent
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits

If the server wants a client certificate, then it will tell you what certificates it wants:

Acceptable client certificate CA names
/CN=My Client CA
Client Certificate Types: RSA sign, DSA sign, ECDSA sign
Shared Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:ECDSA+SHA256:RSA+SHA256:DSA+SHA256:ECDSA+SHA224:RSA+SHA224:DSA+SHA224:ECDSA+SHA1:RSA+SHA1:DSA+SHA1
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits

The important thing here is the "/CN=My Client CA" - the server is telling you that it wants a certificate signed by that Certificate Authority.

Assuming you have such a certificate, how do you send it? Browsers have ways to import it, but it's often easier to diagnose it from the CLI, using curl or wget. You'll need both the certificate and the key, and the syntax is:

wget --certificate=mycert.crt --private-key=mycert.key \


curl --cert mycert.crt --key mycert.key \

Assuming the certificate you have is signed by the relevant CA, this will allow you to retrieve the page. If there's a problem, you might get some meaningful diagnostics.

The other thing is how to load those into a browser. Normally you have something like a .p12 file with both parts bundled. You can create one of those like so:

openssl pkcs12 -export -out mycert.p12 \
  -in mycert.crt -inkey mycert.key

You'll get asked for a passphrase to protect the .p12 file - it contains the key, so needs to be protected in transit.

You can also extract the key and certificate from a .p12 file:

openssl pkcs12 -in mycert.p12 -nokeys -out mycert.crt
openssl pkcs12 -in mycert.p12 -nocerts -out mycert.key