Saturday, November 02, 2019

Testing hardware

My relationship with Sun wasn't just about testing Solaris. One of the other things we were involved with was beta testing new hardware.

Doing beta testing of software contributed to us being offered the chance to test new hardware, of course. Sun could be reasonably sure they would get lots of honest feedback from us.

And also, we had given a huge amount of feedback via the sales organization. Usually of the form "we don't want systems like that, why can't you make something like this instead?".

Some examples here: At one point we ended up buying Dell 6350s (running Solaris on Intel of course). They were a lot cheaper, for one thing. Paradoxically, we often had more memory in those 32-bit Intel machines than in the 64-bit SPARC servers, often down to cost but sometimes limitations in configuration. But a lot of it was that the 4U rack-mount Dell would fit in the datacenter, whereas we wouldn't have been able to fit a pile of E450s in, even if we could have afforded them.

Another example was that we bought a load of Ultra-5 workstations, some cheap shelving, and a KVM switch. Tied together with Grid Engine and its distributed make, they could get through some parallel tasks for a lot less than any server on the price list at the time. We tried (and failed) to persuade Sun to go even further with a cut-down system - we didn't need a CD drive or anything like that, it was just waste.

We also kept complaining about little things like chassis design, cabling, configuration, access and repair.

Eventually, we were asked to trial some new products.

One of the most interesting, and long running, was the B1600 blade system. As in blade servers, not Blade workstations. Aka Stiletto.

This had a variety of Blades - simple cheap single-processor SPARCs (B100s - like a V120 or flapjack, but smaller), and a single-processor AMD (B100x) with a twin-processor Xeon (B200x). There was also a plan for a variety of appliance blades, the only one I think that came out was a load-balancing device, but we didn't test those. If you think about the design, the whole thing full of SPARC blades was not much bigger than an Ultra- 5, but far more powerful and easier to wire.

I had an unfortunate accident while on Holiday just before we started the testing, where I broke my arm. I wasn't able to go to the "training" session and learn about the system, but then I'm a great believer that systems should actually be obvious to use. As the only UK customer though, we had one of the engineering team come and make a video of us unpacking, racking, and configuring the system.

So there was me just supervising, Andy with the video camera, and Geoff and Terry doing the lifting. All the way through from opening the box to having things ready to roll. So the project did cover things like the way it was packed in the carton, how you were able to lift it out safely, and how useful the bits of paper that came with it were.

We vastly preferred Sun rack cabinets, they're just so much stronger and more stable. But we decided that we would try putting the B1600 chassis into a third-party rack, as an additional test. This was a nightmare! The B1600 chassis didn't use traditional rackmount rails, it had some very thin wheels that slotted into side-rails, the tolerances were tiny, you had to line the thing up to within a millimetre, and if you think about getting anything with cage nuts down to under a millimetre by eye, it was always going to be difficult. It took us the best part of an hour (part of that was us giving a running commentary, to be fair) and multiple attempts, including some where we thought it was nicely aligned but actually wasn't, so it could have fallen.

The video was widely shared inside Sun, as I understand it, and the fix was to supply a simple metal measuring bar that you could offer up to the rack to ensure everything was square and at the correct spacing. We ended up trying several designs as they optimised it. If you've ever wondered how those spacer bars originated, now you know!

Of course, we tried sticking it in a proper Sun rack, and it was racked perfectly there in 5 seconds flat before Andy could even get over there with the camera.

Sun didn't really help themselves at times. When we first got the x86 blades, we couldn't run Solaris on them and had to run Linux for a bit. Things like drivers and management interfaces took a while to be completed.

Another project we did was the original V40z, the first generation 64-bit Opterons (this was just the Newisys Opteron reference design with a different bit of plastic tacked on the front). This was less about the actual hardware as bringing in a 64-bit operating system and the overall ecosystem associated with it. The first thing we noticed about it was that as soon as you apply power, the fans scream at full tilt - it was incredibly loud. A nice feature of this generation of systems was that they had 2 management network interfaces, so you could daisy chain the ILOMs in a rack, saving a huge number of external switch ports.

We also tested the V250, a tower server. One of the things we had complained about was that you never got a lot of choice in disk configurations - you either had too few (most of the Sun rack-mount range), or you needed the compute power of an E450 and had to buy a big metal box with mostly unused space. There was never a way to size a Sun box properly. We liked the V250 because it took a sensible number of disks, so for standalone databases it was great.

Monday, October 28, 2019

A brief history with Solaris

I first encountered Solaris (as in Solaris 2.x, as opposed to the retrospectively branded SunOS 4 as Solaris 1.x) when we got a SPARC classic workstation. Initially, that hardware didn't support SunOS 4. That made the shiny workstations useless doorstops, as nothing worked, and building stuff from source didn't work either.

Besides, Solaris 2.1 was utter garbage. It took decades to rid it of some of the more erratic design stupidities inherent in System V. (Cough. Printing. SAF.)

I just missed any serious association with Solaris 2.2, as the SS1000s I got to look after had been upgraded to 2.3 just before my arrival.

So, as a sysadmin, Solaris 2.3 was my first exposure to Solaris at scale. On the SS1000 you didn't have a choice, that was a completely new architecture that was never going to run SunOS 4, and we had several of them as the core of the service.

We built out NISplus. This had a bunch of, shall we call them quirks, and the early releases were pretty grim. But once the more irritating bugs got fixed, it served as a solid workhorse for years. As a network nameservice it was years ahead of its time - having proper administrative tooling and permissions, and a hierarchical structure. It was orders of magnitude better than the older NIS, and far better than anything available today. The SS5 running as our NISplus master did so running Solaris 2.3 for far longer than it probably should.

(We were also one of the few places to use X/Open federated naming, another game-changing state of the art technology Sun introduced and is now lost without trace.)

There's a common rule that odd releases are bad, even release are good. That didn't work with early releases of Solaris, they were all bad. But Solaris 2.4 was getting to be better - more stable, more performant, generally a better feel.

As you might expect, there was a pattern, and we found Solaris 2.5 to be pretty dreadful. We installed it on a couple of systems, but it was so poor we gave up. And then Solaris 2.5.1 was pretty decent, so the alternation pattern was starting to become established.

For me, Solaris 2.6 was a watershed. It was atrocious. We weren't exclusively a Solaris shop, we had RS/6000s running AIX, a decent SGI presence, some Linux, odd bits of other Unices, and still had SunOS 4 (an old ELC salvaged from the skip running our multicast router to connect us to the mbone). But we were starting to like Solaris, as it was so much easier to manage than anything else out there, so I started to report the bugs I was hitting.

I was reporting bug after bug after bug. We had given feedback previously, of course, but at nothing like the scale we were doing here. And, unlike other vendors who slammed the door in our faces and told us to go away, the Sun engineers actually wanted the feedback and the bugs, and fixed things for us.

So when they were planning Solaris 2.7, they got me to test it before it was released, rather than letting all those bugs get out into the wild and have to deal with my irate bug reports afterwards.

This ended up with an odd anecdote. As a beta tester, I was sent the Solaris media just before the official release. And the CDs said "Solaris 7" on them. Sun didn't communicate the renumbering (dropping the leading 2.) very well internally, although clearly whoever pressed the CDs needed to know. So I was able to confirm to the rather sceptical Sun salesforce and reseller community in the UK that the renumbering wasn't a joke.

We tested Solaris 8, and all the updates, and the Solaris 9 and its updates. We found Solaris 8 to be a bit dull, to be honest, and shifted to Solaris 9. At this point we were tracking every release, and it was always better. It was rather annoying that industry seemed to settle on Solaris 8, as that meant that some new hardware was only supported at launch on the old Solaris 8 rather than the current Solaris 9.

With Solaris 10, we got invited onto the Platinum Beta program. This basically means that you run the latest build, in production. As Sun Service hadn't even seen the release, any bugs or problems we had went straight back to Solaris engineering, and every customer in the program had a dedicated engineer we would deal with.

I also got to go out to Menlo Park a couple of times, at the start and end of the program. We got the inside scoop on all the new features from the people who wrote them.

Also with the Platinum Beta, a select few of us got hold of ZFS. You know how you build a prototype and throw it away, and then do it properly? Well, the version we had was that prototype. And yes, it was thrown away and ZFS was rewritten pretty much from scratch. That was why ZFS wasn't in Solaris 10 at launch, by the way. And the version we tested was a bit different to the way that it ended up working - for example, initially the pool didn't have an associated top-level mountpoint, so that pools and datasets were quite distinct. But the attitude of that ZFS testing was quite simple - they just sent us the zfs and zpool binaries, the kernel driver, and a 3-line crib sheet, and everything was supposed to be intuitive and obvious. If you couldn't work out how to do something that was considered to be a bug.

Immediately after Solaris 10 (in fact, starting just before the release) we kicked off OpenSolaris, initially as a closed pilot - nobody really knew how it was going to work, or indeed if some lawyer would find a speck of dust to jam up the works and prevent the whole thing going live. But OpenSolaris launched and its descendants, yes I'm talking illumos, are still making a difference.

Monday, April 29, 2019

HA PostgreSQL on Tribblix with Patroni

When it comes to managing PostgreSQL replication, there are a number of options available.

I looked at stolon, but it's not the only game in town. In terms of a fully managed system, there's also patroni.

In terms of overall functionality, stolon and patroni are pretty similar. They both rely on etcd (or something similar) for storing state; they both take care of running the postgres server with the right options, and reconfiguring it as necessary; they'll both promote a replica to continue service if the master fails.

So, here's how to set up a HA PostgreSQL cluster using patroni.

Before starting on anything like this with Tribblix, it's always a good idea to

zap refresh

so that you're up to date in terms of packages and overlays.

First create 3 zones, just like before:

zap create-zone -z node1 -t whole \
  -o base -O patroni -x 192.168.0.231

zap create-zone -z node2 -t whole \

  -o base -O patroni -x 192.168.0.232

zap create-zone -z node3 -t whole \

  -o base -O patroni -x 192.168.0.233

Building the zones like this, with the patroni overlay, will ensure that all the required packages are installed in the zones so you don't need to mess around with installing packages later.

Then zlogin to each node and run the etcd commands as before, to create the user and start etcd.

Now create a user to run postgres on each node

zlogin node1 (and 2 and 3)
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

Now you need to create yaml files containing the configuration for each node. See http://petertribble.co.uk/Solaris/patroni/ for the sample files I've used here

Log in to each node in turn

pfexec zlogin -l pguser node1

wget http://petertribble.co.uk/Solaris/patroni/node1.yaml
/usr/versions/python-2.7/bin/patroni ${HOME}/node1.yaml

And it initializes a cluster, with just the one node as of yet, and that node will start off as the master.

Now the 2nd node

pfexec zlogin -l pguser node2

wget http://petertribble.co.uk/Solaris/patroni/node2.yaml
/usr/versions/python-2.7/bin/patroni ${HOME}/node2.yaml

And it sets it up as a secondary, replicating from node1.

What do things look like right now? You can check that with:

/usr/versions/python-2.7/bin/patronictl \
  -d etcd://192.168.0.231:2379 \
  list my-ha-cluster

Now the third node:

pfexec zlogin -l pguser node3

wget http://petertribble.co.uk/Solaris/patroni/node3.yaml
/usr/versions/python-2.7/bin/patroni ${HOME}/node3.yaml


You can force a failover by killing (or ^C) the patroni process on the master, which should be node1. You'll see one of the replicas coming up as master, and replication on the other replica change to use the new master. One thing I did notice is that patroni initiates the failover process pretty much instantly, whereas stolon waits a few seconds to be sure.

You can initiate a planned failover too:

/usr/versions/python-2.7/bin/patronictl \
  -d etcd://192.168.0.231:2379 \
  failover my-ha-cluster

It will ask you for the new master node, and for confirmation, and then you'll have a new master.

But you're not done yet. There's nothing to connect to. For that, patroni doesn't supply its own component (like stolon does with its proxy) but depends on a haproxy instance. The overlay install we used when creating the zone will have made sure that haproxy is installed in each zone, all we have to do is configure and start it.

zlogin to each node, as root, and

wget http://petertribble.co.uk/Solaris/patroni/haproxy.cfg -O /etc/haproxy.cfg
svcadm enable haproxy

You don't have to set up the haproxy stats page, but it's a convenient way to see what's going on. If you go to the stats page

http://192.168.0.231:7000

Then you can see that it's got the active backend up and the secondaries marked as down - haproxy is checking the patroni REST api which is only showing the active postgres instance as up, so haproxy will route all connections through to the master. And, if you migrate the master, haproxy will follow it.

Which to choose? That's always a matter of opinion, and to be honest while there are a few differences, they're pretty much even.
  • stolon is in go, and comes as a tiny number of standalone binaries, which makes it easier to define how it's packaged up
  • patroni is in python, so needs python and a whole bunch of modules as dependencies, which makes deployment harder (which is why I created an overlay - there are 32 packages in there, over 2 dozen python modules)
  • stolon has its own proxy, rather than relying on a 3rd-party component like haproxy
As a distro maintainer, it doesn't make much difference - dealing with those differences and dependencies is part and parcel of daily life. For standalone use, I think I would probably tend towards stolon, simply because of the much smaller packaging effort.

(It's not that stolon necessarily has fewer dependencies, but remember that in go these are all resolved at build time rather than runtime.)

Thursday, April 25, 2019

HA PostgreSQL on Tribblix with stolon

I wrote about setting up postgres replication, and noted there that while it did what it said it did - ensured that your data was safely sent off to another system - it wasn't a complete HA solution, requiring additional steps to actually make any use of the hot standby.

What I'm going to describe here is one way to create a fully-automatic HA configuration, using stolon. There's a longer article about stolon, roughly explaining the motivations behind the project.

Stolon uses etcd (or similar) as a reliable, distributed configuration store. So this article follows on directly from setting up an etcd cluster - I'm going to use the same zones, the same names, the same IP addresses, so you will need to have got the etcd cluster running as described there first.

We start off by logging in to each zone using zlogin (with pfexec if you set your account up as the zone administrator when creating the zone):

pfexec zlogin node1 (and node2 and node3)

Followed by installing stolon and postgres on each node, and creating an account for them to use:

zap refresh
zap install TRIBblix-postgres11 TRIBblix-stolon TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

In all the following commands I'm assuming you have set your PATH correctly so it contains the postgres and stolon executables. Either add /opt/tribblix/postgres11/bin and /opt/tribblix/stolon/bin to the PATH, or prefix the commands with

env PATH=/opt/tribblix/postgres11/bin:/opt/tribblix/stolon/bin:$PATH

Log in to the first node as pguser.

pfexec zlogin -l pguser node1

Configure the cluster (do this just the once):

stolonctl --cluster-name stolon-cluster \
  --store-backend=etcdv3 init

It's saving the metadata to etcd. Although it's just a single key to mark the stolon cluster as existing at this point.

Now we need a sentinel.

stolon-sentinel --cluster-name stolon-cluster \
  --store-backend=etcdv3

It complains that there are no keepers, so zlogin to node1 in another window and start one of those up too:

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres0 --data-dir data/postgres0 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \
  --pg-listen-address='192.168.0.231'

after a little while, a postgres instance appears. Cool!

Note that you have to explicitly specify the listen address. That's also the address that other parts of the cluster use, so you can't use "localhost" or '*', you have to use the actual address.


You also specify the postgres superuser password, and the account for replication and its password. Obviously these ought to be the same for all the nodes in the cluster, so they can all talk to each other successfully.

And now we can add a proxy, after another zlogin to node1:

stolon-proxy --cluster-name stolon-cluster \
  --store-backend=etcdv3 --port 25432

If now you point your client (such as psql) at port 25432 you can talk to the database through the proxy.

Just having one node doesn't meet our desire to build a HA cluster, so let's add some more nodes.

Right, go to the second node,

pfexec zlogin -l pguser node2

and add a sentinel and keeper there:

stolon-sentinel --cluster-name stolon-cluster \
  --store-backend=etcdv3


stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres1 --data-dir data/postgres1 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \
  --pg-listen-address='192.168.0.232'

What you'll then see happening on the second node is that stolon will automatically set the new postgres instance up as a replica of the first one (it assumes the first one you run is the master).

Then set up the third node:

pfexec zlogin -l pguser node3

with another sentinel and keeper:

stolon-sentinel --cluster-name stolon-cluster \
  --store-backend=etcdv3

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres2 --data-dir data/postgres2 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \
  --pg-listen-address='192.168.0.233'

You can also run a proxy on the second and third nodes (or on any other node you might wish to use, come to that). Stolon will configure the proxy for you so that it's always connecting to the master.

At this point you can play around, create a table, insert some data.

And you can test failover. This is the real meat of the problem.

Kill the master (^C its keeper). It takes a while, because it wants to be sure there's actually a problem before taking action, but what you'll see is one of the slaves being promoted to master. And if you run psql against the proxies, they'll send your queries off to the new master. Everything works as it should.

Even better, if you restart the old failed master (as in, restart its keeper), then it successfully sets the old master up as a slave. No split-brain, you get your redundancy back.

I tried this a few more times, killing the new master aka the original slave, and it fails across again.

I'm actually mighty impressed with stolon.

Setting up an etcd cluster on Tribblix

Using etcd to store configuration data is a common pattern, so how might you set up an etcd cluster on Tribblix?

I'll start by creating 3 zones to create a 3-node cluster. For testing these could all be on the same physical system, for production you would obviously want them on separate machines.

As root:

zap refresh

zap create-zone -z node1 -t whole -o base -x 192.168.0.231

zap create-zone -z node2 -t whole -o base -x 192.168.0.232

zap create-zone -z node3 -t whole -o base -x 192.168.0.233


If you add the -U flag with your own username then you'll be able to use zlogin via pfexec from your own account, rather than always running it as root (in other words, subsequent invocations of zlogin could be pfexec zlogin.)

Then zlogin to node1 (and node2 and node3) to install etcd, and create
a user to run the service.

zlogin node1

zap install TRIBblix-etcd
useradd -u 11798 -g staff -s /bin/bash -d /export/home/etcd etcd
passwd -N etcd
mkdir -p /export/home/etcd
chown -hR etcd /export/home/etcd

I'm going to use static initialization to create the cluster. See the
clustering documentation.

You need to give each node a name (I'm going to use the zone name) and the cluster a name, here I'll use pg-cluster-1 as I'm going to use it for some PostgreSQL clustering tests. Then you need to specify the URLs that will be used by this node, and the list of URLs used by the cluster as a whole - which means all 3 machines. For this testing I'm going to use unencrypted connections between the nodes, in practice you would want to run everything over ssl.

zlogin -l etcd node1

/opt/tribblix/etcd/bin/etcd \
  --name node1 \
  --initial-advertise-peer-urls http://192.168.0.231:2380 \
  --listen-peer-urls http://192.168.0.231:2380 \
  --listen-client-urls http://192.168.0.231:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.0.231:2379 \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=http://192.168.0.231:2380,node2=http://192.168.0.232:2380,node3=http://192.168.0.233:2380 \
  --initial-cluster-state new

The same again for node2, with the same cluster list, but its own
URLs.

zlogin -l etcd node2

/opt/tribblix/etcd/bin/etcd \
  --name node2 \
  --initial-advertise-peer-urls http://192.168.0.232:2380 \
  --listen-peer-urls http://192.168.0.232:2380 \
  --listen-client-urls http://192.168.0.232:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.0.232:2379 \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=http://192.168.0.231:2380,node2=http://192.168.0.232:2380,node3=http://192.168.0.233:2380 \
  --initial-cluster-state new

And for node3:

zlogin -l etcd node3

/opt/tribblix/etcd/bin/etcd \
  --name node3 \
  --initial-advertise-peer-urls http://192.168.0.233:2380 \
  --listen-peer-urls http://192.168.0.233:2380 \
  --listen-client-urls http://192.168.0.233:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.0.233:2379 \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=http://192.168.0.231:2380,node2=http://192.168.0.232:2380,node3=http://192.168.0.233:2380 \
  --initial-cluster-state new

OK, that gives you a 3-node cluster. Initially you'll see complaints about being unable to connect to the other nodes, but it will settle down once they've all started.

And that's basically it. I think in an ideal world this would be an SMF service, with svccfg properties defining the cluster. Something I ought to implement for Tribblix at some point.

One useful tip, while discussing etcd. How do you see what's been stored in etcd? Obviously if you know what the keys in use are, you can just look them up, but if you just want to poke around you don't know what to look for. Also, etcdctl ls has been removed, which is how we used to do it. So to simply list all the keys:

etcdctl get "" --prefix --keys-only

There you have it.


Tuesday, April 23, 2019

Setting up replicated PostgreSQL on Tribblix

When you're building systems, it's nice to build in some level of resilience. After all, failures will happen.

So, we use PostgreSQL quite a bit. We actually use a fairly traditional replication setup - the whole of the data is pushed using zfs send and receive to a second system. Problem at the source? We just turn on the DR site, and we're done.

One of the reasons for that fairly traditional approach is that PostgreSQL has, historically, not had much built in support for replication. Or, at least, not in a simple and straightforward manner. But it's getting a lot better.

Many of the guides you'll find are rather dated, and show old, rather clunky, and quite laborious ways to set up replication. With current versions of PostgreSQL it's actually pretty trivial to get streaming replication running, so here's how to demo it if you're using Tribblix.

First set up a couple of zones. The idea is that pg1 is the master, pg2 the replica. Run, as root:

zap create-zone -z pg1 -t whole -o base -x 192.168.0.221 -U ptribble

zap create-zone -z pg2 -t whole -o base -x 192.168.0.222 -U ptribble

This simply creates a fairly basic zone, without much in the way of extraneous software installed. Adjust the IP addresses to suit, of course. And I've set them up so that I can use zlogin from my own account.

Then login to each zone, install postgres, and create a user.

zlogin pg1
zap install TRIBblix-postgres11 TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir /export/home/pguser
chown -hR pguser /export/home/pguser

And the same for pg2.

Then log in to the master as pguser.

zlogin -l pguser pg1

Now initialise a database, and start it up:




env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/initdb -E UTF8 -D ~/db
env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db


The next thing to do is create a PostgreSQL user that will run the streaming replication.

/opt/tribblix/postgres11/bin/psql -d postgres
CREATE ROLE replicate WITH REPLICATION LOGIN ;
set password_encryption = 'scram-sha-256';
SET
\password replicate
Enter new password: my_secret_password

Then you need to edit postgresql.conf (in the db directory) with the following settings:

listen_addresses = '*'
wal_level = replica
max_wal_senders = 3 # or whatever
wal_keep_segments = 64 # or whatever
hot_standby = on


And set up authentication so that the user you just created can actually access the database remotely, by adding the following line to pg_hba.conf

host   replication   replicate     192.168.0.0/24    scram-sha-256

ideally we would use hostssl so the connection is encrypted, but that's out of scope for this example.

Then restart the master.

Now log in to the slave.

zlogin -l pguser pg2

And all you have to do to replicate the data is run pg_basebackup:

/opt/tribblix/postgres11/bin/pg_basebackup \
  -h 192.168.0.221 -c fast -D ~/db -R -P \
  -U replicate --wal-method=stream

It will prompt you for the super secret passord you entered earlier. Once that's completed you can start the slave:

env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db

And that's it. Data you insert on the master will be written to the slave. In this mode, you can connect to the replica and issue queries to read the data, but you can't change anything.

Note that this doesn't do anything like manage failover or send client connections to the right place. It just makes sure the data is available. You need an extra layer of management to manage the master and slave(s) to actually get HA.

There are a whole variety of ways to do that, but the simplest way to test it is to stop the database on the master, which will make the slave unhappy. But if you know the master isn't coming back, you can promote the replica:

/opt/tribblix/postgres11/bin/pg_ctl promote -D ~/db

and it will spring into action, and you can point all your clients to it.. The old master is now useless to you; the only course of action you have at this point is to wipe it, and then create a new replica based on the new master.

Tuesday, April 09, 2019

A teeny bug in jkstat char handling

While messing about with illuminate, I noticed an interesting oddity in the disk display:



See on the end of the product string is that "Revision"? That shouldn't be there, and iostat -En doesn't show it. This comes from my JKstat code, so where have I gone wrong?

This comes from the sderr kstat, which is a named kstat of the device_error class.

A named kstat is just a map of keys and values. The key is a string, the value is a union so you need to know what type the data is in order to be able to interpret the bits (and the data type of each entry is stored in the kstat, so that's fine).

For that Product field, it's initialized like this:

kstat_named_init(&stp->sd_pid, "Product", KSTAT_DATA_CHAR);

OK, so it's of type KSTAT_DATA_CHAR. The relevant entry in the union here is value.c, which is actually defined as a char c[16] - the field is 16 characters in size, long enough to hold up to 128-bit ints - most of the numerical data doesn't take that much space.

(For longer, arbitrary length strings, you can stick a pointer to the string in that union instead.)

Back to iostat data. For a SCSI device (something using the sd driver), the device properties are set up in the sd_set_errstats() function in the sd driver. This does a SCSI enquiry, and then copies the Product ID straight out of the right part of the SCSI enquiry string:

strncpy(stp->sd_pid.value.c, un->un_sd->sd_inq->inq_pid, 16);

(If you're interested, you can see the structure of the SCSI enquiry string in the /usr/include/sys/scsi/generic/inquiry.h header file. The inq_pid comes from bytes 16-31, and is 16 bytes long.)

You can see the problem. The strncpy() just copies 16 bytes into a character array that's 16 bytes long. It fits nicely, but there's a snag - because it fits exactly, there's no trailing null!

The problem with JKstat here is that it is (or was, anyway) using NewStringUTF() to convert the C string into a java String, And that doesn't have any concept of length associated with it. So it starts from the pointer to the beginning of the c[] array, and keeps going until it finds the null to terminate the string.

And if you look at the sd driver, the Revision entry comes straight after the product entry in memory, so what JNI is doing here is reading past the end of the Product value, and keeps going until it find the null at the end of the next name "Revision", and takes the whole lot. It is, I suppose, fortunate that there is something vaguely sensible for it to find.

There doesn't appear to be a way of doing the right thing in JNI itself, the fix has to be to copy the correct amount of the value into a temporary string that does have the trailing null added.

(And all the system tools written in C are fine, because they do have a way to just limit the read to 16 characters.)

Monday, April 01, 2019

Notes on web servers and client certificates

With https, web servers have digital certificates to encrypt and authenticate traffic.

Web servers can also require clients to present a valid certificate, which could be used for authentication and identity.

I've recently had the misfortune to end up delving into this, so here are some notes on diagnosing and testing this from a client perspective. One of the problems here is that this all takes place before anything http-related happens, so normal diagnostic techniques are useless - and there won't be anything logged on either the server or client side to work from.

So, rather than trying to work out the exact commands next time, this is an aide-memoire. And hopefully might be useful to others too.

The first thing is to work out whether a server expects a client certificate or not. (If you can get in without, then obviously it's not requiring one, but there are other ways connections can fail.)

Fortunately, openssl can initiate the connection, allowing you to see exactly what's going on:

openssl s_client -showcerts \
  -servername example.com \
  -connect example.com:443

Note that you'll usually need the -servername flag here, to stick on the SNI header, otherwise the server or load balancer or proxy won't know what to do with it.

In addition to printing out all the server certificates and some other diagnostics, this will tell you what, if any, client certificates are required. If none are expected, there will be a section that looks like:

---
No client certificate CA names sent
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits
---

If the server wants a client certificate, then it will tell you what certificates it wants:

---
Acceptable client certificate CA names
/CN=My Client CA
Client Certificate Types: RSA sign, DSA sign, ECDSA sign
Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:ECDSA+SHA256:RSA+SHA256:DSA+SHA256:ECDSA+SHA224:RSA+SHA224:DSA+SHA224:ECDSA+SHA1:RSA+SHA1:DSA+SHA1
Shared Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:ECDSA+SHA256:RSA+SHA256:DSA+SHA256:ECDSA+SHA224:RSA+SHA224:DSA+SHA224:ECDSA+SHA1:RSA+SHA1:DSA+SHA1
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits
---

The important thing here is the "/CN=My Client CA" - the server is telling you that it wants a certificate signed by that Certificate Authority.

Assuming you have such a certificate, how do you send it? Browsers have ways to import it, but it's often easier to diagnose it from the CLI, using curl or wget. You'll need both the certificate and the key, and the syntax is:

wget --certificate=mycert.crt --private-key=mycert.key \
  https://example.com/

or

curl --cert mycert.crt --key mycert.key \
  https://example.com/

Assuming the certificate you have is signed by the relevant CA, this will allow you to retrieve the page. If there's a problem, you might get some meaningful diagnostics.

The other thing is how to load those into a browser. Normally you have something like a .p12 file with both parts bundled. You can create one of those like so:

openssl pkcs12 -export -out mycert.p12 \
  -in mycert.crt -inkey mycert.key

You'll get asked for a passphrase to protect the .p12 file - it contains the key, so needs to be protected in transit.

You can also extract the key and certificate from a .p12 file:

openssl pkcs12 -in mycert.p12 -nokeys -out mycert.crt
openssl pkcs12 -in mycert.p12 -nocerts -out mycert.key