Tuesday, January 28, 2020

Some hardware just won't die

One of the machines I use to build SPARC packages for Tribblix is an old SunBlade 2000.

It's 18 years old, and is still going strong. Sun built machines like tanks, and this has outlasted the W2100z (Metropolis) that replaced it, and the Ultra 20 M2 that replaced that, and the Dell that replaced that.

It's had an interesting history. I used to work for the MRC, and our department used Sun desktops because they were the best value. Next to no maintenance costs, just worked, never failed, and were compatible with the server fleet. And having good machines more than paid back the extra upfront investment. (People are really expensive - giving them better equipment is very quickly rewarded through extra productivity.)

That compatibility gave us a couple of advantages. One was that we could simply chuck the desktops into the compute farm when they weren't being used, to give us extra power. The other was that when we turned up one morning to find 80% of our servers missing, we simply rounded up a bunch of desktops and promoted them, restoring full service within a couple of hours.

When the department was shut down all the computers were dispersed to other places. Anything over 3 years old had depreciated as an asset, and those were just given away. The SB2000 wasn't quite, but a research group went off to another university taking a full rack of gear and some of the desktops, found they weren't given anything like as much space as they expected, and asked me to keep the SB2000 in exchange for helping out with advice if they had a problem.

The snag with a SunBlade 2000 is that it's both huge and heavy. The domestic authorities weren't terribly enthusiastic when I came home with it and a pair of massive monitors.

The SB2000 ended up following me to my next job, where it was used for patch testing and then as a graphical station in the 2nd datacenter.

And it followed me to the job after, too. They gave me an entry-level SunBlade 1500, I brought in the SB2000 and it's 2 22-inch Sony CRTs.

After a while, we upgraded to Ultra 20M2 workstations. Which released the monster, initially again as a patch test box.

At around this time we were replacing production storage, which was a load of Sun fiber arrays hooked up to V880s, with either SAS raid arrays connected to X4200M2s, or thumpers for bulk image data. Which meant we had a number of random fiber arrays kicking around doing nothing.

And then someone shows up with an urgent project. They need to store and serve a pile of extra data, and it had been forgotten when the budget was put together. Could we help them out?

Half an hour later I had found some QLogic cards in the storeroom, borrowed some fibre cables from networking, shoved the cards in the free slots in the SB2000, hooked up the arrays, told ZFS to sort everything out, and we had a couple of terabytes of storage ready for the project.

It was actually a huge success and worked really well. Some time later a VP from the States was visiting and saw this Heath Robinson contraption we had put together. They were a little bit shocked (to put it mildly) to discover that a mission-critical customer-facing project worth millions of dollars was held together by a pile of refurbished rubbish and a member of staff's cast-offs; shortly thereafter the proper funding for the project magically appeared.

After changing jobs again it came home. By this time the kids were off to University and I had a decent amount of space for a home office, so it could stay. It immediately had Tribblix loaded on it and has been used once or twice a week ever since. And it might be a little slower than a modern machine, but it's definitely viable, and it's still showing no signs of shuffling off this mortal coil.

Saturday, November 02, 2019

Testing hardware

My relationship with Sun wasn't just about testing Solaris. One of the other things we were involved with was beta testing new hardware.

Doing beta testing of software contributed to us being offered the chance to test new hardware, of course. Sun could be reasonably sure they would get lots of honest feedback from us.

And also, we had given a huge amount of feedback via the sales organization. Usually of the form "we don't want systems like that, why can't you make something like this instead?".

Some examples here: At one point we ended up buying Dell 6350s (running Solaris on Intel of course). They were a lot cheaper, for one thing. Paradoxically, we often had more memory in those 32-bit Intel machines than in the 64-bit SPARC servers, often down to cost but sometimes limitations in configuration. But a lot of it was that the 4U rack-mount Dell would fit in the datacenter, whereas we wouldn't have been able to fit a pile of E450s in, even if we could have afforded them.

Another example was that we bought a load of Ultra-5 workstations, some cheap shelving, and a KVM switch. Tied together with Grid Engine and its distributed make, they could get through some parallel tasks for a lot less than any server on the price list at the time. We tried (and failed) to persuade Sun to go even further with a cut-down system - we didn't need a CD drive or anything like that, it was just waste.

We also kept complaining about little things like chassis design, cabling, configuration, access and repair.

Eventually, we were asked to trial some new products.

One of the most interesting, and long running, was the B1600 blade system. As in blade servers, not Blade workstations. Aka Stiletto.

This had a variety of Blades - simple cheap single-processor SPARCs (B100s - like a V120 or flapjack, but smaller), and a single-processor AMD (B100x) with a twin-processor Xeon (B200x). There was also a plan for a variety of appliance blades, the only one I think that came out was a load-balancing device, but we didn't test those. If you think about the design, the whole thing full of SPARC blades was not much bigger than an Ultra- 5, but far more powerful and easier to wire.

I had an unfortunate accident while on Holiday just before we started the testing, where I broke my arm. I wasn't able to go to the "training" session and learn about the system, but then I'm a great believer that systems should actually be obvious to use. As the only UK customer though, we had one of the engineering team come and make a video of us unpacking, racking, and configuring the system.

So there was me just supervising, Andy with the video camera, and Geoff and Terry doing the lifting. All the way through from opening the box to having things ready to roll. So the project did cover things like the way it was packed in the carton, how you were able to lift it out safely, and how useful the bits of paper that came with it were.

We vastly preferred Sun rack cabinets, they're just so much stronger and more stable. But we decided that we would try putting the B1600 chassis into a third-party rack, as an additional test. This was a nightmare! The B1600 chassis didn't use traditional rackmount rails, it had some very thin wheels that slotted into side-rails, the tolerances were tiny, you had to line the thing up to within a millimetre, and if you think about getting anything with cage nuts down to under a millimetre by eye, it was always going to be difficult. It took us the best part of an hour (part of that was us giving a running commentary, to be fair) and multiple attempts, including some where we thought it was nicely aligned but actually wasn't, so it could have fallen.

The video was widely shared inside Sun, as I understand it, and the fix was to supply a simple metal measuring bar that you could offer up to the rack to ensure everything was square and at the correct spacing. We ended up trying several designs as they optimised it. If you've ever wondered how those spacer bars originated, now you know!

Of course, we tried sticking it in a proper Sun rack, and it was racked perfectly there in 5 seconds flat before Andy could even get over there with the camera.

Sun didn't really help themselves at times. When we first got the x86 blades, we couldn't run Solaris on them and had to run Linux for a bit. Things like drivers and management interfaces took a while to be completed.

Another project we did was the original V40z, the first generation 64-bit Opterons (this was just the Newisys Opteron reference design with a different bit of plastic tacked on the front). This was less about the actual hardware as bringing in a 64-bit operating system and the overall ecosystem associated with it. The first thing we noticed about it was that as soon as you apply power, the fans scream at full tilt - it was incredibly loud. A nice feature of this generation of systems was that they had 2 management network interfaces, so you could daisy chain the ILOMs in a rack, saving a huge number of external switch ports.

We also tested the V250, a tower server. One of the things we had complained about was that you never got a lot of choice in disk configurations - you either had too few (most of the Sun rack-mount range), or you needed the compute power of an E450 and had to buy a big metal box with mostly unused space. There was never a way to size a Sun box properly. We liked the V250 because it took a sensible number of disks, so for standalone databases it was great.

Monday, October 28, 2019

A brief history with Solaris

I first encountered Solaris (as in Solaris 2.x, as opposed to the retrospectively branded SunOS 4 as Solaris 1.x) when we got a SPARC classic workstation. Initially, that hardware didn't support SunOS 4. That made the shiny workstations useless doorstops, as nothing worked, and building stuff from source didn't work either.

Besides, Solaris 2.1 was utter garbage. It took decades to rid it of some of the more erratic design stupidities inherent in System V. (Cough. Printing. SAF.)

I just missed any serious association with Solaris 2.2, as the SS1000s I got to look after had been upgraded to 2.3 just before my arrival.

So, as a sysadmin, Solaris 2.3 was my first exposure to Solaris at scale. On the SS1000 you didn't have a choice, that was a completely new architecture that was never going to run SunOS 4, and we had several of them as the core of the service.

We built out NISplus. This had a bunch of, shall we call them quirks, and the early releases were pretty grim. But once the more irritating bugs got fixed, it served as a solid workhorse for years. As a network nameservice it was years ahead of its time - having proper administrative tooling and permissions, and a hierarchical structure. It was orders of magnitude better than the older NIS, and far better than anything available today. The SS5 running as our NISplus master did so running Solaris 2.3 for far longer than it probably should.

(We were also one of the few places to use X/Open federated naming, another game-changing state of the art technology Sun introduced and is now lost without trace.)

There's a common rule that odd releases are bad, even release are good. That didn't work with early releases of Solaris, they were all bad. But Solaris 2.4 was getting to be better - more stable, more performant, generally a better feel.

As you might expect, there was a pattern, and we found Solaris 2.5 to be pretty dreadful. We installed it on a couple of systems, but it was so poor we gave up. And then Solaris 2.5.1 was pretty decent, so the alternation pattern was starting to become established.

For me, Solaris 2.6 was a watershed. It was atrocious. We weren't exclusively a Solaris shop, we had RS/6000s running AIX, a decent SGI presence, some Linux, odd bits of other Unices, and still had SunOS 4 (an old ELC salvaged from the skip running our multicast router to connect us to the mbone). But we were starting to like Solaris, as it was so much easier to manage than anything else out there, so I started to report the bugs I was hitting.

I was reporting bug after bug after bug. We had given feedback previously, of course, but at nothing like the scale we were doing here. And, unlike other vendors who slammed the door in our faces and told us to go away, the Sun engineers actually wanted the feedback and the bugs, and fixed things for us.

So when they were planning Solaris 2.7, they got me to test it before it was released, rather than letting all those bugs get out into the wild and have to deal with my irate bug reports afterwards.

This ended up with an odd anecdote. As a beta tester, I was sent the Solaris media just before the official release. And the CDs said "Solaris 7" on them. Sun didn't communicate the renumbering (dropping the leading 2.) very well internally, although clearly whoever pressed the CDs needed to know. So I was able to confirm to the rather sceptical Sun salesforce and reseller community in the UK that the renumbering wasn't a joke.

We tested Solaris 8, and all the updates, and the Solaris 9 and its updates. We found Solaris 8 to be a bit dull, to be honest, and shifted to Solaris 9. At this point we were tracking every release, and it was always better. It was rather annoying that industry seemed to settle on Solaris 8, as that meant that some new hardware was only supported at launch on the old Solaris 8 rather than the current Solaris 9.

With Solaris 10, we got invited onto the Platinum Beta program. This basically means that you run the latest build, in production. As Sun Service hadn't even seen the release, any bugs or problems we had went straight back to Solaris engineering, and every customer in the program had a dedicated engineer we would deal with.

I also got to go out to Menlo Park a couple of times, at the start and end of the program. We got the inside scoop on all the new features from the people who wrote them.

Also with the Platinum Beta, a select few of us got hold of ZFS. You know how you build a prototype and throw it away, and then do it properly? Well, the version we had was that prototype. And yes, it was thrown away and ZFS was rewritten pretty much from scratch. That was why ZFS wasn't in Solaris 10 at launch, by the way. And the version we tested was a bit different to the way that it ended up working - for example, initially the pool didn't have an associated top-level mountpoint, so that pools and datasets were quite distinct. But the attitude of that ZFS testing was quite simple - they just sent us the zfs and zpool binaries, the kernel driver, and a 3-line crib sheet, and everything was supposed to be intuitive and obvious. If you couldn't work out how to do something that was considered to be a bug.

Immediately after Solaris 10 (in fact, starting just before the release) we kicked off OpenSolaris, initially as a closed pilot - nobody really knew how it was going to work, or indeed if some lawyer would find a speck of dust to jam up the works and prevent the whole thing going live. But OpenSolaris launched and its descendants, yes I'm talking illumos, are still making a difference.

Monday, April 29, 2019

HA PostgreSQL on Tribblix with Patroni

When it comes to managing PostgreSQL replication, there are a number of options available.

Updated: If you're running Tribblix m22 or later, you'll need python-3.7 as shown below. On older releases, use python-2.7 in the commands below.

I looked at stolon, but it's not the only game in town. In terms of a fully managed system, there's also patroni.

In terms of overall functionality, stolon and patroni are pretty similar. They both rely on etcd (or something similar) for storing state; they both take care of running the postgres server with the right options, and reconfiguring it as necessary; they'll both promote a replica to continue service if the master fails.

So, here's how to set up a HA PostgreSQL cluster using patroni.

Before starting on anything like this with Tribblix, it's always a good idea to

zap refresh

so that you're up to date in terms of packages and overlays.

First create 3 zones, just like before:

zap create-zone -z node1 -t whole \
  -o base -O patroni -x 192.168.0.231

zap create-zone -z node2 -t whole \

  -o base -O patroni -x 192.168.0.232

zap create-zone -z node3 -t whole \

  -o base -O patroni -x 192.168.0.233

Building the zones like this, with the patroni overlay, will ensure that all the required packages are installed in the zones so you don't need to mess around with installing packages later.

Then zlogin to each node and run the etcd commands as before, to create the user and start etcd.

Now create a user to run postgres on each node

zlogin node1 (and 2 and 3)
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

Now you need to create yaml files containing the configuration for each node. See http://petertribble.co.uk/Solaris/patroni/ for the sample files I've used here

Log in to each node in turn

pfexec zlogin -l pguser node1

wget http://petertribble.co.uk/Solaris/patroni/node1.yaml
/usr/versions/python-3.7/bin/patroni ${HOME}/node1.yaml

And it initializes a cluster, with just the one node as of yet, and that node will start off as the master.

Now the 2nd node

pfexec zlogin -l pguser node2

wget http://petertribble.co.uk/Solaris/patroni/node2.yaml
/usr/versions/python-3.7/bin/patroni ${HOME}/node2.yaml

And it sets it up as a secondary, replicating from node1.

What do things look like right now? You can check that with:

/usr/versions/python-3.7/bin/patronictl \
  -d etcd://192.168.0.231:2379 \
  list my-ha-cluster

Now the third node:

pfexec zlogin -l pguser node3

wget http://petertribble.co.uk/Solaris/patroni/node3.yaml
/usr/versions/python-3.7/bin/patroni ${HOME}/node3.yaml


You can force a failover by killing (or ^C) the patroni process on the master, which should be node1. You'll see one of the replicas coming up as master, and replication on the other replica change to use the new master. One thing I did notice is that patroni initiates the failover process pretty much instantly, whereas stolon waits a few seconds to be sure.

You can initiate a planned failover too:

/usr/versions/python-3.7/bin/patronictl \
  -d etcd://192.168.0.231:2379 \
  failover my-ha-cluster

It will ask you for the new master node, and for confirmation, and then you'll have a new master.

But you're not done yet. There's nothing to connect to. For that, patroni doesn't supply its own component (like stolon does with its proxy) but depends on a haproxy instance. The overlay install we used when creating the zone will have made sure that haproxy is installed in each zone, all we have to do is configure and start it.

zlogin to each node, as root, and

wget http://petertribble.co.uk/Solaris/patroni/haproxy.cfg -O /etc/haproxy.cfg
svcadm enable haproxy

You don't have to set up the haproxy stats page, but it's a convenient way to see what's going on. If you go to the stats page

http://192.168.0.231:7000

Then you can see that it's got the active backend up and the secondaries marked as down - haproxy is checking the patroni REST api which is only showing the active postgres instance as up, so haproxy will route all connections through to the master. And, if you migrate the master, haproxy will follow it.

Which to choose? That's always a matter of opinion, and to be honest while there are a few differences, they're pretty much even.
  • stolon is in go, and comes as a tiny number of standalone binaries, which makes it easier to define how it's packaged up
  • patroni is in python, so needs python and a whole bunch of modules as dependencies, which makes deployment harder (which is why I created an overlay - there are 32 packages in there, over 2 dozen python modules)
  • stolon has its own proxy, rather than relying on a 3rd-party component like haproxy
As a distro maintainer, it doesn't make much difference - dealing with those differences and dependencies is part and parcel of daily life. For standalone use, I think I would probably tend towards stolon, simply because of the much smaller packaging effort.

(It's not that stolon necessarily has fewer dependencies, but remember that in go these are all resolved at build time rather than runtime.)

Thursday, April 25, 2019

HA PostgreSQL on Tribblix with stolon

I wrote about setting up postgres replication, and noted there that while it did what it said it did - ensured that your data was safely sent off to another system - it wasn't a complete HA solution, requiring additional steps to actually make any use of the hot standby.

What I'm going to describe here is one way to create a fully-automatic HA configuration, using stolon. There's a longer article about stolon, roughly explaining the motivations behind the project.

Stolon uses etcd (or similar) as a reliable, distributed configuration store. So this article follows on directly from setting up an etcd cluster - I'm going to use the same zones, the same names, the same IP addresses, so you will need to have got the etcd cluster running as described there first.

We start off by logging in to each zone using zlogin (with pfexec if you set your account up as the zone administrator when creating the zone):

pfexec zlogin node1 (and node2 and node3)

Followed by installing stolon and postgres on each node, and creating an account for them to use:

zap refresh
zap install TRIBblix-postgres11 TRIBblix-stolon TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

In all the following commands I'm assuming you have set your PATH correctly so it contains the postgres and stolon executables. Either add /opt/tribblix/postgres11/bin and /opt/tribblix/stolon/bin to the PATH, or prefix the commands with

env PATH=/opt/tribblix/postgres11/bin:/opt/tribblix/stolon/bin:$PATH

Log in to the first node as pguser.

pfexec zlogin -l pguser node1

Configure the cluster (do this just the once):

stolonctl --cluster-name stolon-cluster \
  --store-backend=etcdv3 init

It's saving the metadata to etcd. Although it's just a single key to mark the stolon cluster as existing at this point.

Now we need a sentinel.

stolon-sentinel --cluster-name stolon-cluster \
  --store-backend=etcdv3

It complains that there are no keepers, so zlogin to node1 in another window and start one of those up too:

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres0 --data-dir data/postgres0 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \
  --pg-listen-address='192.168.0.231'

after a little while, a postgres instance appears. Cool!

Note that you have to explicitly specify the listen address. That's also the address that other parts of the cluster use, so you can't use "localhost" or '*', you have to use the actual address.


You also specify the postgres superuser password, and the account for replication and its password. Obviously these ought to be the same for all the nodes in the cluster, so they can all talk to each other successfully.

And now we can add a proxy, after another zlogin to node1:

stolon-proxy --cluster-name stolon-cluster \
  --store-backend=etcdv3 --port 25432

If now you point your client (such as psql) at port 25432 you can talk to the database through the proxy.

Just having one node doesn't meet our desire to build a HA cluster, so let's add some more nodes.

Right, go to the second node,

pfexec zlogin -l pguser node2

and add a sentinel and keeper there:

stolon-sentinel --cluster-name stolon-cluster \
  --store-backend=etcdv3


stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres1 --data-dir data/postgres1 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \
  --pg-listen-address='192.168.0.232'

What you'll then see happening on the second node is that stolon will automatically set the new postgres instance up as a replica of the first one (it assumes the first one you run is the master).

Then set up the third node:

pfexec zlogin -l pguser node3

with another sentinel and keeper:

stolon-sentinel --cluster-name stolon-cluster \
  --store-backend=etcdv3

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres2 --data-dir data/postgres2 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \
  --pg-listen-address='192.168.0.233'

You can also run a proxy on the second and third nodes (or on any other node you might wish to use, come to that). Stolon will configure the proxy for you so that it's always connecting to the master.

At this point you can play around, create a table, insert some data.

And you can test failover. This is the real meat of the problem.

Kill the master (^C its keeper). It takes a while, because it wants to be sure there's actually a problem before taking action, but what you'll see is one of the slaves being promoted to master. And if you run psql against the proxies, they'll send your queries off to the new master. Everything works as it should.

Even better, if you restart the old failed master (as in, restart its keeper), then it successfully sets the old master up as a slave. No split-brain, you get your redundancy back.

I tried this a few more times, killing the new master aka the original slave, and it fails across again.

I'm actually mighty impressed with stolon.

Setting up an etcd cluster on Tribblix

Using etcd to store configuration data is a common pattern, so how might you set up an etcd cluster on Tribblix?

Updated: With current etcd, you may need to add the --enable-v2=true flag, as shown below. For example, Patroni requires v2.

I'll start by creating 3 zones to create a 3-node cluster. For testing these could all be on the same physical system, for production you would obviously want them on separate machines.

As root:

zap refresh

zap create-zone -z node1 -t whole -o base -x 192.168.0.231

zap create-zone -z node2 -t whole -o base -x 192.168.0.232

zap create-zone -z node3 -t whole -o base -x 192.168.0.233


If you add the -U flag with your own username then you'll be able to use zlogin via pfexec from your own account, rather than always running it as root (in other words, subsequent invocations of zlogin could be pfexec zlogin.)

Then zlogin to node1 (and node2 and node3) to install etcd, and create
a user to run the service.

zlogin node1

zap install TRIBblix-etcd
useradd -u 11798 -g staff -s /bin/bash -d /export/home/etcd etcd
passwd -N etcd
mkdir -p /export/home/etcd
chown -hR etcd /export/home/etcd

I'm going to use static initialization to create the cluster. See the
clustering documentation.

You need to give each node a name (I'm going to use the zone name) and the cluster a name, here I'll use pg-cluster-1 as I'm going to use it for some PostgreSQL clustering tests. Then you need to specify the URLs that will be used by this node, and the list of URLs used by the cluster as a whole - which means all 3 machines. For this testing I'm going to use unencrypted connections between the nodes, in practice you would want to run everything over ssl.

zlogin -l etcd node1

/opt/tribblix/etcd/bin/etcd \
  --name node1 \
  --initial-advertise-peer-urls http://192.168.0.231:2380 \
  --listen-peer-urls http://192.168.0.231:2380 \
  --listen-client-urls http://192.168.0.231:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.0.231:2379 \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=http://192.168.0.231:2380,node2=http://192.168.0.232:2380,node3=http://192.168.0.233:2380
\
  --initial-cluster-state new
\
  --enable-v2=true

The same again for node2, with the same cluster list, but its own
URLs.

zlogin -l etcd node2

/opt/tribblix/etcd/bin/etcd \
  --name node2 \
  --initial-advertise-peer-urls http://192.168.0.232:2380 \
  --listen-peer-urls http://192.168.0.232:2380 \
  --listen-client-urls http://192.168.0.232:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.0.232:2379 \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=http://192.168.0.231:2380,node2=http://192.168.0.232:2380,node3=http://192.168.0.233:2380 \
  --initial-cluster-state new
\
  --enable-v2=true

And for node3:

zlogin -l etcd node3

/opt/tribblix/etcd/bin/etcd \
  --name node3 \
  --initial-advertise-peer-urls http://192.168.0.233:2380 \
  --listen-peer-urls http://192.168.0.233:2380 \
  --listen-client-urls http://192.168.0.233:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.0.233:2379 \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=http://192.168.0.231:2380,node2=http://192.168.0.232:2380,node3=http://192.168.0.233:2380 \
  --initial-cluster-state new
\
  --enable-v2=true

OK, that gives you a 3-node cluster. Initially you'll see complaints about being unable to connect to the other nodes, but it will settle down once they've all started.

And that's basically it. I think in an ideal world this would be an SMF service, with svccfg properties defining the cluster. Something I ought to implement for Tribblix at some point.

One useful tip, while discussing etcd. How do you see what's been stored in etcd? Obviously if you know what the keys in use are, you can just look them up, but if you just want to poke around you don't know what to look for. Also, etcdctl ls has been removed, which is how we used to do it. So to simply list all the keys:

etcdctl get "" --prefix --keys-only

There you have it.


Tuesday, April 23, 2019

Setting up replicated PostgreSQL on Tribblix

When you're building systems, it's nice to build in some level of resilience. After all, failures will happen.

So, we use PostgreSQL quite a bit. We actually use a fairly traditional replication setup - the whole of the data is pushed using zfs send and receive to a second system. Problem at the source? We just turn on the DR site, and we're done.

One of the reasons for that fairly traditional approach is that PostgreSQL has, historically, not had much built in support for replication. Or, at least, not in a simple and straightforward manner. But it's getting a lot better.

Many of the guides you'll find are rather dated, and show old, rather clunky, and quite laborious ways to set up replication. With current versions of PostgreSQL it's actually pretty trivial to get streaming replication running, so here's how to demo it if you're using Tribblix.

First set up a couple of zones. The idea is that pg1 is the master, pg2 the replica. Run, as root:

zap create-zone -z pg1 -t whole -o base -x 192.168.0.221 -U ptribble

zap create-zone -z pg2 -t whole -o base -x 192.168.0.222 -U ptribble

This simply creates a fairly basic zone, without much in the way of extraneous software installed. Adjust the IP addresses to suit, of course. And I've set them up so that I can use zlogin from my own account.

Then login to each zone, install postgres, and create a user.

zlogin pg1
zap install TRIBblix-postgres11 TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir /export/home/pguser
chown -hR pguser /export/home/pguser

And the same for pg2.

Then log in to the master as pguser.

zlogin -l pguser pg1

Now initialise a database, and start it up:




env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/initdb -E UTF8 -D ~/db
env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db


The next thing to do is create a PostgreSQL user that will run the streaming replication.

/opt/tribblix/postgres11/bin/psql -d postgres
CREATE ROLE replicate WITH REPLICATION LOGIN ;
set password_encryption = 'scram-sha-256';
SET
\password replicate
Enter new password: my_secret_password

Then you need to edit postgresql.conf (in the db directory) with the following settings:

listen_addresses = '*'
wal_level = replica
max_wal_senders = 3 # or whatever
wal_keep_segments = 64 # or whatever
hot_standby = on


And set up authentication so that the user you just created can actually access the database remotely, by adding the following line to pg_hba.conf

host   replication   replicate     192.168.0.0/24    scram-sha-256

ideally we would use hostssl so the connection is encrypted, but that's out of scope for this example.

Then restart the master.

Now log in to the slave.

zlogin -l pguser pg2

And all you have to do to replicate the data is run pg_basebackup:

/opt/tribblix/postgres11/bin/pg_basebackup \
  -h 192.168.0.221 -c fast -D ~/db -R -P \
  -U replicate --wal-method=stream

It will prompt you for the super secret passord you entered earlier. Once that's completed you can start the slave:

env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db

And that's it. Data you insert on the master will be written to the slave. In this mode, you can connect to the replica and issue queries to read the data, but you can't change anything.

Note that this doesn't do anything like manage failover or send client connections to the right place. It just makes sure the data is available. You need an extra layer of management to manage the master and slave(s) to actually get HA.

There are a whole variety of ways to do that, but the simplest way to test it is to stop the database on the master, which will make the slave unhappy. But if you know the master isn't coming back, you can promote the replica:

/opt/tribblix/postgres11/bin/pg_ctl promote -D ~/db

and it will spring into action, and you can point all your clients to it.. The old master is now useless to you; the only course of action you have at this point is to wipe it, and then create a new replica based on the new master.

Tuesday, April 09, 2019

A teeny bug in jkstat char handling

While messing about with illuminate, I noticed an interesting oddity in the disk display:



See on the end of the product string is that "Revision"? That shouldn't be there, and iostat -En doesn't show it. This comes from my JKstat code, so where have I gone wrong?

This comes from the sderr kstat, which is a named kstat of the device_error class.

A named kstat is just a map of keys and values. The key is a string, the value is a union so you need to know what type the data is in order to be able to interpret the bits (and the data type of each entry is stored in the kstat, so that's fine).

For that Product field, it's initialized like this:

kstat_named_init(&stp->sd_pid, "Product", KSTAT_DATA_CHAR);

OK, so it's of type KSTAT_DATA_CHAR. The relevant entry in the union here is value.c, which is actually defined as a char c[16] - the field is 16 characters in size, long enough to hold up to 128-bit ints - most of the numerical data doesn't take that much space.

(For longer, arbitrary length strings, you can stick a pointer to the string in that union instead.)

Back to iostat data. For a SCSI device (something using the sd driver), the device properties are set up in the sd_set_errstats() function in the sd driver. This does a SCSI enquiry, and then copies the Product ID straight out of the right part of the SCSI enquiry string:

strncpy(stp->sd_pid.value.c, un->un_sd->sd_inq->inq_pid, 16);

(If you're interested, you can see the structure of the SCSI enquiry string in the /usr/include/sys/scsi/generic/inquiry.h header file. The inq_pid comes from bytes 16-31, and is 16 bytes long.)

You can see the problem. The strncpy() just copies 16 bytes into a character array that's 16 bytes long. It fits nicely, but there's a snag - because it fits exactly, there's no trailing null!

The problem with JKstat here is that it is (or was, anyway) using NewStringUTF() to convert the C string into a java String, And that doesn't have any concept of length associated with it. So it starts from the pointer to the beginning of the c[] array, and keeps going until it finds the null to terminate the string.

And if you look at the sd driver, the Revision entry comes straight after the product entry in memory, so what JNI is doing here is reading past the end of the Product value, and keeps going until it find the null at the end of the next name "Revision", and takes the whole lot. It is, I suppose, fortunate that there is something vaguely sensible for it to find.

There doesn't appear to be a way of doing the right thing in JNI itself, the fix has to be to copy the correct amount of the value into a temporary string that does have the trailing null added.

(And all the system tools written in C are fine, because they do have a way to just limit the read to 16 characters.)