Saturday, November 02, 2019

Testing hardware

My relationship with Sun wasn't just about testing Solaris. One of the other things we were involved with was beta testing new hardware.

Doing beta testing of software contributed to us being offered the chance to test new hardware, of course. Sun could be reasonably sure they would get lots of honest feedback from us.

And also, we had given a huge amount of feedback via the sales organization. Usually of the form "we don't want systems like that, why can't you make something like this instead?".

Some examples here: At one point we ended up buying Dell 6350s (running Solaris on Intel of course). They were a lot cheaper, for one thing. Paradoxically, we often had more memory in those 32-bit Intel machines than in the 64-bit SPARC servers, often down to cost but sometimes limitations in configuration. But a lot of it was that the 4U rack-mount Dell would fit in the datacenter, whereas we wouldn't have been able to fit a pile of E450s in, even if we could have afforded them.

Another example was that we bought a load of Ultra-5 workstations, some cheap shelving, and a KVM switch. Tied together with Grid Engine and its distributed make, they could get through some parallel tasks for a lot less than any server on the price list at the time. We tried (and failed) to persuade Sun to go even further with a cut-down system - we didn't need a CD drive or anything like that, it was just waste.

We also kept complaining about little things like chassis design, cabling, configuration, access and repair.

Eventually, we were asked to trial some new products.

One of the most interesting, and long running, was the B1600 blade system. As in blade servers, not Blade workstations. Aka Stiletto.

This had a variety of Blades - simple cheap single-processor SPARCs (B100s - like a V120 or flapjack, but smaller), and a single-processor AMD (B100x) with a twin-processor Xeon (B200x). There was also a plan for a variety of appliance blades, the only one I think that came out was a load-balancing device, but we didn't test those. If you think about the design, the whole thing full of SPARC blades was not much bigger than an Ultra- 5, but far more powerful and easier to wire.

I had an unfortunate accident while on Holiday just before we started the testing, where I broke my arm. I wasn't able to go to the "training" session and learn about the system, but then I'm a great believer that systems should actually be obvious to use. As the only UK customer though, we had one of the engineering team come and make a video of us unpacking, racking, and configuring the system.

So there was me just supervising, Andy with the video camera, and Geoff and Terry doing the lifting. All the way through from opening the box to having things ready to roll. So the project did cover things like the way it was packed in the carton, how you were able to lift it out safely, and how useful the bits of paper that came with it were.

We vastly preferred Sun rack cabinets, they're just so much stronger and more stable. But we decided that we would try putting the B1600 chassis into a third-party rack, as an additional test. This was a nightmare! The B1600 chassis didn't use traditional rackmount rails, it had some very thin wheels that slotted into side-rails, the tolerances were tiny, you had to line the thing up to within a millimetre, and if you think about getting anything with cage nuts down to under a millimetre by eye, it was always going to be difficult. It took us the best part of an hour (part of that was us giving a running commentary, to be fair) and multiple attempts, including some where we thought it was nicely aligned but actually wasn't, so it could have fallen.

The video was widely shared inside Sun, as I understand it, and the fix was to supply a simple metal measuring bar that you could offer up to the rack to ensure everything was square and at the correct spacing. We ended up trying several designs as they optimised it. If you've ever wondered how those spacer bars originated, now you know!

Of course, we tried sticking it in a proper Sun rack, and it was racked perfectly there in 5 seconds flat before Andy could even get over there with the camera.

Sun didn't really help themselves at times. When we first got the x86 blades, we couldn't run Solaris on them and had to run Linux for a bit. Things like drivers and management interfaces took a while to be completed.

Another project we did was the original V40z, the first generation 64-bit Opterons (this was just the Newisys Opteron reference design with a different bit of plastic tacked on the front). This was less about the actual hardware as bringing in a 64-bit operating system and the overall ecosystem associated with it. The first thing we noticed about it was that as soon as you apply power, the fans scream at full tilt - it was incredibly loud. A nice feature of this generation of systems was that they had 2 management network interfaces, so you could daisy chain the ILOMs in a rack, saving a huge number of external switch ports.

We also tested the V250, a tower server. One of the things we had complained about was that you never got a lot of choice in disk configurations - you either had too few (most of the Sun rack-mount range), or you needed the compute power of an E450 and had to buy a big metal box with mostly unused space. There was never a way to size a Sun box properly. We liked the V250 because it took a sensible number of disks, so for standalone databases it was great.

Monday, October 28, 2019

A brief history with Solaris

I first encountered Solaris (as in Solaris 2.x, as opposed to the retrospectively branded SunOS 4 as Solaris 1.x) when we got a SPARC classic workstation. Initially, that hardware didn't support SunOS 4. That made the shiny workstations useless doorstops, as nothing worked, and building stuff from source didn't work either.

Besides, Solaris 2.1 was utter garbage. It took decades to rid it of some of the more erratic design stupidities inherent in System V. (Cough. Printing. SAF.)

I just missed any serious association with Solaris 2.2, as the SS1000s I got to look after had been upgraded to 2.3 just before my arrival.

So, as a sysadmin, Solaris 2.3 was my first exposure to Solaris at scale. On the SS1000 you didn't have a choice, that was a completely new architecture that was never going to run SunOS 4, and we had several of them as the core of the service.

We built out NISplus. This had a bunch of, shall we call them quirks, and the early releases were pretty grim. But once the more irritating bugs got fixed, it served as a solid workhorse for years. As a network nameservice it was years ahead of its time - having proper administrative tooling and permissions, and a hierarchical structure. It was orders of magnitude better than the older NIS, and far better than anything available today. The SS5 running as our NISplus master did so running Solaris 2.3 for far longer than it probably should.

(We were also one of the few places to use X/Open federated naming, another game-changing state of the art technology Sun introduced and is now lost without trace.)

There's a common rule that odd releases are bad, even release are good. That didn't work with early releases of Solaris, they were all bad. But Solaris 2.4 was getting to be better - more stable, more performant, generally a better feel.

As you might expect, there was a pattern, and we found Solaris 2.5 to be pretty dreadful. We installed it on a couple of systems, but it was so poor we gave up. And then Solaris 2.5.1 was pretty decent, so the alternation pattern was starting to become established.

For me, Solaris 2.6 was a watershed. It was atrocious. We weren't exclusively a Solaris shop, we had RS/6000s running AIX, a decent SGI presence, some Linux, odd bits of other Unices, and still had SunOS 4 (an old ELC salvaged from the skip running our multicast router to connect us to the mbone). But we were starting to like Solaris, as it was so much easier to manage than anything else out there, so I started to report the bugs I was hitting.

I was reporting bug after bug after bug. We had given feedback previously, of course, but at nothing like the scale we were doing here. And, unlike other vendors who slammed the door in our faces and told us to go away, the Sun engineers actually wanted the feedback and the bugs, and fixed things for us.

So when they were planning Solaris 2.7, they got me to test it before it was released, rather than letting all those bugs get out into the wild and have to deal with my irate bug reports afterwards.

This ended up with an odd anecdote. As a beta tester, I was sent the Solaris media just before the official release. And the CDs said "Solaris 7" on them. Sun didn't communicate the renumbering (dropping the leading 2.) very well internally, although clearly whoever pressed the CDs needed to know. So I was able to confirm to the rather sceptical Sun salesforce and reseller community in the UK that the renumbering wasn't a joke.

We tested Solaris 8, and all the updates, and the Solaris 9 and its updates. We found Solaris 8 to be a bit dull, to be honest, and shifted to Solaris 9. At this point we were tracking every release, and it was always better. It was rather annoying that industry seemed to settle on Solaris 8, as that meant that some new hardware was only supported at launch on the old Solaris 8 rather than the current Solaris 9.

With Solaris 10, we got invited onto the Platinum Beta program. This basically means that you run the latest build, in production. As Sun Service hadn't even seen the release, any bugs or problems we had went straight back to Solaris engineering, and every customer in the program had a dedicated engineer we would deal with.

I also got to go out to Menlo Park a couple of times, at the start and end of the program. We got the inside scoop on all the new features from the people who wrote them.

Also with the Platinum Beta, a select few of us got hold of ZFS. You know how you build a prototype and throw it away, and then do it properly? Well, the version we had was that prototype. And yes, it was thrown away and ZFS was rewritten pretty much from scratch. That was why ZFS wasn't in Solaris 10 at launch, by the way. And the version we tested was a bit different to the way that it ended up working - for example, initially the pool didn't have an associated top-level mountpoint, so that pools and datasets were quite distinct. But the attitude of that ZFS testing was quite simple - they just sent us the zfs and zpool binaries, the kernel driver, and a 3-line crib sheet, and everything was supposed to be intuitive and obvious. If you couldn't work out how to do something that was considered to be a bug.

Immediately after Solaris 10 (in fact, starting just before the release) we kicked off OpenSolaris, initially as a closed pilot - nobody really knew how it was going to work, or indeed if some lawyer would find a speck of dust to jam up the works and prevent the whole thing going live. But OpenSolaris launched and its descendants, yes I'm talking illumos, are still making a difference.

Monday, April 29, 2019

HA PostgreSQL on Tribblix with Patroni

When it comes to managing PostgreSQL replication, there are a number of options available.

Updated: If you're running Tribblix m22 or later, you'll need python-3.7 as shown below. On older releases, use python-2.7 in the commands below.

I looked at stolon, but it's not the only game in town. In terms of a fully managed system, there's also patroni.

In terms of overall functionality, stolon and patroni are pretty similar. They both rely on etcd (or something similar) for storing state; they both take care of running the postgres server with the right options, and reconfiguring it as necessary; they'll both promote a replica to continue service if the master fails.

So, here's how to set up a HA PostgreSQL cluster using patroni.

Before starting on anything like this with Tribblix, it's always a good idea to

zap refresh

so that you're up to date in terms of packages and overlays.

First create 3 zones, just like before:

zap create-zone -z node1 -t whole \
  -o base -O patroni -x 192.168.0.231

zap create-zone -z node2 -t whole \

  -o base -O patroni -x 192.168.0.232

zap create-zone -z node3 -t whole \

  -o base -O patroni -x 192.168.0.233

Building the zones like this, with the patroni overlay, will ensure that all the required packages are installed in the zones so you don't need to mess around with installing packages later.

Then zlogin to each node and run the etcd commands as before, to create the user and start etcd.

Now create a user to run postgres on each node

zlogin node1 (and 2 and 3)
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

Now you need to create yaml files containing the configuration for each node. See http://petertribble.co.uk/Solaris/patroni/ for the sample files I've used here

Log in to each node in turn

pfexec zlogin -l pguser node1

wget http://petertribble.co.uk/Solaris/patroni/node1.yaml
/usr/versions/python-3.7/bin/patroni ${HOME}/node1.yaml

And it initializes a cluster, with just the one node as of yet, and that node will start off as the master.

Now the 2nd node

pfexec zlogin -l pguser node2

wget http://petertribble.co.uk/Solaris/patroni/node2.yaml
/usr/versions/python-3.7/bin/patroni ${HOME}/node2.yaml

And it sets it up as a secondary, replicating from node1.

What do things look like right now? You can check that with:

/usr/versions/python-3.7/bin/patronictl \
  -d etcd://192.168.0.231:2379 \
  list my-ha-cluster

Now the third node:

pfexec zlogin -l pguser node3

wget http://petertribble.co.uk/Solaris/patroni/node3.yaml
/usr/versions/python-3.7/bin/patroni ${HOME}/node3.yaml


You can force a failover by killing (or ^C) the patroni process on the master, which should be node1. You'll see one of the replicas coming up as master, and replication on the other replica change to use the new master. One thing I did notice is that patroni initiates the failover process pretty much instantly, whereas stolon waits a few seconds to be sure.

You can initiate a planned failover too:

/usr/versions/python-3.7/bin/patronictl \
  -d etcd://192.168.0.231:2379 \
  failover my-ha-cluster

It will ask you for the new master node, and for confirmation, and then you'll have a new master.

But you're not done yet. There's nothing to connect to. For that, patroni doesn't supply its own component (like stolon does with its proxy) but depends on a haproxy instance. The overlay install we used when creating the zone will have made sure that haproxy is installed in each zone, all we have to do is configure and start it.

zlogin to each node, as root, and

wget http://petertribble.co.uk/Solaris/patroni/haproxy.cfg -O /etc/haproxy.cfg
svcadm enable haproxy

You don't have to set up the haproxy stats page, but it's a convenient way to see what's going on. If you go to the stats page

http://192.168.0.231:7000

Then you can see that it's got the active backend up and the secondaries marked as down - haproxy is checking the patroni REST api which is only showing the active postgres instance as up, so haproxy will route all connections through to the master. And, if you migrate the master, haproxy will follow it.

Which to choose? That's always a matter of opinion, and to be honest while there are a few differences, they're pretty much even.
  • stolon is in go, and comes as a tiny number of standalone binaries, which makes it easier to define how it's packaged up
  • patroni is in python, so needs python and a whole bunch of modules as dependencies, which makes deployment harder (which is why I created an overlay - there are 32 packages in there, over 2 dozen python modules)
  • stolon has its own proxy, rather than relying on a 3rd-party component like haproxy
As a distro maintainer, it doesn't make much difference - dealing with those differences and dependencies is part and parcel of daily life. For standalone use, I think I would probably tend towards stolon, simply because of the much smaller packaging effort.

(It's not that stolon necessarily has fewer dependencies, but remember that in go these are all resolved at build time rather than runtime.)

Thursday, April 25, 2019

HA PostgreSQL on Tribblix with stolon

I wrote about setting up postgres replication, and noted there that while it did what it said it did - ensured that your data was safely sent off to another system - it wasn't a complete HA solution, requiring additional steps to actually make any use of the hot standby.

What I'm going to describe here is one way to create a fully-automatic HA configuration, using stolon. There's a longer article about stolon, roughly explaining the motivations behind the project.

Stolon uses etcd (or similar) as a reliable, distributed configuration store. So this article follows on directly from setting up an etcd cluster - I'm going to use the same zones, the same names, the same IP addresses, so you will need to have got the etcd cluster running as described there first.

We start off by logging in to each zone using zlogin (with pfexec if you set your account up as the zone administrator when creating the zone):

pfexec zlogin node1 (and node2 and node3)

Followed by installing stolon and postgres on each node, and creating an account for them to use:

zap refresh
zap install TRIBblix-postgres11 TRIBblix-stolon TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

In all the following commands I'm assuming you have set your PATH correctly so it contains the postgres and stolon executables. Either add /opt/tribblix/postgres11/bin and /opt/tribblix/stolon/bin to the PATH, or prefix the commands with

env PATH=/opt/tribblix/postgres11/bin:/opt/tribblix/stolon/bin:$PATH

Log in to the first node as pguser.

pfexec zlogin -l pguser node1

Configure the cluster (do this just the once):

stolonctl --cluster-name stolon-cluster \
  --store-backend=etcdv3 init

It's saving the metadata to etcd. Although it's just a single key to mark the stolon cluster as existing at this point.

Now we need a sentinel.

stolon-sentinel --cluster-name stolon-cluster \
  --store-backend=etcdv3

It complains that there are no keepers, so zlogin to node1 in another window and start one of those up too:

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres0 --data-dir data/postgres0 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \
  --pg-listen-address='192.168.0.231'

after a little while, a postgres instance appears. Cool!

Note that you have to explicitly specify the listen address. That's also the address that other parts of the cluster use, so you can't use "localhost" or '*', you have to use the actual address.


You also specify the postgres superuser password, and the account for replication and its password. Obviously these ought to be the same for all the nodes in the cluster, so they can all talk to each other successfully.

And now we can add a proxy, after another zlogin to node1:

stolon-proxy --cluster-name stolon-cluster \
  --store-backend=etcdv3 --port 25432

If now you point your client (such as psql) at port 25432 you can talk to the database through the proxy.

Just having one node doesn't meet our desire to build a HA cluster, so let's add some more nodes.

Right, go to the second node,

pfexec zlogin -l pguser node2

and add a sentinel and keeper there:

stolon-sentinel --cluster-name stolon-cluster \
  --store-backend=etcdv3


stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres1 --data-dir data/postgres1 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \
  --pg-listen-address='192.168.0.232'

What you'll then see happening on the second node is that stolon will automatically set the new postgres instance up as a replica of the first one (it assumes the first one you run is the master).

Then set up the third node:

pfexec zlogin -l pguser node3

with another sentinel and keeper:

stolon-sentinel --cluster-name stolon-cluster \
  --store-backend=etcdv3

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres2 --data-dir data/postgres2 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \
  --pg-listen-address='192.168.0.233'

You can also run a proxy on the second and third nodes (or on any other node you might wish to use, come to that). Stolon will configure the proxy for you so that it's always connecting to the master.

At this point you can play around, create a table, insert some data.

And you can test failover. This is the real meat of the problem.

Kill the master (^C its keeper). It takes a while, because it wants to be sure there's actually a problem before taking action, but what you'll see is one of the slaves being promoted to master. And if you run psql against the proxies, they'll send your queries off to the new master. Everything works as it should.

Even better, if you restart the old failed master (as in, restart its keeper), then it successfully sets the old master up as a slave. No split-brain, you get your redundancy back.

I tried this a few more times, killing the new master aka the original slave, and it fails across again.

I'm actually mighty impressed with stolon.

Setting up an etcd cluster on Tribblix

Using etcd to store configuration data is a common pattern, so how might you set up an etcd cluster on Tribblix?

Updated: With current etcd, you may need to add the --enable-v2=true flag, as shown below. For example, Patroni requires v2.

I'll start by creating 3 zones to create a 3-node cluster. For testing these could all be on the same physical system, for production you would obviously want them on separate machines.

As root:

zap refresh

zap create-zone -z node1 -t whole -o base -x 192.168.0.231

zap create-zone -z node2 -t whole -o base -x 192.168.0.232

zap create-zone -z node3 -t whole -o base -x 192.168.0.233


If you add the -U flag with your own username then you'll be able to use zlogin via pfexec from your own account, rather than always running it as root (in other words, subsequent invocations of zlogin could be pfexec zlogin.)

Then zlogin to node1 (and node2 and node3) to install etcd, and create
a user to run the service.

zlogin node1

zap install TRIBblix-etcd
useradd -u 11798 -g staff -s /bin/bash -d /export/home/etcd etcd
passwd -N etcd
mkdir -p /export/home/etcd
chown -hR etcd /export/home/etcd

I'm going to use static initialization to create the cluster. See the
clustering documentation.

You need to give each node a name (I'm going to use the zone name) and the cluster a name, here I'll use pg-cluster-1 as I'm going to use it for some PostgreSQL clustering tests. Then you need to specify the URLs that will be used by this node, and the list of URLs used by the cluster as a whole - which means all 3 machines. For this testing I'm going to use unencrypted connections between the nodes, in practice you would want to run everything over ssl.

zlogin -l etcd node1

/opt/tribblix/etcd/bin/etcd \
  --name node1 \
  --initial-advertise-peer-urls http://192.168.0.231:2380 \
  --listen-peer-urls http://192.168.0.231:2380 \
  --listen-client-urls http://192.168.0.231:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.0.231:2379 \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=http://192.168.0.231:2380,node2=http://192.168.0.232:2380,node3=http://192.168.0.233:2380
\
  --initial-cluster-state new
\
  --enable-v2=true

The same again for node2, with the same cluster list, but its own
URLs.

zlogin -l etcd node2

/opt/tribblix/etcd/bin/etcd \
  --name node2 \
  --initial-advertise-peer-urls http://192.168.0.232:2380 \
  --listen-peer-urls http://192.168.0.232:2380 \
  --listen-client-urls http://192.168.0.232:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.0.232:2379 \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=http://192.168.0.231:2380,node2=http://192.168.0.232:2380,node3=http://192.168.0.233:2380 \
  --initial-cluster-state new
\
  --enable-v2=true

And for node3:

zlogin -l etcd node3

/opt/tribblix/etcd/bin/etcd \
  --name node3 \
  --initial-advertise-peer-urls http://192.168.0.233:2380 \
  --listen-peer-urls http://192.168.0.233:2380 \
  --listen-client-urls http://192.168.0.233:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.0.233:2379 \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=http://192.168.0.231:2380,node2=http://192.168.0.232:2380,node3=http://192.168.0.233:2380 \
  --initial-cluster-state new
\
  --enable-v2=true

OK, that gives you a 3-node cluster. Initially you'll see complaints about being unable to connect to the other nodes, but it will settle down once they've all started.

And that's basically it. I think in an ideal world this would be an SMF service, with svccfg properties defining the cluster. Something I ought to implement for Tribblix at some point.

One useful tip, while discussing etcd. How do you see what's been stored in etcd? Obviously if you know what the keys in use are, you can just look them up, but if you just want to poke around you don't know what to look for. Also, etcdctl ls has been removed, which is how we used to do it. So to simply list all the keys:

etcdctl get "" --prefix --keys-only

There you have it.


Tuesday, April 23, 2019

Setting up replicated PostgreSQL on Tribblix

When you're building systems, it's nice to build in some level of resilience. After all, failures will happen.

So, we use PostgreSQL quite a bit. We actually use a fairly traditional replication setup - the whole of the data is pushed using zfs send and receive to a second system. Problem at the source? We just turn on the DR site, and we're done.

One of the reasons for that fairly traditional approach is that PostgreSQL has, historically, not had much built in support for replication. Or, at least, not in a simple and straightforward manner. But it's getting a lot better.

Many of the guides you'll find are rather dated, and show old, rather clunky, and quite laborious ways to set up replication. With current versions of PostgreSQL it's actually pretty trivial to get streaming replication running, so here's how to demo it if you're using Tribblix.

First set up a couple of zones. The idea is that pg1 is the master, pg2 the replica. Run, as root:

zap create-zone -z pg1 -t whole -o base -x 192.168.0.221 -U ptribble

zap create-zone -z pg2 -t whole -o base -x 192.168.0.222 -U ptribble

This simply creates a fairly basic zone, without much in the way of extraneous software installed. Adjust the IP addresses to suit, of course. And I've set them up so that I can use zlogin from my own account.

Then login to each zone, install postgres, and create a user.

zlogin pg1
zap install TRIBblix-postgres11 TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir /export/home/pguser
chown -hR pguser /export/home/pguser

And the same for pg2.

Then log in to the master as pguser.

zlogin -l pguser pg1

Now initialise a database, and start it up:




env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/initdb -E UTF8 -D ~/db
env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db


The next thing to do is create a PostgreSQL user that will run the streaming replication.

/opt/tribblix/postgres11/bin/psql -d postgres
CREATE ROLE replicate WITH REPLICATION LOGIN ;
set password_encryption = 'scram-sha-256';
SET
\password replicate
Enter new password: my_secret_password

Then you need to edit postgresql.conf (in the db directory) with the following settings:

listen_addresses = '*'
wal_level = replica
max_wal_senders = 3 # or whatever
wal_keep_segments = 64 # or whatever
hot_standby = on


And set up authentication so that the user you just created can actually access the database remotely, by adding the following line to pg_hba.conf

host   replication   replicate     192.168.0.0/24    scram-sha-256

ideally we would use hostssl so the connection is encrypted, but that's out of scope for this example.

Then restart the master.

Now log in to the slave.

zlogin -l pguser pg2

And all you have to do to replicate the data is run pg_basebackup:

/opt/tribblix/postgres11/bin/pg_basebackup \
  -h 192.168.0.221 -c fast -D ~/db -R -P \
  -U replicate --wal-method=stream

It will prompt you for the super secret passord you entered earlier. Once that's completed you can start the slave:

env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db

And that's it. Data you insert on the master will be written to the slave. In this mode, you can connect to the replica and issue queries to read the data, but you can't change anything.

Note that this doesn't do anything like manage failover or send client connections to the right place. It just makes sure the data is available. You need an extra layer of management to manage the master and slave(s) to actually get HA.

There are a whole variety of ways to do that, but the simplest way to test it is to stop the database on the master, which will make the slave unhappy. But if you know the master isn't coming back, you can promote the replica:

/opt/tribblix/postgres11/bin/pg_ctl promote -D ~/db

and it will spring into action, and you can point all your clients to it.. The old master is now useless to you; the only course of action you have at this point is to wipe it, and then create a new replica based on the new master.

Tuesday, April 09, 2019

A teeny bug in jkstat char handling

While messing about with illuminate, I noticed an interesting oddity in the disk display:



See on the end of the product string is that "Revision"? That shouldn't be there, and iostat -En doesn't show it. This comes from my JKstat code, so where have I gone wrong?

This comes from the sderr kstat, which is a named kstat of the device_error class.

A named kstat is just a map of keys and values. The key is a string, the value is a union so you need to know what type the data is in order to be able to interpret the bits (and the data type of each entry is stored in the kstat, so that's fine).

For that Product field, it's initialized like this:

kstat_named_init(&stp->sd_pid, "Product", KSTAT_DATA_CHAR);

OK, so it's of type KSTAT_DATA_CHAR. The relevant entry in the union here is value.c, which is actually defined as a char c[16] - the field is 16 characters in size, long enough to hold up to 128-bit ints - most of the numerical data doesn't take that much space.

(For longer, arbitrary length strings, you can stick a pointer to the string in that union instead.)

Back to iostat data. For a SCSI device (something using the sd driver), the device properties are set up in the sd_set_errstats() function in the sd driver. This does a SCSI enquiry, and then copies the Product ID straight out of the right part of the SCSI enquiry string:

strncpy(stp->sd_pid.value.c, un->un_sd->sd_inq->inq_pid, 16);

(If you're interested, you can see the structure of the SCSI enquiry string in the /usr/include/sys/scsi/generic/inquiry.h header file. The inq_pid comes from bytes 16-31, and is 16 bytes long.)

You can see the problem. The strncpy() just copies 16 bytes into a character array that's 16 bytes long. It fits nicely, but there's a snag - because it fits exactly, there's no trailing null!

The problem with JKstat here is that it is (or was, anyway) using NewStringUTF() to convert the C string into a java String, And that doesn't have any concept of length associated with it. So it starts from the pointer to the beginning of the c[] array, and keeps going until it finds the null to terminate the string.

And if you look at the sd driver, the Revision entry comes straight after the product entry in memory, so what JNI is doing here is reading past the end of the Product value, and keeps going until it find the null at the end of the next name "Revision", and takes the whole lot. It is, I suppose, fortunate that there is something vaguely sensible for it to find.

There doesn't appear to be a way of doing the right thing in JNI itself, the fix has to be to copy the correct amount of the value into a temporary string that does have the trailing null added.

(And all the system tools written in C are fine, because they do have a way to just limit the read to 16 characters.)

Monday, April 01, 2019

Notes on web servers and client certificates

With https, web servers have digital certificates to encrypt and authenticate traffic.

Web servers can also require clients to present a valid certificate, which could be used for authentication and identity.

I've recently had the misfortune to end up delving into this, so here are some notes on diagnosing and testing this from a client perspective. One of the problems here is that this all takes place before anything http-related happens, so normal diagnostic techniques are useless - and there won't be anything logged on either the server or client side to work from.

So, rather than trying to work out the exact commands next time, this is an aide-memoire. And hopefully might be useful to others too.

The first thing is to work out whether a server expects a client certificate or not. (If you can get in without, then obviously it's not requiring one, but there are other ways connections can fail.)

Fortunately, openssl can initiate the connection, allowing you to see exactly what's going on:

openssl s_client -showcerts \
  -servername example.com \
  -connect example.com:443

Note that you'll usually need the -servername flag here, to stick on the SNI header, otherwise the server or load balancer or proxy won't know what to do with it.

In addition to printing out all the server certificates and some other diagnostics, this will tell you what, if any, client certificates are required. If none are expected, there will be a section that looks like:

---
No client certificate CA names sent
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits
---

If the server wants a client certificate, then it will tell you what certificates it wants:

---
Acceptable client certificate CA names
/CN=My Client CA
Client Certificate Types: RSA sign, DSA sign, ECDSA sign
Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:ECDSA+SHA256:RSA+SHA256:DSA+SHA256:ECDSA+SHA224:RSA+SHA224:DSA+SHA224:ECDSA+SHA1:RSA+SHA1:DSA+SHA1
Shared Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:ECDSA+SHA256:RSA+SHA256:DSA+SHA256:ECDSA+SHA224:RSA+SHA224:DSA+SHA224:ECDSA+SHA1:RSA+SHA1:DSA+SHA1
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits
---

The important thing here is the "/CN=My Client CA" - the server is telling you that it wants a certificate signed by that Certificate Authority.

Assuming you have such a certificate, how do you send it? Browsers have ways to import it, but it's often easier to diagnose it from the CLI, using curl or wget. You'll need both the certificate and the key, and the syntax is:

wget --certificate=mycert.crt --private-key=mycert.key \
  https://example.com/

or

curl --cert mycert.crt --key mycert.key \
  https://example.com/

Assuming the certificate you have is signed by the relevant CA, this will allow you to retrieve the page. If there's a problem, you might get some meaningful diagnostics.

The other thing is how to load those into a browser. Normally you have something like a .p12 file with both parts bundled. You can create one of those like so:

openssl pkcs12 -export -out mycert.p12 \
  -in mycert.crt -inkey mycert.key

You'll get asked for a passphrase to protect the .p12 file - it contains the key, so needs to be protected in transit.

You can also extract the key and certificate from a .p12 file:

openssl pkcs12 -in mycert.p12 -nokeys -out mycert.crt
openssl pkcs12 -in mycert.p12 -nocerts -out mycert.key

Friday, March 22, 2019

Creating illumos packages for Tribblix

In the prior article in this series, I discussed how to build illumos-gate (and it applies to illumos-omnios too), using a Tribblix AMI on AWS.

After that, you'll have two directories of interest under illumos-gate.
  1. The proto area, specifically proto/root_i386, that is a fully installed copy of illumos.
  2. Under packages, an on-disk IPS package respoitory.
If you're using IPS, you could use that repository as is. Tribblix, however, uses SVR4 packages. So the next step is to convert the contents of the IPS repository into a set of SVR4 packages.

For this, there is an additional github repo you'll need to check out, in addition to the ones from the previous article.

cd ${HOME}/Tribblix
git clone https://github.com/tribblix/tribblix-transforms

What is this repo? It's a list of transformations that are applied to the package conversion process. Generally, rather than modify the gate or the build, if there's something I don't want to ship, or something I want to move between packages, or a file I want to change, then it gets transformed at the packaging stage.

The other thing you'll need to use my scripts is a signing certificate. In Tribblix, the illumos ELF objects are digitally signed (yes, I ought to extend this to all ELF objects I ship). Nothing actually uses this yet, but it's all there ready for when verification is required.

To create a signing key and certificate:

mkdir ${HOME}/f
cd ${HOME}/f
openssl req -x509 -newkey rsa:2048 -nodes \
  -subj "/O=MyOrganization/CN=my.domain.name" \
  -keyout elfcert.key -out elfcert.crt -days 3650
Of course, choose the subject to match your own requirements.

Then you can run the scripts in the tribblix-build repo to generate SVR4 packages. These parse the IPS manifests, extract the files from the IPS repo, create the SVR4 prototype file, and automatically generate SVR4 install scripts from the IPS metadata.



So, to create packages, run the following as root:

/illumos/Tribblix/tribblix-build/repo_all.sh \
  -T /illumos/Tribblix \
  -G /illumos/Illumos/illumos-gate \
  -S /illumos/f/elfcert

Of course, replace /illumos with wherever your build user's home directory is. (You can't use $HOME, because this is run as root where $HOME is likely to be different to that of the user who did the build.) The -T flag tells it where the Tribblix tools are checked out, -G where the gate was built, and -S where the signing certificate is. If you want to change the version of the packages you generate, there's a -V flag as well. I normally redirect the output to a log file, as it's quite verbose

Wait a while, and you'll have a set of packages under /var/tmp/illumos-pkgs. Hopefully the pkgs directory there will have your packages, and the build and tmp directories will be empty (if the build directory isn't, it will have a half-created package or packages in it, so you'll be able to see which package or packages failed).

For omnitribblix it's very similar:

/illumos/Tribblix/tribblix-build/omni_all.sh \
  -T /illumos/Tribblix \
  -G /illumos/Illumos/omnitribblix \
  -S /illumos/f/elfcert

And the packages in this case end up in /var/tmp/omni-pkgs.

If you look, you'll see that there are 3 files for each package:

  • A .pkg file, which is in SVR4 datastream format
  • A .zap file, which is the (compressed) zap format Tribblix uses
  • An md5 checksum, which is used for basic validation in the package catalog
I then use those generated SVR4 packages to populate the package repository, and to build the Tribblix ISOs. The details of that will have to wait for another time.

Sunday, March 10, 2019

Tweaking the Tribblix installer

In the latest update to Tribblix, I've made a couple of minor tweaks to the installer.

The first is that compression (lz4) is now enabled on the root pool by default. It was possible in the past to give the -C flag to the installer, but now it's always on. There was a time in the past (the distant past) when enabling gzip compression could really hurt performance with some workloads, but the world has moved on, so enabling compression by default is a good thing.

The compression factor you see on something like the AWS image is:

NAME                 PROPERTY       VALUE  SOURCE
rpool                compressratio  1.92x  -
rpool/ROOT           compressratio  1.92x  -
rpool/ROOT/tribblix  compressratio  1.92x  -
rpool/export         compressratio  1.00x  -
rpool/export/home    compressratio  1.00x  -
rpool/swap           compressratio  2.04x  -

It's not bad, getting a factor of almost 2 essentially for free.

The second change is to partitioning. Traditionally, the root pool was created in a partition rather than the whole disk. You made the partition span the whole disk, but the disk was still partitioned in the traditional way.

As of m20.6, if you use -G instead of -B to the installer

./live_install.sh -G c1t0d0 [overlays...]

then it will create a whole disk pool with EFI System partition to support booting system with UEFI firmware. (As it says in the zpool man page.) Good for newer systems, but it brings another key capability.

For EC2 instances, this allows the rpool to be expanded. The base size is still 8G, but you can choose a different (larger) size for the EBS device when creating the instance:

When you initially log in, you still get the original size:

NAME    SIZE  ALLOC   FREE
rpool  7.50G   301M  7.21G

but you can use the following command:

zpool online -e rpool c2t0d0

and it magically expands to use the available space:

NAME    SIZE  ALLOC   FREE
rpool  11.5G   302M  11.2G

At the moment, this has to be done manually, but in future this expansion will happen automatically when the instance boots.

You can, of course, increase the size of an EBS volume while the instance is running. This takes a little while (the State is shown as "in-use - optimizing" while the expansion takes place). Unfortunately it appears at the moment that you need a reboot for the instance to rescan the devices so it knows there's more space.

Wednesday, February 27, 2019

Building illumos-gate on AWS (2019 version)

I've covered building illumos on AWS before, but the instructions there are a bit out of date. Of course, these will become outdated too, in time, but should still be useful.

First spin up an EC2 instance, the AMI you want to use for this is "Tribblix-m20.5", ami-7cf2181b in London. (If you want to use a different region, then you'll have to copy the AMI.)

For instance size, try an m4.2xlarge - a t2.micro really won't cut it, you won't even be able to install the packages. With an m4.2xlarge you're looking at 30-45 minutes to build illumos, depending on whether debug is enabled.

Add an external EBS device of at least 8G, there isn't enough space on the AMI's root partition to handle the build. When you add it, attach it at /dev/sdf (there's no real reason for this, but that will match the zpool create below).

Once it's booted up (I assume you know about security groups and keypairs, so you can ssh in as root), then the first thing you need to do is update the image:

zap refresh
zap update-overlay -a

Install the illumos-build overlay and tweak the install

zap install-overlay illumos-build
rm -f /usr/bin/cpp
cd /usr/bin ; ln -s ../gnu/bin/xgettext gxgettext

Create a user and a storage pool for them to use

zpool create -O compression=lz4 illumos c2t5d0
useradd -g staff -d /illumos -s /bin/bash illumos
chown illumos /illumos
passwd illumos

Log in as illumos, and clone the gate. What I do here is have a reference copy of the gate, and then I can clone that very quickly locally every time I want to do a build.

mkdir ${HOME}/Illumos-reference
cd ${HOME}/Illumos-reference
git clone https://github.com/illumos/illumos-gate

If you want to create omnitribblix too

git clone https://github.com/omniosorg/illumos-omnios

Clone the relevant Tribblix repo

mkdir ${HOME}/Tribblix
cd ${HOME}/Tribblix
git clone https://github.com/tribblix/tribblix-build

The scripts need to know where the Tribblix repo(s) are checked out

export THOME=${HOME}/Tribblix
Create a build area with the gate checked out

mkdir ${HOME}/Illumos
cd ${HOME}/Illumos
git clone ${HOME}/Illumos-reference/illumos-gate

and if you want to build omnitribblix too

git clone ${HOME}/Illumos-reference/illumos-omnios omnitribblix

You need the closed tarballs

wget -c \
  https://download.joyent.com/pub/build/illumos/on-closed-bins.i386.tar.bz2 \
  https://download.joyent.com/pub/build/illumos/on-closed-bins-nd.i386.tar.bz2

Right, you're now ready to do a build.

cd illumos-gate

For a release build:

${THOME}/tribblix-build/illumos/releasebuild m20.6

For a debug build with a gcc7 shadow

${THOME}/tribblix-build/illumos/debugbuild

For a debug build without the gcc7 shadow

${THOME}/tribblix-build/illumos/debugbuild -q
For omnitribblix, which assumes (and requires) that you've done a vanilla gate build first

cd ${HOME}/Illumos/omnitribblix
${THOME}/tribblix-build/illumos/omnibuild m20lx.6
And there you have it, you should have a beautiful cleanly built gate.

This differs from my previous recipe in a couple of key ways:
  1. I no longer require you to build in a zone, although I would still recommend doing so if you want to use the system for other things. But as this is an AWS instance, we can dedicate it to gate building
  2. It's far more scripted. If you want to see what it's really doing, look inside the releasebuild, debugbuild, and omnibuild scripts.
 Next time, I'll cover how the files from the build are converted into packages.

Sunday, February 10, 2019

Thoughts on SPARC support in illumos

One interesting property of illumos is that its legacy stretches back decades - there is truly ancient code rubbing shoulders with the very modern.

An area where we have really old code is on SPARC, where illumos has support in the codebase for a large variety of Sun desktops and servers.

There's a reasonable chance that quite a bit of this code is currently broken. Not because it's fundamentally poor code (although it's probably fair to say that the code quality is of its time, and a lot of it is really old), but it lives within an evolving codebase and hasn't been touched in the lifetime of illumos, and likely much longer. Not only that, but it's probably making more assumptions about being built with the old Studio toolchain rather than with gcc.

What of this code is useful and worth keeping and fixing, and what should be dropped?

A first step in this was that I have recently removed support for starfire - the venerable Sun E10K. It seems extremely unlikely that anyone is running illumos on such a machine. Or indeed that anyone has them running at all - they're museum pieces at this point.

A similar, if rather newer, class of system is the starcat, the Sun F15K and variants. Again, it's big, expensive, requires dedicated controller hardware, and is unlikely to be the kind of thing anyone's going to have lying about. (And, if you're a business, there's no real point in trying to make such a system work - you would be much better off, both operationally and financially, in getting a current SPARC system.)

And if nobody has such a system, then not only is the code useless, it's also untestable.

The domained systems, like starfire and starcat, are also good candidates for removal because of the relative complexity and uniqueness of their code. And it's not as if the design specs for this hardware are out there to study.

What else might we consider removing (with starfire done and starcat a given)?

  1. The serengeti, Sun-Fire E2900-E6800. Another big blob of complex code.
  2. The lw8 (lightweight 8), aka the V-1280. This is basically some serengeti boards in a volume server chassis.
  3. Anything using Sbus. That would be the Ultra-2, and the E3000-E6000 (sunfire). There's also the socal, sf, and bpp drivers. One snag  with removing the Ultra-2 is that it's used as the base platfrom for the newer US-II desktops, which link back to it.
  4. The olympus platform. That's anything from Fujitsu. One slight snag here is that the M3000 was quite a useful box and is readily available on eBay, and quite affordable too.
  5. Netra systems. (Specifically NetraCT - there's a US-IIi NetraCT, and two US-IIe systems, the NetraCT-40 and the NetraCT-60. Code names montecarlo and makaha (something about Tonga too). Also CP2300 aka snowbird.
  6. Server blade. I'm talking the old B100s blade here.
  7. Binary compatibility with SunOS 4 - this is kernel support for a.out, and libbc.
I'm not saying at this point that all of this code and platform support will go, just that it lists the potential candidates. For example, I regard support for M3000 as useful, and definitely worth thinking about keeping.

What does that leave as supported? Most of the US-II and US-III desktops, most of the V-series servers, and pretty much all the early sun4v (T1 through T3 chips) systems. In other words, the sort of thing that you can pick up second hand fairly easily at this point.

Getting rid of code that we can never use has a number of benefits:

  • We end up with a smaller body of code, that is thus easier to manage.
  • We end up with less code that needs to be updated, for example to make it gcc7 clean, or to fix problems found by smatch, or to enable illumos to adopt newer toolchains.
  • We can concentrate on the code that we have left, and improve its quality.
If we put those together into a single strategy, the overall aim is to take illumos for SPARC from a large body of unknown, untested, and unsupportable code to a smaller body of properly maintained, testable, and supportable code. Reduce quantity to improve quality, if you like.

As part of this project, I've looked through much of the SPARC codebase. And it's not particularly pretty. One reason for attacking starfire was that I was able to convince myself relatively quickly that I could come up with a removal plan that was well-bounded - it was possible to take all of it out without accidentally affecting anything else. Some of the other platforms need bit more analysis to tease out all the dependencies and complexity - bits of code are shared between platforms in a variety of non-obvious ways.

The above represents my thoughts on what would be a reasonable strategy for supporting SPARC in illumos. I would naturally be interested in the views of others, and specifically if anyone is actually using illumos on any of the platforms potentially on the chopping block.

Friday, February 08, 2019

SPARC and tod modules on illumos

Following up from removing starfire support from illumos, I've been browsing through the codebase to identify more legacy code that shouldn't be there any more.

Along the way, I discovered a little tidbit about how the tod (time of day) modules - the interface to the hardware clock - work on SPARC.

If you look, there are a whole bunch of tod modules, and it's not at all obvious how they fit together - they all appear to be candidates for loading, and it's not obvious how the correct one for a platform is chosen.

The mechanism is actually pretty simple, if a little odd.

There's a global variable in the kernel named:

tod_module_name

This can be set in several ways - for some platforms, it's hard-coded in that platform's platmod. Or it could be extracted from the firmware (OBP). That tells the system which tod module should be used.

And the way this works is that each tod module has _init code that looks like

if (tod_module_name is myself) {
   initialize the driver
} else {
   do nothing
}

so at boot all the tod modules get loaded, but only the one that matches the name set by the platform actually initializes itself.

Later in boot, there's an attempt to unload all modules. Similarly the _fini for each driver essentially does

if (tod_module_name is myself) {
   I'm busy and can't be unloaded
} else {
   yeah, unload me
}

So, when the system finishes booting, you end up with only one tod module loaded and functional, and it's the right one.

Returning to the original question, can any of the tod modules be safely removed because no platform uses them? To be honest, I don't know. Some have names that match the platform they're for. It's pretty obvious, for example, that todstarfire goes with the starfire platform, so it was safe to take that out. But I don't know the module names returned by every possible piece of SPARC hardware, so it isn't really safe to remove some of the others. (And, as a further problem, I know that at least one is only referenced in closed source, binary only, platform modules.)