Tuesday, April 23, 2019

Setting up replicated PostgreSQL on Tribblix

When you're building systems, it's nice to build in some level of resilience. After all, failures will happen.

So, we use PostgreSQL quite a bit. We actually use a fairly traditional replication setup - the whole of the data is pushed using zfs send and receive to a second system. Problem at the source? We just turn on the DR site, and we're done.

One of the reasons for that fairly traditional approach is that PostgreSQL has, historically, not had much built in support for replication. Or, at least, not in a simple and straightforward manner. But it's getting a lot better.

Many of the guides you'll find are rather dated, and show old, rather clunky, and quite laborious ways to set up replication. With current versions of PostgreSQL it's actually pretty trivial to get streaming replication running, so here's how to demo it if you're using Tribblix.

First set up a couple of zones. The idea is that pg1 is the master, pg2 the replica. Run, as root:

zap create-zone -z pg1 -t whole -o base -x 192.168.0.221 -U ptribble

zap create-zone -z pg2 -t whole -o base -x 192.168.0.222 -U ptribble

This simply creates a fairly basic zone, without much in the way of extraneous software installed. Adjust the IP addresses to suit, of course. And I've set them up so that I can use zlogin from my own account.

Then login to each zone, install postgres, and create a user.

zlogin pg1
zap install TRIBblix-postgres11 TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir /export/home/pguser
chown -hR pguser /export/home/pguser

And the same for pg2.

Then log in to the master as pguser.

zlogin -l pguser pg1

Now initialise a database, and start it up:




env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/initdb -E UTF8 -D ~/db
env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db


The next thing to do is create a PostgreSQL user that will run the streaming replication.

/opt/tribblix/postgres11/bin/psql -d postgres
CREATE ROLE replicate WITH REPLICATION LOGIN ;
set password_encryption = 'scram-sha-256';
SET
\password replicate
Enter new password: my_secret_password

Then you need to edit postgresql.conf (in the db directory) with the following settings:

listen_addresses = '*'
wal_level = replica
max_wal_senders = 3 # or whatever
wal_keep_segments = 64 # or whatever
hot_standby = on


And set up authentication so that the user you just created can actually access the database remotely, by adding the following line to pg_hba.conf

host   replication   replicate     192.168.0.0/24    scram-sha-256

ideally we would use hostssl so the connection is encrypted, but that's out of scope for this example.

Then restart the master.

Now log in to the slave.

zlogin -l pguser pg2

And all you have to do to replicate the data is run pg_basebackup:

/opt/tribblix/postgres11/bin/pg_basebackup \
  -h 192.168.0.221 -c fast -D ~/db -R -P \
  -U replicate --wal-method=stream

It will prompt you for the super secret passord you entered earlier. Once that's completed you can start the slave:

env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db

And that's it. Data you insert on the master will be written to the slave. In this mode, you can connect to the replica and issue queries to read the data, but you can't change anything.

Note that this doesn't do anything like manage failover or send client connections to the right place. It just makes sure the data is available. You need an extra layer of management to manage the master and slave(s) to actually get HA.

There are a whole variety of ways to do that, but the simplest way to test it is to stop the database on the master, which will make the slave unhappy. But if you know the master isn't coming back, you can promote the replica:

/opt/tribblix/postgres11/bin/pg_ctl promote -D ~/db

and it will spring into action, and you can point all your clients to it.. The old master is now useless to you; the only course of action you have at this point is to wipe it, and then create a new replica based on the new master.

Tuesday, April 09, 2019

A teeny bug in jkstat char handling

While messing about with illuminate, I noticed an interesting oddity in the disk display:



See on the end of the product string is that "Revision"? That shouldn't be there, and iostat -En doesn't show it. This comes from my JKstat code, so where have I gone wrong?

This comes from the sderr kstat, which is a named kstat of the device_error class.

A named kstat is just a map of keys and values. The key is a string, the value is a union so you need to know what type the data is in order to be able to interpret the bits (and the data type of each entry is stored in the kstat, so that's fine).

For that Product field, it's initialized like this:

kstat_named_init(&stp->sd_pid, "Product", KSTAT_DATA_CHAR);

OK, so it's of type KSTAT_DATA_CHAR. The relevant entry in the union here is value.c, which is actually defined as a char c[16] - the field is 16 characters in size, long enough to hold up to 128-bit ints - most of the numerical data doesn't take that much space.

(For longer, arbitrary length strings, you can stick a pointer to the string in that union instead.)

Back to iostat data. For a SCSI device (something using the sd driver), the device properties are set up in the sd_set_errstats() function in the sd driver. This does a SCSI enquiry, and then copies the Product ID straight out of the right part of the SCSI enquiry string:

strncpy(stp->sd_pid.value.c, un->un_sd->sd_inq->inq_pid, 16);

(If you're interested, you can see the structure of the SCSI enquiry string in the /usr/include/sys/scsi/generic/inquiry.h header file. The inq_pid comes from bytes 16-31, and is 16 bytes long.)

You can see the problem. The strncpy() just copies 16 bytes into a character array that's 16 bytes long. It fits nicely, but there's a snag - because it fits exactly, there's no trailing null!

The problem with JKstat here is that it is (or was, anyway) using NewStringUTF() to convert the C string into a java String, And that doesn't have any concept of length associated with it. So it starts from the pointer to the beginning of the c[] array, and keeps going until it finds the null to terminate the string.

And if you look at the sd driver, the Revision entry comes straight after the product entry in memory, so what JNI is doing here is reading past the end of the Product value, and keeps going until it find the null at the end of the next name "Revision", and takes the whole lot. It is, I suppose, fortunate that there is something vaguely sensible for it to find.

There doesn't appear to be a way of doing the right thing in JNI itself, the fix has to be to copy the correct amount of the value into a temporary string that does have the trailing null added.

(And all the system tools written in C are fine, because they do have a way to just limit the read to 16 characters.)

Monday, April 01, 2019

Notes on web servers and client certificates

With https, web servers have digital certificates to encrypt and authenticate traffic.

Web servers can also require clients to present a valid certificate, which could be used for authentication and identity.

I've recently had the misfortune to end up delving into this, so here are some notes on diagnosing and testing this from a client perspective. One of the problems here is that this all takes place before anything http-related happens, so normal diagnostic techniques are useless - and there won't be anything logged on either the server or client side to work from.

So, rather than trying to work out the exact commands next time, this is an aide-memoire. And hopefully might be useful to others too.

The first thing is to work out whether a server expects a client certificate or not. (If you can get in without, then obviously it's not requiring one, but there are other ways connections can fail.)

Fortunately, openssl can initiate the connection, allowing you to see exactly what's going on:

openssl s_client -showcerts \
  -servername example.com \
  -connect example.com:443

Note that you'll usually need the -servername flag here, to stick on the SNI header, otherwise the server or load balancer or proxy won't know what to do with it.

In addition to printing out all the server certificates and some other diagnostics, this will tell you what, if any, client certificates are required. If none are expected, there will be a section that looks like:

---
No client certificate CA names sent
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits
---

If the server wants a client certificate, then it will tell you what certificates it wants:

---
Acceptable client certificate CA names
/CN=My Client CA
Client Certificate Types: RSA sign, DSA sign, ECDSA sign
Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:ECDSA+SHA256:RSA+SHA256:DSA+SHA256:ECDSA+SHA224:RSA+SHA224:DSA+SHA224:ECDSA+SHA1:RSA+SHA1:DSA+SHA1
Shared Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:ECDSA+SHA256:RSA+SHA256:DSA+SHA256:ECDSA+SHA224:RSA+SHA224:DSA+SHA224:ECDSA+SHA1:RSA+SHA1:DSA+SHA1
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits
---

The important thing here is the "/CN=My Client CA" - the server is telling you that it wants a certificate signed by that Certificate Authority.

Assuming you have such a certificate, how do you send it? Browsers have ways to import it, but it's often easier to diagnose it from the CLI, using curl or wget. You'll need both the certificate and the key, and the syntax is:

wget --certificate=mycert.crt --private-key=mycert.key \
  https://example.com/

or

curl --cert mycert.crt --key mycert.key \
  https://example.com/

Assuming the certificate you have is signed by the relevant CA, this will allow you to retrieve the page. If there's a problem, you might get some meaningful diagnostics.

The other thing is how to load those into a browser. Normally you have something like a .p12 file with both parts bundled. You can create one of those like so:

openssl pkcs12 -export -out mycert.p12 \
  -in mycert.crt -inkey mycert.key

You'll get asked for a passphrase to protect the .p12 file - it contains the key, so needs to be protected in transit.

You can also extract the key and certificate from a .p12 file:

openssl pkcs12 -in mycert.p12 -nokeys -out mycert.crt
openssl pkcs12 -in mycert.p12 -nocerts -out mycert.key

Friday, March 22, 2019

Creating illumos packages for Tribblix

In the prior article in this series, I discussed how to build illumos-gate (and it applies to illumos-omnios too), using a Tribblix AMI on AWS.

After that, you'll have two directories of interest under illumos-gate.
  1. The proto area, specifically proto/root_i386, that is a fully installed copy of illumos.
  2. Under packages, an on-disk IPS package respoitory.
If you're using IPS, you could use that repository as is. Tribblix, however, uses SVR4 packages. So the next step is to convert the contents of the IPS repository into a set of SVR4 packages.

For this, there is an additional github repo you'll need to check out, in addition to the ones from the previous article.

cd ${HOME}/Tribblix
git clone https://github.com/tribblix/tribblix-transforms

What is this repo? It's a list of transformations that are applied to the package conversion process. Generally, rather than modify the gate or the build, if there's something I don't want to ship, or something I want to move between packages, or a file I want to change, then it gets transformed at the packaging stage.

The other thing you'll need to use my scripts is a signing certificate. In Tribblix, the illumos ELF objects are digitally signed (yes, I ought to extend this to all ELF objects I ship). Nothing actually uses this yet, but it's all there ready for when verification is required.

To create a signing key and certificate:

mkdir ${HOME}/f
cd ${HOME}/f
openssl req -x509 -newkey rsa:2048 -nodes \
  -subj "/O=MyOrganization/CN=my.domain.name" \
  -keyout elfcert.key -out elfcert.crt -days 3650
Of course, choose the subject to match your own requirements.

Then you can run the scripts in the tribblix-build repo to generate SVR4 packages. These parse the IPS manifests, extract the files from the IPS repo, create the SVR4 prototype file, and automatically generate SVR4 install scripts from the IPS metadata.



So, to create packages, run the following as root:

/illumos/Tribblix/tribblix-build/repo_all.sh \
  -T /illumos/Tribblix \
  -G /illumos/Illumos/illumos-gate \
  -S /illumos/f/elfcert

Of course, replace /illumos with wherever your build user's home directory is. (You can't use $HOME, because this is run as root where $HOME is likely to be different to that of the user who did the build.) The -T flag tells it where the Tribblix tools are checked out, -G where the gate was built, and -S where the signing certificate is. If you want to change the version of the packages you generate, there's a -V flag as well. I normally redirect the output to a log file, as it's quite verbose

Wait a while, and you'll have a set of packages under /var/tmp/illumos-pkgs. Hopefully the pkgs directory there will have your packages, and the build and tmp directories will be empty (if the build directory isn't, it will have a half-created package or packages in it, so you'll be able to see which package or packages failed).

For omnitribblix it's very similar:

/illumos/Tribblix/tribblix-build/omni_all.sh \
  -T /illumos/Tribblix \
  -G /illumos/Illumos/omnitribblix \
  -S /illumos/f/elfcert

And the packages in this case end up in /var/tmp/omni-pkgs.

If you look, you'll see that there are 3 files for each package:

  • A .pkg file, which is in SVR4 datastream format
  • A .zap file, which is the (compressed) zap format Tribblix uses
  • An md5 checksum, which is used for basic validation in the package catalog
I then use those generated SVR4 packages to populate the package repository, and to build the Tribblix ISOs. The details of that will have to wait for another time.

Sunday, March 10, 2019

Tweaking the Tribblix installer

In the latest update to Tribblix, I've made a couple of minor tweaks to the installer.

The first is that compression (lz4) is now enabled on the root pool by default. It was possible in the past to give the -C flag to the installer, but now it's always on. There was a time in the past (the distant past) when enabling gzip compression could really hurt performance with some workloads, but the world has moved on, so enabling compression by default is a good thing.

The compression factor you see on something like the AWS image is:

NAME                 PROPERTY       VALUE  SOURCE
rpool                compressratio  1.92x  -
rpool/ROOT           compressratio  1.92x  -
rpool/ROOT/tribblix  compressratio  1.92x  -
rpool/export         compressratio  1.00x  -
rpool/export/home    compressratio  1.00x  -
rpool/swap           compressratio  2.04x  -

It's not bad, getting a factor of almost 2 essentially for free.

The second change is to partitioning. Traditionally, the root pool was created in a partition rather than the whole disk. You made the partition span the whole disk, but the disk was still partitioned in the traditional way.

As of m20.6, if you use -G instead of -B to the installer

./live_install.sh -G c1t0d0 [overlays...]

then it will create a whole disk pool with EFI System partition to support booting system with UEFI firmware. (As it says in the zpool man page.) Good for newer systems, but it brings another key capability.

For EC2 instances, this allows the rpool to be expanded. The base size is still 8G, but you can choose a different (larger) size for the EBS device when creating the instance:

When you initially log in, you still get the original size:

NAME    SIZE  ALLOC   FREE
rpool  7.50G   301M  7.21G

but you can use the following command:

zpool online -e rpool c2t0d0

and it magically expands to use the available space:

NAME    SIZE  ALLOC   FREE
rpool  11.5G   302M  11.2G

At the moment, this has to be done manually, but in future this expansion will happen automatically when the instance boots.

You can, of course, increase the size of an EBS volume while the instance is running. This takes a little while (the State is shown as "in-use - optimizing" while the expansion takes place). Unfortunately it appears at the moment that you need a reboot for the instance to rescan the devices so it knows there's more space.

Wednesday, February 27, 2019

Building illumos-gate on AWS (2019 version)

I've covered building illumos on AWS before, but the instructions there are a bit out of date. Of course, these will become outdated too, in time, but should still be useful.

First spin up an EC2 instance, the AMI you want to use for this is "Tribblix-m20.5", ami-7cf2181b in London. (If you want to use a different region, then you'll have to copy the AMI.)

For instance size, try an m4.2xlarge - a t2.micro really won't cut it, you won't even be able to install the packages. With an m4.2xlarge you're looking at 30-45 minutes to build illumos, depending on whether debug is enabled.

Add an external EBS device of at least 8G, there isn't enough space on the AMI's root partition to handle the build. When you add it, attach it at /dev/sdf (there's no real reason for this, but that will match the zpool create below).

Once it's booted up (I assume you know about security groups and keypairs, so you can ssh in as root), then the first thing you need to do is update the image:

zap refresh
zap update-overlay -a

Install the illumos-build overlay and tweak the install

zap install-overlay illumos-build
rm -f /usr/bin/cpp
cd /usr/bin ; ln -s ../gnu/bin/xgettext gxgettext

Create a user and a storage pool for them to use

zpool create -O compression=lz4 illumos c2t5d0
useradd -g staff -d /illumos -s /bin/bash illumos
chown illumos /illumos
passwd illumos

Log in as illumos, and clone the gate. What I do here is have a reference copy of the gate, and then I can clone that very quickly locally every time I want to do a build.

mkdir ${HOME}/Illumos-reference
cd ${HOME}/Illumos-reference
git clone https://github.com/illumos/illumos-gate

If you want to create omnitribblix too

git clone https://github.com/omniosorg/illumos-omnios

Clone the relevant Tribblix repo

mkdir ${HOME}/Tribblix
cd ${HOME}/Tribblix
git clone https://github.com/tribblix/tribblix-build

The scripts need to know where the Tribblix repo(s) are checked out

export THOME=${HOME}/Tribblix
Create a build area with the gate checked out

mkdir ${HOME}/Illumos
cd ${HOME}/Illumos
git clone ${HOME}/Illumos-reference/illumos-gate

and if you want to build omnitribblix too

git clone ${HOME}/Illumos-reference/illumos-omnios omnitribblix

You need the closed tarballs

wget -c \
  https://download.joyent.com/pub/build/illumos/on-closed-bins.i386.tar.bz2 \
  https://download.joyent.com/pub/build/illumos/on-closed-bins-nd.i386.tar.bz2

Right, you're now ready to do a build.

cd illumos-gate

For a release build:

${THOME}/tribblix-build/illumos/releasebuild m20.6

For a debug build with a gcc7 shadow

${THOME}/tribblix-build/illumos/debugbuild

For a debug build without the gcc7 shadow

${THOME}/tribblix-build/illumos/debugbuild -q
For omnitribblix, which assumes (and requires) that you've done a vanilla gate build first

cd ${HOME}/Illumos/omnitribblix
${THOME}/tribblix-build/illumos/omnibuild m20lx.6
And there you have it, you should have a beautiful cleanly built gate.

This differs from my previous recipe in a couple of key ways:
  1. I no longer require you to build in a zone, although I would still recommend doing so if you want to use the system for other things. But as this is an AWS instance, we can dedicate it to gate building
  2. It's far more scripted. If you want to see what it's really doing, look inside the releasebuild, debugbuild, and omnibuild scripts.
 Next time, I'll cover how the files from the build are converted into packages.

Sunday, February 10, 2019

Thoughts on SPARC support in illumos

One interesting property of illumos is that its legacy stretches back decades - there is truly ancient code rubbing shoulders with the very modern.

An area where we have really old code is on SPARC, where illumos has support in the codebase for a large variety of Sun desktops and servers.

There's a reasonable chance that quite a bit of this code is currently broken. Not because it's fundamentally poor code (although it's probably fair to say that the code quality is of its time, and a lot of it is really old), but it lives within an evolving codebase and hasn't been touched in the lifetime of illumos, and likely much longer. Not only that, but it's probably making more assumptions about being built with the old Studio toolchain rather than with gcc.

What of this code is useful and worth keeping and fixing, and what should be dropped?

A first step in this was that I have recently removed support for starfire - the venerable Sun E10K. It seems extremely unlikely that anyone is running illumos on such a machine. Or indeed that anyone has them running at all - they're museum pieces at this point.

A similar, if rather newer, class of system is the starcat, the Sun F15K and variants. Again, it's big, expensive, requires dedicated controller hardware, and is unlikely to be the kind of thing anyone's going to have lying about. (And, if you're a business, there's no real point in trying to make such a system work - you would be much better off, both operationally and financially, in getting a current SPARC system.)

And if nobody has such a system, then not only is the code useless, it's also untestable.

The domained systems, like starfire and starcat, are also good candidates for removal because of the relative complexity and uniqueness of their code. And it's not as if the design specs for this hardware are out there to study.

What else might we consider removing (with starfire done and starcat a given)?

  1. The serengeti, Sun-Fire E2900-E6800. Another big blob of complex code.
  2. The lw8 (lightweight 8), aka the V-1280. This is basically some serengeti boards in a volume server chassis.
  3. Anything using Sbus. That would be the Ultra-2, and the E3000-E6000 (sunfire). There's also the socal, sf, and bpp drivers. One snag  with removing the Ultra-2 is that it's used as the base platfrom for the newer US-II desktops, which link back to it.
  4. The olympus platform. That's anything from Fujitsu. One slight snag here is that the M3000 was quite a useful box and is readily available on eBay, and quite affordable too.
  5. Netra systems. (Specifically NetraCT - there's a US-IIi NetraCT, and two US-IIe systems, the NetraCT-40 and the NetraCT-60. Code names montecarlo and makaha (something about Tonga too). Also CP2300 aka snowbird.
  6. Server blade. I'm talking the old B100s blade here.
  7. Binary compatibility with SunOS 4 - this is kernel support for a.out, and libbc.
I'm not saying at this point that all of this code and platform support will go, just that it lists the potential candidates. For example, I regard support for M3000 as useful, and definitely worth thinking about keeping.

What does that leave as supported? Most of the US-II and US-III desktops, most of the V-series servers, and pretty much all the early sun4v (T1 through T3 chips) systems. In other words, the sort of thing that you can pick up second hand fairly easily at this point.

Getting rid of code that we can never use has a number of benefits:

  • We end up with a smaller body of code, that is thus easier to manage.
  • We end up with less code that needs to be updated, for example to make it gcc7 clean, or to fix problems found by smatch, or to enable illumos to adopt newer toolchains.
  • We can concentrate on the code that we have left, and improve its quality.
If we put those together into a single strategy, the overall aim is to take illumos for SPARC from a large body of unknown, untested, and unsupportable code to a smaller body of properly maintained, testable, and supportable code. Reduce quantity to improve quality, if you like.

As part of this project, I've looked through much of the SPARC codebase. And it's not particularly pretty. One reason for attacking starfire was that I was able to convince myself relatively quickly that I could come up with a removal plan that was well-bounded - it was possible to take all of it out without accidentally affecting anything else. Some of the other platforms need bit more analysis to tease out all the dependencies and complexity - bits of code are shared between platforms in a variety of non-obvious ways.

The above represents my thoughts on what would be a reasonable strategy for supporting SPARC in illumos. I would naturally be interested in the views of others, and specifically if anyone is actually using illumos on any of the platforms potentially on the chopping block.