The Trouble with Tribbles...: December 2021

Friday, December 31, 2021

The three strands of Information Technology

How are IT departments structured? I've seen a variety of ways to do this. It depends on the individual business, but over the years I've come up with a way to think about this.

When thinking about Information Technology (IT), it naturally splits into 3 separate strands:

IT for the business

This is the provision of facilities for HR, Finance, Sales, and the like; basic facilities for the organisation to operate as a business

IT for the employee

This is the provision of systems and tools for employees to be able to work at all; laptops/desktops/mobile devices, and communications systems such as telephony and email, together with a way for staff to store and collaborate on documents

IT for the customer

This is the provision of services that your customers use, whether that's a product you sell in its own right, or as a mechanism to sell other products

The relative importance of these 3 strands depends on the nature of the business, of course. And very small organisations might not even have all 3 strands in any meaningful sense.

Structurally, there are two senior roles that an organisation might have, the CIO and CTO. And the way things would naturally be laid out is that the CIO looks after IT for the business and IT for the employee, while the CTO gets IT for the customer.

Splitting things this way works because the characteristics of the strands are quite different. The responsibilities of the CIO are inward-facing, those of the CTO are outward-facing. The work of the CIO is about managing standardised commodities, while the CTO's role is to provide differentiation. Polar opposites, in a way.

There's a third role, that of the CISO, responsible for information security. This is slightly different in that it cuts across all 3 strands. As such, if you have both a CIO and a CTO, it isn't entirely obvious which of the two, if either, should take on the CISO role.

Given the different nature of these 3 strands, where does the IT department (loosely defined as those people whose job is IT) fit? Should you even have one? The job requirements for the 3 strands are sufficiently different that having different IT teams for each strand would seem to make an awful lot of sense, rather than a central IT department. And the IT team for each strand reports to the CIO or CTO as appropriate. In particular, having a product developed in the CTO part of the organisation and then thrown over the wall to be run by an operations team in the CIO organisation is one of the organisational antipatterns that never made any sense and was a major driver for DevOps.

Thus, when structuring the delivery of IT in an organisation, considering the divergent needs of the 3 different IT strands ought to be taken into account. Worst case is a single department that standardises on the same solution to deliver all 3 strands - standardisation is a common refrain of management, but what it really means here is that at least 2 strands (if not all 3) are delivered in a sub-standard way, often in a way that's actually completely unsuitable.

There is a central IT function that does cut across all 3 strands, in the same way that a CISO does at the management level. Which is a compliance function or security office. But for most other functions, you're really looking at providing distinct deliveries for each strand.

Wednesday, December 22, 2021

The cost of cloud

Putting your IT infrastructure into the cloud seems to be the "in" thing. It's been around for a while, of course. And, like most things related to IT, there are tradeoffs to be made.

My rough estimate is that the unit cost of provisioning a service on AWS is about 3 times that of a competent IT organization providing a similar service in house. Other people have come to the same number, and it hasn't really changed much over the last decade. (If you don't think 3x is right, consider what AWS' gross margin is.)

Some services offered by AWS deviate from that simple 3x formula. The two obvious ones are network costs, which as Cloudflare have argued are many times higher than you would expect, and S3, which you're going to struggle to beat. (Although if you're going to use S3 as a distribution site then the network costs will get you, think about Wasabi for that.)

And yet, many organizations move to the cloud to "save money". I'm going to ignore the capex versus opex part of that, and simply note that many IT organizations have in-house operations that are neither very efficient nor cost-effective. In particular, traditional legacy IT infrastructures are ridiculously overpriced. (If you're using commercial virtualization platforms and/or SAN storage, then you're overpaying by as much as a factor of 10, and getting an inferior product into the bargain - so that while many organizations could save a huge amount of money by moving to the cloud, they could save even more by running their internal operations better.)

Often the cost saving associated with a migration - not just cloud, this applies to other transitions too - comes about not because the new solution is cheaper, but because a migration gives a business leverage to introduce better practices. Practices that, if used for your on-premise deployments, would save far more than the cloud ever could. Sometimes, you need to do an end run round an entrenched legacy IT empire.

Another consideration is that the cloud has often been touted as something where you pay for what you use, which isn't always quite correct. For many services, you pay for what you configure. And some services are nowhere near as elastic as you might wish.

Capacity planning doesn't go away either, it's actually more important to get the sizing right, and while you can easily buy more capacity, you have to ensure you have the financial capacity to pay the bills.

Note that I'm not saying you should always run your systems on-premise, nor that it will always be cheaper.

Below a certain scale, doing it yourself isn't financially beneficial. There's a minimum configuration of infrastructure you need in order to get something that works, and many small organizations have needs below that. But generally, the smaller providers are likely to be a better option in that case than full-on cloud offerings.

Having the operational capability to support your infrastructure is also crucial. If you're going to support your own hardware, you really need a team, which is going to set a minimum scale at which operations are worthwhile.

This becomes even more true if you need to deploy globally. It's difficult to do that in-house with a small team, and you have to be pretty large to be able to staff multiple teams in different geographies. A huge advantage of using the cloud for this is that you can deploy into pretty much any location without driving your costs insane. Who wants to hire a full team in every country you operate in? And operationally, it's the same wherever you go, which makes things a lot easier.

In recent times, the Coronavirus pandemic has also had an impact. End user access to colocation facilities has been restricted - we've been able to do repairs, recently, but we've had to justify any datacenter visits as essential.

There are certain workloads that are well matched to the cloud, of course, Anything highly variable, with spikes above 3x the background, will be cheaper in the cloud where you can deploy capacity just for the spike than it would be in house where you either overprovision for peak load or accept that there's a spike you can't handle.

The cloud is also great for experimentation. You can try any number of memory and CPU configurations to see what works well. Much easier than trying to guess and buying equipment that isn't optimal. (This sort of sizing exercise is far less relevant if you have decent virtualization like zones.)

You can even spin up a range of entirely different systems. I do this when testing, just run each of a whole range of Linux distros for an hour or so each.

What the above cases say is that even if the unit cost of cloud resources is high, the cloud gives you more of an opportunity to optimize the number of units. And, when it comes to scaling, this means the ability to scale down is far more important than the ability to scale up.

I use AWS for a lot of things, but I strongly regard the cloud as just another tool, to be used as occasion demands, rather than because the high priests say you should.

Monday, December 20, 2021

Keeping Java alive on illumos

Back in 2019, a new JEP (JDK Enhancement Proposal) appeared.

JEP 362: Deprecate the Solaris and SPARC Ports

Of course, for those of us running Solaris or illumos (which is the same platform as far as Java is concerned), this was a big deal. Losing support for a major language on the platform was potentially a problem.

The stated reason for removal was:

Dropping support for these ports will enable contributors in the OpenJDK Community to accelerate the development of new features that will move the platform forward.

Clearly, this reflected a belief that maintaining Solaris and/or SPARC was a millstone dragging Java down. Still, it's their project, they can make whatever decisions they like, despite those of us who thought it was a bad move.

Eventually, despite objections, the ports were removed, towards the end of the JDK15 cycle.

At which point I simply carried on building OpenJDK. All I did was take the patch from the commit that removed Solaris support, applied that backwards, and added on top the pkgsrc patches that Jonathan Perkin had originally developed to support a gcc port on Solaris and illumos - patches we had already been using extensively from JDK11 onwards.

At that point I wasn't quite sure how sustainable this was. My aim was to support it as long as it wasn't proving too onerous or difficult, and my most optimistic hope was that we might be able to get to Java 17 which was planned as the next LTS release.

The modus operandi was really very simple. Every week a new tag is created. Download the tag, apply the patches, fix any errors in the patch set, try a build, hopefully fix any problems breaking the build.

Rinse and repeat, every week. The idea is that by doing it every week, it's a relatively small and manageable set of changes each time. Some weeks, it's just line number noise in the patches. Other weeks, it could be a more significant change. By loitering on the mailing lists, you become aware of what changes are coming up, which gives you a good idea of where to look when the build breaks.

Along the way, I've been cleaning up the patches to eliminate the SPARC code (you could put it back, but it's not a focus of this project) and most of the code to support the Studio toolchain (the version of Studio to build current Java isn't compatible with illumos anyway). So what we're left with is a straightforward Solaris/illumos+gcc port.

Most of the code changes I've needed to make are fairly straightforward procedural changes. Some functions moved namespace. Some function signatures have been changed. There's also been a lot of work to consolidate a number of functions into common posix code, rather than have each OS provide different implementations which might diverge and become hard to maintain.

Most of this was pretty simple. The only one that caused me a significant amount of work was the signal handling rewrite, which took several attempts to get to work at all.

And it's become fairly routine. Java 17 came along, eventually, and the builds were still succeeding and basic smoke-testing worked just fine. So, illumos has Java 17 available, just as I had hoped.

I originally packaged the builds on Tribblix, of course, which is where I'm doing the work. But I've also dropped tarballs of occasional builds so they can be downloaded and used on other illumos distributions.

Actually, the idea of those builds isn't so much that they're useful standalone, but they provide a bootstrap JDK that you can use to build Java yourself. Which, given that bootstrap JDK and my java patches, ought to be fairly straightforward. (There's a separate patch directory for each jdk release - the directory name ought to be obvious.)

Which means that if you want Java 17 on OmniOS, you can have just that - it's built and packaged ready for you. Not only that, Dominik fixed some problems with my signal handling fix so it works properly and without errors, which benefits everyone.

It doesn't stop there. In addition to the stream of quarterly updates (JDK 17 being an LTS release will see these for some time yet) work is continuing on mainline. JDK 18 works just fine, and as it's ramping down for release shouldn't have any breaking changes, so that's another release supported. I'm building JDK 19, although that's only about 1 build in so hasn't really had any significant changes put into it yet.

The fact that a relatively unskilled developer such as myself can maintain an out of tree Java port for a couple of years, tracking all the upstream changes, does make you wonder if supporting Solaris was really that much of a blocker to progress. At the time my belief was that it wasn't Solaris support that was the problem, but the Studio toolchain, and I think that's been borne out by my experience. Not only that, but the consolidation and simplification of the various OS-specific code into common posix code shows that supporting a variety of modern operating systems really isn't that hard.