The Trouble with Tribbles...: 2022

Monday, November 21, 2022

A decade of Tribblix

I seem to have just missed the anniversary, but it turns out that Tribblix has existed for slightly over a decade.

The initial blog post on Building Tribblix was published on October 24th, 2012. But the ISO image (milestone 0) was October 21st, and it looks like the packages were built on October 4th. So there's a bit of uncertainty about the actual date, and I had been playing around with some of the bits and pieces for a while before that.

There have been a lot of releases. We're now on Milestone 28, but there have been several update releases along the way, so I make it 42 distinct releases in total. That doesn't include the LX-enabled OmniTribblix variant (there have been 20 of those by the way).

The focus (given hardware availability) has been x86, naturally. But the SPARC version has seen occasional bursts of life. Now I have a decent build system, it's catching up. Will there be an ARM version? Who knows...

Over the years there have been some notable highlights. It took a few releases to become fully self-hosting; package management had to be rebuilt; LibreOffice was ported; Xfce and MATE added as fully functional desktop offerings (with a host of others); a whole family of zones, including reimplementing the traditional sparse root; made available on clouds like AWS and Digital Ocean; network install via iPXE; huge numbers of packages (it's never-ending churn); and maintaining Java by default.

And it's been (mostly) fun. Here's to the next 10 years!

Sunday, November 20, 2022

TREASURE - The Remote Execution and Access Service Users Really Enjoy

Many, many years ago I worked on a prototype of a software ecosystem I called TREASURE - The Remote Execution and Access Service Users Really Enjoy.

At the time, I was running the infrastructure and application behind an international genomics service. The idea was that we could centrally manage all the software and data for genomic analysis, provide high-end compute and storage capability, and amortize the cost across 20,000 academics so that individual researchers didn't have to maintain it all individually.

Originally, access was via telnet (I did say it was a long time ago). After a while we enabled X11, so that graphical application would work (running X11 directly across the internet was fun).

Then along came the web. One of my interesting projects was to write a web server that would run with the privileges of the authenticated user. (This was before apache came along, by the way!) And clearly a web browser might be able to provide a more user-friendly and universal interface than a telnet prompt.

We added VNC as well (it came out of Cambridge and we were aware of it well before it became public), so that users could view graphical applications more easily. This had a couple of advantages - all the hard work and complexity was at our end, where we had control, and X11 is quite latency sensitive so performance improved.

But ultimately what I wanted to do was to run the GUI on the user's machine, wit access to the user's files. Remember that the GUI is then not running where the software, genome databases, and all the compute power are located.

Hence the Remote Execution part of TREASURE - what we wanted was a system that would call across to a remote service to do the work, and return the result to the user. And the Access part was about making it accessible and transparent, which would lead to a working environment that people would enjoy using.

Ultimately, the core of TREASURE was originally a local GUI that knew how to run applications. Written in Java, it would therefore run on pretty much any client (and we had users with all sorts of Unix workstations in addition to Windows making inroads). The clever bit was to replace the java Runtime.getRuntime().exec() calls that ran applications locally with some form of remote procedure call. Being of its time, this might involve CORBA, RMI, SOAP, or JAX-WS with data marshalled as XML. In fact, I implemented pretty much every remote call mechanism available (and this did in fact come in useful as other places did make available some services using pretty random protocols). And then of course there's the server side which was effectively a CGI script.

The other key part was to work out which files needed to be sent across. Sometimes it was obvious (it's a GUI, the user has selected a file to analyse), but sometimes we needed to send across auxiliary files as well. And on the server side it ran in a little sandbox so you knew what output files had bee generated so you could return those.

Effectively, this was a production form of serverless computing running over 20 years ago. Only we called it GRID computing back then.

Another interesting feature of the architecture was the TREASURE CHEST, which was a source of applications. There were literally hundreds of possible applications you could run, and many more if you included interfaces to other providers. So rather than write all those into the app, there was a plugin system where you could download a jar file and run it as a plugin, and the TREASURE CHEST was where you could find these application. Effectively an app store, in modern terminology.

Sadly the department got closed down due to political incompetence, so the project never really got beyond the prototype stage. And while I still have bits and pieces of code, I don't appear to have a copy of the whole thing. A lot of the components would need to be replaced, but the overall concept is still sound.

Tuesday, November 15, 2022

Tribblix for SPARC m25.1

Following hot on the heels of the Tribblix Milestone 22 ISO for SPARC, it's possible to upgrade that to a newer version. The new version that's available is m25.1.

(If the available versions look a bit random, that's because they are. Not every release on x86 was built for SPARC, and not all of the ones that were actually worked properly. So we have what we have.)

The major jump, aside from the underlying OS, in m25.1 for SPARC is that it brings in gcc7 (for applications, illumos itself is still built with gcc4), and generally there's a bunch of more modern applications available.

To upgrade m22 to m25.1 is a manual process. This is because there are steps that are necessary, and if you don't follow them exactly the system won't boot.

The underlying cause here of the various problems in this process is that it's a big jump from m22 to m25.1 and you will hit bugs in the upgrade process that have been fixed in intermediate releases.

First, take a note of the current BE, eg tribblix. You might need it later if things go bad and you need to reboot into the current (hopefully working) release.

You can manually add available versions for upgrade with the following trick (this is just one line, despite how it might be formatted):

echo "m25.1|http://pkgs.tribblix.org/release-m25.1.sparc/TRIBzap.0.0.25.1.zap|Tribblix m25.1" >> /etc/zap/version.list

and check that's visible with

zap upgrade list

and then start the upgrade with

zap upgrade m25.1

Do not activate or reboot yet!

You MUST do the following:

beadm mount m25.1 /a
zap install -C /a TRIBshell-ksh93
pkgadm sync -q -R /a
beadm umount m25.1

and then you should be safe to reboot:

beadm activate m25.1
init 6

If it doesn't come back, you can boot into the previous release (that you took the name of earlier, remember) from the ok prompt

boot -Z rpool/ROOT/tribblix

Once you're up and running m25.1 it's time to clean up.

zap refresh

and then remove some of the old opensxce packages

zap uninstall \
SUNWfont-xorg-core \
SUNWfont-xorg-iso8859-1 \
SUNWttf-dejavu \
SUNWxorg-clientlibs \
SUNWxorg-xkb \
SUNWxvnc \
SUNWxwcft \
SUNWxwfsw \
SUNWxwice \
SUNWxwinc \
SUNWxwopt \
SUNWxwxft \
SUNWxwrtl \
SUNWxwplr \
SUNWxwplt

and then bring packages up to current

zap update-overlay -a

and this should give you a system that's in a workable state, and roughly matching my active SPARC environment.

Monday, November 14, 2022

Tribblix for SPARC m22 ISO now available

I've made available a newer ISO image for Tribblix on SPARC.

This is an m22 ISO. So it's actually relatively old compared to the mainstream x86 release.

I actually had a number of random SPARC ISO images, but for a while I've had no way of testing any of them. (And many of the problems with the SPARC ISOs in general is because I had no real way of testing them properly.)

Arrive a newish T4-1 (thanks Andy!), and I can now trivially create an LDOM, assign it a zvol for a root disk and a ISO image to boot from, and testing is trivial again. And while some of the ISO images I have are clearly so broken as to not be worth considering, the m22 version looks pretty reasonable.

In terms of available application packages, it exactly matches the old m20 release. I do have newer packages on some of my test systems, but they are built with a newer gcc and so need a proper upgrade path. But that's going to be easier now too.

There is a minor error on the m22 ISO, in that the xz package shipped appears to be wrong. To fix, simply

zap install TRIBcompress-xz

and to update to the latest available applications (the ISO is early 2020, the repo is middle of 2021)

zap refresh
zap update TRIBlib-security-openssl
zap update-overlay -a

The reason for updating openssl on its own is that a number of applications are compiled against openssl 1.1.1, so you need to be sure that gets updated first.

Next step is to push on to something newer.

Tuesday, October 11, 2022

DevOps as a HR problem

I wrote about one way in which HR and IT can operate more closely, but there's another interaction between IT and HR that might not be so benign.

DevOps is ultimately about breaking down silos in IT (indeed, my definition of DevOps is as a cultural structure where teams work together to meet the needs of the business rather than competing against each other to meet the needs of the team).

However, in a business, individuals and teams are actually playing a game in which the rules and criteria for success are set by HR in the shape of the (often annual) review cycle. And all too often promotions, pay rises, even restructuring, are based around individual and team performance in isolation. And who can blame individuals and teams for optimising their behaviour around the performance targets they've been set?

It's similar to Conway's Law, in which the outputs of an organization mirror its organisational structure - here, the outputs of an organisation will mirror the performance targets that have been set. If you want to improve collaboration and remove silos, then make sure that HR are on board and get them to explicitly put those into the annual performance targets.

Tuesday, October 04, 2022

On the intersection between IT and HR

A while ago I mentioned The three strands of Information Technology, and how this was split into an internal-facing component (IT for the business, IT for the employee) and external-facing (IT for the customer).

In a pure technology company, there's quite a mismatch, with the customer-facing component being dominant and the internal-facing parts being minimised. In this case, do you actually need an IT department, in the traditional sense?

You need a (small) team to do the work, of course. But one possibility is to assign them not to a separate IT organization but to the HR department.

Why would you do this? Well, the primary role of internal IT in a technology company is simply to make sure that new starters get the equipment and capabilities they need on day one, and that they hand stuff back and get their access removed when they leave. And if there's one part of an organisation that knows when staff are arriving and leaving, it's the HR department. Integrating internal IT directly into the rest of the onboarding and offboarding process dramatically simplifies communication.

It helps security and compliance too. One of the problems you often see with the traditional setup where IT is completely separate from HR is that it can take forever to revoke a staff member's access when they leave; integrating the two functions massively shortens that cycle.

Friday, August 19, 2022

Tribblix m28, Digital Ocean, and vioscsi

A while ago, I wrote about deploying Tribblix on Digital Ocean.

The good news is that the same process works flawlessly with the recently released m28 Tribblix release.

If you recall from the original article, adding an additional storage volume didn't work. But, we now have a vioscsi driver, so has the situation improved?

Yes!

All I did was select an additional volume when creating the droplet. Then if I run format, I see:

AVAILABLE DISK SELECTIONS:
       0. c3t0d1 <DO-Volume-2.5+ cyl 13052 alt 2 hd 255 sec 63>
          /pci@0,0/pci1af4,8@5/iport@iport0/disk@0,1
       1. c4t0d0 <Virtio-Block Device-0000-25.00GB>
          /pci@0,0/pci1af4,2@6/blkdev@0,0

That 25GB Virtio-Block device is the root device used for rpool; the other one is the 100GB additional volume. It's also visible in diskinfo:

TYPE    DISK                    VID      PID              SIZE          RMV SSD
SCSI    c3t0d1                  DO       Volume            100.00 GiB   no no
-       c4t0d0                  Virtio   Block Device       25.00 GiB   no no
-       c5t0d0                  Virtio   Block Device        0.00 GiB   no no

(That empty c5t0d0 is the metadata service, by the way.)

Let's create a pool:

zpool create store c3t0d1

It just works. And performance isn't too shabby - I can read and write at 300MB/s.

There you go. More options for running illumos.

Saturday, July 09, 2022

Tribblix and static networking on AWS

I've just made available the m27 AMIs for Tribblix. As usual, these are just available in London (eu-west-2).

One thing I've noticed repeatedly while running illumos on AWS is that network stability isn't great. The instance will occasionally drop off the network and stubbornly refuse to reclaim its IP address even if you reboot it. It's not just Tribblix, I run a whole lot of OmniOS on AWS and that does the same thing.

The problem appears to be related to DHCP not being able to pick up the address (even though I can see it ending out the correct requests and getting what look like legitimate responses).

So what I do is convert the running instance from using NWAM and being a DHCP client to having statically configured networking. On first boot it needs to use DHCP, because it cannot know what its IP address and network configuration should be until it's booted up once and used DHCP to get the details. But it's really extremely rare to take an AWS instance and change its networking - you would simply build new instances rather than modifying existing ones - so changing it to static is fine, and eliminates any possibility of DHCP failures messing you up in future.

In the past I've always done this manually, but now there's a much easier way if you're using m27 or later:

zap staticnet

will show you what the system will do, just as a sanity check, and then

zap staticnet -y

will implement the change.

Sunday, June 05, 2022

Cleaning up the Java illumos port

This was originally a twitter thread, this is a more permanent expanded version.

When support for Solaris and SPARC was removed from Java, the code that was removed fell into a number of distinct classes, not all of which are equally valuable.

Solaris platform support

First, there's the support for Solaris (and thus illumos) as an operating system (the os tree in hotspot, specifically, but there are other places in the tree that have OS-specific code), which we definitely want to retain in the illumos port of openjdk.

There really isn't much difference between Solaris and illumos from Java's point of view. We have a tiny bit of illumos-specific code in the patches, but that fact is that you can take what I've done and build it on Solaris pretty easily.

Solaris x86 cpu support

Then there's support for the x86 architecture on Solaris (and thus illumos) as a specific platform (the os_cpu tree in hotspot, specifically), which I also want to retain.

SPARC cpu support

Then there's support for SPARC systems (mostly Solaris, but it took out any other OS that wanted to support SPARC too), which I've decided to pass on (if anybody wants it, they're free to do the work, but it's too large a task for me).

With SPARC removed, the remaining platform (x86) is supported by multiple other operating systems, so the basic processor port is no longer my concern. What this means, also, is that it should be relatively straightforward to add in support for arm, for example, if we ever got to the stage where illumos gained a new hardware port.

Studio toolchain support

Then there's support for the (now legacy) Studio toolchain, which was a right pain, was a great motivator for the removal, and I can only say good riddance.

There's some junk around the fact that Studio uses different command line flags to gcc. Sometimes the code has to differ based on idiosyncracies of the Studio compiler. A larger part is the way that Studio and gcc handle inline assembler, which I've almost (but not quite) fully cleaned out.

Support for legacy Solaris versions

Then, when you look closer, there's actually bits of code to support really old Solaris versions or other components that's simply no longer relevant and it would be a good thing to get rid of all that complexity.

Some of this is complex workarounds for bugs that have long been fixed. Some things use dlopen() to handle the fact that not all releases have all the shared libraries. There are still some comments mentioning old Solaris versions that you wonder if they're still relevant.

Support for various quirks and peculiarities

Finally, there's a bunch of code to simply do weird or idiosyncratic things on Solaris just because you can, and I'm actually quite keen to rip all that out and just do things the correct and easy way. Historically, there have been certain interesting peculiarities on Solaris waiting to trip up the unwary, mostly for no good reason whatsoever.

As an example of that last one, Java contains its own implementation of the way that Solaris falls back on the NIS/NIS+ domain for dns searches, and even looks for LOCALDOMAIN in the environment. Oh dear. I've eradicated that.

Another example is that there's still quite a lot of font handling code that's specifically referencing /usr/openwin, and has interesting knowledge of bugs in the OpenWindows font files.

Something I ought to look at is the way that printing is handled. Java thinks SunOS is SYS5 from the point of view of printing, which (thankfully) hasn't really been the case for a very long time. In illumos the state of printing is a mess, as we have a bunch of legacy support built in, while ideally we would just move to CUPS.

I'm also wondering whether to remove DTrace. I haven't seen it working for a while, the native DTrace in Java was always a bit problematic (the performance hit was massive and quite discouraging), and I've always disabled it in my builds.

Together, eliminating these peculiarities reduces the size of the port, makes it easier to maintain (because I don't have to keep fixing the patches as the source changes for other reasons), and makes the Java behaviour consistent with other platforms rather than being needlessly different - after all, Java is supposed to be cross-platform.

Tuesday, March 08, 2022

On password policies in the 21st century

One of the scourges of corporate life was the forced monthly password change. As anyone who understands security will know, this was always a terrible idea - it leads to a culture of passwords that are weak, formulaic, and written down.

Another, more widespread scourge, is the use of devious complexity requirements.

Fortunately, the world is changing.

The NCSC called for people to update their approach to passwords.

The NIST Special Publication 800-63 on Digital Identity explicitly covers the fact that forced password changes and complexity rules shouldn't be applied. They call out the problems of bad policy - human behaviour is predictable in the face of stupid rules.

Even OWASP have got in on the act.

I think it's pretty clear that forced password changes and complexity rules are on the way out.

This is reinforced by the fact that the current UK Cyber Essentials certification requires you to follow the NCSC guidance. (Go to the Resources tab, and download the "Cyber Essentials Requirements for IT infrastructure".) Under User Access Control, it's pretty explicit - no password expiry, no password complexity requirements. Given that you need to have CE or equivalent to get UK government or NHS contracts now, this is a pretty big stick.

There's also a general push towards MFA to be used in concert with passwords.

There's another interesting question: how long do passwords need to be?

If you want to be really scared, look at the Hive Systems Password Table. You've probably seen this floating around recently. Almost all regular passwords can be trivially brute forced. If people have rainbow tables, game over.

Only it's not quite that simple. Both NIST and NCSC talk about minimum 8 characters. How is that possibly secure?

The point is that there's a huge gulf between running an optimised cracker on a GPU (incredibly quick), and trying to put a long list of passwords into a website or application. The former is 10s of billions of hashes per second; the latter is 10s of attempts per second. You're looking at a factor of a billion difference between the two attack vectors. And if you follow the recommendations for throttling and lockout, in reality an attacker will get a handful of attempts at most. If you look at the NIST guidance, it wants 8 characters for user-generated passwords, and only 6 passwords for random machine-generated passwords.

In practice, for most decent algorithms, rainbow tables normally only go up to 14-16 characters or so. But this means two things. First, that the ease of brute force and rainbow table attacks is such that you absolutely must keep the encrypted passwords protected, and assume that knowledge of the encrypted password means that the password is compromised. And second, that there's actually no benefit to a minimum password length between 8 and 16. You should allow longer (much longer) but the current attack vectors can either be met with 8 or require more than 16.

Happy passwording!

Sunday, March 06, 2022

The datacentre business seems to be very much alive

Last week I went to Cloud Expo Europe at ExCeL.

(Yes, there was a tube strike. No, that didn't affect me much. Had to walk from London Bridge Station to Tower Gateway to get to the DLR, but walking past HMS Belfast, Tower Bridge, and the Tower of London isn't such an imposition.)

Now, "Cloud Expo" is the umbrella event. There are a number of co-located shows - DevOps Live, Cloud and Cyber Security Expo, Big Data and AI World, and Data Centre World.

The one absolutely conspicuous thing to take away was that, despite it being notionally a Cloud Expo, the Data Centre World part was as big as all the others put together. It's all a bit swankier than when I was designing, building, and running small datacentres too.

But the place was awash with power - generators, UPS, PDUs (power strips can be really fancy and light up in all sorts of colours now), cabling. Not to mention DCIM, inventory systems, management software, security, cages, equipment lift systems, fire suppression. The whole shooting match, as it were.

This is interesting. I've had the impression for a while that the datacentre (or colocation, as a variation) business isn't in much of a decline, and this reinforced that view. There's still a lot of on-premise compute, it's not going away.

Despite the idea being propagated by some that the only way is Cloud, it appears that Cloud is additive to on-premise. The vendors I chatted to seemed to be going strong.

The rest of Cloud Expo was really quite muted. But not only was it small, and quiet, but there was really nothing new out there. It had been 2 years since I was last out talking to vendors in person, and the impression I got was that the market is simply stagnant.

Like all business sectors, everything's cyclical, but I suspect that mourning the death of the datacentre and on-premise (including colocation) is premature.

Thursday, January 20, 2022

Tribblix updates and https

One good thing to have happened recently is the rise of Let's Encrypt, bringing https to all websites without all the hassle you previously had to go through to get a certificate.

One not quite so good event recently was the switch by Let's Encrypt to certificates signed by their own ISRG X1 root, and more excitingly the expiry of the prior DST Root CA X3 signing certificate.

My experience of this is that most things just worked, but I'm still seeing odd cases where clients can't connect. Generally, browsers work just fine; CLI tools are a bigger issue.

This might be due to a couple of issues. Sometimes the software itself guesses wrong (older openssl 1.0.2 for example); sometimes the system's CA bundle of trusted root certificates needs updating.

For a while now, the Tribblix package repositories have been served over https and the zap tool for package management has been configured to use https. There are cases where it falls foul of the above issues.

This might occur on older Tribblix releases - I've seen this on m22, for example.

It turns out that curl fails, but wget works. Again, that's an example of the inconsistency in behaviour that I see. You need to update the CA bundle on m22, but if the package update tool is broken that's a bit tricky.

There's an ugly hack, though, because zap will try wget if it can't find curl. So just move curl out of the way temporarily:

mv /usr/bin/curl /usr/bin/curl.t
zap refresh
zap update TRIBca-bundle
mv /usr/bin/curl.t /usr/bin/curl

and you should be good to go again.

There's another way, of course: edit the *.repo files in /etc/zap/repositories to change the URL from https to http. That's not particularly recommended (although the packages are signed and the signatures are checked).

One thing that last hack demonstrates is the value in using simple text files.

Tuesday, January 18, 2022

Inside zone installation

How do zones actually get put together on Solaris and illumos? Specifically, how does a zone get installed?

There are various type of zones. The nomenclature here is a brand. A zone's brand defines how it gets installed and managed and its properties. Often, this is mapped to a zone template which is the default configuration for a zone of that type or brand.

(By the way, this overlap between template and brand can be seen in the create subcommand of zonecfg. You do "create -t SUNWlx" to build a zone from a template, which is where the -t comes from. It's not the create that sets the brand, it's the template.)

The templates are stored as xml files in /etc/zones. As are the configured zones, which is a bit confusing. So in theory, if you wanted to generate a custom template to save adding so much to your zonecfg each time, you could add your own enhanced template here. The actual zone list is in /etc/zones/index.

In fact, Tribblix has template zones, which are sparse-root zones built from a different image to the global zone. They are implemented by building an OS image that provides the file systems to be mounted read only, and a template xml file configured appropriately.

One of the things in the template is the brand. That maps to a directory under /usr/lib/brand. So, for example, the TRIBsparse template in /etc/zones/TRIBsparse.xml sets the brand to be sparse-root, in addition to having the normal lofs mounts for /usr, /lib, and /sbin that you expect for a sparse-root zone. There's a directory /usr/lib/brand/sparse-root that contains everything necessary to manage a sparse-root zone.

In there you'll find a couple more xml files - platform.xml and config.xml. A lot of what's in those is internal to zones. Of the two, config.xml is the more interesting here, because it has entries that match the zoneadm subcommands. And one of those is the install entry. For TRIBsparse, it is

/usr/lib/brand/sparse-root/pkgcreatezone -z %z -R %R

When you invoke zoneadm install, this script gets run, and you get the zone name (-z) and zonepath (-R) passed in automatically. There's not much else that you can specify for a sparse root zone. If you look at the installopts property in config.xml, there's just an h, which means that the user can specify -h (and will get the help).

For a whole-root zone the install entry is similar, but installopts is now o:O:h - this is like getopts, so it's saying that you can pass the -o and -O flags, and that each must have an argument. These flags are used to define what overlays get installed in a whole-root zone. Having the installopts defined here means that zoneadm can validate the install command.

So, for a given brand, we've now seen from config.xml what command will be called when you install a zone, and what options it's allowed.

The point is that there's nothing special here. You can build a custom brand by writing your own install script, and if you need to pass arguments to it you can easily do so as long as you set installopts to match. When building all the zone brands for Tribblix, that's all I did.

To reiterate, the install script is completely open. For existing ones, you can see exactly what it's going to do. If you want to create one, you can have it do anything you like in order to lay down the files you want in the layout you want.

As a crazy example, a long time ago I created a brand that built a sparse-root zone on a system using IPS packaging.

There's a little bit of boilerplate (if you're going to create your own brands, it's probably easier to start with a copy of an existing one so you pick up the common actions that all zone installs do), but after that, the world's your oyster.

Consider the alien-root zone in Tribblix. If you look at the installer for that, it's just dumping the contents of an iso image, tarball, or zfs send stream into the zone root. It does some cleanup afterwards, but generally it doesn't care what's in the files you give it - you can create an arbitrary software installation, tar it up, and install a zone from it.

(In fact, I probably won't create more native zone types for Tribblix - the alien-root is sufficiently generic that I would extend that.)

This generality in scripting goes beyond the install. For example, the prestate and poststate scripts are called before or after the zone transitions from one state to another, and you can therefore get your zone brand to do interesting things triggered by a zone transitioning state. One of the coolest uses here is the way that OmniOS implements on-demand vnics - the prestate script creates a vnic for a zone before a zone boots, and the poststate script tears it down after it halts. (Tribblix uses zap to manage vnics outside of zoneadm, so they're persistent rather than on-demand, it's just a different way of doing things.)

As you can see, you aren't limited to the zone types supplied by your distribution. With enough imagination, you can extend zones in arbitrary ways.

Monday, January 17, 2022

Are software ecosystems a good thing?

One way to judge the health or strength of a product might be to look at the ecosystem surrounding that product. But is this diagnostic?

Note that there are several concepts here that are similar to the ecosystem. I'm not referring to the community, those people who might use or support the product. Nor am I talking about a marketplace, which is a source of artefacts that might be consumed by the product. Those are important in their own right, but they aren't what I mean when I'm talking about an ecosystem.

No, an ecosystem is the set of other services or software that spring up to support or integrate with the product.

There's one immediate problem here that's obvious if you think about it. Much of the ecosystem thus exists to address flaws or gaps in the product. Something that is more polished, more mature, and more finished will provide fewer opportunities for other products to add value.

What this means, then, is that a thriving ecosystem is often a sign of weakness and immaturity, not strength. A good product will not need the extras and hangers on that come with an ecosystem.

The notion of an ecosystem is tied in with that of MVP - Minimum Viable Product. The current trend is to launch a startup with just an MVP, rely on first mover advantage, and hope to actually finish the offering at a later date. By definition, an MVP cannot be complete, and will need a surrounding ecosystem in order to function at all. This is much more common now than in the past, when products - especially proprietary products - were not launched until they were in some sense done.

Over time, too, an ecosystem will - or should - naturally diminish, as bugs are fixed and missing features filled in. The partners in the ecosystem will get frozen out, as their offerings become irrelevant (think ClusterHQ).

As an example from the past, consider the ecosystems that built up around Windows and DOS. Whole industries were built on things like TCP stacks and centralized nameservices and storage (PC-NFS, even Netware). These were products reliant on fundamental failings of the product they supported. (Don't even get me started on antivirus software.)

Fast forward, and I can't be the only one to recognise the CNCF landscape as a disaster area.