The Trouble with Tribbles...: Limiting CPU usage in a zone

Friday, September 30, 2011

Limiting CPU usage in a zone

By default, a Solaris zone has access to all the resources of the host it's running on. Normally, I've found this works fine - most applications I put in zones aren't all that resource hungry.

But if you do want to place some limits on a zone, then the zone configuration offers a couple of options.

First, you can simply allocate some CPUs to the zone:

add dedicated-cpu
set ncpus=4
end

Or, you can cap the cpu utilization of the zone:

add capped-cpu
set ncpus=4
end

I normally put all the configuration commands for a zone into a file, and use zonecfg -f to build the zone; if modifying a zone then I create a fragment like the above and load that the same way.

In terms of stopping a zone monopolizing a machine, the two are fairly similar. Depending on the need, I've used both.

When using dedicated-cpu, it's not just a limit but a guarantee. Those cpus aren't available to other zones. Sometimes that's exactly what you want, but it does mean that those cpus will be idle if the zone they're allocated to doesn't use them.

Also, with dedicated-cpu, the zone thinks it's only got the specified number of cpus (just run psrinfo to see). Sometimes this is necessary for licensing, but there was one case where I needed this for something else: consolidating some really old systems running a version of the old Netscape Enterprise Server, and it would crash at startup. I worked out that this was because it collected performance statistics on all the cpus, and someone had decided that hard coding the array size at 100 (or something) would cover all future possibilities. That was, until I ran it one a T5140 with 128 cpus and it segfaulted. Just giving the zone 4 cpus allowed it to run just fine.

I use capped-cpu when I just want to stop a zone wiping out the machine. For example, I have a machine that runs application servers and a data build process. The data build process runs only rarely, but launches many parallel processes. When it had its own hardware that was fine: the machine would have occasional overload spikes but was otherwise OK. When shared with other workloads, we didn't want to change the process, but have the build zone capped at 30 or 40 cpus (on a 64-way system) so there's plety of cpu left over for other workloads.

One advantage of stopping runaways with capped-cpu is that you can limit each zone to, say, 80% of the system, and you can do that for all zones. It looks like you're overcommitting, but that's not really the case - uncapped is the same as a cap of all the cpus, so you're lower than that. This means that any one zone can't take the system out, but each zone still has most of the machine if it needs it (and the system has the available capacity).

The capability to limit memory also exists. I haven't yet had a case where that's been necessary, so have no practical experience to share.

2 comments:

David Magda said...: Personally I always tended to use "cpu-shares". Give the global zone (say) 100 shares, and each non-global zone 20, and this gives reasonable assurances that a zone won't take out the system.

This also allows 'burstable' performance, so that if most of the system is idle, but one zone needs more CPU it can get it. But, if other zones come alive and need CPU as well, they all balance out to proportionately shared resources.

Using "capped-cpu" on top of this is also possible of course.

The other resource to always set is "max-lwps". I've had cases where bugs / accidental fork bombs have taken out entire physical machines because one zone gobble up all the processes (even if the CPU/s was mostly idle).; 12:12 AM
Paul said...: There are some good examples of how to limit cpu usage in a zone in Oracle's white papers.

See 'Resource Partitioning with Pools'
in the white paper “Effective Resource Management Using Oracle Solaris Resource Manager”

This is part 2 of a 4 part series

Part 1: “Introduction to Resource Management in Oracle Solaris and Oracle Database”
http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-054-intro-rm-419298.pdf

Part 2: “Effective Resource Management Using Oracle Solaris Resource Manager”
http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-054-intro-rm-419298.pdf

Part 3: “Effective Resource Management Using Oracle Database Resource Manager”
http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-056-oracledb-rm-419380.pdf

Part 4: “Resource Management Case Study for Mixed Workloads and Server Sharing”
http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-054-intro-rm-419298.pdf; 11:53 PM