Monday, March 28, 2016

Running illumos in 48M of RAM

Whilst tweaking mvi recently, I went back and had another look at just how minimal an illumos install I could make.

And, given sufficiently aggressive use of the rm command, the answer appears to be that it's possible to boot illumos in 48 meg of RAM.

No, it's not April 1st. 48 meg of RAM is pretty low. It's been a long time since I've seen a system that small.

I've added some option scripts to the mvi repo. (The ones with -fix.sh as part of their names.) You don't have to run mvi to see what these do.

First, I start with mvix.sh, which is fairly minimal up front.

Then I go 32bit, which halves the size.

Then I apply the extreme option, which removes zfs (it's the single biggest driver left), along with a bunch of crypto and other unnecessary files. And I clean up lots of bits of grub that aren't needed.

I then abstracted out a nonet and nodisk option. The nonet script removes anything that looks like networking from the kernel, and the bits of userland that I add in order to be able to configure an interface. The nodisk script removes all the remaining storage device drivers (I only included the ones that you normally see when running under a hypervisor in the first place), along with the underlying scsi and related driver frameworks.

What you end up with is a 17M root file system, which compresses down to a 6.8M root archive, which gets packaged up in an 8.7M iso.

For those interested the iso is here. It should run in VirtualBox - in 32-bit mode, and you should be able to push the memory allocated by VirtualBox down to 48M. Scary.

Of course, it doesn't do much. It boots to a prompt, the only tools you have are ksh, ls, and du.

(Oh, and if you run du against the devices tree, it will panic.)

While doing this, I found that there are a lot of dependencies between modules in the illumos kernel. Not all of them are obvious, and trying to work out what's needed and what can be removed has involved large amounts of trial and error. That said, it takes less than 5 seconds to create an iso, and it takes longer for VirtualBox to start than for this iso to boot, so the cycle is pretty quick.

Sunday, March 27, 2016

Almost there with Solarus

Every so often I'll have a go at getting some games running on Tribblix. It's not a gaming platform, but having the odd distraction can't hurt.

I accidentally stumbled across Solarus, and was immediately intrigued. Back in the day, I remember playing Zelda on the SNES. A few years ago it was released for the Game Boy Advance, and I actually went out and bought the console and the game. I haven't added any more games, nothing seemed compelling, although we had a stock of old Game Boy games (all mono) and those still worked, which is great.

So I thought I would try and build Solarus for Tribblix. Having had a quick look at the prerequisites, most of those would be useful anyway so the effort wouldn't be wasted in any event.

First up was SDL 2.0. I've had the older 1.2.15 version for a while, but hadn't had anything actually demand version 2 yet. That was easy enough, and because all the filenames are versioned, it can happily be installed alongside the older version.

While I was at it, I installed the extras SDL_image, SDL_net, SDL_ttf, smpeg, and SDL_mixer. The only build tweak needed was to supply LIBS="-lsocket -lnsl" to the configure script for smpeg and SDL_net. I needed to version the plaympeg binary installed by smpeg to avoid a conflict with the older version of smpeg, but that was about it.

Then came OpenAL, which turned out to be a bit tricky to find. Some of the obvious search results didn't seem appropriate. I have no idea whether the SourceForge site is the current one, but it appears to have the source code I needed.

Another thing that looks abandoned is modplug, where the Solarus folks have a github mirror of the copy that's right for them.

Next up, PhysicsFS. This isn't quite what you expect from the name, it's an abstraction layer that replaces traditional file system access.

On to Solarus itself. This uses cmake to build, and I ended up with the following incantation in a new empty build directory:

cmake ../solarus-1.4.5 \
 -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=NO \
 -DCMAKE_INSTALL_RPATH=/opt/solarus/lib \
 -DSOLARUS_USE_LUAJIT=OFF \
 -DCMAKE_INSTALL_PREFIX=/opt/solarus

Let's go through that. The CMAKE_INSTALL_PREFIX is fairly obvious - that's where I'm going to install it. And SOLARUS_USE_LUAJIT is necessary because I've got Lua, but have never had LuaJIT working successfully.

The two RPATH lines are necessary because of the strange way that cmake handles RPATH. Usually, it builds with RPATH set to the build location, then installs with RPATH stripped. This is plain stupid, but works when you simply dump everything into a single swamp. So you need to manually force it to put the correct RPATH into the binary (which is the sort of thing you would actually want a build system to get right on its own).

Unfortunately, it doesn't actually work properly. There are two problems which I haven't really had a chance to look at - the first is that it fails fatally with an Xlib error if I move the mouse (which is a little odd as it doesn't actually use the mouse); the second is that it runs an order of magnitude or two slower than useful, so I suspect a timing error.

Still, the build is pretty simple and it's so close to working that it would be nice to finish the job.

Saturday, March 26, 2016

Tweaking MVI

A few months ago I first talked about minimal viable illumos, an attempt to construct a rather more minimalist bootable copy of illumos than the gigabyte-size image that are becoming the norm.

I've made a couple of changes recently, which are present in the mvi repository.

The first is to make it easier for people who aren't me (and myself when I'm not on my primary build machine) to actually use mvi. The original version had the locations of the packages hardcoded to the values on my build machine. Now I've abstracted out package installation, which eliminates quite a lot of code duplication. And then I added an alternative package installation script which uses zap to retrieve and install packages from the Tribblix repo, just like the regular system does. So you can much more easily run mvi from a vanilla Tribblix system.

What I would like to add is a script that can run on pretty much any (illumos) system. This isn't too hard, but would involve copying most of the functionality of zap into the install script. I'm holding off for a short while, hoping that a better mechanism presents itself. (By better, what I mean is that I've actually got a number of image creation utilities, and it would be nice to rationalise them rather than keep creating new ones.)

The second tweak was to improve the way that the size of the root archive is calculated, to give better defaults and adapt to variations more intelligently.

There are two slightly different mechanisms used to create the image. In mvi.sh, I install packages, and then delete what I'm sure I don't need; with mvix.sh I install packages and then only take the files I do need. The difference in size is considerable - for the basic installation mvi.sh is 127M and mvix.sh is 57M.

Rather than a common base image size of 192M, I've set mvi.sh to 160M and mvix.sh to 96M. These sizes give a reasonable amount of free space - enough that adding the odd package doesn't require the sizes to be adjusted.

I then have standard scripts to construct 32-bit and 64-bit images. A little bit of experimentation indicates that the 32-bit image ends up being half the size of the base, whereas the 64-bit image comes in at two thirds. (The difference is that in the 32-bit image, you can simply remove all 64-bit files. For a 64-bit kernel, you still need both 32-bit and 64-bit userland.) So I've got those scripts to simply scale the image size, rather than try and pick a new number out of the air.

I also have a sample script to install Node.js. This again modifies the image size, just adding the extra space that Node needs. I've had to calculate this more accurately, as reducing the size of the base archive gave me less margin for error.

(As an aside, adding applications doesn't really work well in general with mvix.sh, as it doesn't know what dependencies applications might need - it only installs the bare minimum the OS needs to boot. Fortunately Node is fairly self-contained, but other applications are much less so.)

Sunday, March 13, 2016

Software selection - choice or constraint?

In Tribblix, software is preferentially managed using overlays rather than packages.

Overlays comprise a group of packages bundled together to supply a given software need - the question should be "what do you want to do?", and packages (and packaging) are merely an implementation artifact in providing the answer to that question.

Part of the idea was that, not only would the overlays match a user's mental model of what they want to install, but that there would be many fewer overlays than packages, and so it's much easier for the human brain to track that smaller number of items.

Now, it's true that there are fewer overlays than available packages. As of right now, there are 91 overlays and 1237 packages available for Tribblix. So that's better than an order of magnitude reduction in the number of things, and an enormous reduction in the possible combinations of items. However, it's no longer small in the absolute sense.

(Ideally, small for me means that it can be all seen on screen at once. If you have to page the screen to see all the information, your brain is also paging stuff in and out.)

So I've been toying with the idea of defining a more constrained set of overlays. Maybe along the lines of a desktop and server split, with small, large, and developer instances of each.

This would certainly help dramatically in that it would lead to significant simplification. However, after trying this out I'm unconvinced. The key point is that genuine needs are rather more complicated than can be addressed by half a dozen neat pigeonholes. (Past experience with the install metaclusters in Solaris 10 was also that they were essentially of no use to any particular user, they always needed to be customised.)

By removing choice, you're seriously constraining what users can do with the system. Worse, by crippling overlays you force users back into managing packages which is one of the things I was trying to avoid in the first place.

So, I'm abandoning the idea of removing choice, and the number of overlays is going to increase as more applications are added. Which means that I'm going to have to think a lot harder about the UX aspect of overlay management.

Sunday, March 06, 2016

Load balancers - improving site reliability

You want your online service to be reliable - a service that isn't working is of no use to your users or customers. Yet, the components that you're using - hardware and software - are themselves unreliable. How do you arrange things so that the overall service is more releiable than the individual components it's made up of?

This is logically 2 distinct problems. First, given a set of N systems able to provide a service, how do you maintain service if one or more of those fail? Second, given a single service, how do you make sure it's always available?

The usual solution here involves some form of load balancer. A software or hardware device that takes incoming requests and chooses a system to handle the request, only considering as candidates those systems that are actually working.

(The distinction between hardware and software here is more between prepackaged appliances and DIY. The point about the "hardware" solutions is that you buy a thing, and treat it as a black box with little access to its internals.)

For hardware appliances, most people have heard of F5. Other competitors in this space are Kemp and A10. All are relatively (sometimes eye-wateringly) expensive. Not necessarily bad value, mind, depending on your needs. At the more affordable end of the spectrum sits loadbalancer.org.

Evaluating these recently, one thing I've noticed is that there's a general tendency to move upmarket. They're no longer load balancers, there's a new term here - ADC, or Application Delivery Controllers. They may do SSL termination and add functionality such as simple firewall functionality, Web Applications Firewalls, or Intrusion Detection and Threat Management. While this is clearly to add differentiation and keep ahead of the cannibalization of the market, many of the additional functionality simply isn't relevant for me.

Then there is a whole range of open source software solutions that do load balancing. Often these are also reverse proxies.

HAProxy is well known, and very powerful and flexible. It's not just web, it's very good at handling generic TCP. Packed with features, my only criticism is that configuration is rather monolithic.

You might think of Nginx as a web server, but it's also an excellent reverse proxy and load balancer. It doesn't quite have the range of functionality of HAProxy, but most people don't need anything that powerful anyway. One thing I like about Nginx is directory-based configuration - drop a configuration fragment into a directory, signal nginx, and you're off. If you're managing a lot of sites behind it, such a configuration mode is a godsend.

There's an interesting approach used in SNI Proxy. It assumes an incoming HTTPS connection has an SNI header on it, picks that out, and uses that to decide where to forward the TCP session. By using SNI, you don't have to put certificates on the proxy host, or get it to decrypt (and possibly re-encrypt) anything.

Offering simpler configuration are Pound and Pen. Neither are very keen on virtual hosting configurations. If all your backend servers are the same and you do all the virtual hosting there, then that's fine, but if you need to route to different sets of back end servers depending on the incoming request, they aren't a good choice.

For more dynamic configurations, there's vulcand, where you put all you configuration into Etcd. If you're into microservices and containers the it's definitely worth a look.

All the above load balancers assume that they're relatively reliable (or are relatively stable) compared to the back end services they're proxying. So they give you protection against application or hardware failure, and allow you to manage replacement, upgrades, and general deployment tasks without affecting users. The operational convenience of being able to manage an application independent of it's user-facing endpoint can be a huge win.

To achieve availability of the service customers connect to needs a little extra work. What's to stop it failing?

In terms of the application failing, that should be less of a concern. Compared to a fully-fledged business application, the proxy is a fairly simple, usually stateless, so has less to fail and can be automatically restarted pretty quickly if and when it fails.

But what if the underlying system goes away? That's what you need to protect against. And what you're really doing here is trying to ensure that the IP address associated with that service is always live. If it goes away, move it someplace else and carry on.

Ignoring routing tricks and things like VRRP and anycast, some solutions here are:

UCARP is a userland implementation of the Common Address Redundancy Protocol (CARP). Basically, hosts in a group monitor each other. If the host holding the address disappears, another host in the group will bring up a virtual interface with the required IP address. The bringup/teardown is delegated to scripts, allowing you to perform any other steps you might need to as part of the failover.

Wackamole, which uses the Spread toolkit, is another implementation of the same idea. It's getting a bit old now and hasn't seen any work for a while.

A newer variation that might be seen as the logical successor to wackamole is vippy, which is built on Node. The downside here is that Node is a moving target, so vippy won't build as is on current versions of Node, and I had trouble building it at all.

As you can see, this is a pretty large subject, and I've probably only scratched the surface. If there are things I've missed, especially if they're relevant to illumos, let me know.

Friday, March 04, 2016

Supermicro - illumos compatible server configuration

We were recently in the market for some replacement servers. We run OmniOS in production, so something compatible with illumos is essential.

This is a little more tricky than it appears. Some of the things to be aware of are specific to illumos, while some are more generic. But, after spending some time reading an awful lot of spec sheets and with the help of the OmniOS mailing list, we got to a final spec that ought to be pretty good.

I'm going to talk about a subset of Supermicro systems here. While other vendors exist, it can be harder to put together a working spec.

To start with, Supermicro have a bewildering list of parts. But we started out looking at 2U storage servers, with plenty of 2.5" disk slots in the front for future expansion.

Why 2.5"? Well, it allows you twice as many slots (24 as opposed to 12) so you have more flexibility. Also, the industry is starting to move away from 3.5" drives to 2.5", but that's a slow process. More to the point, most SSDs come in the 2.5" form factor, and I was planning to go to SSDs by default. (It's a good thing now, I'm thinking that choosing spinning rust now will look a bit daft in a few years time.) If you want bulk on the cheap, then something with 3.5" slots that you can put 4TB or larger SAS HDDs in might be better, or something like the DataON 1660 JBOD.

We're also looking at current motherboards. That means the X10 at this point. The older X9 are often seen in Nexenta configurations, and those will work too. But we're planning for the long haul, so want things to not turn into antiques for as long as possible.

So there is a choice of chassis. These are:
  • 216 - 24 2.5" drive slots
  • 213 - 16 2.5" drive slots
  • 826 - 12 3.5" drive slots
  • 825 - 8 3.5" drive slots
The ones with smaller numbers of drives have space for something like a CD.

The next thing that's important, especially for illumos and ZFS, is whether it's got a direct attach backplane or whether it puts the disks behind an expander. Looking at the 216 chassis, you can have:
So you can see that something with A or AC is direct attach, E1C is a single expander, E2C is a dual expander. (With a dual expander setup you can use multipathing.)

Generally, for ZFS, you don't want expanders. Especially so if you have SATA drives - and many affordable SSDs are SATA. Vendors and salespeople will tell you that expanders never cause any problems, but most illumos folk appear allergic to the mere mention of them.

(I think this is one reason Nexenta-compatible configurations, and some of the preconfigured SuperServer setups, look expensive at first sight. They often have expanders, use exclusively SAS drives as a consequence, and SAS SSDs are expensive.)

So, we want the SC216BAC-R920LPB chassis. To connect up 24 drives, you'll need HBAs. We're using ZFS, so don't need (or want) any sort of hardware raid, just direct connectivity. So you're looking at the LSI 9300-8i HBA, which has 8 internal ports, and you're going to need 3 of them to connect all 24 drives.

For the motherboard, the X10 has a range of models. At this point, select how many and what speed network interfaces you want.
The dual network boards have 16 DIMM slots, the quad network boards have 24 DIMM slots. The network cards are Intel i350 (1Gbe) or X540 (10Gbe) which are both supported by illumos.

The 2U chassis can support a little 2-drive disk bay at the back of the machine, you can put a pair of boot drives in here and wire them up directly to the SATA ports on the motherboard, giving you an extra 2 free drive slots in the front. Note, though, that this is only possible with the dual network boards, the quad network boards take up too much room in the chassis. (It's not so much the extra network ports as such, but the extra DIMM slots.)

Another little quirk is that as far as I can tell the quad 1G board has fewer USB ports, and they're all USB3. You need USB2 for illumos, and I'm not sure if you can demote those ports down to USB2 or not.

So, if you want 4 network ports (to provide, for example, a public LACP pair and a private LACP pair), you want the X10DRi-T4+.

Any E5-2600 v3 CPUs will be fine. We're not CPU bound so just went for cheaper lower-power chips, but that's a workload thing. One thing to be aware of is that you do need to populate both sockets - if you just have one then you lose half of the DIMM slots (which is fairly obvious) and most of the PCI slots (which isn't - look at the documentation carefully if you're thinking of doing this, but better to get a single-socket motherboard in the first place).

As for drives, we went for a pair of small Intel S3510 for the boot drives, those will be mirrored using ZFS. For data, larger Intel S3610, as they support more drive writes - analysis of our I/O usage indicated that we are worryingly close to the DWPD (Drive Writes Per Day) of the S3510, so the S3610 was a safer choice, and isn't that much more expensive.

Hopefully I'll be able to tell you how well we get on once they're delivered and installed.

Wednesday, March 02, 2016

Moving goalposts with openssl

The most recent openssl release fixed a number of security issues.

In order to mitigate against DROWN, SSLv2 was disabled by default. OK, that's a reasonable thing to do. The mere presence of the code is harmful, and improving security is a good thing. Right?

Well, maybe. Unfortunately, this breaks binary compatibility by default. Suddenly, a number of functions that used to be present in libssl.so have disappeared. In particular:

SSLv2_client_method
SSLv2_method
SSLv2_server_method

and the problem is that if you have an application that references those symbols, the linker can't find them, and your application won't even run.

This hit pkg(5) on OmniOS, which is pretty nasty.

So I had a look around on Tribblix to see what would break. It's largely what you would expect:

  • curl, specifically libcurl
  • wget
  • neon
  • the python ssl module
  • the ruby ssl module
  • apache
  • mysql
 And, of course, by the magic of dependency hell that spreads to affect a large number of applications.

In many cases the SSLv2 code is only present if it detects the corresponding calls being present in libssl. Which leads you into a chicken and egg situation - you have to install the newer openssl, thus breaking your system, in order to rebuild the applications that are broken.

And even if a distributor rebuilds what the distro ships, there's still any 3rd-party or locally built binaries which could be broken.

For Tribblix, I'm rebuilding what I can in any case, explicitly disabling SSLv2 (because the automatic detection is wrong). I'll temporarily ship openssl with SSLv2 enabled, until I've finally nailed everything.

But this is a game changer in terms of expectations. The argument before was that you should link dynamically in order to take advantage of updates to critical libraries rather than have to rebuild everything individually. Now, you have to assume that any security fix to a library you use could break compatibility.

For critical applications, another reason to build your own stack.