Thursday, December 24, 2015

The palatability of complexity

There seems to be a general trend to always add complexity to any system. Perhaps it's just the way most of our brains are wired, but we just can't help it.

Whether this be administrative tasks (filing your expenses), computer software (who hasn't suffered the dead hand of creeping featurism), systems administration, or even building a tax system, the trend seems always to be to keep adding additional layers of complexity.

Eventually, this stops when the complexity becomes unsustainable. People can rebel - they will go round the back of the system, taking short cuts to achieve their objectives without having to deal with the complexity imposed on them. Or they leave - for another company without the overblown processes, or another piece of software that is easier to use.

But there's another common way of dealing with the problem that is superficially attractive but with far worse consequences, which involves the addition of what I'll call a palatability layer. Rather than address the underlying problem, an additional layer is added on top to make it easier to deal with.

Which fails in two ways: you have failed to actually eliminate the underlying complexity, and the layer you've added will itself grow in complexity until it reaches the palatability threshold. (At which point, someone will add another layer, and the cycle repeats.)

Sometimes, existing bugs and accidental implementation artefacts become embedded as dogma in the new palatability layer. Worse, over time all expertise gravitates to the outermost layer, leaving you with nobody capable of understanding the innermost internals.

On occasion, the palatability layer becomes inflated to the position of a standard. (Which perhaps explains why standards are often so poor, and there are so many to choose from.)

For example, computer languages have grown bloated and complex. Features have been added, dependencies have grown. Every so often a new language emerges as an escape hatch.

Historically, I've often been opposed to the use of Configuration Management, because it would end up being used to support complexity rather than enforcing simplicity. This is not a fault of the tool, but of the humans who would abuse it.

As another example, I personally use an editor to write code rather than an IDE. That way, I can't write overly complex code, and it forces me to understand every line of code I write.

Every time you add a palatability layer, while you might think you're making things better, in reality you're helping build a house of cards on quicksand.

Monday, December 14, 2015

The cost of user-applied updates

Having updated a whole bunch of my (proprietary) devices with OS updates today, I was moved to tweet:

Imagining a world in which you could charge a supplier for the time it takes you to regularly update their software

On most of my Apple devices I'm applying updates to either iOS or MacOS on a regular basis. Very roughly, it's probably taking away an hour a month - I'm not including the elapsed time for the update (you just schedule this so you get yourself a cup of coffee or something), but there's a bit of planning involved, some level of interaction during the process, and then the need to fix up anything afterwards that got mangled by the update.

I don't currently run Windows, but I used to have to do that as well. And web browsers and applications. And that used to take forever, although current hardware helps (particularly the move away from spinning rust).

And then there's the constant stream of updates at the installed apps. Not all of which you can ignore - some games have regular mandatory updates and if you don't apply them the game won't even start.

If you charge for the time involved at commercial rates, you could easily justify $100 per month or $1000 per year. It's a significant drain on time and productivity, a burden being pushed from suppliers onto end users. Multiply that by the entire user base and you're looking at it having a significant impact on the economy of the planet.

And that's when things go smoothly. Sometimes systems go and apply updates at inconvenient times - I once had Windows update suddenly decide to update my work laptop just as I was shutting it down to go to the airport. Or just before an important meeting. If the update interferes with a critical business function, then costs can skyrocket very easily.

So you could avoid the manual interaction and associated costs, but then you end up giving users no way to prevent bad updates or to schedule them appropriately. Of course, if the things were adequately tested beforehand, or minimised, then there would be much less of a problem, but the update model seems to be to replace the whole shebang and not bother with testing. (Or worry about compatibility.)

It's not just time (or sanity), there's a very real cost in bandwidth. With phone or tablet images being measured in gigabytes, you can very easily blow your usage cap. (Even on broadband - if you're on metered domestic broadband then the usage cap might be 25GB/month, which is fine for email and general browsing, but OS and app updates for a family could easily hit that limit.)

The problem extends beyond computers (or things like phones that people do now think of as computers). My TV and BluRay player have a habit of updating themselves. (And one's significant other gets really annoyed if the thing decides to spend 10 minutes updating itself just as her favourite soap opera is about to start.)

As more and more devices are connected to the network, and update over the network, the problem's only going to get worse. While some updates are going to be necessary due to newly found bugs and security issues, there does seem to be a philosophy of not getting things right in the first place but shipping half-baked and buggy software, relying on being able to update it later.

Any realistic estimate of the actual cost involved in expecting all your end users to maintain the shoddy software that you ship is so high that the industry could never be expected to foot even a small fraction of the bill. Which is unfortunate, because a financial penalty would focus the mind and maybe lead to a much better update process.

Sunday, December 13, 2015

Zones beside Zones

Previously, I've described how to use the Crossbow networking stack in illumos to create a virtualized network topology with Zones behind Zones.

The result there was to create the ability to have zones on a private network segment, behind a proxy/router zone.

What, however, if you want the zones on one of those private segments to communicate with zones on a different private segment?

Consider the following two proxy zones:

A: address 192.168.10.1, subnet 10.1.0.0/16
B: address 192.168.10.2, subnet 10.2.0.0/16

And we want the zones in the 10.1.0.0 and 10.2.0.0 subnets to talk to each other. The first step is to add routes, so that packets from system A destined for the 10.2.0.0 subnet are sent to host B. (And vice versa.)

A: route add net 10.2.0.0/16 192.168.10.2
B: route add net 10.1.0.0/16 192.168.10.1

This doesn't quite work. The packets are sent, but recall that the proxy zone is doing NAT on behalf of the zones behind it. So packets leaving 10.1.0.0 get NATted  on the way out, get delivered successfully to the 10.2.0.0 destination but then the reply packet gets NATted on its way back, so it doesn't really work.

So, all that's needed is to not NAT the packets that are going to the other private subnet. Remember the original NAT rule in ipnat.conf on host A would have been:

map pnic0 10.1.0.0/16 -> 0/32 portmap tcp/udp auto
map pnic0 10.1.0.0/16 -> 0/32

and we don't want to NAT anything that is going to 10.2.0.0, which would be:

map pnic0 from 10.1.0.0/16 ! to 10.2.0.0/16 -> 0/32 portmap tcp/udp auto
map pnic0 from 10.1.0.0/16 ! to 10.2.0.0/16 -> 0/32

And that's all there is to it. You now have a very simple private software-defined network with the 10.1 and 10.2 subnets joined together.

If you think this looks like the approach underlying Project Calico, you would be right. In Calico, you build up the network by managing routes (many more as it's per-host rather than the per-subnet I have here), although Calico has a lot more manageability and smarts built in to it rather than manually adding routes to each host.

While simple, there are obvious problems associated with scaling such a solution.

While adding and deleting routes isn't so bad, listing all the subnets in ipnat.conf would be tedious to say the least. The solution here would be to use the ippool facility to group the subnets.

How do we deal with a dynamic environment? While the back-end zones would come and go all the time, I expect the proxy/router zone topology to be fairly stable, so configuration churn would be fairly low.

The mechanism described here isn't limited to a single host, it easily spans multiple hosts. (With the simplistic routing as I've described it here, those hosts would have to be on the same network, but that's not a fundamental limitation.) My scripts in Tribblix just save details of how the proxy/router zones on a host are configured locally, so I need to extend the logic to a network-wide configuration store. That, at least, is well-known territory.

Thursday, December 10, 2015

Building an application in Docker

We have an application that we want to make easy for people to run. As in, really easy. And for people who aren't necessarily systems administrators or software developers.

The application ought to work on pretty much anything, but each OS platform has its quirks. So we support Ubuntu - it's pretty common, and it's available on most cloud providers too. And there's a simple install script that will set everything up for the user.

In the modern world, Docker is all the rage. And one advantage of Docker from the point of a systems administrator is that it decouples the application environment from the systems environment - if you're a Red Hat shop, you just run Red Hat on the systems, then add Docker and your developers can get a Ubuntu (or whatever) environment to run the application in. (The downside of this decoupling is that it gives people an excuse to write even less portable code than they do even now.)

So, one way for us to support more platforms is to support Docker. I already have the script that does everything, so isn't it going to be just a case of creating a Dockerfile like this and building it:

FROM ubuntu:14.04
RUN my_installer.sh

Actually, that turns out to be (surprisingly) close. It turns out to fail on just one line. The Docker build process runs as root, and when we try and initialise postgres with initdb, it errors out as it won't let you run postgres as root.

(As an aside, this notion of "root is unsafe" needs a bit of a rethink. In a containerized or unikernel world, there's nothing beside the one app, so there's no fundamental difference between root and the application user in many cases, and root in a containerized world is a bogus root anyway.)

OK, so we can run the installation as another user. We have to create the user first, of course, so something like:

FROM ubuntu:14.04
RUN useradd -m hbuild

USER hbuild
RUN my_installer.sh

Unfortunately, this turns out to fail all over the place. One thing my install script does is run apt-get via sudo to get all the packages that are necessary. We're user hbuild in the container and can't run sudo, and if we could we would get prompted, which is a bit tricky for the non-interactive build process. So we need to configure sudo so that this user won't get prompted for a password. Which is basically:

FROM ubuntu:14.04
RUN useradd -m -U -G sudo hbuild && \

    echo "hbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER hbuild
RUN my_installer.sh

Which solves all the sudo problems, but the script also references $USER (it creates some directories as root, then chowns them to the running user so the build can populate them), and the Docker build environment doesn't set USER (or LOGNAME, as far as I can tell). So we need to populate the environment the way the script expects:

FROM ubuntu:14.04
RUN useradd -m -U -G sudo hbuild && \

    echo "hbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER hbuild
ENV USER hbuild
RUN my_installer.sh

And off it goes, cheerfully downloading and building everything.

I've skipped over how the install script itself ends up on the image. I could use COPY, or even something very crude like:

FROM ubuntu:14.04
RUN apt-get install -y wget
RUN useradd -m -U -G sudo hbuild && \

    echo "hbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER hbuild
ENV USER hbuild
RUN cd /home/hbuild && \

    wget http://my.server/my_installer.sh && \
    chmod a+x my_installer.sh && \
./my_installer.sh

This all works, but is decidedly sub-optimal. Leaving aside the fact that we're running both the application and the database inside a single container (changing that is a rather bigger architectural change than we're interested in right now), the Docker images end up being huge, and you're downloading half the universe each time. So to do this properly you would add an extra RUN step that did all the packaging and cleaned up after itself, so you have a base layer to build the application on.

What this does show, though, is that it's not that hard to take an existing deployment script and wrap it inside Docker - all it took here was a little fakery of the environment to more closely align with how the script was expecting to be run.

Monday, November 30, 2015

Zones behind zones

With Solaris 10 came a host of innovations - ZFS, DTrace, and zones were the big three, there was also SMF, FMA, NFS4, and a supporting cast of improvements across the board.

The next big followup was Crossbow, giving full network virtualization. It never made it back into Solaris 10, and it's largely gone under the radar.

Which is a shame, because it's present in illumos (as well as Solaris 11), and allows you to construct essentially arbitrary network configuration in software. Coupled with zones you can build a whole datacentre in a box.

Putting everything together is a bit tedious, however. One of the things I wanted with Tribblix is to enable people (myself being the primary customer) to easily take advantages of the technologies, to automate away all the tedious grunt work.

This is already true up to a point - zap can create and destroy zones with a single command. No more mucking around with the error-prone process of writing config files and long streams of commands, computers exist to do all this for us - leaving humans to worry about what you want to do, not how to remember the minutiae of the how.

So the next thing I wanted to do was to have a zone that can act as a router or proxy (I've not yet really settled on a name), so you have a hidden network with zones that can only be reached from your proxy zone. There are a couple of obvious uses cases:

  • You have a number of web applications in isolated zones, and run a reverse proxy like nginx or haproxy in your proxy zone to act as a customer-facing endpoint.
  • You have a multitier application, with just the customer-facing tier in the proxy zone, and the other tiers (such as your database) safely hidden away.

Of course, you could combine the two.

So the overall aim is that:

  • Given an appropriate flag and an argument that's a network description (ideally a.b.c.d/prefix in CIDR notation) the tool will automatically use Crossbow to create the appropriate plumbing, hook the proxy zone up to that network, and configure it appropriately
  • In the simplest case, the proxy zone will use NAT to forward packets from the zones behind it, and be the default gateway for those zones (but I don't want it to do any real network routing)
  • If you create a zone with an address on the hidden subnet, then again all the plumbing will get set up so that the zone is connected up to the appropriate device and has its network settings correctly configured

This will be done automatically, but it's worth walking through the steps manually.

As an example, I want to set up the network 10.2.0.0/16. By convention, the proxy zone will be connected to it with the bottom address - 10.2.0.1 in this case.

The first step is to create an etherstub:

dladm create-etherstub zstub0
And then a vnic over it that will be the interface to this new private network:

dladm create-vnic -l zstub0 znic0

Now, for the zone to be able to manage all the networking stuff it needs to have an exclusive-ip network stack. So you need to create another vnic for the public-facing side of the network, let's suppose you're going to use the e1000g0 interface:

dladm create-vnic -l e1000g0 pnic0

You create the zone with exclusive-ip and add the pnic0 and znic0 interfaces.

Within the zone, configure the address of znic0 to be 10.2.0.1/16.

You need to set up IP forwarding on all the interfaces in the zone:

ipadm set-ifprop -p forwarding=on -m ipv4 znic0
ipadm set-ifprop -p forwarding=on -m ipv4 pnic0

The zone also needs to NAT the traffic coming in from the 10.2 network. The file /etc/ipf/ipnat.conf needs to contain:

map pnic0 10.2.0.0/16 -> 0/32 portmap tcp/udp auto
map pnic0 10.2.0.0/16 -> 0/32

and you need to enable ipfilter in the zone with the command svcadm enable network/ipfilter.

Then, if you create a zone with address 10.2.0.2, for example, you need to create a new vnic over the zstub0 etherstub:

dladm create-vnic -l zstub0 znic1

and allocate the znic1 interface to the zone. Then, in that new zone, set the address of znic1 to be 10.2.0.2 and its default gateway to be 10.2.0.1.

That's just about manageable. But in reality it gets far more complicated:

  • With multiple networks and zones, you have to dynamically allocate the etherstub and vnic names, they aren't fixed
  • You have to make sure to delete all the items you have created when you destroy a zone
  • You need to be able to find which etherstub is associated with a given network, so you attach a new zone to the correct etherstub
  • Ideally, you want all the hidden networks to be unique (you don't have to, but as the person writing this I can make it so to keep things simple for me)
  • You want to make sure you can't delete a proxy zone if there are zones on the network behind it
  • You want the zones to boot up with their networks fully and correctly configured (there's a lot going on here that I haven't even mentioned)
  • You may need to configure rather more of a firewall than the simple NAT configuration
  • In the case of a reverse proxy, you need a way to update the reverse proxy configuration automatically as zones come and go

Overall, there are a whole lot of hoops to jump through, and a lot of information to track and verify.

I'm about halfway through writing this at the moment, with most of the basic functionality present. I can, as the author, make a number of simplifying assumptions - I get to choose the naming convention, I can declare than the hidden networks must be unique, I can declare that I will only support simple prefixes (/8, /16, and /24) rather than arbitrary prefixes, and so on.

Thursday, November 26, 2015

Buggy basename

Every so often you marvel at the lengths people go to to break things.

Take the basename command in illumos, for example. This comes in two incarnations - /usr/bin/basename, and /usr/xpg4/bin/basename.

Try this:

# /usr/xpg4/bin/basename -- /a/b/c.txt
c.txt

Which is correct, and:

# /usr/bin/basename -- /a/b/c.txt
--

Which isn't.

Wait, it gets worse:

# /usr/xpg4/bin/basename /a/b/c.txt .t.t
c.txt

Correct. But:

# /usr/bin/basename /a/b/c.txt .t.t
c

Err, what?

Perusal of the source code reveals the answer to the "--" handling - it's only caught in XPG4 mode. Which is plain stupid, there's no good reason to deliberately restrict correct behaviour to XPG4.

Then the somewhat bizarre handling with the ".t.t" suffix. So it turns out that the default basename command is doing pattern matching rather then the expected string matching. So the "." will match any character, rather than being interpreted literally. Given how commonly "." is used to separate the filename from its suffix, and the common usage of basename to strip off the suffix, this is a guarantee for failure and confusion. For example:

# /usr/bin/basename /a/b/cdtxt .txt
c

The fact that there's a difference here  is actually documented in the man page, although not very well - it points you to expr(1) which doesn't tell you anything relevant.

So, does anybody rely on the buggy broken behaviour here?

It's worth noting that the ksh basename builtin and everybody else's basename implementation seems to do the right thing.

Fixing this would also get rid of a third of the lines of code and we could just ship 1 binary instead of 2.

Tuesday, November 24, 2015

Replacing SunSSH with OpenSSH in Tribblix

I recently did some work to replace the old SSH implementation used by Tribblix, which was the old SunSSH from illumos, with OpenSSH.

This was always on the list - our SunSSH implementation was decrepit and unmaintained, and there seemed little point in general in maintaining our own version.

The need to replace has become more urgent recently, as the mainstream SSH implementations have drifted to the point that we're no longer compatible - to the point that our implementation will not interoperate at all with that on modern Linux distributions with the default settings.

As I've been doing a bit of work with some of those modern Linux distributions, being unable to connect to them was a bit of a pain in the neck.

Other illumos distributions such as OmniOS and SmartOS have also recently been making the switch.

Then there was a proposal to work on the SunSSH implementation so that it was mediated - allowing you to install both SunSSH and OpenSSH and dynamically switch between them to ease the transition. Personally, I couldn't see the point - it seemed to me much easier to simply nuke SunSSH, especially as some distros had already made or were in the process of making the transition. But I digress.

If you look at OmniOS, SmartOS, or OpenIndiana, they have a number of patches. In some cases, a lot of patches to bring OpenSSH more in line with old SunSSH.

I studied these at some length, looked at them, and largely rejected them. There are a couple of reasons for this:

  • In Tribblix, I have a philosophy of making minimal modifications to upstream projects. I might apply patches to make software build, or when replacing older components so that I don't break binary compatibility, but in general what I ship is as close to what you would get if you did './configure --prefix=/usr; make ; make install' as I can make it.
  • Some of the fixes were for functionality that I don't use, probably won't use, and have no way of testing. So blindly applying patches and hoping that what I produce still works, and doesn't arbitrarily break something else, isn't appealing. Unfortunately all the gssapi stuff falls into this bracket.
One thing that might change this in the future, and something we've discussed a little, is to have something like Joyent's illumos-extra brought up to a state where it can be used as a common baseline across all illumos distributions. It's a bit too specific to SmartOS right now, so won't work for me out of the box, and it's a little unfortunate that I've just about reimplemented all the same things for Tribblix myself.

So what I ship is almost vanilla OpenSSH. The modifications I have made are fairly few:

It's split into the same packages (3 of them) along just about the same boundaries as before. This is so that you don't accidentally mix bits of SunSSH with the new OpenSSH build.

The server has
KexAlgorithms +diffie-hellman-group1-sha1
added to /etc/ssh/sshd_config to allow connections from older SunSSH clients.

The client has
PubkeyAcceptedKeyTypes +ssh-dss
added to /etc/ssh/ssh_config so that it will allow you to send DSA keys, for users who still have just DSA keys.

Now, I'm not 100% happy about the fact that I might have broken something that SunSSH might have done, but having a working SSH that will interoperate with all the machines I need to talk to outweighs any minor disadvantages.

Sunday, November 22, 2015

On Keeping Your Stuff to Yourself

One of the fundamental principles of OmniOS - and indeed probably its defining characteristic - is KYSTY, or Keep Your Stuff* To Yourself.

(*um, whatever.)

This isn't anything new. I've expressed similar opinions in the past. To reiterate - any software that is critical for the success of your business/project/infrastructure/whatever should be directly under your control, rather than being completely at the whim of some external entity (in this case, your OS supplier).

We can flesh this out a bit. The software on a system will fall, generally, into 3 categories:

  1. The operating system, the stuff required for the system to boot and run reliably
  2. Your application, and its dependencies
  3. General utilities

As an aside, there are more modern takes on the above problem: with Docker, you bundle the operating system with your application; with unikernels you just link whatever you need from classes 1 and 2 into your application. Problem solved - or swept under the carpet, rather.

Looking at the above, OmniOS will only ship software in class 1, leaving the rest to the end user. SmartOS is a bit of a hybrid - it likes to hide everything in class 1 from you and relies on pkgsrc to supply classes 2 and 3, and the bits of class 1 that you might need.

Most (of the major) Linux distributions ship classes 1, 2, and 3, often in some crazily interdependent mess that you have to spend ages unpicking. The problem being that you need to work extra hard to ensure your own build doesn't accidentally acquire a dependency on some system component (or that you build somehow reads a system configuration file).

Generally missing from discussions is that class 3 - the general utilities. Stuff that you could really do with an instance of to make your life easier, but where you don't really care about the specifics of.

For example, it helps to have a copy of the gnu userland around. Way too much source out there needs GNU tar to unpack, or GNU make to build, or assumes various things about the userland that are only true of the GNU tools. (Sometimes, the GNU tools aren't just a randomly incompatible implementation, occasionally have capabilities that are missing from standard tools - like in-place editing in gsed.)

Or a reasonably complete suite of compression utilities. More accurately, uncompression, so that you have a pretty good chance of being able to unpack some arbitrary format that people have decided to use.

Then there are generic runtimes. There's an awful lot of python or perl out there, and sometimes the most convenient way to get a job done is to put together a small script or even a one-liner. So while you don't really care about the precise details, having copies of the appropriate runtimes (and you might add java, erlang, ruby, node, or whatever to that list) really helps for the occasions when you just want to put together a quick throwaway component. Again, if your business-critical application stack requires that runtime, you maintain your own, with whatever modules you need.

There might also be a need for basic graphics. You might not want or need a desktop, but something is linked against X11 anyway. (For example, java was mistakenly linked against X11 for font handling, even in headless mode - a bug recently fixed.) Even if it's not X11, applications might use common code such as cairo or pango for drawing. Or they might need to read or write image formats for web display.

So the chances are that you might pull in a very large code surface, just for convenience. Certainly I've spent a lot of time building 3rd-party libraries and applications on OmniOS that were included as standard pretty much everywhere else.

In Tribblix, I've attempted to build and package software cognizant of the above limitations. So I supply as wide a range of software in class 3 as I can - this is driven by my own needs and interests, as a rule, but over time it's increasingly complete. I do supply application stacks, but these are built to be in a separate location, and are kept at arms length from the rest of the system. This then integrated with Zones in a standardized zone architecture in a way that can be managed by zap. My intention here is not necessarily to supply the building blocks that can be used by users, but to provide the whole application, fully configured and ready to go.

Sunday, November 15, 2015

The Phallacy of Web Scale

A couple of times recently I've had interviewers ask me to quickly sketch out the design for a web scale architecture. Of course, being a scientist by training the first thing I did was to work out what sort of system requirements we're looking at, to see what sort of scale we might really need.

In both cases even my initial extreme estimate, just rounding everything up, didn't indicate a scaling problem. Most sites aren't Facebook or Google, they see limited use by a fraction of the population. The point here is that while web scale sites exist, they are the exception rather than the rule, so why does everyone think they have to go to the complexity and expense of architecting a "web scale" solution?

To set this into perspective, assume you want to track everyone in the UK's viewing habits. If everyone watches 10 programmes per day and channel-hops 3 times for each programme, and there are 30 million viewers, then that's 1 billion data points per day or 10,000 per second. Each is small, so at 100 bytes each that's 100GB/day or ~10megabits/s. So, you're still talking single server. You can hold a week's data in RAM, a year on disk.

And most businesses don't need anything like that level of traffic.

Part of the problem is that most implementations are horrifically inefficient. The site itself may be inefficient - you know the ones that have hundreds of assets on each page, multi-megabyte page weight, widgets all over, that take forever to load and look awful - if their customers bother waiting long enough. The software implementation behind the site is almost certainly inefficient (and is probably trying to do lots of stupid things it shouldn't as well).

Another trend fueling this is the "army of squirrels" approach to architecture. Rather than design an architecture that is correctly sized to do what you need, it seems all too common to simply splatter everything across a lot of little boxes. (Perhaps because the application is so badly designed it has cripplingly limited scalability so you need to run many instances.) Of course, all you've done here is simply multiplied your scaling problem, not solved it.

As an example,  see this article Scalability! But at what COST? I especially like the quote that Big data systems may scale well, but this can often be just because they introduce a lot of overhead.

Don't underestimate the psychological driver, either. A lot of people want to be seen as operating at "web scale" or with "Big Data", either to make themselves or their company look good, to pad their own CV, or to appeal to unsophisticated potential employees.

There are problems that truly require web scale, but for the rest it's an ego trip combined with inefficient applications on badly designed architectures.

Thursday, November 12, 2015

On the early web

I was browsing around, as one does, when I came across a list of early websites. Or should I say, a seriously incomplete list of web servers of the time.

This was November 1992, and I had been running a web server at the Institute of Astronomy in Cambridge for some time. That wasn't the only technology we were using at the time, of course - there was Gopher, the emergent Hyper-G, WAIS, ftp, fsp, USENET, and a few others that never made the cut.

Going back a bit further in time, about a year earlier, is an email regarding internet connectivity in Cambridge. I vaguely remember this - I had just arrived at the IoA at the time and was probably making rather a nuisance of myself, having come back from Canada where the internet was already a thing.

I can't remember exactly when we started playing with the web proper, but it would have been some time about Easter 1992. As the above email indicates, 1991 saw the department having the staggering bandwidth of 64k/s, and I think it would have taken the promised network upgrade for us to start advertising our site.

Graphical browsers came quite late - people might think of Mosaic (which you can still run if you like), but to start with we just had the CERN line mode browser, and things like Viola. Around this time there were other graphical browsers - there was one in the Andrew system, as I recall, and chimera was aimed at the lightweight end of the scale.

Initially we ran the CERN web server, but it was awful - it burnt seconds of cpu time to deliver every page, and as soon as the NCSA server came out we switched to that, and the old Sun 630MP that hosted all this was much the happier for it. (That was the machine called cast0 in the above email - the name there got burnt into the site URL, it took a while for people to get used to the idea of adding functional aliases to DNS.)

With the range of new tools becoming available, it wasn't entirely obvious which technologies would survive and prosper.

With my academic background I was initially very much against the completely unstructured web, preferring properly structured and indexed technologies. In fact, one of the things I remember saying at the time, as the number of sites started to grow, was "How on earth are people going to be able to find stuff?". Hm. Missed business opportunity there!

Although I have to say that even with search engines, actually finding stuff on the web now is a total lottery - Google have made a lot of money along the way, though. One thing I miss, again from the early days of the web (although we're talking later in the 90s now) is the presence of properly curated and well maintained indices of web content.

Another concern I had about the web was that, basically, any idiot could create a web page, leading to most of the content being complete and utter garbage (both in terms of what it contained and how it was coded). I think I got the results of that one dead right, but it failed to account for the huge growth that the democratization of the web allowed.

After a couple of years the web was starting to emerge as a clear front runner. OK, there were only a couple of thousand sites in total at this point (I think that up to the first thousand or so I had visited every single one), and the concept was only just starting to become known to the wider public.

One of the last things I did at the IoA, when I left in May 1994, was to set up all the computers in the building running Mosaic, with it looping through all the pages on a local website showcasing some glorious astronomical images, all for the departmental open day. This was probably the first time many of the visitors had come across the web, and the Wow factor was incredible.

Wednesday, October 21, 2015

Tribblix Turns Three

It's a little hard to put a fixed date on when I started work on Tribblix.

The idea - of building a custom distribution - had been floating around my mind in the latter days of OpenSolaris, and I registered the domain back in 2010.

While various bits of exploratory work had been going on in the meantime, it wasn't until the autumn of 2012 that serious development started. Eventually, after a significant number of attempts, I was able to produce a functional ISO image. That was:

-rw-r--r--   1 ptribble 493049856 Oct 21  2012 tribblix-0m0.iso

The first blog post was a few days later, but I'm going to put October 21st as the real date of birth.

Which means that Tribblix is 3 years old today!

In that time it's gone from a simple toy to a fully fledged distribution, most of the original targets I set myself have been met, it's been my primary computing environment for a while, it's proving useful as a platform for interesting experiments, and I'm looking forward to taking it even further in the next few years.

Tuesday, October 20, 2015

Minimal Viable Illumos

I've been playing with the creation of several minimal variants of illumos recently.

I looked at how little memory a minimal illumos distro could be installed and run in. Note that this was a properly built distribution - correctly packaged, most features present (if disabled), running the whole set of services using SMF.

In another dimension, I considered illumos pureboot, something that was illumos, the whole of illumos, and nothing but illumos.

Given that it was possible to boot illumos just to a shell, without all the normal SMF services running, how minimal can you make such a system.

At this point, if you're not thinking JEOS, Unikernels, or things like IncludeOS, then you should be.

So the first point is that you're always running this under a hypervisor of some sort. The range of possible hardware configurations you need to worry about is very limited - hypervisors emulate a small handful of common devices.

Secondly, the intention is never to install this. Not directly anyway. You create an image, and boot and run that. For this experiment, I'm simply running from a ramdisk. This is the way the live image boots, or you PXE boot, or even the way SmartOS boots.

First, the starting set of packages, both in Tribblix (SVR4) and IPS naming:

  • SUNWcsd=SUNWcsd
  • SUNWcs=SUNWcs
  • TRIBsys-library=system/library
  • TRIBsys-kernel=system/kernel
  • TRIBdrv-ser-usbser=driver/serial/usbser
  • TRIBsys-kernel-platform=system/kernel/platform
  • TRIBdrv-usb=driver/usb
  • TRIBsys-kernel-dtrace=system/kernel/dtrace/providers
  • TRIBsys-net=system/network
  • TRIBsys-lib-math=system/library/math
  • TRIBsys-libdiskmgt=system/library/libdiskmgt
  • TRIBsys-boot-grub=system/boot/grub
  • TRIBsys-zones=system/zones
  • TRIBdrv-storage-ata=driver/storage/ata
  • TRIBdrv-storage-ahci=driver/storage/ahci
  • TRIBdrv-i86pc-platform=driver/i86pc/platform
  • TRIBdrv-i86pc-ioat=driver/i86pc/ioat
  • TRIBdrv-i86pc-fipe=driver/i86pc/fipe
  • TRIBdrv-net-e1000g=driver/network/e1000g
  • TRIBsys-boot-real-mode=system/boot/real-mode
  • TRIBsys-file-system-zfs=system/file-system/zfs
You can probably go further, but I wanted to at least allow the possibility of talking to a storage device.

There are a few packages here that you might wonder about:

  • usbser is actually needed, it's a hard dependency of consconfig_dacf
  • many admin commands link against the zones libraries, so I add those even though they're not strictly necessary in most scenarios
  • the system will boot and run without zfs, but will panic if you run find down the dev tree
  • the system will panic if the real-mode stuff is missing
  • grub is needed to make the iso and boot it
It's possible to construct a bootable iso from the above components, which can then be customized.

I took two approaches to this. The simple way is to simply start chopping out the files you don't want. For example, man pages and includes. The second is to drop all of userland and only put back the files you need, one by one. I tend not to tweak the kernel much, that's non-trivial and you're only looking at marginal gains.

Working out which files are necessary is trial and error. Especially shared libraries, many of which are loaded lazily so you can't just use what the executable tells you - some of the libraries it's linked against will never be pulled in

I've put together some scripts that know how to create an image suitable for 32-bit or 64-bit hardware, we can be specific as we know exactly the environment we are going to run in - and you just build a new custom iso if things change, rather than try and build a generic image.

To be useful, the system needs to talk to something. You'll see that I've installed e1000g, which is what VirtualBox and qemu will give you by default. First, we have to get the root filesystem mounted read-write:

/etc/fs/ufs/mount -o remount,rw /devices/ramdisk:a /
Normally, there's a whole lot of network configuration handled by SMF, and it's all rather complicated. So we have to do it all by hand, which turns out to be relatively simple:

/sbin/ifconfig e1000g0 plumb
env SMF_FMRI=svc:/net/ip:d /lib/inet/ipmgmtd
/sbin/ifconfig e1000g0 inet 192.168.59.59/24 up

You need ipmgmtd running, and it's expecting to be run under SMF, but the way it checks is to look for SMF_FMRI in the environment, so it's easy to fool.

If you've got your VirtualBox VM set up with a Host-only Adapter, you should be able to communicate with the guest. Not that there's anything present to talk to yet.

So I set up a simple Node.js server. Now node itself doesn't have many external dependencies - just the gcc4 runtime - and for basic purposes you just need the node binary and a js file with a 'hello world' http server.

With that, I have 64M of data in a 22M boot archive that is put on a 25M iso that boots up in a few seconds with an accessible web server. Pretty neat.

While it's pretty specific to Tribblix and my own build environment, there's an mvi repository on github containing all the scripts I used to build this, for those interested.

Thursday, October 08, 2015

Deconstructing .pyc files

I've recently been trying to work out why python was recompiling a bunch of .pyc files. I haven't solved that, but I learnt a little along the way, enough to be worth writing down.

Python will recompile a .py file onto a .pyc file if it thinks something's changed. But how does it decide something has changed? It encodes some of the pertinent details in the header of the .pyc file.

Consider a file. There's a foo.py and a foo.pyc. I open up the .pyc file in emacs and view it in hex. (ESC-x hexl-mode for those unfamiliar.)

The file starts off like this:

 03f3 0d0a c368 7955 6300 0000 ....

The first 4 bytes 03f30d0a are the magic number, and encode the version of python. There's a list of magic numbers in the source, here.

To check this, take the 03f3, reverse it to f303, which is 62211 decimal. That corresponds to 2.7a0 - this is python 2.7, so that matches. (The 0d0a is also part of the encoding of the magic number.) This check is just to see if the .pyc file is compatible with the version of python you're using. If it's not, it will ignore the .pyc file and may regenerate it.

The next bit is c3687955. Reverse this again to get the endianness right, and it's 557968c3. In decimal, that's 1434020035.

That's a timestamp, standard unix time. What does that correspond to?

perl -e '$f=localtime(1434020035); print $f'
Thu Jun 11 11:53:55 2015

And I can look at the file (on Solaris and illumos, there's a -e flag to ls to give us the time in the right format rather than the default "simplified" version).

/bin/ls -eo foo.py
-rw-r--r-- 1 root  7917 Jun 11 11:53:55 2015 foo.py

As you can see, that matches the timestamp on the source file exactly. If the timestamp doesn't match, then again python will ignore it.

This has consequences for packaging. SVR4 packaging automatically preserves timestamps, with IPS you need to use pkgsend -T to do so as it's not done by default.

Tuesday, October 06, 2015

Software directions in Tribblix

Tribblix has been developing in a number of different directions. I've been working on trimming the live image, and strengthening the foundations.

Beyond this, there is a continual stream of updated packages. Generally, if I package it, I'll try and keep it up to date. (If it's downrev, it's usually for a reason.)

In the meantime I've found time for experiments in booting Tribblix in very little memory, and creating a pure illumos bootable system.

But I thought it worthwhile to highlight some of the individual packages that have gone into Tribblix recently.

The big one was adding LibreOffice, of course. Needless to say, this was a modest amount of work. (Not necessarily all that hard, but it's a big build, and the edit-compile-debug cycle is fairly long, so it took a while.) I need to go back and update LibreOffice to a more current version, but the version I now have meets all of my needs so I can invest time and energy elsewhere.

On the desktop, I added MATE, and incorporated SLiM as a login manager. Tribblix has a lot of desktop environments and window managers available, although Xfce is still the primary and best supported option. I finally added the base GTK engines and icon themes, which got rid of a lot of errors.

In terms of tools, there's now Dia, Scribus, and Inkscape.

Tribblix has always had a retro streak. I've added gopher, gophervr, and the old Mosaic browser. There are other old X11 tools that some of you may remember - xcoral, xsnow, xsol, xshisen. If only I could get xfishtank working again.

I've been keeping up with Node.js releases, of course. But the new kid on the block is Go, and that's included in Tribblix. Current versions work very well, and now we've got past the cgo problems, there's a whole raft of modern software written in Go that's now available to us. The next one up is probably Rust.

Fun with SPARC emulators

While illumos supports both SPARC and x86 platforms, it would be a fair assessment that the SPARC support is a poor relation.

There are illumos distributions that run on SPARC - OpenSXCE has for a while, Tribblix and DilOS also have SPARC images available and both are actively maintained. The mainstream distributions are x86-only.

A large part of the lack of SPARC support is quite simple - the number of users with SPARC hardware is small; the number of developers with SPARC hardware is even smaller. And you can see that the SPARC support is largely in the hands of the hobbyist part of the community. (Which is to be expected - the commercial members of the community are obviously not going to spend money on supporting hardware they neither have nor use.)

Absent physical hardware, are there any alternatives?

Perhaps the most obvious candidate is qemu. However, the sparc64 implementation is fairly immature. In other words, it doesn't work. Tribblix will start to boot, and does get a little way into the kernel before qemu crashes. I think it's generally agreed that qemu isn't there yet.

The next thing I tried is legion, which is the T1/T2 simulator from the opensparc project. Having built this, attempting to boot an iso image immediately fails with:

FATAL: virtual_disk not supported on this platform

which makes it rather useless. (I haven't investigated to see if support can be enabled, but the build system explicitly disables it.) Legion hasn't been updated in a while, and I can't see that changing.

Then I came across the M5 simulator. This supports a number of systems, not just SPARC. But it's an active project, and claims to be able to emulate a full SPARC system. I can build it easily enough, running it needs the opensparc binary download from legion (note - you need the T1 download, version 1.5, not the newer T2 version of the download). The instructions here appear to be valid.

With M5, I can try booting Tribblix for SPARC. And it actually gets a lot further than I expected! Just not far enough:


cpu Probing I/O buses


Sun Fire T2000, No Keyboard
Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.20.0, 256 MB memory available, Serial #1122867.
[mo23723 obp4.20.0 #0]
Ethernet address 0:80:3:de:ad:3, Host ID: 80112233.



ok boot
Boot device: vdisk  File and args:
hsfs-file-system
Loading: /platform/sun4v/boot_archive
ramdisk-root ufs-file-system
Loading: /platform/sun4v/kernel/sparcv9/unix
\
panic[cpu0]/thread=180e000: lgrp_traverse: No memory blocks found

Still, that's illumos bailing, there aren't any errors from M5.

Overall, I think that M5 shows some promise as a SPARC emulator for illumos.

Friday, October 02, 2015

Notifications from SMF (and FMA)

In illumos, it's possible to set the system up so that notifications are sent whenever anything happens to an SMF service.

Unfortunately, however, the illumos documentation is essentially non-existent, although looking at the Solaris documentation on the subject should be accurate.

The first thing is that you can see the current notification state by looking at svcs -n:

Notification parameters for FMA Events
    Event: problem-diagnosed
        Notification Type: smtp
            Active: true
            reply-to: root@localhost
            to: root@localhost

        Notification Type: snmp
            Active: true

        Notification Type: syslog
            Active: true

    Event: problem-repaired
        Notification Type: snmp
            Active: true

    Event: problem-resolved
        Notification Type: snmp
            Active: true

The first thing to realize here is that first line - these are the notifications sent by FMA, not SMF. There's a relationship, of course, in that if an SMF service fails and ends up in maintenance, then an FMA event will be generated, and notifications will be sent according to the above scheme.

(By the way, the configuration for this comes from the svc:/system/fm/notify-params:default service, which you can see the source for here. And you'll see that it basically matches exactly what I've shown above.)

Whether you actually receive the notifications is another matter. If you have syslogd running, which is normal, then you'll see the syslog messages ending up in the log files. To get the email or SNMP notifications relies on additional service. These are

service/fault-management/smtp-notify
service/fault-management/snmp-notify

and if these are installed and enabled, they'll send the notifications out.

You can also set up notifications inside SMF itself. There's a decent intro available for this feature, although you should note that illumos doesn't currently have any of the man pages referred to. This functionality uses the listnotify, setnotify, and delnotify subcommands to svccfg. The one thing that isn't often covered is the relationship between the SMF and the FMA notifications - it's important to understand that both exist, in a strangely mingled state, with some non-obvious overlap.

You can see the global SMF notifications with
/usr/sbin/svccfg listnotify -g
This will come back with nothing by default, so the only thing you'll see is the FMA notifications. To get SMF to email you if any service goes offline, then

/usr/sbin/svccfg setnotify -g to-offline mailto:admin@example.com

and you can set this up at a per-service level with

/usr/sbin/svccfg -s apache22 setnotify to-offline mailto:webadmin@example.com

Now, while the SMF notifications can be configured in a very granular manner - you can turn it on and off by service, you can control exactly which state transitions you're interested in, and you can route individual notifications to different destinations, when it comes to the FMA notifications all you have is a big hammer. It's all or nothing, and you can't be selective on where notifications end up (beyond the smtp vs snmp vs syslog channels).

This is unfortunate, because SMF isn't the only source of telemetry that gets fed into FMA. In particular, the system hardware and ZFS will generate FMA events if there's a problem. If you want to get notifications from FMA if there's a problem with ZFS, then you're also going to get notified if an SMF service breaks. In a development environment, this might happen quite a lot.

Perhaps the best compromise I've come up with is to have FMA notifications disabled in a non-global zone, and configure SMF notifications explicitly there. Then, just have FMA notifications in the global zone. This assumes you have nothing but applications in zones, and all the non-SMF events will get caught in the global zone.

Tuesday, September 29, 2015

Improving the foundations of Tribblix

Mostly due to history, the software packages that make up Tribblix come from 3 places.
In the first category, I'm using an essentially unmodified illumos-gate. The only change to the build is the fix for 5188 so that SVR4 packaging has no external dependencies (or the internal one of wanboot). I then create packages, applying a set of transforms (many of which simply avoid delivering individual files that I see no valid reason to ship - who needs ff or volcopy any more?).

The second category is the historical part. Tribblix was bootstrapped from another distro. Owing to the fact that the amount of time I have is rather limited, not all the bits used in the bootstrapping have yet been removed. These tend to be in the foundations of the OS, which makes them harder to replace (simply due to where they sit in the dependency tree).

In the latest update (the 0m16 prerelease) I've replaced several key components that were previously inherited. Part of this is so that I'm in control of these components (which is a good thing), another is simply that they needed upgrading to a newer version.

One key component here is perl. I've been building my own versions of perl to live separately, but decided it was time to replace the old system perl (I was at 5.10) with something current (5.22). This of itself is easy enough. I then have to rebuild illumos because it's connected to perl, and that's a slightly tricky problem - the build uses the Sun::Solaris module, which comes from the build. (Unfortunately it uses the copy on the system rather than the bits it just built.) So I had to pull the bits out from the failed build, install those on the system, and then the build goes through properly.

Another - more critical - component is libxml2. Critical because SMF uses it, so if you get that wrong you break the system's ability to boot. After much careful study of both the OmniOS and OpenIndiana build recipes, I picked a set of options and everything worked first time. Phew!

(Generally, I'll tend to the OpenIndiana way of doing things, simply because that's where the package I'm trying to replace came from. But I usually look at multiple distros for useful hints.)

A broader area was compression support. I updated zlib along with libxml2, but also went in and built my own p7zip, xz, and bzip2, and then started adding additional compression tools such as lzip and pigz.

The work isn't done yet. Two areas I need to look at are the Netscape Portable Runtime, and the base graphics libraries (tiff, jpeg, png). And then there's the whole X11 stack, which is a whole separate problem - because newer versions start to require KMS (which we don't have) or have gone 64-bit only (which is still an open question, and a leap I'm not yet prepared to take).

Monday, September 28, 2015

Trimming the Tribblix live image

When Tribblix boots from the installation ISO, it reads in two things: the root archive, as a ramdisk, and /usr mounted from solaris.zlib via lofi.

In preparation for the next update, I've spent a little time minimizing both files. Part of this was alongside my experiments on genuinely memory-constrained systems; working out what's necessary in extreme cases can guide you into better behaviour in normal circumstances. While I don't necessarily expect installing onto a 128M system to be a routine occurrence, it would be good to keep 1G or even 512M within reach.

One of the largest single contributors to /usr was perl. It turns out that the only critical part of the system that needs perl is intrd, which is disabled in the live environment anyway. So, perl's not needed.

Another significant package is GNU coreutils. On closer investigation, the only reason I needed this was for a script that generated a UUID which is set as a ZFS property on the root file system (it's used by beadm to match up which zone BE matches the system BE). Apart from the fact that this functionality has recently been integrated into illumos, using the GNU coreutils was just being lazy (perhaps it was necessary under Solaris 10, where this script originated, but the system utilities are good enough now).

I also had the gcc runtime installed. The illumos packages don't need it, but some 3rd-party packages did - compile with gcc and you tend to end up with libgcc_s being pulled in. There are a variety of tricks with -nostdlib and -static-libgcc that are necessary to avoid this. (And I wish I understood better exactly what's happening, as it's too much like magic for my liking.)

The overall impact isn't huge, but the overall footprint of the live image has been reduced by 25%, which is worthwhile. It also counteracts the seemingly inevitable growth of the base system, so I have to worry less about whether I can justify every single driver or small package that might be useful.

Friday, September 25, 2015

illumos pureboot

In my previous article, I discussed an experiment in illumos minimization.

Interestingly, there was a discussion on IRC that wandered off into realms afar, but it got me thinking about making oddball illumos images.

So then I thought - how would you build something that was pure illumos. As in illumos, the whole of illumos, and nothing but illumos.

Somewhat surprisingly, this works. For some definition of works, anyway.

The idea is pretty simple. After building illumos, you end up with the artefacts that are created by the build populating a proto area. This has the same structure as a regular system, so you can find usr/bin/ls under there, for example.

So all I do is create a bootable image that is the proto area from a build.

The script is here.

What does this do?
  • Copies the proto area to a staging area, so it can be manipulated
  • Modifies inittab
  • Sets up grub for boot
  • Copies the kernel state files from the running system (otherwise, they're blank and the kernel is clueless)
  • Creates a block device and builds a ufs file system on it
  • Copies the staging area to it
  • Compresses the block device
  • Copies it back to the platform staging area
  • Creates a bootable iso
There's a bit more detail, but those are the salient points.

My (non-debug) proto area seems to be around 464MB, so a 512MB ramdisk works just fine. You could start deleting (or, indeed, adding) stuff to tune the image you create. The ISO image is 166MB, ready for VirtualBox.

The important thing to realise is that illumos, of itself, does not create a complete operating system. Even core OS functionality requires additional third-party components, which will not be present in the image you create. In particular, libxml2, zlib, and openssl are missing. What this means is that anything depending on these will not function. The list of things that won't work includes SMF, which is an integral part of the normal boot and operations.

So instead of init launching SMF, I have it run a shell instead. (I actually do this via a script rather than directly from init, this allows me to put up a message, and also allows me to run other startup commands if so desired.)

This is what it looks like:


A surprising amount of stuff actually works in this environment. Certainly most of the standard unix commands that I've tried are just fine. Although it has to be appreciated that none of the normal boot processing has been done at this point, so almost nothing has been set up. (And / won't budge from being read-only which is a little confusing.)

Sunday, September 20, 2015

How low can Tribblix go?

One of the things I wanted to do with Tribblix was to allow it to run in places that other illumos distros couldn't reach.

One possible target here is systems with less than gargantuan memory capacities.

(Now, I don't have any such systems. But VirtualBox allows you to adjust the amount of memory in a guest very easily, so that's what I'm doing.)

I started out by building a 32-bit only image. That is, I built a regular (32- and 64-bit combined) image, and simply deleted all the 64-bit pieces. You can do this slightly better by building custom 32-bit only packages, but it was much simpler to identify all the directories named amd64 and delete them.

(Why focus on 32-bit here? The default image has both a 32-bit and 64-bit kernel, and many libraries are shipped in both flavours too. So removing one of the 32-bit or 64-bit flavours will potentially halve the amount of space we need. It makes more sense to drop the 64-bit files - it's easier to do, and it's more likely that real systems with minimal memory are going to be 32-bit.)

The regular boot archive in Tribblix is 160M in size (the file on the ISO is gzip-compressed and ends up being about a third of that), but it's loaded into memory as a ramdisk so the full size is a hard limit on how much memory you're going to need to boot the ISO. You might be able to run off disk with less, as we'll see later. The 32-bit boot archive can be shrunk to 90M, and still has a little room to work in.

The other part of booting from media involves the /usr file system being a compressed lofi mount from a file. I've made a change in the upcoming release by removing perl from the live boot (it's only needed for intrd, which is disabled anyway if you're booting from media), which saves a bit of space, and the 32-bit version of /usr is about a third smaller than the regular combined 32/64-bit variant. Without any additional changes, it is about 171M.

So, the boot archive takes a fixed 90M, and the whole of /usr takes 171M. Let's call that 256M of basic footprint.

I know that regular Tribblix will boot and install quite happily with 1G of memory, and previous experience is that 768M is fine too.

So I started with a 512M setup. The ISO boots just fine. I tried an install to ZFS. The initial part of the install - which is a simple cpio of the system as booted from media - worked fine, if very slowly. The second part of the base install (necessary even if you don't add more software) adds a handful of packages. This is where it really started to struggle, it just about managed the first package and then ground completely to a halt.

Now, I'm sure you could tweak the system a little further to trim the size of both the boot archive and /usr, or tweak the amount of memory ZFS uses, but we're clearly close to the edge.

So then I tried exactly the same setup, installing to UFS instead of ZFS. And it installs absolutely fine, and goes like greased lightning. OK, the conclusion here is that if you want a minimal system with less than 512M of memory, then don't bother with ZFS but aim at UFS instead.

Reducing memory to 256M, the boot and install to UFS still work fine.

With 192M of memory, boot is still good, the install is starting to get a bit sluggish.

If I go down to 128M of memory, the ISO won't boot at all.

However, if I install with a bit more memory, and then reduce it later, Tribblix on UFS works just fine with 128M of memory. Especially if you disable a few services. (Such as autofs cron zones-monitoring zones power fmd. Not necessarily what you want to do in production, but this isn't supposed to be production.)

It looks like 128M is a reasonable practical lower limit. The system is using most of the 128M (it's starting to write data to swap, so there's clearly not much headroom).

Going lower also starts to hit real hard limits. While 120M is still good, 112M fails to boot at all (I get "do_bop_phys_alloc Out of memory" errors from the kernel - see the fakebop source). I'm sure I could go down a bit further, but I think the next step is to start removing drivers from the kernel, which will reduce both the installed boot archive size and the kernel's memory requirements.

I then started to look more closely at the boot archive. On my test machine, it was 81M in size. Removing all the drivers I felt safe with dropped it down to 77M. That still seems quite large.

Diving into the boot archive itself, and crawling through the source for bootadm, I then found that the boot archive was a ufs archive that's only 25% full. It turns out that the boot archive will be hsfs if the system finds /usr/bin/mkisofs, otherwise it uses ufs. And it looks like the size calculation is a bit off, leading to an archive that's massively oversized. After installing mkisofs and rebuilding the boot archive, I got back to something that was 17M, which is much better.

On testing with the new improved boot archive, boot with 96M, or even 88M, memory is just fine.

Down to 80M of memory, and I hit the next wall. The system looks as though it will boot reasonably well, but /etc/svc/volatile fills up and you run out of swap. I suspect this is before it's had any opportunity to add the swap partition, but once it's in that state it can't progress.

Overall, in answer to the question in the title, a 32-bit variant of Tribblix will install (using UFS) on a system with as little as 192M of memory, and run on as little as 96M.

Sunday, September 06, 2015

Fixing SLiM

Having been using my version of SLiM as a desktop login manager for a few days, I had seen a couple of spontaneous logouts.

After minimal investigation, this was a trivial configuration error on my part. And, fortunately, easy to fix.

The slim process is managed by SMF. This ensures that it starts at boot (at the right time, I've written it so that it's dependent on the console-login service, so it launches immediately the CLI login is ready) and that it gets restarted if it exits for whatever reason.

So I had seen myself being logged out on a couple of different occasions. Once when exiting a VNC session (as another user, no less); another whilst running a configure script.

A quick look at the SMF log file, in /var/svc/log/system-slim:default.log, gives an immediate hint:
[ Sep  5 13:37:17 Stopping because process dumped core. ]
So, a process in the slim process contract - which is all processes launched from the desktop - dumped core, SMF spotted it happening, and restarted the whole desktop session. You really don't want that, especially as as a desktop session can be comprised of essentially arbitrary applications, and random core dumps are not entirely unexpected.

So, the fix is a standard one, which I had forgotten entirely. Just add the following snippet to the SMF manifest:

<property_group name='startd' type='framework'>
     <!-- sub-process core dumps shouldn't restart session -->
     <propval name='ignore_error' type='astring'
         value='core,signal' />
</property_group>
and everything is much better behaved.

Thursday, September 03, 2015

Tribblix Graphical Login

Up to this point, login to Tribblix has been very traditional. The system boots to a console login, you enter your username and password, and then start your graphical desktop in whatever manner you choose.

That's reasonable for old-timers such as myself, but we can do better. The question is how to do that.

OpenSolaris, and thus OpenIndiana, have used gdm, from GNOME. I don't have GNOME, and don't wish to be forever locked in dependency hell, so that's not really an option for me.

There's always xdm, but it's still very primitive. I might be retro, but I'm also in favour of style and bling.

I had a good long look at LightDM, and managed to get that ported and running a while back. (And part of that work helped get it into XStreamOS.) However, LightDM is a moving target, it's evolving off in other directions, and it's quite a complicated beast. As a result, while I did manage to get it to work, I was never happy enough to enable it.

I've gone back to SLiM, which used to be hosted at BerliOS. The current source appears to be here. It has the advantage of being very simple, with minimal dependencies.

I made a few modification and customizations, and have it working pretty well. As upstream doesn't seem terribly active, and some of my changes are pretty specific, I decided to fork the code, my repo is here.

Apart from the basic business of making it compile correctly, I've put in a working configuration file, and added an SMF manifest.

SLiM doesn't have a very good mechanism for selecting what environment you get when you log in. By default it will execute your .xinitrc (and fail horribly if you don't have one). There is a mechanism where it can look in /usr/share/xsessions for .desktop files, and you can use F1 to switch between them, but there's no way currently to filter that list, or tell it what order to show then in, or have a default. So I switched that bit off.

I already have a mechanism in Tribblix to select the desktop environment, called tribblix-session. This allows you to use the setxsession and setvncsession commands to define which session you want to run, either in regular X (via the .xinitrc file) or using VNC. So my SLiM login calls a script that hooks into and cooperates with that, and then falls back on some sensible defaults - Xfce, MATE, WindowMaker, or - if all else fails - twm.

It's been working pretty well so far. It can also do automatic login for a given user, and there are magic logins for special purposes (console, halt, and reboot, with the root password).

Now what I need is a personalized theme.

Wednesday, September 02, 2015

The 32-bit dilemma

Should illumos, and the distributions based on it - such as Tribblix - continue to support 32-bit hardware?

(Note that this is about the kernel and 32-bit hardware, I'm not aware of any good cause to start dropping 32-bit applications and libraries.)

There are many good reasons to go 64-bit. Here are a few:

  • Support for 32-bit SPARC hardware never existed (Sun dropped it with Solaris 10, before OpenSolaris)
  • Most new hardware is 64-bit, new 32-bit systems are very rare
  • Even "old" x86 hardware is now 64-bit
  • DragonFly BSD went 64-bit
  • Solaris 11 dropped 32-bit
  • SmartOS is 64-bit only
  • Applications - or runtimes such as GO and Java 8 - are starting to only exist in 64-bit versions
  • We're maintaining, building, and shipping code that effectively nobody uses, and is therefore effectively untested
  • The time_t problem (traditional 32-bit time runs out in 2038)

So, I know I'm retro and all, but it's getting very hard to justify keeping 32-bit support.

Going to a model where we just support 64-bit hardware has other advantages:

  • It makes SPARC and x86 equivalent
  • We can make userland 64-bit by default
  • ...which makes solving the time_t problem easier
  • ...and any remaining issues with large files and 64-bit inode numbers go away
  • We can avoid isaexec
  • At least on x86, 64-bit applications perform better
  • Eliminating 32-bit drivers and kernel makes packages and install images smaller

Are there any arguments for keeping 32-bit hardware support?

  • It's a possible differentiator - a feature we have that others don't. On the other hand, if the potential additional user base is zero then it makes no difference
  • There is still some existence of 32-bit hardware (mainly in embedded contexts)
Generally, though, the main argument against ripping out 32-bit hardware support is that it's going to be a certain amount of work, and the illumos project doesn't have that much in the way of spare resources, so the status quo persists.

My own plan for Tribblix was that once I had got to releasing version 1 then version 2 would drop 32-bit hardware support. (I don't need illumos to drop it, I can remove the support as I postprocess the illumos build and create packages.) As time goes on, I'm starting to wonder whether to drop 32-bit earlier.

Saturday, August 29, 2015

Tribblix meets MATE

One of the things I've been working on in Tribblix is to ensure that there's a good choice of desktop options. This varies from traditional window managers (all the way back to the old awm), to modern lightweight desktop environments.

The primary desktop environment (because it's the one I use myself most of the time) is Xfce, but I've had Enlightenment available as well. Recently, I've added MATE as an additional option.

OK, here's the obligatory screenshot:


While it's not quite as retro as some of the other desktop options, MATE has a similar philosophy to Tribblix - maintaining a traditional environment in a modern context. As a continuation of GNOME 2, I find it to have a familiar look and feel, but I also find it to be much less demanding both at build and run time. In addition, it's quite happy with older hardware or with VNC.

Building MATE on Tribblix was very simple. The dependencies it has are fairly straightforward - there aren't that many, and most of them you would tend to have anyway as part of a modern system.

To give a few hints, I needed to add dconf, a modern intltool, itstool, iso-codes, libcanberra, zenity, and libxklavier. Having downloaded the source tarballs, I built packages in this order (this isn't necessarily the strict dependency order, it was simply the most convenient):
  • mate-desktop
  • mate-icon-theme
  • eom (the image viewer)
  • caja (the file manager)
  • atril (the document viewer, disable libsecret)
  • engramap (the archive manager)
  • pluma (the text editor)
  • mate-menus
  • mateweather (is pretty massive)
  • mate-panel
  • mate-session
  • marco (the window manager)
  • mate-backgrounds
  • mate-themes (from 1.8)
  • libmatekbd
  • mate-settings-daemon
  • mate-control-center
The code is pretty clean, I needed a couple of fixes but overall very little needed to be changed for illumos.

The other thing I added was the murrine gtk2 theme engine. I had been getting odd warnings from applications for a while mentioning murrine, but MATE was competent enough to give me a meaningful warning.

I've been pretty impressed with MATE, it's a worthy addition to the available desktop environments, with a good philosophy and a clean implementation.

Monday, August 10, 2015

Whither open source?

According to the Free Software Definition, there are 4 essential freedoms:

  • The freedom to run the program as you wish, for any purpose (freedom 0).
  • The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
  • The freedom to redistribute copies so you can help your neighbor (freedom 2).
  • The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

Access to the source code and an open-source license are necessary preconditions for software freedom, but not sufficient.

And, unfortunately, we are living in an era where it is becoming ever more difficult to exercise the freedoms listed above.

Consider freedom 0. In the past, essentially all free software ran perfectly well on essentially every hardware platform and operating system. At the present time, much of what claims to be open-source software is horrendously platform-specific - sometimes by ignorance (I don't expect every developer to be able to test on all platforms), but there's a disturbing trend of deliberately excluding non-preferred platforms.

There is increasing use of new languages and runtimes, which are often very restricted in terms of platform support. If you look at some of the languages like Node.JS, Go, and Rust, you'll see that they explicitly target the common hardware architectures (x86 and ARM), deliberately and consciously excluding other platforms. Add to that the trend for self-referential bootstrapping (where you need X to build X) and you can see other platforms frozen out entirely.

So, much of freedom 0 has been emasculated. What of freedom 1?

Yes, I might be able to look at the source code. (Although, in many cases, it is opaque and undocumented.) And I might be able to crack open an editor and type in a modification. But actually being able to use that modification is a whole different ball game.

Actually building software from source often enters you into a world of pain and frustration. Fighting your way through Dependency Hell, struggling with arcane and opaque build systems, becoming frustrated with the vagaries of the autotools (remember how the configure script works - it makes a bunch of random and unsubstantiated guesses about the state of your system, the ignores half the results, and often needs explicitly overriding, making a mockery of the "auto" part), only to discover that "works on my system" is almost a religion.

Current trends like Docker make this problem worse. Rather than having to pay lip-service to portability by having to deal with the vagaries of multiple distributions, authors can now restrict the target environment even more narrowly - "works in my docker image" is the new normal. (I've had some developers come out and say this explicitly.)

The conclusion: open-source software is becoming increasingly narrow and proprietary, denying users the freedoms they deserve.

Sunday, August 02, 2015

Blank Zones

I've been playing around with various zone configurations on Tribblix. This is going beyond the normal sparse-root, whole-root, partial-root, and various other installation types, into thinking about other ways you can actually use zones to run software.

One possibility is what I'm tentatively calling a Blank zone. That is, a zone that has nothing running. Or, more precisely, just has an init process but not the normal array of miscellaneous processes that get started up by SMF in a normal boot.

You might be tempted to use 'zoneadm ready' rather than 'zoneadm boot'. This doesn't work, as you can't get into the zone:

zlogin: login allowed only to running zones (test1 is 'ready').
So you do actually need to boot the zone.

Why not simply disable the SMF services you don't need? This is fine if you still want SMF and most of the services, but SMF itself is quite a beast, and the minimal set of service dependencies is both large and extremely complex. In practice, you end up running most things just to keep the SMF dependencies happy.

Now, SMF is started by init using the following line (I've trimmed the redirections) from /etc/inittab

smf::sysinit:/lib/svc/bin/svc.startd

OK, so all we have to do is delete this entry, and we just get init. Right? Wrong! It's not quite that simple. If you try this then you get a boot failure:

INIT: Absent svc.startd entry or bad contract template.  Not starting svc.startd.
Requesting maintenance mode

In practice, this isn't fatal - the zone is still running, but apart from wondering why it's behaving like this it would be nice to have the zone boot without errors.

Looking at the source for init, it soon becomes clear what's happening. The init process is now intimately aware of SMF, so essentially it knows that its only job is to get startd running, and startd will do all the work. However, it's clear from the code that it's only looking for the smf id in the first field. So my solution here is to replace startd with an infinite sleep.

smf::sysinit:/usr/bin/sleep Inf

(As an aside, this led to illumos bug 6019, as the manpage for sleep(1) isn't correct. Using 'sleep infinite' as the manpage suggests led to other failures.)

Then, the zone boots up, and the process tree looks like this:

# ptree -z test1
10210 zsched
  10338 /sbin/init
    10343 /usr/bin/sleep Inf

To get into the zone, you just need to use zlogin. Without anything running, there aren't the normal daemons (like sshd) available for you to connect to. It's somewhat disconcerting to type 'netstat -a' and get nothing back.

For permanent services, you could run them from inittab (in the traditional way), or have an external system that creates the zones and uses zlogin to start the application. Of course, this means that you're responsible for any required system configuration and for getting any prerequisite services running.

In particular, this sort of trick works better with shared-IP zones, in which the network is configured from the global zone. With an exclusive-IP zone, all the networking would need to be set up inside the zone, and there's nothing running to do that for you.

Another thought I had was to use a replacement init. The downside to this is that the name of the init process is baked into the brand definition, so I would have to create a duplicate of each brand to run it like this. Just tweaking the inittab inside a zone is far more flexible.

It would be nice to have more flexibility. At the present time, I either have just init, or the whole of SMF. There's a whole range of potentially useful configurations between these extremes.

The other thing is to come up with a better name. Blank zone. Null zone. Something else?