Everything in its right place

18 July 2024

I spend most of my free time using a computer, and I consider myself someone who’s particularly keen on archiving information. However, whenever I need to record something (photos from a holiday, music to listen to, whatever, I feel a deep anxiety that I don’t really know where to put everything. I don’t really feel like I know what’s on my computer.

Part of the problem is due to how information is laid out on a computer. In every operating system I’m aware of, files are organised into folders, folders are allowed to contain folders, and you can link between folders. You can even have cycles (at least on Linux), so it’s not strictly a tree, even though that’s how people generally think about them.

David Weinberger wrote a book about information categorisation in 2007 (there is an expository talk about the book on YouTube), in which he really tries to rail against hierarchical organisation. Just slap a bunch of tags on everything and build a good search engine and you’re good to go. He seems to argue that, basically, we only organise things into hierarchies because that’s often a requirement in the physical world, and for conceptual organisation it’s just an arbitrary limitation.

Of course I’m paraphrasing, and he does make some points I agree with: in particular that there is no single hierarchy that works for all domains, and that people must be free to come up with their own taxonomy. And in a certain sense he’s right that hierarchical organisation is arbitrary! If you prefer hierarchical organisation, just tag all your stuff so that each file fits into exactly one tag! If you find yourself with too many tags, you can make “subtags”, so music might have a subtag pop, which we refer to as music/pop just to distinguish it from snap/crackle/pop. This is the literally identical to the classic tree-based file system.

Okay, so we have this idea: the tag-based filesystem. It’s a strict generalisation of the hierarchical filesystem. But maybe the specialisation was useful? To see what I mean, imagine somebody got rid of the folder structure on your computer right now. Are you happy? Do you feel organised? Do you even know where to begin organising all this stuff?

So while tags and trees are not strictly speaking alternatives, let’s pretend they’re alternatives for a moment. Are there situations where trees are better than tags, or vice versa? I think there are some criteria you can use to decide:

  1. Coming up with good tags is strictly easier than coming up with a good classification. It’s really easy to know if something is a “holiday photo”. It’s much harder to decide whether “holiday photos” is one of the disjoint categories of photos.
  2. A classification is easier to reason about than a tag system. This is basically because a tag system supports a richer semantics.
  3. Classification allows for easy discoverability.
  4. Tagging allows for incompleteness: if you don’t understand something yet, you can just leave it untagged. The list of untagged stuff can sit around like a queue, waiting for you to understand it. And if you don’t have complete knowledge of the thing, you can just apply the tags you do understand.

On balance, I think my basic feeling is that when something really is a tree, organising it as a tree is better. It is hard to imagine someone who would argue that Linnean taxonomy would be improved if it weren’t organised as a tree. And it’s quite nice to be able to organise your music collection first by artist, then by album, then by track. A tagging system (like ratings, or genres) can sit on top of that, but the broad domain classification encodes your complete understanding of the domain into the representation. The converse of that statement being of course that if you don’t have complete understanding, you shouldn’t pretend to.

tmsu is a simple mechanism for doing this in software. It sits alongside your hierarchical filesystem, and stores a bunch of file tags in a big database. The database is stored in a single file at the root of the tree, so if you want to add tagging “lower down” then you can specialise. For instance, if all your photos sit in a big “media” directory, you can have a database at the root of that directory that stores all your tags to do with photographs, and maybe doesn’t know anything about how you categorise music.