Tuesday, January 3, 2012

Building a Tagged and Revisioned File system

This blog post consists of discussions with vinod regarding the file system in his blog post about os.next and my mindless mumblings during the 12 hour drive to and from Phoenix for the Christmas 2011 holidays!

Version control has been around for a really long time. Tagging, however, is a fairly new concept, that became popular largely due to Flickr and Gmail. Tagging, instead of creating folders in the native file system for the desktop, seems like an extremely useful feature, especially, for documents, photos, music. Usage of tags on the web has demonstrated that it is convenient to tag/label content to categorize them under multiple buckets. Similarly, for content typically found on a desktop PC, we can apply the same concept of tagging/labeling instead of creating folders. This would allow retrieval of files with tags without having to look for them in specific folders. This is again an attribute that we use extensively on Flickr and Gmail and other cloud content storage/retrieval applications and bringing this feature to the desktop might make storage/retrieval of content as easy as that on the web. This approach of tagging in file system vs. using folders, surprisingly is not a new idea and Googling around revealed a couple of posts: tagfs-tag-semantics-for-hierarchical-file-systemsa-tagged-filesystem.

Version control, on the other hand has been used primarily for managing source code on projects and is usually an action that is manually carried out by the user and typically not used for content on the desktop. However, if files on the desktop automatically get revisioned, then it would make this a background activity in the file system and the user would not have to explicitly checkin/checkout their files. This would be an extremely useful in cases where we clicked on "Save" instead of "SaveAs"  and lost the original copy of a document that we were editing. Again, auto-revisioning is not a new idea. It has been around for a long time but surprisingly hasn't become a default feature in all desktop file systems. Mac OS X has good internal backup and versioning support with Time Machine and Lion introduced the auto-save the feature.

Now, combining the flexibility and convenience of tagging with the benefits of auto-revisioning (rather than explicitly placing a file under some version control) would be an interesting combination for the desktop environment, we think. Here are some design and implementation ideas to build a file system that supports both tagging and auto-revisioning:

Goals:
- tagging instead of folders
- auto revisioning of files and tags
- don't break make and other build tools like autoconf, automake, configure, ant etc.
- accomodate duplicate file names (for example, digital cameras auto-number photos and the file system could the potentially have duplicate file names referring to totally different content)

Design Basics:
- prototype implementation on linux using fuse
- when a file or tag is created or revisioned it is associated with an 32 (or x) bit id.
- by default a file without any tags is equivalent to being in the "root" folder.
- tags and files are revisioned objects and by default accessing either will give their latest versions.
- tags and metadata stored as json/bson (hopefully makes the metadata future proof)
- tags can be ordered or unordered
- ordered tags can be used in cases where a fixed folder hierarchy is necessary (source code, for example) and are associated with parent and children tags (similar to the  Nested Labels feature in GMail)
- tags are also be versioned. suffix @@ followed by a version number for a specific tag version. this allows the user to look at what files were tagged on a specific version, similar to what files were in a folder at a specific version
- file are auto-versioned whenever they get modified. suffix @@ followed by a version number for a specific version
- use forward diffs for revisions - so when a file gets modified to version 2, go and update version 0 in the backing store to contain the diff between v0 and v1. similarly when v3 gets created, update v1 as diff between v2 and v1. this will allow the latest version to be viewed without having to traverse the diff hierarchy and old versions can be pruned from the filesystem, if necessary, without having to go and fix the  diff hierarchy. this approach adds an extra overhead of having to update old versions when new ones are created but it can be setup as a background low priority activity as its primary purpose is to reduce the space used by the backing store. this idea of forward diffs can be applied to the scheme of branches also where there could be multiple children to a particular revision. when a branch is created, v0 on that branch is a full copy of the file on the branch point. this allows the files on the branch to get revisioned using the same approach as that on the main branch.

Tag:
- all metadata stored as json/bson
- root.tag file always present and contains the info for the "root" tag
- tag files contain following info:
* name of the tag
* parent tag id (empty string if directly under root. allows "orphaned" files to be tagged to a parent tag)
* ordered child tag ids (allows a directory structure that may be necessary for some source code stored in the file system)
* file ids that are tagged with this tag (basic means of lookup to file_id.dat and file_id.meta)

Files:
- stored in a separate common folder as file_id.dat
- metadata for a file (timestamp etc and associated tags) stored as file_id.meta in json format 

Example:
root.tag
{
  "parent": "",
  "tags": ["1234abcd", "5678abce"],
  "ordered_tags": ""
  "files": ["2345defc", "9876beef"]
}

1234abcd.tag
{
  "parent": "root",
  "tags": "",
  "ordered_tags": ""
  "files": ["2574dddd", "aeee5621"]
}

2574dddd.meta
{
  "name": "somefile.txt",
  "tags": "1234abcd",
  "last_modified": "01-01-2012 1300 PST"
  ...
  ...
  ...
}

2574dddd.dat
<contents of somefile.txt>

Versioning with Tags:
When versioning is added to the above scheme of Tag and Files, the following adjustments need to be made:
- tag files split into two. first one is versioned_tag_id.tag. this contains the information as before but is a file that gets stored as diffs. so, when a new file gets added to this tag, this versioned tag file moves to a newer version with an updated entry for "files". similar thing happens when a file gets removed from this tag.  tag_id.tag is the higher level metadata that contains the mapping between the tag_id and all its corresponding versioned_tag_id.tag files. it is basically a json file containing a list of version numbers and corresponding versioned_tag_id files. this also allows branches and "release" labels to be added to tags.
- nothing really changes with regard to files. they continue to be tracked as file_id.dat except that they are stored as diffs described in the basics section. the specific version of a file in a branch is listed in the entry of the tag and lookups occur based on that.
- to speed up lookup of the latest version of a file special tags (such as CURRENT or LATEST) could be used in the tag file.

Same example, now with versioning:
root.tag
{
  "parent": "",
  "tags": ["1234abcd", "5678abce"],
  "ordered_tags": ""
  "files": ["2345defc", "9876beef"]
}

1234abcd.tag
{
  "parent": "root",
  "tags": "",
  "ordered_tags": [
      {"0": "1235abcd"},
      {"1": "1236abcd"},
    ]
}

1235abcd.tag
{
  "parent": "root",
  "tags": "",
  "ordered_tags": ""
  "files": ["2574dddd"]
}

1236abcd.tag

{
  "parent": "root",
  "tags": "",
  "ordered_tags": ""
  "files": ["2574dddd", "aeee5621"]
}

2574dddd.meta
{
  "name": "somefile.txt",
  "tags": "1234abcd",
  "last_modified": "01-01-2012 1300 PST"
  ...
  ...
  ...
}

2574dddd.dat
<contents of somefile.txt>

Stretch:
- Add a mini webserver with REST api's for all the tag operations and file accesses (GET for operations and POST for data). This will allow a file explorer to be build possibly using javascript. Additional benefit is that access to files on the file system from anywhere on the network can be accomplished without any samba/nfs style mechanisms. They just need to open a browser and connect to the webserver of the machine they want to view the files on. There is an issue with permissions but this can be resolved in a scheme similar to htaccess file used in Apache via a scheme of login's.

Performance and Scalability:
Hard to comment on these without actually building it and carrying out measurements but it seems like the flexibility gained may outweigh the additional overhead of using json. It will also largely depend on the content that is stored in the file system. It seems like the data on a personal laptop/desktop would be an ideal candidate for a file system like this. It would certainly perform horribly if you use it as a file system for storing your database archive! I think medium sized (a wezel word) source code dev tree may perform well in this file system but mixing it other vcs like subversion may cause strange thrashing behavior between the auto-revisioning and the versioning mechanisms of that vcs.

The approach of adding versioning on top of the tagging approach would possibly allow for versioning to be disabled for certain "mount points" or file types like build objects.

In conclusion, the concept of tagged and auto-revisioned file system can be implemented as a user space file system on Linux. The next step would be to actually prototype this design, add varieties of files and tags to the files and see if the added behavior has value and the performance overhead is not unbearable.

[Update in light of vinod's discovery of fossil dvcs]:
This entire design can be re-adjusted using one of the approaches below:
- use fossil and discard the schemes proposed here
- use sqlite as backing datastore instead of pile-of-files (.meta, .tag and .dat) and possibly the scheme of hash instead of tag/file_id
- modify fossil to retrofit this tag concept instead of folders
- add tagging/versioning to a libsqlfs - a fuse file system based on sqlite