Hacker News new | ask | show | jobs
by vifon 1002 days ago
TMSU is the only such system I found useful without being cumbersome. After years of trying to use Git Annex, it was refreshing that TMSU doesn't alter the files in any way, merely storing all the (meta)data out-of-band in a separate DB.

These days I use TMSU via my own Emacs-based UI almost every single day, so thank you for that!

1 comments

Storing metadata out-of-band strikes me as key to any usable content management system within a realistically complex space.

Naming schemes and directory hierarchies have some limited application, but ultimately there will be data which simply won't be shoehorned into any such system, and an externally-managed catalogue tying together disparate elements is required.

(Keeping that catalogue up to date and consistent is a whole 'nother issue.)

I do like the idea of a virtual filesystem in which elements are effectively search dimensions, which leads to an interesting notion that search is identity.

That is, a search will produce one of three possible result sets:

- Null, that is, no matches.

- Plural, that is, a list of matches.

- Unity, that is, one matching item.

In the last case, the search providing a single result is an identity of that result. (It may not be a stable identity over time, but it is at least for the present.)

Where a list is returned, the size of the list determines how usable it is, and how it is usable. Ten items can be quickly scanned to find the relevant item(s), if they exist. 100 or 1,000 items can often still be managed manually, though they'll typically take some time. Somewhere between 100 and a few thousand items, though, you're in the range where automated assessments or filtering becomes necessary.

Large libraries themselves typically have tens of thousands to millions of items. The largest book collections (Library of Congress, British Library) have roughly 150 million books (or equivalents). Other records may exist in greater numbers: periodicals, financial records, databases. Facebook has reported ~5 billion items posted daily for some years now. (I suspect most of those are trivial, but that still leaves a large number of potentially non-trivial items.) Surveillance and other large-scale data collection systems may be larger still.

> Storing metadata out-of-band strikes me as key to any usable content management system within a realistically complex space.

Yes and no. I was specifically comparing it to Git Annex which is hard to categorize in these terms. It forces every file to become a symbolic link to the actual file living in `.git/annex/` and then every query temporarily mutating the hierarchy of directories storing these symbolic links. I found the latter disruptive enough (in particular for the directory mtimes) that I was actively avoiding doing any such queries. See: https://git-annex.branchable.com/tips/metadata_driven_views/

On the other hand my current setup involves TMSU queries which result in virtual Emacs directory editor (dired) views that don't affect anything else. I don't even use the FUSE functionality of TMSU.

The situation is one of compromise.

The core problem is that there are formats and storage modes which don't readily allow for the modification of the item itself. Editing PDFs is already a pain, applying metadata to some entry in a database or wiki, or third-party website, isn't possible at all.

The remaining options seem to me analogues of practices with books.

It's possible to "rebind" an item and include biblographic information into the equivalent of fly-leaves of that work, much as a library may rebind a book and apply a label with inventory number(s), call number(s), and/or bibliographic data to that book. Since physical objects are inherently modifiable and enclosable, this makes sense. The digital analogues vary, but are at least theoretically available (e.g., enclosing a work in an archive format which includes metadata. See the WARC (Web ARChive) file format for example: <https://en.wikipedia.org/wiki/WARC_(file_format)>. Epub files and software packaging formats such as RPM and APT are other examples of standardised file structures which encapsulate others.

The other practice of a library is to abstract out the metadata to a catalogue, effectively a metadata index.

In a physical library archive, this involves a cataloguing process, as part of item acquisition workflows. The metadata for many traditionally-published works is already centralised such that it can be obtained from specific organisations such as the US Library of Congress, the British Library, the OCLC (originally the Ohio College Library Center, which has both its own item identifier and manages the Dewey Decimal Classification), the International ISBN Agency or one of its national affiliates (e.g., Bowker, in the US), the International DOI Foundation (for DOI assignments: digital object identifiers, used extensively in academic journals). Circulation of items is managed through a circulation desk, for both external lending and managing, tracking, and reshelving books used within the library itself, but not being borrowed externally.

For digital media, the equivalents would be either some sort of management system, which would require an application-specific interface, or a filesystem which incorporates not only metadata but workflow management. I'm leaning toward the latter concept as more universal, though that also raises the question of how to deal with workflows in which contents leave or enter that filesystem context itself.

But with a filesystem, you have a number of additional possibilities available:

- The pairing of works with metadata is automated.

- Workflows can be integrated into filesystem actions. The act of creating a file would also create the file metadata (to a greater extent than present inode entries do).

- Additionally, introducing the notion of process status means that the filesystem itself could distinguish between works which have no or only default cataloguing data applied, those which have had additional metadata added (say, from an external look-up or automated heuristics based on file contents).

- Renames and deletions are now managed through the filesystem itself rather than third-party tools.

- I'd like to see both versioning (changes to a given file) and relationships (source, derived, referenced, and referencing works) tracked as well.

- Different forms of a work could be tracked together. The markup-language source and generated outputs (PDF, PS, ePub, HTML, plain text, etc.) versions of a text. Translations. Audio formats. Different performances of a work. Optical scans of printed material. Photographs of visual or plastic arts (sculpture), or architecture. See FRBR (Functional Requirements for Biblographic Records), and the Work, Expression, Manifestation, Item distinctions: <https://en.wikipedia.org/wiki/Functional_Requirements_for_Bi...>).

- Ideally, some sort of highly-invariant fingerprinting such that different versions of a work can be identified and matched despite different formats or slight modifications (intentional differences between editions, translations, errors or damage introduced over time). Traditional whole work hashes fail to offer this, though segmented and normalised hashes or vectors might be able to do so.

Again, the largest problem with a filesystem approach is in imports and exports from that filesystem-based archive. That would probably best be achieved through wrapper formats and applications or servers for other organisations or the general public.