Hacker News new | ask | show | jobs
by sayhello 5391 days ago
You are right, having a binary diffing mechanism would make it simpler to implement, but that diffing mechanism would be hard to implement!

We've thought about going pure diff at the beginning and decided that to implement that method will give us less flexibility in the short term.

I speculate that do so, we would need to implement our own virtual machine (we are using virtualbox) and/or disk image format.

2 comments

mmm I take your point - what ideas are out there to do that thought? I agree it is hard ...

Good news is that for the "known/commont stuff" you can always have a central database of the random/temp stuff they generate.

E.g., mysql generetes tmp files here and there, and so and so - so you could profile all that common stuff in that way.

Then the uncommon stuff or your custom things, you declaratively say "do not track" as you do with code today.

Moreover you can do it in a way that is crowdsourced - ie: if its a cloud service when people declare such and such in mongodb is random/temp then you learn for all users.

At the end of the day there is a limited number of things people use and for the long tail it is OK for people to be declarative I guess.

But on the flip side I can see how that could end up being a nightmare.

But wouldn't it be nice to do something as simple as pull/commit/push for general purpose computers?

that piece is the less difficult i think - you could always use existing VM infrastructure and do (1) restore (2) apply changes and (3) save back - I wouldn't mess with disk formats and so on - no need
Thanks for your thoughts, PabloOsinaga.

I am just spitting out ideas without deep thought but a discussion might prove fruitful.

It could come down to certain types of data, some to be ignored, some to be careful of (like data or code), others to blindly overwrite (critical security updates).

The point would be treat the data as "dumb", but to keep in mind that some data are "dumber" than others.

Perhaps if we had our own disk image format, we could mark certain types of data to be ignored for instance.

Our VM would know what to mark as ignored for say POSIX systems and that aspect could be configurable.