Hacker News new | ask | show | jobs
by haberman 4481 days ago
You're right inasmuch as I shouldn't have implied that unsandboxed interpretation is the only option.

But my larger point still stands; the fundamental tradeoff is still "power of the payload" vs "guarantees to the container." Even in the case of sandboxed execution, the container loses two important guarantees compared with non-executable data formats like JSON:

1. I can know a priori roughly how much CPU I will spend evaluating this payload.

2. I can know that the payload halts.

This is why, for example, the D language in DTrace is intentionally not Turing-complete.

2 comments

I agree 100% with you, but #1 isn't completely true. The counterexample is the ZIP bomb (http://en.wikipedia.org/wiki/Zip_bomb) Whenever you unzip anything you got from outside, you should limit the time spent and the amount of memory written.
If excess CPU/non-halting behavior is the issue, you could run the code with a timeout.
You could. But that has downsides also:

1. imposing CPU limits incurs an inherent CPU overhead and code complexity.

2. if those limits are hit, you can't tell whether the code just ran too long or whether it was in an infinite loop.

So now if we fully evaluate the options, the choice is between:

1. A purely data language like JSON: simple to implement, fast to parse, decoder can skip over parts it doesn't want, etc.

2. A Turing-complete data format: have to implement sandboxing and CPU limits (both far trickier security attack surfaces), have configure CPU limits, when CPU limits are exceeded the user doesn't know whether the code was in an infinite loop or not, maybe have to re-configure CPU limits.

Sure, sometimes all the work involved in (2) is worth it, that's why we have JavaScript in web browsers after all. But a Turing-complete version of JSON would never have taken off like JSON did for APIs, because it would be far more difficult and perilous to implement.

I have to agree here. General Turing-completeness was known from the beginning to imply undecidable questions -- about it's structure, running time, memory and so on. I don't think this has a place as the 'data'.

Abstractions exist for a reason -- this is analogous to source/channel coding separation or internet layers. They don't have to be that way, but are there for a reason.

Someone could change my opinion, though. Provide me a data format which proves certain things about it's behavior and that would be a nice counterexample.