Hacker News new | ask | show | jobs
by iot_devs 975 days ago
As very long term goal, I want to work on a project like this but on much larger scale.

I have a tiny prototype based on Go and soufflé.

The overall goal would be to figure out classical error conditions like nill pointers deference.

If I can figure out if a pointer will be nil in some execution branch, there is no reason why a computer cannot do the same.

However it is very hard work and the part I am currently stuck in is creating a reasonable abstraction level where I can reason (in soufflé) about code and not about an AST.

I personally would see this as an human race level upgrades. Imagine feeding your code to a CI that spit back something like: "you will have a panic at line 156 when your input is > 4"

Moreover we will be able to check the "real code" starting from main down to all execution paths. Which is much more powerful than current static analysis that usually analyse a single function without considering how the function itself is invoked.

Beside the actual implementation difficulties, the other big elephant in the room is the amount of resources needed to run this kind of analysis.

(I am less concerned about that as computation resources exploded due to AI and solving SAT is getting faster and faster)

5 comments

No, static analysis can also analyze multiple functions. What you are looking for is called interprocedural analysis and is more expensive than single function analysis (aka intraprocedural analysis). More generally what you want is context-sensitive (knows the calling context) and probably flow-sensitive (knows the control flow). The more precise the analysis, the more expensive they are. There are points-to analysis that can be pretty precise, e.g. https://arxiv.org/abs/1801.09189

> I am less concerned about that as computation resources exploded due to AI and solving SAT is getting faster and faster

Sadly we are usually very concerned about this. The complexity of these problems are usually NP-hard, so we usually have either heuristics (imprecise), exponential algorithms, or fixed-parameter tractable algorithms (practically speaking, O(2^c n) for a pretty large c), which are either imprecise or too slow. This is still an area with active research. The main problem is that, even if your computational power gets 2x faster, you cannot solve problems 2x larger within the same amount of time. It doesn't scale... (but of cause, usually we are dealing with sparse problems so the scaling is not in LOC for example)

> The overall goal would be to figure out classical error conditions like nill pointers deference.

> If I can figure out if a pointer will be nil in some execution branch, there is no reason why a computer cannot do the same.

Note, this is called flow-sensitive typing (also called type narrowing) and I think that typescript does it.

https://en.wikipedia.org/wiki/Flow-sensitive_typing

> I personally would see this as an human race level upgrades. Imagine feeding your code to a CI that spit back something like: "you will have a panic at line 156 when your input is > 4"

A model checker can do that!

See this

https://model-checking.github.io/kani/tutorial-kinds-of-fail...

Other techniques are also possible

https://github.com/viperproject/prusti-dev#quick-example

(Here I could link a lot of things, I just selected two Rust projects to illustrate)

This works better if you are able to provide contracts in your API that says which guarantees you provide. Alternatively, asserts are useful too.

Much of this kind of analysis is relatively easy in Datalog/Souffle. You can see a tutorial paper here: https://yanniss.github.io/points-to-tutorial15.pdf

> the part I am currently stuck in is creating a reasonable abstraction level where I can reason (in soufflé) about code and not about an AST

I recommend aiming for an abstraction level similar to the one in the paper (starts on page 9). Once you have that target, writing relations to convert an AST fact such as `binary_op("=", leftId, rightId)` to `Move(to, from)` is fairly straightforward!

If we want to prevent dereferencing null pointers then it seems like our time would be better spent improving type systems and language syntax. Shift the issue left instead of detecting problems after code has already been written.
Are you aware of symbolic execution?