Hacker News new | ask | show | jobs
by Animats 1194 days ago
DARPA is funding this. Good.

They haven't reached inter-procedural static analysis yet, which means they can't solve the big problem: how big is an array? Most of the troubles in C come from that. Whoever creates the array knows how big it is. Everybody else is guessing.

A bit of machine learning might help here. If you see

    void dosomethingwitharray(int arr[], size_t n) {}
a good conjecture is that n is the length of arr. So, the question is, if this is translated to

    fn dosomethingwitharray(arr: &[i64]) {}
does it break anything? Both caller and callee have to be analyzed. The C caller has the constraint

    assert_eq!(arr.len(), n);
That's a proof goal. If a simple SMT-type prover can prove that true., then the call can be simplified to just use an ordinary Rust slice. If not, conversion to Rust has to drop to those ugly C pointer forms, preferably with a comment inserted. So you need something that makes good guesses, which is a large language model kind of thing, and something which checks them, which is a formalism kind of thing.

The process can be assisted by putting asserts in the original C, as checks on the C and hints to the conversion process. That's probably the cleanest way to provide human assistance.

I've wanted this for conversion of OpenJPEG code to Rust. That's a tangle of code doing wavelet transforms, with long blocks of touchy subscripting and arithmetic, plus encoders and decoders for an overly complex binary format containing offsets and lengths. Someone recently ran it through c2rust. The unsafe Rust code works. It's compatible with the original C - it segfaults for the same test cases which cause the C code to segfault. This is why a naive transpiler isn't too helpful.

(The date at the bottom of the article is 2022-06-13. Has there been further progress?)

6 comments

> a good conjecture is that n is the length of arr

As an old osdev currently enjoying using rust instead, I would say I wish.

N might be the length of arr. it might also correspond to the number of elements of (implicit) type t that would fit in the unsigned char array arr. It might be the length of the array minus space for a trailing char (either minus one or minus sizeof(char) bytes). Or it could be the size plus one, because why not.

What you just described is the absolute bane of my existence at work in embedded firmware development. Though typically size_t does mean what it says on the tin — the horrific drivers usually use some other unsigned int type if they’re doing dumb stuff with “n” there.
Which is why it's only a conjecture. You need static analysis, SAT solving, and run-time checks to validate that conjecture.

Using something like GPT-4 on this problem is promising. It's probably going to be right most of the time, and its errors can be caught by the next phase of the analysis. That's about what you'd get if you put junior programmers on language conversion.

> The date at the bottom of the article is 2022-06-13

The date was wrong; sorry, my mistake. The article reflects progress as of early January 2023. We're actively working on the lifting feature and will post a follow-up post once the tooling are sufficiently mature to be tested by the community.

> The date at the bottom of the article is 2022-06-13. Has there been further progress?

The article links to their github repo:

https://github.com/immunant/c2rust

There's commits in the last hour, so at least some signal of life.

One could rewrite the C code to pass a pointer to an array with a length in the function prototype. Then you could also get bounds checking in C with UBSan.

https://godbolt.org/z/dPWPo1rrv

n could be the subset still requiring sorting; or any other subset of the array; a start index; it could be anything. assert(len(arr) == n) is quite the wild conjecture.
> does it break anything?

probably the abi if nothing else?