Hacker News new | ask | show | jobs
UndoDB – The interactive time travel debugger for Linux C/C++ for debugging (undo.io)
109 points by droideqa 390 days ago
7 comments

FOSS alternative: https://rr-project.org/
What's the difference with RR?
Undo (where I'm CTO) has existed for longer than RR and its real benefit is that it scales to use cases where RR (for one reason or another) isn't a fit.

Technically:

* Doesn't need hardware performance counters - runs on more CPUs and on cloud systems (where performance counters are often blocked).

* Can attach and detach at any time - means you get to record just a subset of program execution that's interesting.

* You can our ship recording tech with your application and control it by API, so you can grab crash recordings on customer systems.

* Supports programs that share memory with non-recorded processes.

* Supports direct device access (e.g. DPDK).

* Accelerated debugging features - searching with recordings using parallel processing, accelerated conditional breakpoints a few thousand times faster than native GDB.

* We provide a stable, patched fork of GDB that we're occasionally told is more stable than the default.

For many people's use cases none of these really matter - they should use RR if they're not already.

But if you need any of these things then Undo can give you time travel debugging. In practice, it's usually big software organisations that we deal with because they have development pain and the extreme requirements we can match.

> * You can our ship recording tech with your application and control it by API, so you can grab crash recordings on customer systems.

That’s actually pretty neat.

[rr developer here]

Undo has cool features like Live Recording that we don't have in rr. They don't need access to the hardware PMU which is a big advantage in some situations. They can handle accesses to shared memory in cases where rr can't. https://undo.io/resources/undo-vs-rr/ is a good resource.

If you don't have access to the hardware PMU then you can try https://github.com/sidkshatriya/rr.soft (which is a modification of the rr debugger).

It may not be commercial quality but its open source and free :)

[I built rr.soft]

Undo also support Java and Scala: https://docs.undo.io/java/index.html
AFAIK it records multithreaded applications on multiple threads and CPU, rr records them on a single OS thread, AFAIK. Not sure about replay. Never used undo though, so not sure how much better it is.
rr does support multithreaded and multi-process applications, via, like Undo[1], allowing only a single thread to run at a time. (edit note - that's only about multithreading; Undo might have parallel multi-process recording)

[1]: https://undo.io/resources/undo-performance-benchmarks/ - "Undo serializes their execution"

I stand corrected, not sure where I heard this then.
https://undo.io/resources/undo-vs-rr/ does note parallel recording for multi-process (not multi-threaded), so perhaps that.
Free Windows equivalent: WinDbg Time Travel Debugging (https://learn.microsoft.com/en-gb/windows-hardware/drivers/d...).
WinDbg's time travel debug is really cool and more people should know about it. I'm always a little sad that it's not (so far!) officially integrated in something like VS Code.

Before it was released publicly I believe Microsoft had been using it internally to share recordings on bug reports against massive pieces of software like Office. So it's a serious piece of tech.

I used it (iDNA) on the Windows team starting around 2006 or so and we were able to resolve bugs in minutes that had been open for years. It was absolute magic.
gdb should already have a similar feature?
GDB does:

https://sourceware.org/gdb/current/onlinedocs/gdb.html/Proce...

But it's limited. It's really cool that it's integrated by default but it doesn't scale to big applications / workloads.

RR and Undo both use GDB as a user interface, though, so any skills you have there will carry over.

A lot people don't know and don't use GDBs reverse debugging. It is an awesome and hidden feature, which more developers should know :)

All these Oh wait. I missed it...debugging sessions. and these What exactly changed over there? are answerable.

It does, but it is really sad by comparison with rr and UndoDB. You could use it to record a few function calls or perhaps if you’re lucky a whole frame of your game but not a whole program.
if you could get this working on embedded arm cpus, I think you'd be surprised how many customers there would be.
Time travel debugging on embedded ARM has been available for over 20 years via trace probes [1].

The category namer of time-travel debugging, TimeMachine, (hence time-travel debugging in contrast to other attempted names such as reversible, bidirectional, record-replay, etc.) was available in 2003 and supports/supported the ARM7 [2]. Note, that is not ARMv7 architecture, that is the ARM7 chip [3] in use from 1993-2001.

From what I know, the ARM7 was one of the first ARM designs implementing the Embedded Trace Macrocell (ETM) which could output the instruction and data trace data used to support trace probe-based time travel debugging.

[1] https://jakob.engbloms.se/archives/1564

[2] https://www.ghs.com/products/probe.html

[3] https://en.m.wikipedia.org/wiki/ARM7

What's limiting us is that Undo does need a Linux kernel - so traditional embedded programming wouldn't be a fit. Embedded Linux could work and we do support ARM64.

I've thought I bit about how you might support time travel on bare metal embedded - but actually there are hardware-assisted solutions (Lauterbach's Trace32 was one we came across) there sometimes.

Let me save you a click:

Pricing & Licensing

A UDB floating license costs $7,900 per year.

If you’re going to spend money, then you would be better off using rr and paying for Pernosco. Pernosco is amazing.
Thanks for the Pernosco tip. It really looks amazing and you can try it for free as github user
You’re welcome.
rr is awesome and is free and open and all that. How much better could this possibly be?
Undo co-founder here. rr is indeed awesome. If it works for your use-case, you should use it!

Undo is mostly used by companies whose world is complex enough that rr doesn't work for them, and they understand how powerful time travel debugging is.

There has now been a LOT of engineering invested by a lot of very smart people into Undo, so it does also have a lot of polish and nice features.

But honestly, if rr is working for you, that's great. I'm just glad you're not doing printf debugging the whole time :)

They have a comparison page: https://undo.io/resources/undo-vs-rr/

I was in talks with them recently because I kept running into limitations with rr. The main advantages for my use case were that undo doesn't have the same dependency on hardware timers, which means the ARM support is much better, you can run it in a VM (e.g. a cloud machine) and you can do replays on different systems.

A couple minor notes:

- If your program is very light on syscalls (i.e. basically entirely in-memory computation), rr can go to a basically 1.0x slowdown. In particular this means you can run benchmarks in it at full capacity, provided that I/O is outside of the repeated part (e.g. if sometimes the bench is noticably slower, you can replay and see if some important loads/stores crossed a cacheline/page). You can even "perf record" / "perf stat" a replay if you want to! (none of this is too useful, but it's fun! Gathering repeated stats over the same execution for more resolution might be useful with proper tooling though)

- rr does have an in-memory buffer of recording data.

- rr recordings should be portable within the architecture, as long as the replay hardware has the extensions the recorder did (or if replayer-unsupported features are disabled at record-time).

I regularly deal with 3 different architectures. I can go and spin up a cloud instance every time I want to run rr (and in fact that's the solution I've been working with), but it's just annoying enough to justify spending a couple hours in sales calls.
Well, if you have a Google L5 making ~365k [1] then it would need to make them ~2.2% more productive overall to be worth it when just considering direct pay. If we consider a Google L3 at ~187k then it would need to make them ~4.2% more productive overall.

This, of course, ignores employee benefits and overhead which usually amount to ~100% extra costs over direct pay. So that is now ~1.1% and ~2.1%, respectively.

And that ignores the fact that you need to pay people less than they produce to be profitable which probably drops us down to ~0.5% and ~1.0%, respectively.

[1] https://www.levels.fyi/companies/google/salaries/software-en...

edit: Incorrectly linked to product designer instead of software engineer levels.

OK... Most of us don't know what a "google l5" is, so I guess we can safely ignore this. Heh.
The major fail of such "just a 1% / cup of coffee" is that there is an infinite number of things you could pay for with the same potential productivity promise without any hard data on whether those are true, so just the fact the you can use a calculator and divide to get to a low % doesn't help you much if at all
No-one is going to spend $8K out of pocket to A/B test this on themselves. Of all the things you could be doing to improve your productivity, this is some high hanging fruit.
If you have a US employer who is unwilling to spend 8 k$ on software engineering productivity then they are pennywise, pound foolish. It literally costs 10x that for a single junior engineer. And, as I pointed out, the net productivity improvement you need to see to justify that expense is miniscule.

If your employer really is skeptical, then they can run a A/B test over a small group of engineers to prove out changes in productivity. But not even being willing to run that test when it is so cheap is just management incompetence.

Engineers are ridiculously expensive. In electrical engineering, where the engineers are generally less well-paid than in software, employers routinely spend multiple hundreds of thousands of dollars per engineer per year in tooling. Not being willing to spend 8 k$ on a test of well known technology and attempting to identify mere single digit percentage improvements is just stupid.

Not everyone is Google. Some people work for themselves, or have very small teams, or live in a developing country, and don't have lots of spare cash laying around.

Please try to understand that the world is not as simple and black and white as you'd like.

Another more powerful option: Intel SDE / PinPlay