Hacker News new | ask | show | jobs
by Nition 346 days ago
Thanks for linking that. I've tried Thunderbird a couple of times in the past and quite liked it, but that thread has put me off using it forever.

Even if the bug is fiendishly hard to track down and reproduce, you'd think there would be some additional safety checks they could add that would at least let it fail with an error message instead of actual data loss.

3 comments

Its not unlikely that similar problems exist with other mail programs but since they are closed source you dont see it
People would still complain about them on forums, often ones run by the company who makes the client! I'm often reading threads of issues on Apple's public support forums. Being open or closed source has nothing to do with hearing about problems.
Closed software doesn't have open bug trackers, so there's no systematic way to find out.

An acquaintance of mine were twice hit with a bug that corrupted Word documents stored on iCloud if editing on her iPad. Searching online yielded others with the same problem from more than one year ago...

I was able to find complaints fairly easily. I had them listed but HN ate my comment. Search "Missing emails" instead of "delete all emails" as the latter tends to provide instructions about how to bulk delete.

  > Being open or closed source has nothing to do with hearing about problems.
Also, pay attention to observation bias and userbase bias.

If my dad faced this issue, he'd never post online. He'd call me or go to a computer repair shop. That's what your average user will do.

Open Source users tend to be a bit more tech savvy. There's that famous article about Linux gamers reporting way more bugs than average users and how it can be accidentally misinterpreted as "why develop for linux?" These frequency biases are a big part of this. Pluus, OSS tends to do better bug tracking.

  > you'd think there would be some additional safety checks they could add that would at least let it fail with an error message instead of actual data loss.
My guess is that these would exist, and do.

I think you've just made an assumption about a bug that was reported 17 years ago. Assuming nothing has been done since. It looks like they can't reproduce it, *making it impossible to mark as fixed* even if it was. But I wouldn't assume nothing was done.

Also remember that Gmail, Outlook, and others are in play here. They also maintain trashed items for 30 days, making it easy to recover. As the provider, they shouldn't make it easy to mass delete things either, right? TB is just the interface, frankly, I'm not sure I know how to permanently delete emails with it. I'm not sure I can. But the interaction here should result in multiple lines of defense.

It's not just one report from 17 years ago, it's 194 comments with the most recent one from nine months ago. It doesn't seem like mitigation steps have been implemented.
See, you don't understand. Fixing the bug before reproducing it would violate the process.
Well, I do think the thundebird team should investigate and fix this. But it is almost impossible to fix a bug you can't reproduce and have no clue why it might be happening.
> But it is almost impossible to fix a bug you can't reproduce and have no clue why it might be happening.

No, not at all. It's very easy.

This bug involves taking an inappropriate action under corrupted conditions. You don't need to know how those conditions arose. All you have to do is check whether they currently obtain, and - if so - refrain from taking the inappropriate action.

For this bug, that looks like this:

1. When we're executing a "move"...

2. Before deleting the original messages...

3. Check whether the copies are identical to the originals...

4. And if not, delete the copies instead of the originals.

At this point, the bug can't occur. The "root cause" bug, where your buggy logic says that you copied a bunch of messages even though you didn't, can still occur, but it can no longer delete any messages.

So…do it. Sounds like it’d make a great case study that would get a person tons of attention and praise on HN, a real feather to put in one’s cap.

Literally nothing stopping anyone in this thread from opening a PR with this reportedly “very easy” fix that’s eluded developers for nearly two decades, and is so terrible folks swear off Thunderbird forever because I guess for email very basic rules for backing up data don’t apply (or something?) and/or Gmail and Outlook are implicitly trustworthy?

> and is so terrible folks swear off Thunderbird forever because I guess for email very basic rules for backing up data don’t apply (or something?)

Well, this bug literally causes Thunderbird to delete your original copies of data during the backup process, so I'm not sure why backing up your data is supposed to be the solution.

Thunderbird stores mail locally on disk.

If you're keeping backups of your disk, then this bug is not unrecoverable.

Did a developer ever try? Reading the issue, found only one person asking for test cases and trying to close it.
One of the many comments on the issue notes that although the bug has reoccurred in every version of Windows, it might not get much attention from developers because it is catalogued as something specific to Windows XP.

Nobody in the intervening nine years followed up by updating the bug's metadata, though. It's still "Windows XP only".

I try to never underestimate the incompetence/lack of concern people can have when it comes to addressing major product issues, but if this has been open for 17 years and is so widely known, somebody has surely looked into it and determined it’s not so easy.
and then they simultaneously determined "yeah, we might eat your data. Lets not warn anyone about that AT ALL, lets keep the feature activated and let them users lose their data". This behavior ought to be criminal.

  > Lets not warn anyone about that AT ALL, lets keep the feature activated and let them users lose their data
How did you conclude this?

IDK why the assumption is that safety measures haven't been created. You wouldn't mark the bug as resolved if you put in safety features, right? You *ONLY MARK AS RESOLVED* after reproducing the bug and *VERIFYING* that it won't happen again. Right? Dear god I hope this is what you do, because otherwise you are prematurely closing bugs.

Make no mistake - I am not absolving them of leaving this issue unaddressed lol just saying if it was easy they’d likely have handled it. It’s probably difficult or they just don’t know, so they keep putting it off and decided that not enough users are affected for real consequences (which is wrong to do)
It's not criminal, but you're entitled to a full refund of Thunderbird in the event it happens.
if (!user_requested_mass_delete && delete_requests_past_second > 10) throw(“we sure seem to be deleting a lot of stuff from the server”)
I would not want my email client to be relying on such brittle and incorrect heuristics.

A better workaround would be to keep deleted emails around for some time so users have the option to restore them if the bug triggers. But this has drawbacks such as potential privacy breakage (you meant to delete mails you don't want the chance that anybody sees it) or free disk space management (your local drive is overloaded and you want to urgently free up space) or ux confusion (this is a de facto trash but Thunderbird already has such a feature)

Ultimately, what needs to be done is make the code robust, make sure there are no race conditions, etc.

Well, would you rather have a brittle heuristic lose all of your mail?

My point wasn’t that this is a great solution, just that it is very easy and almost certainly better than doing nothing for almost two decades.

> Well, would you rather have a brittle heuristic lose all of your mail?

That's not what's happening. I wouldn't expect such an heuristic to be currently present. There is a bug, not something intentional.

> almost certainly better than doing nothing

No, because with such an heuristic, you add behavior that's difficult for the user to understand well and to work with. With such an heuristic, you will lose some mails and at some point the process stops in the middle. Which mails have you lost? What is "many" mails? 10? 100? What if my computer is fast and is deleting 100s of mails per seconds, losing all the mails anyway? What if it is slow and never triggers the heuristic?

If the heuristic does trigger, you end up with a mixed situation where you still have lost some stuff, but not all, and it'll be impossible to understand which ones. It doesn't fix the issue (you still lose email), just makes it even more difficult to understand even for the devs when they inevitable need to track down related issues. You really don't want to willingly add mechanisms that feel like they are non-deterministic: they are hard to debug, and hard for the users to grasp.

A way better solution is backups anyway: if you care not to lose your emails, you should be backing them up. From the beginning, your local TB mails are not a proper backup of your IMAP account because it's two-way synchronized so you need a backup somewhere else.

A still better workaround is disabling the move to local folder feature and make people copy and then manually delete mails.

Not saying your heuristic is not a good idea or clever (it is clever and could lead to further good ideas), just that after reflection, it should probably not be implemented. It barely starts to address the issue and adds complexity for everyone involved.

Just do a complete rewrite in rust, that will solve all the issues
Except the bug was filed in 2008. Back then, Rust was Graydon Hoare's personal project that Mozilla wouldn't start funding until a year later. Rust was written in OCaml and the famed borrow checker wouldn't be in place until 2010. The first public release was v0.1 in 2012 and the first stable release 1.0 wouldn't happen till 2015. The language was very different back then with sigils, garbage collection and green threading as language features. So this bug was already bugging people when Rust was just an embryo that was still years away from birth.

Now even if we neglect the timeline, Rust only guarantees memory safety. If TB is deleting mails on the server too, then the corruption is happening over IMAP connections as well. Does that sound like a memory safety bug to you? Perhaps it is. But how do we eliminate the possibility of a logical bug that Rust won't protect you against, when nobody has any clue even now? And all that aside, if you're going to rewrite it in Rust, you might as well start a new project in Rust instead of porting an old design that may potentially contain a language-agnostic logical Heisenbug.

I'm not trying to be hostile here. I started using Rust in 2013 (I have 12 years of experience in a 10 year old language, and a bunch of repos that I can't compile anymore unless I compile the compiler from old commits somehow!). I wouldn't use C or C++ for any of these applications - I simply don't have enough competence to avoid the kind of bugs that Rust protects me from (despite being a hardware engineer with more knowledge about memory management than about type system theory). Despite all that, statements like this will only cause an unwanted backlash against Rust. Not that you're entirely wrong, but some people are so offended by such suggestions for reasons that are still under investigation, that they start a crusade against Rust [1].

[1] https://fosstodon.org/@goku12/114077011555069124

Is whichever part of Mozilla that runs Thunderbird going to rehire the rust team now?
Honestly? It might.
A better approach might be to feed all of this into an LLM to have it figure it out. If it finds a bug and has a fix, reproducing it might be easier and a test could potentially be written.

I don’t think LLMs are the answer to everything, but this would be a good test for newer generations of LLMs as they’re developed.

Worst case- it deletes all of your emails, but that would’ve happen anyway, right? =)

Reproducing bugs is a luxury and not even close to required for analyzing and fixing issues. Even if the issue is external (hardware, antivirus, etc.), the code can be changed to be more defensive and only ever delete the original when the new data has been successfully written and verified.
You're right, but you're also wrong.

The problem is you can never close the bug report if you can't reproduce. I guess, you could, as the other commenter suggests, mathematically prove that it can't happen, but otherwise you're prematurely closing it.

How do you differentiate that you solved the bug and not a similar looking bug?

  > the code can be changed to be more defensive and only ever delete the original when the new data has been successfully written and verified.
But this doesn't solve the problem.

  - What if it is an upstream issue? They have to be connected, since they are deleting data. Maybe it is completely a bug on their end? Doesn't matter how defensive you are if the bug was "anytime an email has 'man man' and is pulled between 00:00-00:04 everything deletes" then what can you do? 
  - What if the user was hacked and the hacker just deleted all the data?
  - What if the user was just dumb and deleted the data themselves. Either not knowingly or were embarrassed to say anything. 
  - What if it is another program on the user's computer that is deleting the data because of some weird unexpected collision?
I'm sure you can think of more situations that still won't solve the problem.

How do you close the report if you can not make strong guarantees that it is resolved?

A luxury? Not even close to required? You are not afraid of words! I'm not looking forward to receive a bug report from you!

Yeah, reproducing is not theoretically mathematically necessary. In theory you could prove your code is correct with formal methods¹. Now, nobody does this because it is impractical (borderline impossible), reproducing is in practice so useful as to be almost essential:

- it lets you study how your code behave in the problematic case and identify what's causing the exact issue the user is seeing

- it lets you check that your fix does indeed address the bug

I have indeed already fixed trivial bugs without reproduction cases from a vague description of a bug because I'm intimately familiar with the code and it immediately rings a bell: the cause is immediately obvious. But that's not the usual case.

> the code can be changed to be more defensive and only ever delete the original when the new data has been successfully written and verified.

What if the code is already designed like this (and I sure do hope it is currently written like that, because that's almost common sense, if not the only sensible way of moving something) but somehow fails for some currently inexplicable reason? It smells race condition to me.

In the case of the discussed bug, users have described a reproduction case that's not 100%. But someone will need to find a 100% reproduction case. Users, or devs. It will not be optional. You can't play a guessing game, attempt to fix the code and hope for the best. You might be able to actually fix the bug, but without much confidence. Best case, you'll be able to find a reproduction case after fixing the bug (that you'll probably use as a functional test), to prove you fixed the bug for this specific case you found. You'll not be 100% sure you addressed the user's case.

A bug can hide another one, so you could find and fix a bug, but the issue is still present in the user's case. You can only be sure with their reproduction case.

But I agree that it is hard to reproduce a race condition.

¹ which in practice applies to code of trivial size (static analysis), or consists in checking a model but not the actual implementation (model checking), or does apply useful checks but is not exhaustive and has false positives / negatives (static analysis), or does apply useful exhaustive checks but only on a limited number of executions (runtime verification, and we do have functional tests that serve a similar purpose in practice - and you'll actually need the reproduction case here so you have the right execution to check), or requires you to write your code in a specific language (stuff like coq) and you cross your fingers that this specific language's implementation is itself correct. In short: not applicable here.

  > it is almost impossible to fix a bug you can't reproduce
It's also impossible to mark a bug report as resolved if you can't reproduce it.

You could have fixed the bug (especially since a lot of TB was rewritten) but if you can't reproduce the bug you wouldn't know it was solved only that people stopped reporting it. This is actually a common occurrence with long standing bugs.

You know what else they could do? They could disable a feature that deletes large volumes of email the user doesn't intend to delete.
I don’t remember the last time I deleted an email. I’ve marked things as spam, archived things but not deleted in a long while.
I delete email everyday.
Sometimes you want to delete a whole bunch of mails, don't you?
I've updated my comment for clarity. The bug (which I've never encountered in more than 20 years as a Thunderbird user) is that users move messages to a local email folder, but the messages are deleted from the server without actually downloading them. At a minimum they should disable that operation. The guy that originally reported it worked at Sun and lost hundreds of work messages as a result of this bug. AFAICT the user wouldn't be affected if they did a copy of the messages and then manually deleted them from the server folder after confirming the copy was successful.
How do you fix a bug you can't reproduce?

It's a genuine question because I'm puzzled here.

A very small number of users have this bug (and tbf, it's a really bad bug), and are unable to consistently reproduce it and it seems none of the developers have been able to (the seemingly random nature of the bug occurring is not helping). How is it supposed to be fixed?

You add more and more diagnostics (e.g. logging) in that area till you manage to track down the bug. Over several years this should be possible. At that point you can either fix the bug directly or do it properly by first reproducing the bug (in a test) and then fixing it.
How do you close a bug you cannot reproduce?

Said another way - If they can't reproduce it, they can't close it.

They may well have fixed it already, but without a way to reproduce it the only prudent behavior is to leave it open and wait for the next diagnostic file to be uploaded.

That's not the only prudent behaviour, as the OP said, the prudent behaviour is to add more diagnostics and guards against the conditions that lead up to the bug.
Okay, let's assume more diagnostics and guards were added.

Now re-answer the above questions with these assumptions.

  - How do you fix a bug you can't reproduce?
  - How do you *close* a bug report when you can't reproduce? 
Being generous here, we're assuming there's 17 years worth of diagnostics and safety guards added but through that time the bug still isn't reproducible. Let's try to answer the questions under these assumptions.
The way I've dealt with that in the past is putting into into Review or whatever the equivalent is, make a note ("cannot repro, but attempted potential fix in version XXXX, moving to review, please reopen if anyone reports this again) and then if nobody reports it still happening for x amount of time (e.g. 12 months), close it. Can always reopen it if it gets reported again beyond that.
For starters, put a lot more effort into reproducing it.
- You can try harder to reproduce it.

- You can extend logging to gather additional information to reproduce it.

- You can try to reason about the code and figure out possible causes.

- You can attempt to formally verify the correctness of the code.

- You can put guards into the code against unexpected states and actions.

- You can verify the correct result of previous actions before any destructive actions.

- If all fails you can scrap the piece of code in question since it seems to be beyond your ability to maintain.

> How do you fix a bug you can't reproduce?

You strangle it from the edges.