Well, I do think the thundebird team should investigate and fix this. But it is almost impossible to fix a bug you can't reproduce and have no clue why it might be happening.
> But it is almost impossible to fix a bug you can't reproduce and have no clue why it might be happening.
No, not at all. It's very easy.
This bug involves taking an inappropriate action under corrupted conditions. You don't need to know how those conditions arose. All you have to do is check whether they currently obtain, and - if so - refrain from taking the inappropriate action.
For this bug, that looks like this:
1. When we're executing a "move"...
2. Before deleting the original messages...
3. Check whether the copies are identical to the originals...
4. And if not, delete the copies instead of the originals.
At this point, the bug can't occur. The "root cause" bug, where your buggy logic says that you copied a bunch of messages even though you didn't, can still occur, but it can no longer delete any messages.
So…do it. Sounds like it’d make a great case study that would get a person tons of attention and praise on HN, a real feather to put in one’s cap.
Literally nothing stopping anyone in this thread from opening a PR with this reportedly “very easy” fix that’s eluded developers for nearly two decades, and is so terrible folks swear off Thunderbird forever because I guess for email very basic rules for backing up data don’t apply (or something?) and/or Gmail and Outlook are implicitly trustworthy?
> and is so terrible folks swear off Thunderbird forever because I guess for email very basic rules for backing up data don’t apply (or something?)
Well, this bug literally causes Thunderbird to delete your original copies of data during the backup process, so I'm not sure why backing up your data is supposed to be the solution.
One of the many comments on the issue notes that although the bug has reoccurred in every version of Windows, it might not get much attention from developers because it is catalogued as something specific to Windows XP.
Nobody in the intervening nine years followed up by updating the bug's metadata, though. It's still "Windows XP only".
I try to never underestimate the incompetence/lack of concern people can have when it comes to addressing major product issues, but if this has been open for 17 years and is so widely known, somebody has surely looked into it and determined it’s not so easy.
and then they simultaneously determined "yeah, we might eat your data. Lets not warn anyone about that AT ALL, lets keep the feature activated and let them users lose their data". This behavior ought to be criminal.
> Lets not warn anyone about that AT ALL, lets keep the feature activated and let them users lose their data
How did you conclude this?
IDK why the assumption is that safety measures haven't been created. You wouldn't mark the bug as resolved if you put in safety features, right? You *ONLY MARK AS RESOLVED* after reproducing the bug and *VERIFYING* that it won't happen again. Right? Dear god I hope this is what you do, because otherwise you are prematurely closing bugs.
I'd agree with you that the fact that the bug is still open after 17 years isn't the problem, but the issue is that people are still (as of 10 months ago) running into the issue of their mail being deleted. If they'd secretly implemented "safety measures" as you suggest that wouldn't be happening.
Looking at the timeline, it's possible that they've addressed a few of the bugs that result in data loss several years ago, and it's possible that the latest guy who ran into the problem within the last year triggered it in new ways or under new conditions but it's clear that the problem of thunderbird deleting messages from the server when copies haven't successfully been saved during a move operation wasn't solved by any "safety measures" 9 months ago and it's doubtful that it's been solved now.
My guess is that because thunderbird ultimately doesn't bother to make sure that messages are successfully and accurately copied before it removes them from the server it'll only be a matter of time before someone else stumbles on some other set of circumstances which results in data loss when messages are being moved.
are you being for real? did you see anything as such in the bug listing? but even IF they did put safeguards in place, the fact that this is SEVENTEEN YEARS, no warning, functionality still enabled without ANY WARNING losing people data. unforgivable.
How can you possibly justify this behavior? I understand they dont owe the world any software, fine, but dont knowingly publish stuff that KILLS PEOPLES DATA without atleast a warning
Make no mistake - I am not absolving them of leaving this issue unaddressed lol just saying if it was easy they’d likely have handled it. It’s probably difficult or they just don’t know, so they keep putting it off and decided that not enough users are affected for real consequences (which is wrong to do)
I would not want my email client to be relying on such brittle and incorrect heuristics.
A better workaround would be to keep deleted emails around for some time so users have the option to restore them if the bug triggers. But this has drawbacks such as potential privacy breakage (you meant to delete mails you don't want the chance that anybody sees it) or free disk space management (your local drive is overloaded and you want to urgently free up space) or ux confusion (this is a de facto trash but Thunderbird already has such a feature)
Ultimately, what needs to be done is make the code robust, make sure there are no race conditions, etc.
> Well, would you rather have a brittle heuristic lose all of your mail?
That's not what's happening. I wouldn't expect such an heuristic to be currently present. There is a bug, not something intentional.
> almost certainly better than doing nothing
No, because with such an heuristic, you add behavior that's difficult for the user to understand well and to work with. With such an heuristic, you will lose some mails and at some point the process stops in the middle. Which mails have you lost? What is "many" mails? 10? 100? What if my computer is fast and is deleting 100s of mails per seconds, losing all the mails anyway? What if it is slow and never triggers the heuristic?
If the heuristic does trigger, you end up with a mixed situation where you still have lost some stuff, but not all, and it'll be impossible to understand which ones. It doesn't fix the issue (you still lose email), just makes it even more difficult to understand even for the devs when they inevitable need to track down related issues. You really don't want to willingly add mechanisms that feel like they are non-deterministic: they are hard to debug, and hard for the users to grasp.
A way better solution is backups anyway: if you care not to lose your emails, you should be backing them up. From the beginning, your local TB mails are not a proper backup of your IMAP account because it's two-way synchronized so you need a backup somewhere else.
A still better workaround is disabling the move to local folder feature and make people copy and then manually delete mails.
Not saying your heuristic is not a good idea or clever (it is clever and could lead to further good ideas), just that after reflection, it should probably not be implemented. It barely starts to address the issue and adds complexity for everyone involved.
Except the bug was filed in 2008. Back then, Rust was Graydon Hoare's personal project that Mozilla wouldn't start funding until a year later. Rust was written in OCaml and the famed borrow checker wouldn't be in place until 2010. The first public release was v0.1 in 2012 and the first stable release 1.0 wouldn't happen till 2015. The language was very different back then with sigils, garbage collection and green threading as language features. So this bug was already bugging people when Rust was just an embryo that was still years away from birth.
Now even if we neglect the timeline, Rust only guarantees memory safety. If TB is deleting mails on the server too, then the corruption is happening over IMAP connections as well. Does that sound like a memory safety bug to you? Perhaps it is. But how do we eliminate the possibility of a logical bug that Rust won't protect you against, when nobody has any clue even now? And all that aside, if you're going to rewrite it in Rust, you might as well start a new project in Rust instead of porting an old design that may potentially contain a language-agnostic logical Heisenbug.
I'm not trying to be hostile here. I started using Rust in 2013 (I have 12 years of experience in a 10 year old language, and a bunch of repos that I can't compile anymore unless I compile the compiler from old commits somehow!). I wouldn't use C or C++ for any of these applications - I simply don't have enough competence to avoid the kind of bugs that Rust protects me from (despite being a hardware engineer with more knowledge about memory management than about type system theory). Despite all that, statements like this will only cause an unwanted backlash against Rust. Not that you're entirely wrong, but some people are so offended by such suggestions for reasons that are still under investigation, that they start a crusade against Rust [1].
A better approach might be to feed all of this into an LLM to have it figure it out. If it finds a bug and has a fix, reproducing it might be easier and a test could potentially be written.
I don’t think LLMs are the answer to everything, but this would be a good test for newer generations of LLMs as they’re developed.
Worst case- it deletes all of your emails, but that would’ve happen anyway, right? =)
Reproducing bugs is a luxury and not even close to required for analyzing and fixing issues. Even if the issue is external (hardware, antivirus, etc.), the code can be changed to be more defensive and only ever delete the original when the new data has been successfully written and verified.
The problem is you can never close the bug report if you can't reproduce. I guess, you could, as the other commenter suggests, mathematically prove that it can't happen, but otherwise you're prematurely closing it.
How do you differentiate that you solved the bug and not a similar looking bug?
> the code can be changed to be more defensive and only ever delete the original when the new data has been successfully written and verified.
But this doesn't solve the problem.
- What if it is an upstream issue? They have to be connected, since they are deleting data. Maybe it is completely a bug on their end? Doesn't matter how defensive you are if the bug was "anytime an email has 'man man' and is pulled between 00:00-00:04 everything deletes" then what can you do?
- What if the user was hacked and the hacker just deleted all the data?
- What if the user was just dumb and deleted the data themselves. Either not knowingly or were embarrassed to say anything.
- What if it is another program on the user's computer that is deleting the data because of some weird unexpected collision?
I'm sure you can think of more situations that still won't solve the problem.
How do you close the report if you can not make strong guarantees that it is resolved?
A luxury? Not even close to required? You are not afraid of words! I'm not looking forward to receive a bug report from you!
Yeah, reproducing is not theoretically mathematically necessary. In theory you could prove your code is correct with formal methods¹. Now, nobody does this because it is impractical (borderline impossible), reproducing is in practice so useful as to be almost essential:
- it lets you study how your code behave in the problematic case and identify what's causing the exact issue the user is seeing
- it lets you check that your fix does indeed address the bug
I have indeed already fixed trivial bugs without reproduction cases from a vague description of a bug because I'm intimately familiar with the code and it immediately rings a bell: the cause is immediately obvious. But that's not the usual case.
> the code can be changed to be more defensive and only ever delete the original when the new data has been successfully written and verified.
What if the code is already designed like this (and I sure do hope it is currently written like that, because that's almost common sense, if not the only sensible way of moving something) but somehow fails for some currently inexplicable reason? It smells race condition to me.
In the case of the discussed bug, users have described a reproduction case that's not 100%. But someone will need to find a 100% reproduction case. Users, or devs. It will not be optional. You can't play a guessing game, attempt to fix the code and hope for the best. You might be able to actually fix the bug, but without much confidence. Best case, you'll be able to find a reproduction case after fixing the bug (that you'll probably use as a functional test), to prove you fixed the bug for this specific case you found. You'll not be 100% sure you addressed the user's case.
A bug can hide another one, so you could find and fix a bug, but the issue is still present in the user's case. You can only be sure with their reproduction case.
But I agree that it is hard to reproduce a race condition.
¹ which in practice applies to code of trivial size (static analysis), or consists in checking a model but not the actual implementation (model checking), or does apply useful checks but is not exhaustive and has false positives / negatives (static analysis), or does apply useful exhaustive checks but only on a limited number of executions (runtime verification, and we do have functional tests that serve a similar purpose in practice - and you'll actually need the reproduction case here so you have the right execution to check), or requires you to write your code in a specific language (stuff like coq) and you cross your fingers that this specific language's implementation is itself correct. In short: not applicable here.
> it is almost impossible to fix a bug you can't reproduce
It's also impossible to mark a bug report as resolved if you can't reproduce it.
You could have fixed the bug (especially since a lot of TB was rewritten) but if you can't reproduce the bug you wouldn't know it was solved only that people stopped reporting it. This is actually a common occurrence with long standing bugs.
I've updated my comment for clarity. The bug (which I've never encountered in more than 20 years as a Thunderbird user) is that users move messages to a local email folder, but the messages are deleted from the server without actually downloading them. At a minimum they should disable that operation. The guy that originally reported it worked at Sun and lost hundreds of work messages as a result of this bug. AFAICT the user wouldn't be affected if they did a copy of the messages and then manually deleted them from the server folder after confirming the copy was successful.
A very small number of users have this bug (and tbf, it's a really bad bug), and are unable to consistently reproduce it and it seems none of the developers have been able to (the seemingly random nature of the bug occurring is not helping). How is it supposed to be fixed?
You add more and more diagnostics (e.g. logging) in that area till you manage to track down the bug. Over several years this should be possible.
At that point you can either fix the bug directly or do it properly by first reproducing the bug (in a test) and then fixing it.
Said another way - If they can't reproduce it, they can't close it.
They may well have fixed it already, but without a way to reproduce it the only prudent behavior is to leave it open and wait for the next diagnostic file to be uploaded.
That's not the only prudent behaviour, as the OP said, the prudent behaviour is to add more diagnostics and guards against the conditions that lead up to the bug.
Okay, let's assume more diagnostics and guards were added.
Now re-answer the above questions with these assumptions.
- How do you fix a bug you can't reproduce?
- How do you *close* a bug report when you can't reproduce?
Being generous here, we're assuming there's 17 years worth of diagnostics and safety guards added but through that time the bug still isn't reproducible. Let's try to answer the questions under these assumptions.
If you've added guards and diagnostics, then you close it until someone else files a follow-up, then it can be re-opened. There's no sense keeping it open unless there are ongoing reports of the issue.
The way I've dealt with that in the past is putting into into Review or whatever the equivalent is, make a note ("cannot repro, but attempted potential fix in version XXXX, moving to review, please reopen if anyone reports this again) and then if nobody reports it still happening for x amount of time (e.g. 12 months), close it. Can always reopen it if it gets reported again beyond that.