Hacker News new | ask | show | jobs
by formalsystems 1986 days ago
I got the sense while crawling data from their API that the engineering quality is poor at Parler. Dates were represented as strings in "YYYYMMDD" format (so today would be "20210113053923") instead of UNIX timestamps, certain fields were duplicated for no reason (e.g. every object would have an identical "id" and "_id" key), counts of impressions/comments/etc would be the display strings rather than raw numbers (so "2k" or "5m"), and various moderation flags were in place like a boolean "sensitive" which was always false, even for posts that had been downvoted significantly.
6 comments

Dates were represented as strings in "YYYYMMDD" format (so today would be "20210113053923") instead of UNIX timestamps

Such a representation naturally avoids the Y2K38 problem, and could go beyond Y10K. It's traditional in Windows and DOS (neither of which have the Y2K38 problem) to store timestamps as a structure of fields.

The other things you noted I agree with, however.

If they're using a javascript 53bit int representation for the seconds (or an int64_t cast down to a javascript big int) then it's a Y142711K problem, by which point the Imperium of Mankind will hopefully have settled on a more robust format.
The tech-priests will have lost the ability to fix it.
That's how we ended up with the 2038 problem!
I expect Slaanesh and friends will manage to sabotage that somehow.
You can also instantly read them which makes troubleshooting easier. I mean sure, if your shit is too slow maybe switch to less text in release mode but YAGNI.
Well, assuming they're storing the strings as ASCII, that's 98 bits - the y2k38 problem is for 32 bit integers, so a 64 bit integer would be way, way more than needed for human needs for foreseeable generations.
Doesn't seem to me like Parler will have to worry about Y2K38...
A timestamp is a timestamp. It isn't a date. If you need a date, use a proper date/time data type.
All timestamps have to start somewhere. If you want to avoid DST changes and leap seconds, you can use MJD, TAI or GPS time instead of UTC, but you might as well format it nicely so that you can see roughly at what (civil) date something happened.
ISO 8601 is a good one.
Nice that makes sense. I was unaware and found it strange when I plugged it into JavaScript's Date constructor and got an "Invalid Date" error.
This.

Plus, it's unambiguously human readable, for users, bystanders, platform developers, everyone. There's a useful usability principle in there.

Of all the things to criticise Parler's tech folks over, using ISO8601 (minus the non-digit characters) shouldn't be one.
Is ISO8601 without punctuation still ISO8601? Most log parsers I have seen would not pick up the Parker format. ex gr

https://docs.python.org/3/library/datetime.html#datetime.dat...

https://github.com/elastic/logstash/blob/v1.4.2/patterns/gro...

Yes... kind of. Per https://en.wikipedia.org/wiki/ISO_8601, there is a "basic format" without separators and an "extended format" that includes them for readability. However, a T is still required to separate the date and time in the most recent version of the standard.
ISO 8601 is pretty absurd when you actually read it. `2021-W02` and `--01-14` are valid, as is `--1013` (quick! guess what that means! and beware that `-1013` is valid too!)

Please, everyone, use a single format at all times in your systems. I don't really care what it is, though I'm fond of `2021-01-14T06:28:08Z` because it's unambiguous. But don't just say "use ISO 8601", it's far too vague and you'll inevitably have variations.

Without having read the spec...

* `2021-W02` means the second (ISO) week of 2021. Perfectly valid and used in a lot of planning.

* `--01-14` - I'm assuming this is a recurring date: every 14 Jan for every year

* `--1013` - at 1PM every 10th of the month? Guessing here

I believe ISO 8601 is a ISO codification of a DIN standard, and based on other standards processes I'm guessing some German manufacturing companies were the only ones who bothered showing up, so their internal software practices were encoded into the spec because no-one else cared..

That is such a common problem when standardizing, I've started to force my clients to have at least one person of each entity in project teams.

Often the biggest entity will end up accidentally forcing their practices, sometimes sub-optimal, to entire organizations, simply by having the manpower to show up to meetings.

Edit `--1013` is 13 Oct in any year: https://en.wikipedia.org/wiki/ISO_8601#Truncated_representat...

(`--01-14` is Jan 14 in any year, the last dash is "optional").

The "duration" (`P`) and "repetition" (`R`) syntax is also pretty wild.

RFC 3339 is a profile of ISO 8601 that is much more limited but still provides the timestamp format everybody expects when you say “ISO 8601”:

https://tools.ietf.org/html/rfc3339

Indeed, what you really want to say is "use RFC3339" (https://www.ietf.org/rfc/rfc3339.txt)
IDK the issue OP saw with using ISO over UNIX timestamps, but one reason why you might want accuracy down to the second for dates is with providing accurate relative time/date across timezones.
I think the display strings thing is because exact number of impressions etc is slightly sensitive information. The whole site was "gamed" from the start, but providing exact vote counts makes it easier for other people to game. I guess. Don't really know, but I do believe that the numbers given by reddit, for example, are exact, but fake. Fuzzed a bit. HN also hides some of this, or behaves misleadingly, your downvotes don't always count, I think.
They would display numbers less than 1000 as-is, and only start adding the "k" and "m" prefix after the 4-digit and 7-digit threshold was crossed.
But how could they maintain an accurate count? Maybe they were just persisting the user-friendly format alongside the actual count...
The endpoint of the API is probably just rounding the accurate number and returning a friendly number... or it's all bullshit anyway.
If I remember correctly mongo stores the id in “_id” and has a getter for “id” so maybe they just iterated all the keys of the model when they stringified their output
Elasticsearch, too. In either case, it looks like they're just piping raw backend responses to the API endpoint without removing unnecessary fields.
Yep, that's an indication of Elasticsearch being used (and not transforming documents to a standard representation that strips such fields).
It seems like they basically just exposed a lot of data directly, as apparently most of their APIs didn’t enforce any authentication or hide records that had been soft deleted.

Apparently the records were strictly sequential, which I don’t believe is true for Mongo which IIRC includes the node ID in part of it.

One big advantage of using string representations of dates is avoiding misunderstood timezone calculations that may or may not occur at various layers of the backend stack. The downside of course is storage space.
I think most JSON libraries encode dates in something that's closer to what Parler is doing than when you think is correct (e.g, using ISO 8601 or something)

I could see the argument for representing impressions as a string (especially if it's updated asynchronously and denormalized like that). The major downside is localization.