Hacker News new | ask | show | jobs
by yakshaving_jgt 1663 days ago
Why 5, exactly? This struck me as odd in the article. Perhaps I missed something. Are there other statuses? Why are statuses numeric?
3 comments

It's arbitrary.

I have a field, post_status, in my backend database, that I use to categorize posts. Each category is a numeric code so SQL can filter it relatively quickly. I have statuses for active posts, dead posts, ignored links, links needing review, etc.

It's a way for me to sort through my scraper's results quickly.

I think you have a case of premature optimisation there, as I wrote in a recent comment[0].

[0]: https://news.ycombinator.com/item?id=29430281

Not sure what's premature here. The optimization is to allow me, a human, to find a certain class of database records quickly. I also chose a method that I understand to be snappy on the SQL side as well.

What would you suggest as a non-optimized alternative? That might make your point about premature optimization clearer.

There is indeed a trade-off, and the direction I would have chosen is to use meaningful status names as opposed to magic numbers. My reasoning being that maintenance cost in terms of how self-explanatory the system is makes more sense to me economically than obscuring the meaning behind some of the code/data for a practically non-existent performance benefit.

After all, hardware is cheap, but developer time isn't.

For a more concrete example, I might have chosen the value `'pending'` (or similar) instead of `5`. Active listings might have status `'active'`. Expired ones might have status `'expired'`, etc.

Integer columns are significantly faster and smaller than strings in a SQL database. It adds up quickly if you have a sufficiently large database.

I use the following scheme:

   1 - exhausted
   0 - alive
  -1 - blocked (by my rules)
  -2 - redirected
  -3 - error
The author is scraping fewer than 1,000 records per day, or roughly 365,000 records per year.

On my own little SaaS project, the difference between querying an integer and a varchar like “active” is imperceptible, and that’s in a table with 7,000,000 rows.

It would take the author 19 years to run into the scale that I’m running at, where this optimisation is meaningless. And that’s assuming they don’t periodically clean their database of stale data, which they should.

So this looks like a premature optimisation to me, which is why it stood out as odd to me in the article.

I'd put it closer to the category of best practices than premature optimization. It's pretty much always a good idea. It's not that not doing this will break things, the alternative is slower and uses more resources in a way that affects all queries since larger datatypes elongate the records, and locality is tremendously important all aspects of software performance.
I disagree. I think a better "best practice" is to make the meaning behind the code as clear as possible. In this case, the code/data is less clear, and there is zero performance benefit.
There is absolutely a performance benefit to reducing your row sizes. It both reduces the amount of disk I/O and the amount of CPU cache misses and in many cases also increases the amount of data that can be kept in RAM.

You can map meaning onto the column in your code, as most languages have enums that are free in terms of performance. It does not make sense to burden the storage layer with this, as it lacks this feature.

You can map meaning onto the column in your code, as most languages have enums that are free in terms of performance. It does not make sense to burden the storage layer with this, as it lacks this feature.

Was just looking at how to do this with an enum today! Read my mind. :)

The performance benefit is negligible at the scale the author of the article is operating at. You already alluded to this point being context-dependent earlier when you said:

> if you have a sufficiently large database

Roughly 360,000 rows per year is not sufficiently large. It's tiny.

It's arbitrary.