Hacker News new | ask | show | jobs
by bakuninsbart 1459 days ago
SQL is great at what it is designed to do, and absolutely horrible for anything else. Sometimes I have had to use tools that only allow (a subset of) SQL for querying data. (Especially BI) Doing regex date validation in a json derived from a string in an sql dialect without function-definitions is horrendous. These kinds of "why the f do i have to use sql for this"-moments happened surprisingly often to me working as a data analyst.

But then I'm trying to do relatively simple things in Pandas or R, like changing column values based on multiple field value conditions, and it is a struggle requiring at least 17 brackets, while in SQL it would be a simple statement that anyone can understand after reading it once.

3 comments

>Doing regex date validation in a json derived from a string in an sql dialect without function-definitions is horrendous.

Doing regex is horrendous, but doing it on SQL in a modern database is no more difficult than in a full-fledged programming language. Most modern DBs have strong JSON support and built-in regex functions

1. JSON_CAST/JSON_PARSE your data

2. REGEXP_EXTRACT() on the result, here's several date validator regex from a SO post (https://stackoverflow.com/questions/15491894/regex-to-valida...)

And that's it. In fact in many cases it's probably faster to do it natively in SQL than to export it to python or R and parse there.

The problem you are describing is probably rooted in the specific SQL dialect you had to use. Selecting from inside JSON strings and matching regular expressions should be a no brainer. And it is straight forward in PostgreSQL, e.g.
> simple things in Pandas or R, like changing column values based on multiple field value conditions, and it is a struggle requiring at least 17 brackets

With dplyr, wouldn’t this be mutate(x = case_when(… ~ y, TRUE ~ x)) or the same but with ifelse?

I haven't used much R, but how would you do something like:

  CASE WHEN
  SUM(daily_revenue) OVER (PARTITION BY department, TRIM(SUBSTR(region, 5)) IN ('North','West','Misc')) > 
  AVG(revenue) OVER (ORDER BY sale_time ASC rows BETWEEN 28 PRECEDING AND CURRENT ROW)
  AND NOT COALSECE(had_prev_month_party, FALSE)

  THEN pizza_party_points + 5

  WHEN <above> AND had_prev_month_party THEN pizza_party_points + 3

  WHEN MIN(sale_time) over (PARTITION BY department) = DATE_TRUNC('month', current_date) then 5

  ELSE GREATEST(pizza_party_points - 1, 0)

  END as pizza_party_performance_points_current

this example may be a bit esoteric but it actually draws from lots of real-world cases (comparing a partitioned sum with an overall average, checking conditions on other columns, messy data parsing and manipulation, implicit waterfall nature of CASE WHEN vs. explicit elseif, etc)