Hacker News new | ask | show | jobs
by chrisaycock 1637 days ago
I built my own table-oriented language out of frustrations I had with with time-series analysis:

https://www.empirical-soft.com

Empirical has statically typed Dataframes. It can infer the type of a file's contents at compile time using a ton of metaprogramming techniques.

  >>> let trades = load("trades.csv")
  
  >>> trades
   symbol                  timestamp    price size
     AAPL 2019-05-01 09:30:00.578802 210.5200  780
     AAPL 2019-05-01 09:30:00.580485 210.8100  390
      BAC 2019-05-01 09:30:00.629205  30.2500  510
      CVX 2019-05-01 09:30:00.944122 117.8000 5860
     AAPL 2019-05-01 09:30:01.002405 211.1300  320
     AAPL 2019-05-01 09:30:01.066917 211.1186  310
     AAPL 2019-05-01 09:30:01.118968 211.0000  730
      BAC 2019-05-01 09:30:01.186416  30.2450  380
      CVX 2019-05-01 09:30:01.639577 118.2550 2880
      ...                        ...      ...  ...
Functions have generic typing by default; the caller determines the type instantiation. Here is a weighted average:

  >>> func wavg(ws, vs) = sum(ws * vs) / sum(ws)
Queries are built into the language. Here is a five-minute volume-weighted average price:

  >>> from trades select vwap = wavg(size, price) by symbol, bar(timestamp, 5m)
   symbol           timestamp       vwap
     AAPL 2019-05-01 09:30:00 210.305724
      BAC 2019-05-01 09:30:00  30.483875
      CVX 2019-05-01 09:30:00 119.427733
     AAPL 2019-05-01 09:35:00 202.972440
      BAC 2019-05-01 09:35:00  30.848397
      CVX 2019-05-01 09:35:00 119.431601
     AAPL 2019-05-01 09:40:00 204.671388
      BAC 2019-05-01 09:40:00  30.217362
      CVX 2019-05-01 09:40:00 117.224763
      ...                 ...        ...
Everything is statically typed. Misspelled column names, for example, result in an error before the script is even run!
2 comments

This is pretty cool. I've had thoughts (or dreams, more accurately :) of a language like this every time I get a runtime 'type error in q. I gotta say, I prefer q's syntax, though :)
q with static typing and a sensible pricing model would be amazing.

I do think that q's main strength is not its speed, but the fact that qSQL statements are a first class citizen in the language - no network hops, no awkward marshalling and unmarshalling of data, no awkward mismatch around how to use nulls, nans, tz-aware timestamps etc.

I started Empirical with the goal of "q like Haskell". The end result went in a radically different direction, but the guiding light has always been to have a statically typed language where tables and queries are a first-class operation.

The source code is publicly available under AGPL with the Commons Clause:

https://github.com/empirical-soft/empirical-lang

How does it handle dirty data? Does it assign an "any" type?

Also, why do you think embedding data frames is not possible?

Missing and poorly formatted input is given a type-specific value. Eg., Float64 is nan and Int64 is nil.

  >>> Int64("5")
  5

  >>> Int64("5b")
  nil
If inferencing cannot determine a consistent type from a CSV file, then the column will just be a String.

I don't know what you mean by "embedding" a Dataframe.

On the website you linked:

"Embedding Dataframes into an existing language would not be possible."

I don't think it would be an issue for languages with good metaprogramming facilities.

Ah, I see what you're referring to.

The hardest thing is the load() function, particularly in the REPL. It looks dynamic, but is actually static. Pulling off this slight-of-hand requires both type providers and automatic compile-time function evaluation on arbitrary expressions.

F# is the only other language I know of that has type providers. They invented it.

As for CTFE, languages like Zig and D require the user to indicate when to evaluate something ahead of time. I wanted this to happen automatically and still be available for compound expressions, user-defined functions, user-defined types, etc. Doing that requires tracking purity (no state or IO) in an expression, plus a mechanism to actually do the evaluation. I've never seen a language take it to the extreme that Empirical does.

So an existing statically typed language would need (1) a REPL interface, (2) purity tracking, (3) compile-time function evaluation, (4) some kind of types-as-parameters setup, and (5) array notation. Most existing statically typed languages don't have a REPL; the ones that do generally lack array notation. I couldn't find a language that did all of that plus type providers and automated CTFE on arbitrary expressions.

Hence, I had to create my own language.

I've written similar in Julia, you can see the record type used in https://www.juliapackages.com/p/namedtuples. The full library, not in the open source, uses this type for time series analysis. It's all type safe and allowed expressions such as x = vwap( ts, 5) - l1( vwap( ts, 5)) through to a time moving PCA. Julia makes writing this sort of thing short and quick. The total impl was only a thousand lines or so of code.
I checked your website; do you have an example of how to load data from a file into NamedTuples? Specifically, can NamedTuples infer type from an external source?

Also, do you have an example of what a displayed table looks like? Julia has a DataFrames package that can display a table. I am curious to know how your time-series library displays a table.

I don't think I get it. I do a lot of pandas in a bank so I recognize your dataframes for what they are, but what advantage do you have over python+pandas ?

I hate Python (I'm a Java dev helping Quants), but it's that or KDB, and I think I could murder the creator of KDB :D And I have to admit Pandas is instinctive, Python is easy enough to extend, what are you doing that's so important you made a language for it ?

Empirical is statically typed. Python and q/kdb+ are dynamically typed.

I spent years using those products in finance. I would set-up a simulation that would crash after four hours because of a misspelled column name. Empirical prevents that by refusing to run a script that has a type error or unresolved identifier. No more crashed overnight sims!

> python+pandas

Another advantage is supporting sql-like syntax natively (and not having to use pandas' awkward, bolted-on API)