Hacker News new | ask | show | jobs
Show HN: GraFlo - Universal ETL tool for property KG (Neo4j, TigerGraph, Arango) (github.com)
5 points by acrostoic 231 days ago
We built GraFlo after repeatedly writing the same boilerplate code to transform datasets into Neo4j, ArangoDB, and TigerGraph. Every time we had a new dataset (OpenAlex, IBES financial data, Debian packages), we'd write yet another custom ETL script with the same problems: ID generation, type coercion, deduplication, and database-specific quirks.

GraFlo is a declarative framework that handles this once and for all. You define your graph structure in a database-agnostic schema - vertices, edges, properties, and how they map to your source data (CSV, SQL, JSON, XML). GraFlo then generates the ingestion code for your target database. The key insight: while Neo4j, ArangoDB, and TigerGraph are all idiosyncratic, the underlying property graph model is universal. We crystallized that into a single abstraction layer.

What GraFlo handles automatically:

- Consistent ID generation across vertices and edges

- Type coercion (strings to dates, numbers, etc.)

- Vertex and edge deduplication

- Generating database-specific ingestion scripts

It's plug-and-play in the sense that swapping from Neo4j to ArangoDB takes no time — just change the target database type in your config (docker compose examples provided).

We've used it to build knowledge graphs from academic publications, financial datasets, and package dependencies. Instead of maintaining N × M scripts (N datasets, M databases), we maintain N schemas.

On the roadmap: SQL/API integration (e.g., automatically generating GraFlo configs from SQL schemas).

Would love feedback from anyone working with graph databases or building knowledge graphs.

2 comments

Cool! These are indeed very common graph-building steps.

Thinking outloud here, but some of these were supposed to be solved with RML (https://rml.io/) for the RDF paradigm. I witnessed a bit of their evolution: it started with similar operations as GraFlo and eventually they built some support for arbitrary java code. For example, say you want your node ID to be generated by concatenating the values of the firstName column and the lastName column, but only after some weird string normalization (think of making sure everything is utf8)... you woundn't want to make your schema-mappings Turing-complete, so you'd eventually have to allow for calling other functions. Any way, all of that was for RDF graphs, it's cool to see something like this for property graphs.

Are there any tools for data migration when swapping database engines? Thanks.
Not directly - GraFlo is for the ingestion side, not migration. Migration between different property graph DBs isn't trivial (and sometimes not even possible) because they're organized in fundamentally different ways. Some are much more flexible with uniqueness constraints, indexes, or how they handle certain graph patterns.

But the nice thing is: if you have your source data and GraFlo schema, regenerating your graph in a different DB is trivial. GraFlo handles indexes and constraints for each target database. It's like having the recipe instead of trying to reverse-engineer the cake.