Hacker News new | ask | show | jobs
by jnewhouse 819 days ago
For the SQL interface, both sources and sinks are treated as tables. Sources you SELECT FROM, while sinks you INSERT INTO. Right now it is incumbent on the user to correctly specify the types of a source for deserialization. How getting this wrong behaves is a little source-dependent, as some data formats are stricter. Parquet will fail hard at read-time, while JSON will coerce as best as it is able, optionally dropping the data instead of failing the job depending on the bad_data parameter: https://doc.arroyo.dev/connectors/overview#bad-data.

Currently we don't support much in the way of changing configuration in external systems, instead focusing on defining long-running pipelines.

What did you have in mind for an HTTP source? We have a polling HTTP source, as well as a WebSocket source:

https://doc.arroyo.dev/connectors/polling-http https://doc.arroyo.dev/connectors/websocket

1 comments

Let me take a look - thank you!

So if I'm understanding, you actually read data directly from (say) S3? It isn't copied from S3 and stored locally (ie, a bunch of local .arrow files.)

(Apologies if I'm ignorant of the underlying tech - I think this is really cool and just trying to wrap my head around what happens from "I upload some data to S3" and "we get query results")

Yep, pretty much. Right now filesystem^ sources are finite, scanning the target path at operator startup time and processing all matching files. This processing is done by opening an asynchronous reader, courtesy of the object_store crate.

^We call these Filesystem Sources/Sinks to match terminology present in other streaming systems, but I'm not in love with it.