|
|
|
|
|
by jnewhouse
819 days ago
|
|
For the SQL interface, both sources and sinks are treated as tables. Sources you SELECT FROM, while sinks you INSERT INTO. Right now it is incumbent on the user to correctly specify the types of a source for deserialization. How getting this wrong behaves is a little source-dependent, as some data formats are stricter. Parquet will fail hard at read-time, while JSON will coerce as best as it is able, optionally dropping the data instead of failing the job depending on the bad_data parameter: https://doc.arroyo.dev/connectors/overview#bad-data. Currently we don't support much in the way of changing configuration in external systems, instead focusing on defining long-running pipelines. What did you have in mind for an HTTP source? We have a polling HTTP source, as well as a WebSocket source: https://doc.arroyo.dev/connectors/polling-http
https://doc.arroyo.dev/connectors/websocket |
|
So if I'm understanding, you actually read data directly from (say) S3? It isn't copied from S3 and stored locally (ie, a bunch of local .arrow files.)
(Apologies if I'm ignorant of the underlying tech - I think this is really cool and just trying to wrap my head around what happens from "I upload some data to S3" and "we get query results")