|
|
|
|
|
by dmpetrov
586 days ago
|
|
DataChain has no assumptions about metadata format. However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc. Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet. In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map(). Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo
Blog post: https://datachain.ai/blog/datachain-unstructured-pdf-process... |
|
Forgive my ignorance, but what is "json-pair"?