|
|
|
|
|
by jerednel
592 days ago
|
|
Cool! Does this assume the unstructured data already has a corresponding metadata file? My most common use cases involve getting PDFs or HTML files and I have to parse the metadata to store along with the embedding. Would I have to run a process to extract file metadata into JSONs for every embedding/chunk? Would keys created based off document be title+chunk_no? Very interested in this because documents from clients are subject to random changes and I don’t have very robust systems in place. |
|
Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.
In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().
Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo Blog post: https://datachain.ai/blog/datachain-unstructured-pdf-process...