Hacker News new | ask | show | jobs
by jerednel 592 days ago
Cool! Does this assume the unstructured data already has a corresponding metadata file?

My most common use cases involve getting PDFs or HTML files and I have to parse the metadata to store along with the embedding.

Would I have to run a process to extract file metadata into JSONs for every embedding/chunk? Would keys created based off document be title+chunk_no?

Very interested in this because documents from clients are subject to random changes and I don’t have very robust systems in place.

2 comments

DataChain has no assumptions about metadata format. However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.

In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().

Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo Blog post: https://datachain.ai/blog/datachain-unstructured-pdf-process...

> However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Forgive my ignorance, but what is "json-pair"?

It's not a format :)

It's simpliy about linking metadata from a json to a corresponding image or video file, like pairing data003.png & data003.json to a single, virtual record. Some format use this approach: open-image or laion datasets.

Thanks for the explanation!
> DataChain has no assumptions about metadata format.

Could your metadata come from something like a Postgres sql statement? Or an iceberg view?

Absolutely, that's a common scenario!

Just connect from your Python code (like the lambda in the example) to DB and extract the necessary data.

What relevant metadata is there in an HTML file?
I guess, it involves splitting a file into smaller document snippets, getting page numbers and such, and calculating embeddings for each snippet—that’s the usual approach. Specific signals vary by use case.

Hopefully, @jerednel can add more details.

For HTML it's markup tags...h1's, page title, meta keywords, meta descriptions.

My retriever functions will typically use metadata in combination with the similarity search to do impart some sort of influence or for reranking.