| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jerednel 592 days ago

Cool! Does this assume the unstructured data already has a corresponding metadata file?

My most common use cases involve getting PDFs or HTML files and I have to parse the metadata to store along with the embedding.

Would I have to run a process to extract file metadata into JSONs for every embedding/chunk? Would keys created based off document be title+chunk_no?

Very interested in this because documents from clients are subject to random changes and I don’t have very robust systems in place.

2 comments

dmpetrov 592 days ago

DataChain has no assumptions about metadata format. However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.

In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().

Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo Blog post: https://datachain.ai/blog/datachain-unstructured-pdf-process...

link

nbbaier 592 days ago

> However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Forgive my ignorance, but what is "json-pair"?

link

dmpetrov 592 days ago

It's not a format :)

It's simpliy about linking metadata from a json to a corresponding image or video file, like pairing data003.png & data003.json to a single, virtual record. Some format use this approach: open-image or laion datasets.

link

nbbaier 591 days ago

Thanks for the explanation!

link

spott 592 days ago

> DataChain has no assumptions about metadata format.

Could your metadata come from something like a Postgres sql statement? Or an iceberg view?

link

dmpetrov 592 days ago

Absolutely, that's a common scenario!

Just connect from your Python code (like the lambda in the example) to DB and extract the necessary data.

link

Kiro 592 days ago

What relevant metadata is there in an HTML file?

link

dmpetrov 592 days ago

I guess, it involves splitting a file into smaller document snippets, getting page numbers and such, and calculating embeddings for each snippet—that’s the usual approach. Specific signals vary by use case.

Hopefully, @jerednel can add more details.

link

jerednel 592 days ago

For HTML it's markup tags...h1's, page title, meta keywords, meta descriptions.

My retriever functions will typically use metadata in combination with the similarity search to do impart some sort of influence or for reranking.

link