Hacker News new | ask | show | jobs
by Kiro 594 days ago
What relevant metadata is there in an HTML file?
1 comments

I guess, it involves splitting a file into smaller document snippets, getting page numbers and such, and calculating embeddings for each snippet—that’s the usual approach. Specific signals vary by use case.

Hopefully, @jerednel can add more details.

For HTML it's markup tags...h1's, page title, meta keywords, meta descriptions.

My retriever functions will typically use metadata in combination with the similarity search to do impart some sort of influence or for reranking.