| OP here. I took the unofficial IKEA US dataset (originally scraped by jeffreyszhou) and converted all 30,511 products into a flat, markdown-like protocol called CommerceTXT. The goal: See if a flatter structure is more efficient for LLM context windows. The results:
- Size: 30k products across 632 categories.
- Efficiency: The text version uses ~24% fewer tokens (3.6M saved total) compared to the equivalent minified JSON.
- Structure: Files are organized in folders (e.g. /products/category/), which helps with testing hierarchical retrieval routers. The link goes to the dataset on Hugging Face which has the full benchmarks. Parser code is here: https://github.com/commercetxt/commercetxt Happy to answer questions about the conversion logic! |
For example, Google’s indexers already use this to surface pricing data. https://developers.google.com/search/docs/appearance/structu...