|
|
|
|
|
by wruza
825 days ago
|
|
The prompt, for those interested. I find it pretty underspecified, but maybe that's the point. For example, "Business operating hours" could be expanded a little, because "Closed - Opens at XX" is still non-processable in both cases. You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.
Data to scrape:
title: Name of the business
type: The business nature like Cafe, Coffee Shop, many others
phone: The phone number of the business
address: Address of the business, can be a state, country or a full address
years_in_business: Number of years since the business started
hours: Business operating hours
rating: Rating of the business
reviews: Number of reviews on the business
price: Typical spending on the business
description: Extra information that is not mentioned yet in any of the data
service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
is_operating: Whether the business is operating
HTML:
{html}
|
|
Lower end models do not have the attention to complete tasks like this, GPT4Turbo will generally have the capability. But to have an optimal pipeline you should really be splitting up these tasks into individual units. You extract each attribute you want independently and then combine it back together however you want. Also asking for JSON upfront is equally suboptimal in the whole process.
I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.
Edit: I am not suggesting that an LLM is more optimal than what ever traditional parsing methods they may use, simply the way they are doing it is wrong from an LLM flow.