| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wruza 825 days ago

The prompt, for those interested. I find it pretty underspecified, but maybe that's the point. For example, "Business operating hours" could be expanded a little, because "Closed - Opens at XX" is still non-processable in both cases.

  You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.

  Data to scrape: 
  title: Name of the business
  type: The business nature like Cafe, Coffee Shop, many others
  phone: The phone number of the business
  address: Address of the business, can be a state, country or a full address
  years_in_business: Number of years since the business started
  hours: Business operating hours
  rating: Rating of the business
  reviews: Number of reviews on the business
  price: Typical spending on the business
  description: Extra information that is not mentioned yet in any of the data
  service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
  is_operating: Whether the business is operating
  
  HTML: 
  {html}

1 comments

infecto 825 days ago

This should be higher up. This whole blog post is mostly worthless because the way they are extracting data is less than optimal.

Lower end models do not have the attention to complete tasks like this, GPT4Turbo will generally have the capability. But to have an optimal pipeline you should really be splitting up these tasks into individual units. You extract each attribute you want independently and then combine it back together however you want. Also asking for JSON upfront is equally suboptimal in the whole process.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Edit: I am not suggesting that an LLM is more optimal than what ever traditional parsing methods they may use, simply the way they are doing it is wrong from an LLM flow.

link

ilyazub 821 days ago

> I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Cool, cool. I'm super interested. Please share the process and the results.

link

wruza 825 days ago

Also, my (limited) experience with prompts tells that you want to invest more into the “You are” part. I’ll share my understanding, corrections are appreciated.

LLMs aren’t people even in a chat-roleplaying sense. They complete a “document” that can be a plot, a book, a protocol of conversation. The “AI” side in the chat isn’t an LLM itself, it’s a character (and so are you, it completes your “You: …” replies too - that’s where the driver app stops it and allows you to interfere). So everything you put in that header is very important. There are two places where you can do that: right in the chat, as in TFA, or in the “character card” (idk if GPTs have it, no GPT access for me). I found out that properly crafting a character card makes a huge difference and can resolve the whole classes of issues.

Idk what will work best in this case, but I’d start with describing which sort of a bot, how it deals with unclear or incomplete information, how amazing it is (yes, really), its soft/tech skills and problem solving abilities, what other people think of it, their experience and so on. Maybe would add few examples of interactions in a free form. Then in the task message I’d tell it more and specific details about that json.

One more note - at least for 8x7B, the “You are” in the chat is a much weaker instruction than a character card, even if the context is still empty. I low-key believe that’s because it’s a second-class prompt, i.e. the chat document starts with “This is a conversation with a helpful AI bot which yada yada” in… mind, and then in that chat that AI character gets asked to turn into something else, which poisons the setting.

Simply asking the default AI card represents 0.1% of what’s possible and doesn’t give the best results. Prompt Engineering is real.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Same. I think that no matter how good a model is, this prompt just isn’t a professional task statement and leaves too much to decide. It’s a task that you, as a regular human, would hate to receive.

link

mhuffman 825 days ago

Do you have an example of a more optimal prompt to share?

link

infecto 825 days ago

The prompt does not matter as much as the workflow which is describe above. 1) Extract one attribute at a time. 2) Don't ask for json during extraction, but on binary small attributes it might not matter as much.. 3) Combine the data later.

There are differences that can be marked on how different models perform against the same raw prompt but generally the workflow is what matters more. The raw text prompt will be dependent on what model you are using as there are those differences but I don't think its a level of "prompt engineering" like we had a year ago.

link