Hacker News new | ask | show | jobs
by benl_c 545 days ago
The backend is still a mess of code, so no. It's not too hard to do though. The prompt I used extract location is

"The text provided are enviornmental science papers. They often (but not always) will include references to locations where this science is relevant, for example a study might be of soil around a small town, in this case the town would be the relevant location, extract all locations that are relevant or the subject of the science done, do not extract any locations that are related to the location of or institutions, organisations, or laboratories. So for example exclude the location of government departments and CSIRO laboratories. If there are no relevant locations, please return an empty array. Each location should be extracted in a form suitable for calling the Nominatim geocode API in Python via geopy. Also, extract out a short context string that describes the context in which this location is referenced. Please provide the output in JSON format."

Then I passed it through both Nominatum and Google Geocoder. Google worked better.

One thing that didn't work great in the prompt above was excluding the location of places where the authors worked. They sometimes got included anyway.

2 comments

> One thing that didn't work great in the prompt above was excluding the location of places where the authors worked. They sometimes got included anyway.

Have you tried adding the institutions as an explicit property in the JSON response and just ignoring the second list?

I’ve had much better luck with having LLMs explicitly choose a different label when working with similar types of entities than asking the LLM to exclude them via prompting. This way you can also spot ambiguity if the LLM add a location to both arrays.

I have not done that but I like that strategy not just for this use case but as a general idea for replacing exclusion with finer grained categorisation. One thing I did do is use a regex to preprocess the papers to remove bibliographies which were a really big source of noise. In titles of referenced papers there would often be a mention of location that was not directly relevant to the paper itself.

The Atlas is also trying to answer the question "Can we build inaccurate and incomplete systems with LLMs that are still useful?".

Cool project. Note that you can force structured output now instead of asking for json: https://platform.openai.com/docs/guides/structured-outputs
Thanks, structured output makes a lot more sense. The pydantic approach at the link looks straightforward.