|
|
|
|
|
by dcferreira
1169 days ago
|
|
> This way you can write functions that e.g. take only a URL as argument, load the content from that site, and then summarize it with the LLM. I'm currently doing a project where this would be very helpful, but I can't think of what I'd need to send the LLM.
In my case I'm scraping headlines from many news websites. I'm doing it manually with xpath currently. What would be the way to use LLMs here? Just sending the HTML wouldn't work, as it's too many tokens. Probably I could send all <a> tags, but then how could I be sure the LLM doesn't choose too many/few? |
|
(The previous example is also good)
A few things you could consider:
1. We have a utility for getting content out of HTML at marvin.utilities.strings.html_to_content. That would probably significantly compress it.
2. Chunk the HTML into batches that fit in context, send each over with an AI function that summarizes it (you could instruct the AI function to optimize the summary to help with title generation), then send all the resulting summaries to a title generator
3. We have a suite of HTML loader classes that will probably be ready for production in a couple releases (see https://github.com/PrefectHQ/marvin/blob/main/src/marvin/loa...) but you could try them out now (note: these use parts of Marvin beyond just AI functions, so I'm not recommending it as a drop-in right now). Our loader classes are (ideally) designed to do more than just chunk the input; depending on the nature of the input we do different preprocessing steps to help with insight.
4. Experiment and let us know what you learn - we can incorporate it into a loader class if its effective