Hacker News new | ask | show | jobs
by sergiotapia 652 days ago
In Elixir, I select the `<body>`, then remove all script and style tags. Then extract the text.

This results in a kind of innerText you get in browsers, great and light to pass into LLMs.

    defp extract_inner_text(html) do
      html
      |> Floki.parse_document!()
      |> Floki.find("body")
      |> Floki.traverse_and_update(fn
        {tag, _attrs, _children} = _node when tag in ["script", "style"] ->
          nil
  
        node ->
          node
      end)
      |> Floki.text(sep: " ")
      |> String.trim()
      |> String.replace(~r/\s+/, " ")
    end
1 comments

An example of where this approach is problematic: many ecommerce product pages feature embedded json that is used to dynamically update sections of the page.