| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 1vuio0pswjnm7 1205 days ago

Please share some examples of webpages with data to be wrangled that support this statement:

"The reality is that shit is hard, doesn't scale (classic blocking for-loop or async saturation), and comes with thorny maintenance/security issues."

Every web user's needs are different. One person might have a task that they struggle to accomplish while another might have one that presents no major challenges. As a web user, I transform web pages to CSV or SQL. I log HTTP and network requests. I do this for free using open source software. No web browser needed. No docker image needed. Works on both Linux and BSD.

For me, the web is a dataset from which I retrieve data/information. "Tech" companies want to the web to be more like a video game, with visuals and constant interactivity.

2 comments

portInit 1205 days ago

Thanks for this, we're still trying to figure out these details ourselves.

Related to the quote, we've seen interest for API data wrangling, where prepackaged data feeds can be cumbersome to edit, or other implementation details become challenging like credential management, domain throttling, scheduling, checkpointing, export, etc.

It's also interesting for webpage data when you need to use a particular page as an index of links to filter and crawl. We've tried to build an abstraction layer around that.

Initially, we were mainly focused on webpages, and wanted to bypass the visuals of the browser and use a headless browser to fulfill network requests, render js, etc. then convert the page to flat table of enriched elements. With the APIs as another data set there is some work for us to do around language.

We're now trying to figure out what workflows are most relevant for crul to optimize around, as well, honestly - we just built what we thought was cool. Some features/workflows will certainly be more straightforward with existing tools and software - especially for a technically savvy user.

link

nine_k 1205 days ago

Do you have an easy way to transform e.g. a typical Amazon product page into nicely structured data? Not that it's trying to be very video-gamey.

link

1vuio0pswjnm7 1205 days ago

What should the structure look like. If it is CSV what are the columns, i.e., what specific data does it need include.

Taking a quick look at the Amazon site these product pages appear to be enormous in size. Interestingly, the website requires a "viewport-width" header. Otherwise one gets directed to a CAPTCHA.

The product page I checked already has some structered data in the form of JSON, including keys such as

   "title":"xxxxxxxxxx"
   "displayPrice":"$000.00"
   "priceAmount":000.00
   "currencySymbol":"$"
   "integerValue":"000"
   "decimalSeparator":"."
   "fractionalValue":"00"
   "symbolPosition":"left"
   "asin": "xxxxxxxxx"
   "asin":"xxxxxxxxxx"
   "acAsin":"xxxxxxxxxx"
   "buyingOptionTypes":["NEW"]
   "productAsin":"xxxxxxxxxx"
   "mediaAsin":"xxxxxxxxxxx"
   "parentAsin":"xxxxxxxxx"
   "asinList":"xxxxxxxxxx"

Thus, CSV with product name, price and ASIN would appear to be easy. No need to mess with the HTML.

Other data such as, e.g., delivery time, seller, where the item ships from and number left in stck can be extracted from the HTML.

Delivery time is in a <span> that contains "data-csa-c-delivery-time".

Seller, shipping info and number left in stock are under a <span> with class="a-size-base _p13n-desktop-sims-fbt_fbt-desktop_shipping-info-show-box__17yWM"

One needs to decide what data one wants from the page.

The way to present an example on which to evaluate a "new" solution such as the one in this thread is to present a problem, e.g.,

Get data items x, y and z from website xyzexample.com.

In the majorty of cases I see submitted to HN, it is impossible to benchmark these "new" solutions against existing ones because no example websites are ever provided.

link

nine_k 1205 days ago

The output may be a collection of CSV files, or a JSON file with nicely structured data, because the page certainly has a pretty visible structure, with various data blocks.

link