| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by epistasis 5403 days ago

I don't really get the point. Unix has been fantastic at scraping and munging text for decades.

  curl http://weather.yahoo.com/united-states/california/san-jose-2488042/ | sed '/Current conditions/s/.*id="yw-temp">\([0-9]\+\).*/\1/'

It may be fragile, but any method of extracting data out of HTML is going to be fragile when the provider changes design or layout.

A tiny bit of knowledge of grep, sed, and awk, and other simple unix text utilities such as join, comm, cut, paste, goes a long long way.

1 comments

Jacob4u2 5403 days ago

The example happens to use "munging" text, but I think the GP is trying to make the point that you can't use sed (effectively) to parse, for instance, a collection of database entries from an SQL server in the same way that LINQ would be able to do so.

The tl;dr I got from the article was LINQ is effective at working with sets of data; not just sets of text data from a text file.

link

_delirium 5403 days ago

True, although Plan9 pushed that part of the Unix philosophy even further, towards where it arguably handles some of those more general cases as well, with "structural regexes" that work on things other than collections of lines: http://doc.cat-v.org/bell_labs/structural_regexps/

link

Jacob4u2 5403 days ago

Isn't the general consensus that regexes are hard to maintain and debug? I'm not sure that "structural regexes" are solving the right problem.

I think maybe we're not shooting at the same baskets though (basketball reference, apologies if you're not from US); I'm trying to write software applications, not shell scripts.

link