| Problem 1 - Extract the values of <h2> tags from NYT front page NB. In 1.htm, NYT is using the <h3> tag for headlines, not <h2> as in the 2020 video. Solution A - Use UNIX utilties grep -o "<h3[^\>]*>[^\<]*" 1.htm |sed -n '/indicate-hover/s/.*\">//p'
The grep utility is ubiquitous, but the -o option is not.https://web.archive.org/web/20201202103125/https://pubs.open... For example, Plan9 grep does not have an -o option. This solution is fast and flexible, but not portable. There are myriad other portable solutions using POSIX UNIX utilities such as sh, tr and sed. For small tasks like those in "web scraping" tutorials these can still be faster than Python (due to Python start up time alone). Solution B - Use flex to make small, fast, custom utilities Create a file called 1.l that contains int fileno(FILE *);
#define jmp (yy_start) = 1 + 2 *
#define echo do {if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
%s xa xb
%option noyywrap noinput nounput
%%
\<h3 jmp xa;
<xa>\> jmp xb;
<xb>\< jmp 0;
<xb>[^<]* echo;putchar(10);
.|\n
%%
int main(){ yylex();exit(0);}
Then compile with something like flex -8iCrf 1.l
cc -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy1
And finally, yy1 < 1.htm
This is faster than Python.Solution C - Extract values from JSON instead of HTML The file 1.htm contains a large proportion of what appears to be JSON. I wrote a quick and dirty WIP JSON reformatter that takes web pages as input called yy059. https://news.ycombinator.com/item?id=31174088 yy059 < 1.htm|sed -n '/promotionalHeadline\":\"[^\"]/p'|cut -d\" -f4
Sure enough, the JSON contains the headlines. One could rewrite Solution B to extract from the JSON instead of the HTML. |