| Problem 3 - Extract totalcount value from <span> tag in Craigslist job pages Create a file called 3.l containing int fileno(FILE *);
#define jmp (yy_start) = 1 + 2 *
%s xa xb xc
%option noyywrap noinput nounput
%%
\<ul\40id=\"jjj0\" jmp xa;
<xa>"</ul>" yyterminate();
<xa><a\40href=\" jmp xb;
<xb>\" putchar(10);jmp xa;
<xb>[^\"]* fprintf(stdout,"%s%s","https://newyork.craigslist.org",yytext);
.|\n
%%
int main(){ yylex();exit(0);}
Compile flex -8iCrf 1.l
cc -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy3
yy3 extracts and prints the URLs for the job pagesCreate a file called 4.l containing int fileno(FILE *);
#define jmp (yy_start) = 1 + 2 *
#define echo do{if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
%s xa xb xc xd xe
%option noyywrap noinput nounput
%%
\<h1\40class=\"cattitle\" jmp xa;
<xa>\<a\40href jmp xb;
<xb>\"\> jmp xc;
<xc>[^<]* fprintf(stdout,"%s ",yytext);jmp xd;
<xd>\<span\40class=\"totalcount\"\> jmp xe;
<xe>\< jmp 0;
<xe>[0-9]* echo;putchar(10);
.|\n
%%
int main(){ yylex();exit(0);}
Compile flex -8iCrf 1.l
cc -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy4
yy4 extracts and prints the job catgeory name and totalcountWe can either solve this in steps where we create files or we can do it as a single pipeline. I personally find breaking a problem into discrete steps is easier. In steps echo http://newyork.craigslist.org|yy025|nc -vv proxy 80|yy045 > 1.htm;
ka;yy3 < 1.htm|yy025|nc -vv proxy 80|yy045 > 2.htm;ka-;
yy4 < 2.htm;
As a single pipeline echo http://newyork.craigslist.org|yy025|nc -vv proxy 80|y045|yy3|(ka;yy025)|nc -vv proxy 80|yy045|yy4;ka-
Shortened further by using a shell script called nc0 for the yy025|nc|yy045 sequence echo https://newyork.craigslist.org|nc0|yy3|(ka;nc0)|yy4
Thanks to yy025, we are using HTTP/1.1 pipelining. This is a feature of HTTP that almost 100% of httpd's support (I cannot name one that doesn't) however neither "modern" browsers nor cURL cannot take advantage of it. Multiple HTTP request are made over a single TCP connection. Unlike the Python tutorial in the video we are not "hammering" a server with multiple TCP connections at the same time, nor are we making a number of successive TCP connections that could "trigger a block". We are following the guidance of the RFCs which historically recommended that clients not open many connections to the same host at the same time. Here we only open one for retrievng all the jobs pages. Adding a delay between requests is unnecessary. We allow the server to return the results at its own pace. For most websites, this is remarkably fast. Craigslist is an anamaly and is rather slow.What are ka and ka-. yy025 sets HTTP headers acording to environmental variables. For example, the value of Connection is set to "close" by default. To change it, Connection=keep-alive yy025 < url-list|nc -vv proxy 80 >0.htm
Another way is to use aliases alias ka="export Connection=keep-alive;set|sed -n /^Connection/p";
alias ka-="export Connection=close;set|sed -n /^Connection/p";
ka;yy025 < url-list|nc -vv proxy 80 >0.htm;ka-
yy025 is intended to be used with djb's envdir. Custom sets of headers can thus be defined in a directory.This solution uses less resources, both on the client side and on the server side, than a Python approach. It is probably faster, too. |