yy025 is a flexible utility I wrote to generate custom HTTP from URLs. It is controlled through environmental variables. nc is a tcpclient, such as netcat. proxy is a HOSTS file entry for a localhost TLS proxy. The sequence "yy025|tcpclient" is normally contained in a shell script that adds a <base href> tag, something like
For example, Plan9 grep does not have an -o option.
This solution is fast and flexible, but not portable.
There are myriad other portable solutions using POSIX UNIX utilities such as sh, tr and sed. For small tasks like those in "web scraping" tutorials these can still be faster than Python (due to Python start up time alone).
Solution B - Use flex to make small, fast, custom utilities
Create a file called 1.l that contains
int fileno(FILE *);
#define jmp (yy_start) = 1 + 2 *
#define echo do {if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
%s xa xb
%option noyywrap noinput nounput
%%
\<h3 jmp xa;
<xa>\> jmp xb;
<xb>\< jmp 0;
<xb>[^<]* echo;putchar(10);
.|\n
%%
int main(){ yylex();exit(0);}
yy4 extracts and prints the job catgeory name and totalcount
We can either solve this in steps where we create files or we can do it as a single pipeline. I personally find breaking a problem into discrete steps is easier.
Thanks to yy025, we are using HTTP/1.1 pipelining. This is a feature of HTTP that almost 100% of httpd's support (I cannot name one that doesn't) however neither "modern" browsers nor cURL cannot take advantage of it. Multiple HTTP request are made over a single TCP connection. Unlike the Python tutorial in the video we are not "hammering" a server with multiple TCP connections at the same time, nor are we making a number of successive TCP connections that could "trigger a block". We are following the guidance of the RFCs which historically recommended that clients not open many connections to the same host at the same time. Here we only open one for retrievng all the jobs pages. Adding a delay between requests is unnecessary. We allow the server to return the results at its own pace. For most websites, this is remarkably fast. Craigslist is an anamaly and is rather slow.
What are ka and ka-. yy025 sets HTTP headers acording to environmental variables. For example, the value of Connection is set to "close" by default. To change it,
Retrieving the HTML
yy025 is a flexible utility I wrote to generate custom HTTP from URLs. It is controlled through environmental variables. nc is a tcpclient, such as netcat. proxy is a HOSTS file entry for a localhost TLS proxy. The sequence "yy025|tcpclient" is normally contained in a shell script that adds a <base href> tag, something like yy045 is a utility that removes chunked transfer encoding.The benefit of using separate, small programs that do one thing will be illustrated in the solution for Problem 3.