|
|
|
|
|
by Aloisius
4580 days ago
|
|
I'd have to check the last crawl settings, but I believe I set the last crawl was set to truncate at 1 MB (response body size, so that could be 1 MB uncompressed or 1 MB compressed depending on what the source web server sent out). At one point I was tried out a 10 MB limit, but the thing is we try to limit crawls to webpages and few are that big, but occasionally we'd hit sites ISDN-speed connections that would slow down the whole thing. For the next crawl, we'll mark which pages are truncated and which aren't (an oversight in the last crawl) so at least you can skip over them. Also, hopefully you'll find the new metadata files to be a little clearer. We switched over the same format Internet Archive uses and it contains quite a bit more data (xpath truncated paths for each link for instance). |
|