Hacker News new | ask | show | jobs
by mulmen 844 days ago
Can you be more specific? What filtering did you apply? How many columns did you select?
1 comments

SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'
I wouldn’t expect either of those filters to utilize a partition key if one exists. So yeah, you probably did a full table scan every time. Is the partitioning documented somewhere?
Yeah, 'LIKE' ops usually give you a full table scan, which is brutal. If it was my own data I'd chop the fields up and index them properly - which is the issue here, it's not your data, so you don't get a say in the indexes, but you do have to pay per row scanned even if you can't apply an index of your own.
Seems like an ideal case for pre-processing. You still have to do one full scan but you only have to do one scan.

I’m not familiar with your use case or BigQuery but in Redshift I’d just do a COPY to a local table from S3 then do a CREATE TABLE AS SELECT with some logic to split those URLs for your purpose.

You might even be able to do it all in one step with Spectrum.