Hacker News new | ask | show | jobs
by shanty 2578 days ago
I wasn't responsible for the data intake, but I know that the data was extensive, and always included time on page, full URL, other request information (often post stuff).

I know that HTTPS provided a technical hurdle that our company and data providers worked around after about 6 months.

My guess is that some MITM-type collection? Some data providers gave us IPs and some just gave us some Tokenized ID. I don't know if ISPs provided IPs, but probably not.

Note that we did lots of data linking. Let's say an ISP provided us your age, URL, and Timestamp. We would link that into another data provider that provided past purchases, URL, and Timestamp (shopping toolbar/plugins do this) to get a bigger picture of who you are.

1 comments

>get a bigger picture of who you are.

Sorry if I'm reading too much into this, but are you saying this data being collected and sold contains PII?

Well, PII is a bit of a nebulous term. Some websites still transfer some signup/user info in url parameters or unencrypted responses. We would even see SSNs pop up now and then.

Most data being sold has some good faith effort to remove PII, but that's never 100% complete, and by utilizing multiple data sources, an industrious person or team could de-anonymize your data. We were mostly doing this type of work for segmentation and persona analysis. Targeting an individual was never a goal, but would not have been terribly difficult.

I'll give you an example. We might receive all urls a person visited. Many contain person information that would not be caught in usual PII filtering process: https://mail.google.com/mail/u/1/#search/my+viagra+prescript...