Hacker News new | ask | show | jobs
by rmdashrfroot 2858 days ago
I wrote a collection of Dockerfiles for images running Python 2.7 or Python 3.6 + Selenium with either Chrome or Firefox and using Xvfb for the X display (necessary for running Selenium headlessly).

https://github.com/seanpianka/docker-python-xvfb-selenium-ch...

Using this, in conjunction with AWS Step Functions, Lambda, and ECS, it became merely cents a month to run a headless scraper task in the cloud.

1 comments

Can you elaborate a bit? Sounds interesting. I had never heard of AWS Step Functions before.

What does your workflow look like?

Not OP, but to elaborate on AWS Step Functionss:

In short - this gives you the ability to pass the output of one lambda function to the input of another lambda function.

An example of one that I've written to regularly create a new copy of our Production RDS database in Ireland as a Staging RDS database in Oregon.

1. Cloudwatch Event starts Step Function on the 15th

2. Copy last Production snapshot from Ireland to Oregon

3. Restore this snapshot as a new RDS instance (It will fail until the snapshot is available and retry with exponential backoff - this is a step function feature)

4. In parallel:

  - Add tags to the instance (Once it's available)

  - Delete the snapshot copy (When finished restoring)

  - Modify the new instance with security groups and subnets

     - In parallel:

        - Run a SQL query to anonymize all of PII columns for GDPR compliance as data has now left the EU.

        - Call out to the Cloudflare API to update our DNS entry with the new RDS endpoint.

           - Delete the old Staging database instance
Do you run into problems with Lambda's 5-minute maximum execution time for those kinds of operations? I'd like to do something similar to this for both RDS and DynamoDB, but the execution time will often surpass 5 minutes, meaning I'd have to run a Step Functions worker on EC2 or ECS. That opens up a whole bunch of complexity with managing the worker code and its deployment, which I'd rather avoid if possible.
With the current implementation; no problems hitting the limit. As mentioned in my below comment, our query for anonymization would be the heaviest - but it's designed to be quick as we don't care about unique values for most data.

If we did though - Fargate is a great solution for it, but you wouldn't be able to feed data back into the next step without some additional complexity - Maybe have the next step pull an SQS queue, or an S3 file, or look for a database entry, etc. as it's next bit of data that it needs - and just fail until it finds it, and once the Fargate (Or whatever) has done it's job and placed it in your method of choice, then it could continue.

> Run a SQL query to anonymize all of PII columns for GDPR compliance as data has now left the EU.

Do you ensure the values you replace with 'make sense' in the context of the application? i.e are names turned into fake names?

If so, I would love to hear more about you handle the complexities of this. If not, it's still a wonderful pipeline that I'm putting my ideas box, thanks for sharing.

Nope! I probably could with the Faker library; but we don't care about that - and to do so would be a much heavier query. My query looks like this, so it runs extremely fast and isn't an issue on Lambda:

  `UPDATE Users set FirstName = 'FAKEFIRSTNAME', LastName = 'FAKELASTNAME', StreetAddress = '123 FAKE ST.', Zip = '10001', PrimaryEmail = Cast(NewId() as varchar(36)) + '@x.com', Phone = '555-555-5555')`