It's a thousand paper cuts scenario; nothing big and specific, just running into little problems constantly. Honestly, that's worse than running into big problems. It makes us go from trying to accomplish what should be a simple change, to spending three hours hammering an API with random combinations of inputs to find the magical incantation that makes it work correctly.
Some examples:
Cloud Storage doesn't support multi part uploads. The best it seems to have is the ability to compose objects. Honestly that's a better system than dodgy multi-part or resumable uploads, but there are hard restrictions on composing objects. You can't compose more than 32 objects, and you can't compose more than 2 layers deep. So with two iterations you can compose at most 1024 objects. That's not great for uploading large objects through our servers in small chunks. If our chunks are, say, 10MB than the largest final file size we can achieve is only 10 gigs.
On AWS when connecting EC2 to RDS we just threw together the VPC and then configured the EC2 servers with the RDS's hostname. Easy. On GCP we basically _had_ to use Cloud SQL Proxy. Now, again, it seems that Cloud SQL Proxy is a better system overall, but it required fiddling with our server setup, upgrading our MySQL library (which caused other issues), and other random dickery. Another annoyance.
We use Go for our backend servers, and GCP's Go API libraries are all autogenerated, and might as well not be documented. We frequently receive the opaque error "required: required" when trying to blindly figure out the API. It's become an office joke. "Why won't Ubuntu recognize this Wifi card?" "Because required: required man, obviously."
Google App Engine's dev_appserver.py completely broke after an update, caused in part by another Google library being installed (protobuf...). Still not sure if the fix was rolled into a release yet...
The web interface frequently breaks and requires manually refreshing, and it's generally slow and unresponsive on the best of days. It also loves to switch me to my personal account and throw errors because I don't have access to the project I was trying to access...
The "scopes" for launching a compute instance aren't documented to the extent that we know which ones provide what privileges. Really the whole privilege system on GCP is a mess and pales in comparison to AWS. I recall some obvious permissions were just outright missing a few weeks ago.
We have some Go code that uses the API to launch a compute instance. When specifying the scopes on the command line for launching an instance they seemed to require being accompanied by the service account "email" address. So in the Go code we specify the service account email and the scopes. One day during development I forgot to set the service account, didn't notice, and everything worked as normal...
I was not able to find an obvious place where preemptible instances report being killed. Not in the activity logs or the serial console log (which is not saved/available when the instance shuts down). shrugs I didn't feel like looking deeper into it.
Startup scripts specified when launching a compute instance run every time the instance starts. Makes sense in retrospect given the name, but it's in contrast to AWS where the script runs once, and in contrast to the example startup script given in the documentation (which installs things ... not something a script that runs every time the machine boots up should do). And it's not very helpful. A script that runs once ever is more practical than a script that runs every boot.
Figuring out exactly how to cook up my own compute images in a format that GCP likes required finding a random video on YouTube from a Google developer.
Some of the documentation (this was either for Datastore or some part of App Engine) is actually just a bunch of marketing copy with no technical meat to it, leaving me to just assume how various features work (because they aren't actually documented anywhere else).
New strange behavior from MySQL running on Cloud SQL that we still haven't nailed down (random lock contentions) that we never encountered on RDS.
Random networking failures on fresh compute instances.
Random upload failures to Cloud Storage.
Transferring objects from one bucket in Cloud Storage to another bucket using the transfer interface resulted in the ACLs being lost for all the objects.
Random things get deprecated every other week. Image aliases last week, something about the Cloud Storage metadata was weird the week before that, etc.
The CLI randomly failing to query for the list of compute instances for tab completion, instead just tab completing an instance that was deleted 10 minutes ago.
HN comment space was not sufficient enough for me to write about the ways AWS drives me insane. So, here is my blog on 1000 cuts by AWS: https://medium.com/google-cloud/the-future-of-cloud-computin.... I feel Google cloud is much better engineered, focusing on developer happiness and productivity.
Talking specifically about my field, Cloud, Big Data and DataScience, its so painful to build a decent data stack that can handle few terabytes of data, let alone petabytes of data. Google Cloud (Pub/Sub, Dataflow & Big Query) make it a breeze to handle petabytes of data. You can literally debug a petabyte scale pipeline, while its running. Unified logs, metrics, monitoring, alerting is another feature that shows how well the Google Cloud platform is built with developer in mind.
Totally hear you on the death by a thousand tiny cuts :(
GCS provides multi-part and resumable uploads (https://cloud.google.com/storage/docs/json_api/v1/how-tos/up...), though I agree that the docs make it hard to find given how deeply they are nested. We use resumable uploads in Firebase Storage (mobile GCS: firebase.google.com/docs/storage) to great effect, and routinely upload some pretty huge files with no problems.
Definitely hear you on autogenerated libs sucking: the gcloud-* libs are designed to address some of those issues. gcloud-golang is still under development (https://github.com/googlecloudplatform/gcloud-golang), but might be a good place to start.
GCP is working to address a number of permissions issues with Cloud IAM (https://cloud.google.com/iam), which will provide more fine grained control over resources. I believe Cloud PubSub already uses this model.
Firebase (which shares certain services with GCP) has free developer support (firebase.google.com/support), and as you can imagine, we're inundated with questions and have two teams working 24/7 to address them. Free developer support is a great thing for developers, but providing high quality support at Google scale is probably the hardest thing to do--people just don't scale the same way machines do.
That's why so many of us are active on social media/HN/etc., we want to talk directly to developers and get feedback so we can improve our products, but we typically aim for high quality feedback (like this, thank you :), where we can engage with savvy developers to solve their problems, or at least get actionable feedback to guide our roadmap (x is a bad experience, have you considered y and z which would save me n hours). Ideally, this feedback trickles down into all areas of the product, and even across products (when it comes to permissions, console changes, docs, etc.), though it can take some time to implement those changes.
(Disclosure: PM on Firebase, and work closely with Cloud)
> Google App Engine's dev_appserver.py completely broke after an update, caused in part by another Google library being installed (protobuf...). Still not sure if the fix was rolled into a release yet...
We've been having a lot of fun with how tricky namespace packages are in Python. We've got a fix in for this issue that should hopefully be in the next SDK release, and we're looking into ways to better isolate dev_appserver from the OS environment.
A simple workaround is to activate an empty virtualenv before running dev_appserver.
Some examples:
Cloud Storage doesn't support multi part uploads. The best it seems to have is the ability to compose objects. Honestly that's a better system than dodgy multi-part or resumable uploads, but there are hard restrictions on composing objects. You can't compose more than 32 objects, and you can't compose more than 2 layers deep. So with two iterations you can compose at most 1024 objects. That's not great for uploading large objects through our servers in small chunks. If our chunks are, say, 10MB than the largest final file size we can achieve is only 10 gigs.
On AWS when connecting EC2 to RDS we just threw together the VPC and then configured the EC2 servers with the RDS's hostname. Easy. On GCP we basically _had_ to use Cloud SQL Proxy. Now, again, it seems that Cloud SQL Proxy is a better system overall, but it required fiddling with our server setup, upgrading our MySQL library (which caused other issues), and other random dickery. Another annoyance.
We use Go for our backend servers, and GCP's Go API libraries are all autogenerated, and might as well not be documented. We frequently receive the opaque error "required: required" when trying to blindly figure out the API. It's become an office joke. "Why won't Ubuntu recognize this Wifi card?" "Because required: required man, obviously."
Google App Engine's dev_appserver.py completely broke after an update, caused in part by another Google library being installed (protobuf...). Still not sure if the fix was rolled into a release yet...
The web interface frequently breaks and requires manually refreshing, and it's generally slow and unresponsive on the best of days. It also loves to switch me to my personal account and throw errors because I don't have access to the project I was trying to access...
The "scopes" for launching a compute instance aren't documented to the extent that we know which ones provide what privileges. Really the whole privilege system on GCP is a mess and pales in comparison to AWS. I recall some obvious permissions were just outright missing a few weeks ago.
We have some Go code that uses the API to launch a compute instance. When specifying the scopes on the command line for launching an instance they seemed to require being accompanied by the service account "email" address. So in the Go code we specify the service account email and the scopes. One day during development I forgot to set the service account, didn't notice, and everything worked as normal...
I was not able to find an obvious place where preemptible instances report being killed. Not in the activity logs or the serial console log (which is not saved/available when the instance shuts down). shrugs I didn't feel like looking deeper into it.
Startup scripts specified when launching a compute instance run every time the instance starts. Makes sense in retrospect given the name, but it's in contrast to AWS where the script runs once, and in contrast to the example startup script given in the documentation (which installs things ... not something a script that runs every time the machine boots up should do). And it's not very helpful. A script that runs once ever is more practical than a script that runs every boot.
Figuring out exactly how to cook up my own compute images in a format that GCP likes required finding a random video on YouTube from a Google developer.
Some of the documentation (this was either for Datastore or some part of App Engine) is actually just a bunch of marketing copy with no technical meat to it, leaving me to just assume how various features work (because they aren't actually documented anywhere else).
New strange behavior from MySQL running on Cloud SQL that we still haven't nailed down (random lock contentions) that we never encountered on RDS.
Random networking failures on fresh compute instances.
Random upload failures to Cloud Storage.
Transferring objects from one bucket in Cloud Storage to another bucket using the transfer interface resulted in the ACLs being lost for all the objects.
Random things get deprecated every other week. Image aliases last week, something about the Cloud Storage metadata was weird the week before that, etc.
The CLI randomly failing to query for the list of compute instances for tab completion, instead just tab completing an instance that was deleted 10 minutes ago.