I built Degap on several AWS (Amazon Web Services) services. These include DynamoDB for storage, Cognito for authentication, Lambda for compute, and others. To coordinate the deployment and management of AWS resources for Degap, I use CloudFormation. This lets me express my infrastructure as code, and this code can be edited, refactored, and version controlled alongside the rest of the code for Degap. Each time I deploy, CloudFormation reconciles the resources I've already deployed with the new desired state of my stack.
I really like this system, but I made a mistake recently where I didn't think the implications of this system all the way through. Here's the story of what I did and how I fixed it.
How CloudFormation works: CloudFormation attempts to bring your AWS resources to the state that you express in your YAML code. For many reasons, this can fail. You could have as straightforward a situation as a syntax error in your code. Or, you may be requesting a situation that CloudFormation doesn't allow, like changing the type of a resource. At some point, CloudFormation may decide it can't fulfill your request, and CloudFormation will start rolling back its changes.
How CloudFormation works with Lambda functions: To deploy a Lambda function implemented in the Go programming language, CloudFormation needs to have access to your compiled binary. This binary needs to be packaged in a zip archive and uploaded to S3. To accommodate this, I have a deployment script that compiles and archives all the Lambda functions that are part of Degap and uploads them to a deployment S3 bucket before doing any CloudFormation deployments. Over time, this deployment bucket grows without bound.
My builds are reproducible, so I already key my deployment archives by a hash of the binary content. Even so, my deployment bucket eventually grew to multiple GB in size. Here's where I made the mistake: I turned on automatic expiration in S3 to delete older Go binary archives.
I ran a CloudFormation build that failed for some reason. I don't particularly remember why it failed, but CloudFormation immediately tried to roll back. To do so, it tried to fetch the original S3 binary archives it used for the last successful build. These were gone.
Uh oh. In the beginnings of panic, I deleted the functions entirely to try to coax CloudFormation to roll forward. However, CloudFormation refused to update my stack while the stack was in a UPDATE_ROLLBACK_FAILED state.
As the panic rose, I looked up how to recover from a failed rollback. The consistently excellent re:Post had a direct answer: How do I update my CloudFormation stack when it's stuck in the UPDATE_ROLLBACK_FAILED state?. Unfortunately, while its advice was correct, I had substacks in UPDATE_ROLLBACK_FAILED that I didn't actually update in the failed deployment. They weren't skippable in "Resources to skip".
In the end, what I actually did was to stop trying to get ahead of the CloudFormation machinery and instead help it do what it was trying to do: roll back.
I rebuilt and re-archived old versions of each of my Lambda functions by hand. Having good tagging of releases in source control was hugely helpful, but getting this 100% right wasn't essential: this version would only need to be live long enough to finish the rollback and the subsequent rollforward.
I uploaded these archives back to the original S3 keys (filenames) I used in the original deployment. Thankfully, my failed deployments were logged to CloudTrail as UpdateFunctionCode* events.
{
...
"errorMessage": "Error occurred while GetObject. S3 Error Code: NoSuchKey. S3 Error Message: The specified key does not exist.",
"requestParameters": {
"functionName": "DegapStack-WebPushStack-XXXXXXXX-XXXXXXXXFunction-XXXXXXXX",
"s3Bucket": "degap-deployment-bucket-XXXXXXXX-XXXXXXXX",
"s3Key": "fcf0cfca3f9013385e323007ff77f53ef442bfd8b77189b3c312613577841fc0.zip",
...
},
...
}
In this case, I just needed to upload my newly built archive to s3://degap-deployment-bucket-XXXXXXXX-XXXXXXXX/fcf0cfca3f9013385e323007ff77f53ef442bfd8b77189b3c312613577841fc0.zip
Continue rollback from the AWS Console.
After this, the rollback succeeded. Further deployments started proceeding to completion, and I was back to having a fully deployed stack.
However, some things were still broken. I couldn't sign in, and requests to my API server failed.
With the panic starting to subside. I retraced my steps. I had deleted some Lambda functions entirely in an effort to get the deployment working. When I saw that that failed, I'd manually recreated new Lambda functions with the same name. CloudFormation seems to have adopted those Lambda functions just fine, but my permissions, roles, and policies seemed not to be applying anymore.
Now that deploys were working though, this was a simple thing to fix. Lambda functions are stateless. I had no problem simply having CloudFormation deploy a new one.
I renamed each of my Lambda functions in my CloudFormation template by adding a suffix. In my case, Registrar became RegistrarTwo. A subsequent deploy destroyed all the existing functions, created new ones, and hooked up all the permissions. Another rename to remove all the suffixes and another deploy later, and my stack was back to responding to API requests and successfully handling Cognito hook events.
The moral of this story: be careful with your deployment bucket.
Don't let binary archives expire out of your deployment bucket until you're sure you'll never need to roll back to a version that depends on them. Definitely don't let binary archives expire while they're still deployed to production.
Use a lot of care when manually cleaning up old binary archives. Emptying out the bucket entirely is usually not the right move.
Remember that rolling back is CloudFormation's default defense against something going wrong. Anything you do that can get in the way of this leads you down a bad path.
It's back up! Check it out. https://degap.app/