Almost Every infrastructure decision I endorse or regret after 4 years
A startup's infrastructure leader meticulously reviews four years of crucial technology decisions, candidly labeling each as an endorsement or regret. The post offers a practical, experience-based blueprint for navigating AWS, Kubernetes, SaaS, and internal processes. It's a goldmine for fellow engineers seeking real-world lessons on scaling infrastructure without repeating common pitfalls.
The Lowdown
Jack Lindamood, an infrastructure lead at a rapidly scaling startup, shares a comprehensive retrospective of nearly every significant infrastructure decision made over the past four years. He categorizes each choice as either an 'endorse' (recommend) or 'regret' (advise against), providing valuable lessons drawn from hands-on experience.
- AWS Decisions: Endorses choosing AWS over GCP for better support, EKS for managed Kubernetes, RDS for managed databases, Redis ElastiCache, ECR for container registries, AWS VPN, and Control Tower Account Factory for Terraform. Regrets include EKS managed add-ons due to customization limitations, AWS premium support for its high cost relative to internal knowledge, and Bottlerocket for EKS nodes due to networking and debugging challenges.
- Process & Tools: Endorses automating post-mortem processes with a Slack bot, using Notion and PagerDuty templates for incident management, implementing a two-tiered alerting system with regular reviews, monthly cost tracking meetings, and GitOps for infrastructure management. Regrets include not leveraging Function as a Service (FaaS) more for CPU workloads, relying on multiple applications sharing a single database which leads to technical debt, and built-in post-mortem tools for their lack of customization.
- SaaS Choices: Endorses Notion for documentation, Slack for communication, Linear over Jira for issue tracking, and PagerDuty for alerting. Regrets not adopting a dedicated identity platform like Okta early on and Datadog due to its expensive cost model, particularly for Kubernetes clusters and AI workloads.
- Software & Configuration: Endorses schema migration by diff, Ubuntu for dev servers, AppSmith for internal tools, Helm for Kubernetes package management, Kubernetes itself, Karpenter for node management, ExternalSecrets for Kubernetes secret management, ExternalDNS, cert-manager, and Terraform over CloudFormation. Regrets include the complexity of Bazel for Go services, not adopting OpenTelemetry early, and SealedSecrets for making secret management harder for developers.
Lindamood's detailed account provides actionable insights for any startup or engineering team building and scaling their infrastructure, highlighting the practical trade-offs and unexpected consequences of seemingly minor decisions.