Description
You’re the person who keeps 50+ SaaS products alive when everyone else is guessing. We’re after DevOps engineers who can navigate unfamiliar AWS accounts, stabilize chaos, and push uptime past 99.9% with real monitoring, real automation, and real RCAs. You’ll decompose hairy projects into one-day tasks, ship production-grade Python or JavaScript, and treat AI as your junior.
Most teams brag about “cloud” while babysitting pets. We’re industrializing reliability across dozens of acquired products where the original authors have moved on and the documentation has gaps. That’s the fun: you’ll use agents and modern tooling to investigate new systems 5–10x faster, codify what you learn, and automate it so the next outage never happens. Instead of evaluating you on certs and tool logos, we’ll watch you troubleshoot live, write a real 5-Whys that lands on one preventable root cause, and build automations that survive production reality.
This is not an L2 “run the playbook” job. In this role, you write the playbooks, design the rollout from dev to staged to 10% to 100% with soak and rollback triggers, and build the monitors that catch the edge cases. You reject unsafe changes before someone flips a switch. You separate infrastructure faults you own from application defects Engineering owns, and you assign permanent fixes to the right team.
You’ll sit at the engineering core of reliability, owning infrastructure projects, incident response and RCAs, and change requests with copy-paste-executable runbooks. If you’ve already owned a serious SaaS product and want to scale that discipline to a fleet, step in. Bring expert-level AWS, production-grade coding chops, ruthless scope control, and daily, critical use of AI tools. If you’re ready to keep the lights on, please apply.
What you will be doing
- Complex infrastructure migrations, consolidations, production-grade automations, monitoring changes
- Triaging production outages, implementing immediate fixes, and writing root cause analyses with permanent fixes assigned to the responsible teams
- Creating, reviewing, and executing changes in production, including validating whether a proposed change is safe to execute
What you will NOT be doing
- Living in Jira and endless status meetings – we value people who can drive solutions, not just track problems
- Maintaining outdated systems indefinitely – you’ll be empowered to drive meaningful improvements
- Getting blocked by bureaucratic approval chains – you’ll have the authority to execute immediate fixes to resolve incidents
Key responsibilities
- Drive reliability and standardization of cloud infrastructure across our growing product portfolio by implementing robust monitoring, automation, and AWS best practices.
Candidate requirements
- Deep AWS infrastructure expertise (this is our primary platform – other cloud experience alone won’t cut it)
- Experience managing production infrastructure at a scale of hundreds of containers
- Experience scripting with Python and Bash for day-to-day administration operations
- Experience managing and migrating production databases with multiple engines (including MySql, Postgres, Oracle, MS-SQL)
- Experience with infrastructure automation (Terraform, Ansible, or CloudFormation)
