Senior Site Reliability Engineer - Cloud

Couchbase ,
Manchester, Greater Manchester

Overview

Job Description

This role will have the primary accountability of designing, implementing and operating Couchbase's Cloud platforms. The team operates with a "run what you write" philosophy and each engineer is responsible for deploying and operating the code they write. A successful candidate must have demonstrable experience in at least one programming language, previous work in SaaS application development and operations. The ideal candidate will also have prior experience developing applications on either of the three major cloud platforms - AWS, Azure, and GCP. This role is also open to remote work within UK as our teams are distributed all over the world! Responsibilities * Design, creation, and provisioning of infrastructure. * Deploy and maintain applications. * Design, build, manage and operate the infrastructure and configuration of SaaS applications with a focus on automation and infrastructure as code. * Design, build, manage and operate the infrastructure as a service layer (hosted and cloud-based platforms) that supports the different platform services. * Develop comprehensive monitoring solutions to provide full visibility to the different platform components using tools and services like Kubernetes, Prometheus, Grafana, ELK, Datadog, New Relic and other similar tools. * Experience working within an Agile/Scrum SDLC * Integrate different components and develop new services with a focus on open source to allow a minimal friction developer interaction with the platform and application services. * Identify and troubleshoot any availability and performance issues at multiple layers of deployment, from hardware, operating environment, network, and application. * Evaluate performance trends and expected changes in demand and capacity, and establish the appropriate scalability plans * Troubleshoot and solve customer issues on production deployments * Ensure that SLAs are met in executing operational tasks * Collaborate with other engineers to implement operational solutions while defining, adhering to industry best practices. * Experience in Building and managing Virtualized systems (KVM, OVM, Containers/Docker) and ability to read and understand source code * Systematic problem-solving approach, combined with a strong sense of ownership and drive. * Conduct periodic on-call duties * Working knowledge of information security issues * Firm grasp of at least one modern programming language, beyond advanced scripting (Shell, Perl, Python) * Solid experience using configuration management frameworks (e.g. Chef, Puppet) * Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc) * Experience writing automation tools & eagerness to "automate all the things" Qualifications * 5+ years related professional experience * Strong experience with Infrastructure as Code and Configuration Management tools. * Demonstrated knowledge of the ELK stack. * Experience with Prometheus/Grafana for metrics aggregation/visualization. * Configuration of CI/CD pipelines using Jenkins. * Experience using Kubernetes. * Experience with automation tools/platforms. * Experience with alerting and monitoring tools. * Experience with Terraform is a plus. * Experience working in a highly distributed company is a plus. * Experience writing backend applications is not required but definitely a plus. * Experience working within an Agile/Scrum SDLC. * Align a portion of your day with the business hours of Pacific Time Zone - UTC -8