DevOps Engineer, Distributed Systems

Cloudflare ,
London, Greater London

Overview

Job Description

About Us At Cloudflare, we have our eyes set on an ambitious goal: to help build a better Internet. Today the company runs one of the world's largest networks that powers trillions of requests per month. Cloudflare protects and accelerates any Internet application online without adding hardware, installing software, or changing a line of code. Internet properties powered by Cloudflare have all web traffic routed through its intelligent global network, which gets smarter with every request. As a result, they see significant improvement in performance and a decrease in spam and other attacks. Cloudflare was recognized by the World Economic Forum as a Technology Pioneer and named to Entrepreneur Magazine's Top Company Cultures list. We realize people do not fit into neat boxes. We are looking for curious and empathetic individuals who are committed to developing themselves and learning new skills, and we are ready to help you do that. We cannot complete our mission without building a diverse and inclusive team. We hire the best people based on an evaluation of their potential and support them throughout their time at Cloudflare. Come join us! About the team We are a team of software engineers writing and running big critical distributed systems. When we deploy, we deploy on thousands of servers across almost 200 data centres all around the globe. We have recently moved to Kubernetes and we need help to push forward our automation and orchestration of our systems. We believe solid automation and testing increases productivity and make engineers happy. We already have automation, CI pipelines and testing in place but we are looking for somebody to help us drive more projects on this area. You will work on helping us improve our automation, tooling and replacing some of it with better technology. There is also the opportunity to help us writing Go for our distributed systems. What you'll do Quicksilver is Cloudflare's replicated key-value store. It replicates our data all around the world, over thousands and thousands of servers and around 200 data centers. It handles all kinds of network conditions and is queried multiple times whenever a request crosses our network. In this role, you can expect to work on automation and orchestration systems that deploy across thousands of physical servers. Systems that decide how the replication topology should be across our services. CI pipelines that allow us to deploy with confidence changes that affect immediately mission critical systems. Monitoring systems that allow us early detection of symptoms caused by performance and integrity degradation on our replicated database. We use a variety of tools, some integrated with the company platform, some other are team specific which are required to connect and run all these systems as smoothly as possible. You will contribute directly on our orchestration, deployment and rollback plans, automation, operation, monitoring, alerting and troubleshooting. If working on non-trivial engineering problems to achieve reliability at scale excites you, you might be a match for this role. Examples of desirable skills, knowledge and experience * We look for people with experience on Python and Go that feels comfortable working on Linux systems. * You know what it means reproducible builds, immutable infrastructure and good observability. We aim for that. * Our tooling is based on: Kubernetes, Helm, Docker, Prometheus, Salt, Debian, Git, Python and Go. * You have exposure to containerization/clustering technology and you will help us to improve our architecture. * You believe that single point of failure are a thing from the past. * We are open to all levels of experience, from junior to senior. We look for curiosity and creativity with a compromise on pragmatism. What Makes Us Special We're not just a highly ambitious, large-scale technology company. We're a highly ambitious, large-scale technology company with a soul. Fundamental to our mission to help build a better Internet is protecting the free and open Internet. Project Galileo : We equip politically and artistically important organizations and journalists with powerful tools to defend themselves against attacks that would otherwise censor their work, technology already used by Cloudflare's enterprise customers--at no cost. Project Athenian : We created Athenian Project to ensure that state and local governments have the highest level of protection and reliability for free, so that their constituents have access to election information and voter registration. Path Forward Partnership : Since 2016, we have partnered with Path Forward, a nonprofit organization, to create 16-week positions for mid-career professionals who want to get back to the workplace after taking time off to care for a child, parent, or loved one. 1.1.1.1 : We released 1.1.1.1 to help fix the foundation of the Internet by building a faster, more secure and privacy-centric public DNS resolver. This is available publicl