Trainline
,
London, Greater London
Site Reliability Engineer (SRE) - Observability
Overview
Job Description
Our Mission Trainline is the leading independent rail and coach travel platform selling rail and coach tickets to millions of travellers worldwide. Via our highly rated website and mobile app, people can seamlessly search, book and manage their journeys all in one place. We bring together millions of routes, fares and journey times from 260 rail and coach carriers across 45 countries. We offer our customers the best price for their journey and smart, real time travel information on the go. Our aim is to make rail and coach travel easier and more accessible, encouraging people to make more environmentally sustainable travel choices. Introducing the Platform Delivery organisation The Platform Delivery team cover all areas of infrastructure, reliability, platform and operations engineering across public cloud and data centres; Windows & Linux builds, deployment & management, CDN configuration, load balancing, PKI and a variety of other technologies that combine to provide the Platform for all other teams to use. In the Platform Improvement team, you will: - Design, build and implement tools to aid observability, identification and resolution of incidents that occur in live production environments with a strong emphasis on reducing MTTR as metric - Actively troubleshoot escalated production problems - Contributing to incident retrospective as a someone who is knowledgeable enough to explain what may have occurred at the level of TCP, TLS, HTTP, and operating systems- - Work with Dev and Ops team to promote and expand the SRE concepts in both consultancy and hands-on fashion, identifying how it best fits their services and benefits them and the wider group of technology teams - Reduce MTTR by working with other teams to understand their situation and surface and present the right data - Apply anomaly detection and failure prediction in live environments - Platform visibility and identifying metrics to base decisions on, sourcing them if we don't record them, and equally explaining which metrics are not that valuable - See data presentation as socio-technological problem - anyone can create a dashboard, what we need is the most pertinent metrics presented in the most speedily understood human consumable way to effect the MTTR of an incident - Contribute to capacity projects to recognise issues and changes in traffic before they become impacting - Contribute to scaling projects, understanding the benefits and risks of scaling architecture on demand and the challenges of achieving it for the differing profiles of our services What you'll bring (Essential).. SRE concepts such as SLIs SLOs and error budget Observability concepts RED/USE Strong understanding of HTTP (status codes in detail, nuances of HTTP headers, cookies, connection and request life cycle) Strong understanding of TCP, lifecycle, connection and termination scenarios Strong understanding of Loadbalancing (HTTP and TCP) and reverse proxy concepts Application/service architecture concepts (threads, queuing, readiness checks, health checks, circuit breakers, timeouts, exponential backoff) Knowledge of OS level resources, file descriptors, open files AWS - EC2, S3 and config management, VPC, Networking Elasticsearch - architecture of nodes, logical flow of a write, logical flow of a read, architecture of indices and shards, some api knowledge, key cluster health/usage metrics, writing queries, aggregations, watches, mappings, schema Logstash - tuning, pipelines, writing parsing config that includes some enrichment, health/usage metrics Kibana - strong knowledge of object management, graph/dashboard tips and tricks for best human consumption and lowest cost elasticsearch queries, Timelion, ML Filebeat/Rsyslog - understand log shipping concepts in terms of read registries, buffers, compression, log rotations, inclusions, exclusions, buffer, etc Gafana -graph/dashboard tips and tricks for best human consumption and lowest cost datasource queries Ubuntu/Debian -several years experience running and troubleshooting the Linux OS and the apps/packages/services running on it, understand bottlenecks, understand the symptoms and where to see them Influx DB and other time series database - experience in building, maintaining timeseries databases, downsampling, rollups, and cardinality Experience with deviation, moving average and other mathematical functions Generally understand the challenges of long term data retention and it's impact to query latency, storage, compute, and data architecture Scripting languages such as Bash, Python, Ruby Our Culture Everything begins with great people, as well as aptitude, we put a heavy emphasis on attitude. Coaches Over Heroes - We prioritise the focus on being one team over elevating the heroics of an individual, for us the true heroes are those individuals who are excellent at nurturing, coaching and generous in sharing their knowledge with others. Well-being - Everything that we do takes into account the morale of every member