Lead Service Reliability Engineer

2 weeks ago


Singapore NodeFlair Full time

Job Summary:

Salary
S$9,000 - S$16,500 / Monthly

Job Type

Seniority
Lead

Years of Experience
At least 8 years

Tech Stacks
Strategy Zipkin GitLab CircleCI AWS Terraform Docker Jenkins Go Docker Swarm Shell Script Jaeger Swarm CI ELK EKS Shell Java Grafana Prometheus Kubernetes Ansible Ruby Python


As a Service Reliability Engineer (SRE) you will take a multifaceted approach to ensure technical excellence and operational efficiency within the infrastructure domain.

Specializing in reliability, resilience and system performance, you take a lead role in championing the principles of Site Reliability Engineering.

By strategically integrating automation, monitoring and incident response, you facilitate the evolution from traditional operations to a more customer-focused and agile approach.

Emphasizing shared responsibility and a commitment to continuous improvement, you cultivate a collaborative culture, enabling organizations to meet and exceed their reliability and business objectives.


Job responsibilities:

  • You will be responsible for understanding requirements or SRE goals in depth from both tech and business perspectives
  • You will provide solutions to improve reliability, including identifying and implementing mechanisms and architectures that enable fault tolerance and faster median time to respond and median time to detect
  • You will be responsible for enhancing the incident management process, including the development of an incident prioritization matrix, triage, communication, mitigation, postmortem analysis and implementation of corrective actions
  • You will manage client stakeholder expectations and queries during production incidents, providing detailed technical analysis of issues and remediation plans for mitigation and prevention in future, and act as the interface for Clevel executives, if or when needed
  • You will be a liaison with client engineering teams, build trust and productive relationships with senior client stakeholders and team leads to influence them in making better decisions
  • You will be responsible for identifying opportunities for enhancing system performance and reliability in alignment with business SLAs, SLOs, KPIs and objectives, and provide guidance and assistance to SRE teams in implementing the identified improvements
  • You will oversee and mentor other SREs on the team, contributing to their growth and development

Job qualifications:

Technical skills

  • You can program with one or more highlevel languages such as Python, Golang, Shell scripting, Ruby or Java
  • You are familiar with DevOps and GitOps practices, driving the integration of observability automation into CI/CD pipelines, e

g:
GitLab, Jenkins, CircleCI or equivalent

  • You have indepth knowledge of configuration management and Infrastructure as Code (IAC) tools such as Terraform, Ansible, ARM and CloudFormation for provisioning and managing infrastructure
  • You have an expertise in observability, logs, tracing and monitoring tools such as Grafana (Loki and Tempo), Prometheus, Graylog, Jaeger, Zipkin, ELK stack or equivalent
  • You have a strong understanding of containerbased architecture and handson experience with orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc.
  • You have a good understanding of essential concepts such as quality gates encompassing SLI/SLO/SLA, chaos engineering, golden signals, blameless postmortem methodologies, synthetic monitoring, distributed tracing, enduser monitoring and performance testing
  • You have experience with network load balancing, security tech stacks, Transport Layer Security (TLS) and certificate management, and an understanding of standard networking protocols and configurations
Professional skills

  • You have strong communication and articulation skills, and are proficient in English
  • You are able to convey resolutions to audiences with varying degrees of technical/business proficiency and bring them to consensus
  • You have excellent problemsolving and analytical skills, with a focus on continuous improvement
  • You have good listening and presentation skills
  • You solve challenging problems and difficult to debug issues with a never give up attitude
  • You can collaborate with crossfunctional engineering teams to conduct capacity planning and scalability assessments, and design solutions for handling current and future growth
  • You have the ability to work under pressure, with composure, during production incidents
  • You understand requirements provided by the client on both technical and business aspects, and can break them down for successful implementation
  • You're willing to be part of a rotation
- and need-based, 24x7 available team.

Other things to know:

Learning and development:


There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you.

But we also balance autonomy with the strength of our cultivation culture. This m
  • Reliability Engineer

    2 weeks ago


    Singapore HYPERSCAL SOLUTIONS PTE. LTD. Full time

    COMPANY DESCRIPTIONNTUC Enterprise Co-operative Limited is the holding entity and single largest shareholder of the NTUC group of Social Enterprises. We aim to create a greater social force to do good by harnessing the capabilities of the social enterprises to meet pressing social needs in areas like health and eldercare, childcare, daily essentials, cooked...

  • Reliability Engineer

    2 weeks ago


    Singapore CARIFLEX PTE. LTD. Full time

    Roles & ResponsibilitiesThe Reliability Engineer will play a key role in maximizing the reliability and performance of our chemical manufacturing processes. Working closely with cross-functional teams, this individual will identify areas for improvement, implement preventive maintenance strategies, and develop solutions to minimize downtime and enhance...

  • Reliability Engineer

    2 weeks ago


    Singapore NTUC Enterprise Nexus Co-operative Limited Full time

    COMPANY DESCRIPTIONNTUC Enterprise Co-operative Limited is the holding entity and single largest shareholder of the NTUC group of Social Enterprises. We aim to create a greater social force to do good by harnessing the capabilities of the social enterprises to meet pressing social needs in areas like health and eldercare, childcare, daily essentials, cooked...


  • Singapore APPLE SERVICES PTE. LTD. Full time

    Roles & ResponsibilitiesJob SummaryThe Apple Services Engineering (ASE) team is one of the most exciting examples of Apple’s long-held passion for combining art and technology. These are the people who power the App Store, Apple TV, Apple Music, Apple Podcasts, and Apple Books. And they do it on a massive scale, meeting Apple’s high expectations with...


  • Singapore APPLE SOUTH ASIA PTE. LTD. Full time

    Roles & ResponsibilitiesJob SummaryThe Apple Services Engineering (ASE) team is one of the most exciting examples of Apple’s long-held passion for combining art and technology. These are the people who power the App Store, Apple TV, Apple Music, Apple Podcasts, and Apple Books. And they do it on a massive scale, meeting Apple’s high expectations with...


  • Singapore Singapore Technologies Engineering Ltd Full time

    Date:17-Feb-2023Location: Singapore, SGCompany:ST Engineering GroupEngineering & Reliability Engineer / ExecutivePosition Purpose:Provide component reliability support in relation to group functions, internal departments, and external customers, with the end goal of achieving and improving component reliabilityJob: Manage a component reliability monitoring...

  • Reliability Engineer

    2 weeks ago


    Singapore ADDVALUE INNOVATION PTE LTD Full time

    Roles & ResponsibilitiesReliability EngineerResponsibilitiesWork with product development teams to develop relaibility requirements, establish a reliability / test program and perform appropriate analyse to ensure that new products meet all the relaibility targets. Perform risk / reliabilty analysis (FMEA, FMECA, MTBF) for existing and new products Able to...

  • Reliability Engineer

    4 weeks ago


    Singapore ADDVALUE INNOVATION PTE LTD Full time

    Roles & ResponsibilitiesReliability EngineerResponsibilities Work with product development teams to develop relaibility requirements, establish a reliability / test program and perform appropriate analyse to ensure that new products meet all the relaibility targets. Perform risk / reliabilty analysis (FMEA, FMECA, MTBF) for existing and new products Able...


  • Singapore ADYEN SINGAPORE PTE. LTD. Full time

    Roles & ResponsibilitiesThis is AdyenAdyen provides payments, data, and financial products in a single solution for customers like Meta, Uber, H&M, and Microsoft - making us the financial technology platform of choice. At Adyen, everything we do is engineered for ambition.For our teams, we create an environment with opportunities for our people to succeed,...


  • Singapore ADYEN SINGAPORE PTE. LTD. Full time

    Roles & ResponsibilitiesThis is AdyenAdyen provides payments, data, and financial products in a single solution for customers like Meta, Uber, H&M, and Microsoft - making us the financial technology platform of choice. At Adyen, everything we do is engineered for ambition.For our teams, we create an environment with opportunities for our people to succeed,...


  • Singapore AEA INTERNATIONAL HOLDINGS PTE. LTD. Full time

    A. Overall Purpose Of The JobB. Key Responsibilities Be on rotation to for availability incidents and provide support for customer service engineers. Proactively develop scripts and tools to prevent incidents from ever happening. Develop comprehensive set of monitoring and alerting alert on symptoms and potential issues to prevent outages. Document every...


  • Singapore TIKTOK PTE. LTD. Full time

    About TikTokTikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Singapore, Jakarta, Seoul, and Tokyo. Why Join UsCreation is the core of TikTok's purpose. Our platform is built to help imaginations thrive....

  • Engineer Reliability

    2 weeks ago


    Singapore GLOBALFOUNDRIES Full time

    About GlobalFoundriesIntroductionYour Job SRAM/Flash/NVM/OTP/MTP/eFUSE/CPI reliability setup & analysis, and handle PRM (Periodic reliability monitoring) Work with customer/vendor to design & bring in hardware & software for reliability characterization. Establish wafer and/or package level test methodologies and test program. Support customer engagement in...


  • North-East Singapore PERSOLKELLY Full time

    The Site Reliability Engineer is responsible for ensuring the reliability, scalability, and efficiency of our systems and infrastructure. This role involves monitoring, troubleshooting, and resolving issues to maintain optimal performance. The engineer will also collaborate with cross-functional teams to automate processes and improve system reliability....


  • Singapore ARYAN SOLUTIONS PTE. LTD. Full time

    Bachelor's degree in information technology, Computer Science, Engineering, or similar areas. Working experience as a Platform Reliability Engineer or as a Site Reliability Engineer in a cloud operating environment is required. Strong experience in Kubernetes and Docker. Good exposure to Tanzu TAS, TKGI & PCF is a must: Good working knowledge of DevOps...


  • Singapore Housing and Development Board Full time

    What the role is The mission of Housing & Development Board (HDB) is to provide affordable, quality housing and a great living environment where communities thrive. To achieve its mission, HDB aims to be datadriven to the core and adopt evidencebased decision making in developing better housing policies service, improving service delivery and optimising...

  • Reliability Engineer

    4 weeks ago


    Singapore ATR EASTERN SUPPORT PTE LTD Full time

    Roles & ResponsibilitiesAvions de Transport Regional (ATR) GIE Founded in 1981. ATR has become world leader on the market for regional aircraft with 90 seats or less. Since its creation, ATR has sold over 1,500 aircraft to over 200 operators based in more than 100 countries. ATR planes have totaled over 28 million flight hours. ATR is a joint partnership...

  • Reliability Engineer

    2 months ago


    Singapore ATR EASTERN SUPPORT PTE LTD Full time

    Roles & ResponsibilitiesAvions de Transport Regional (ATR) GIE Founded in 1981. ATR has become world leader on the market for regional aircraft with 90 seats or less. Since its creation, ATR has sold over 1,500 aircraft to over 200 operators based in more than 100 countries. ATR planes have totaled over 28 million flight hours. ATR is a joint partnership...


  • Singapore D L RESOURCES PTE LTD Full time

    Roles & ResponsibilitiesJob ObjectivesThe Site Reliability Engineer/Software Engineer is a contract position responsible software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. As a SRE you will help to ensure that our services are reliable, available, and improving at a rapid pace. You will write code, ...


  • Singapore 2K Full time

    Who We Are Founded in 2005, 2K Games is a global video game company, publishing titles developed by some of the most influential game development studios in the world. Our studios responsible for developing 2K\'s portfolio of world-class games across multiple platforms, include Visual Concepts, Firaxis, Hangar 13, CatDaddy, Cloud Chamber, and HB Studios. Our...