Site Reliability Specialist

6 months ago


Singapur, Singapore IHiS Full time

Position Overview

The Reliability Lead will support the reliability principal with senior management in strategy discussion for application & system improvement, and will also manage the reliability team.

He/She will ensure that the existing site reliability engineering (SREs) initiatives, such as monitoring availability, uplifting capability and automoation are on track. He/She will also assist the Reliability Principal and Engineering Teams in reviewing the reliability program to take stock of success and challenges and refine the program. He/She will be in charge of the management reports that describe the current situation and recommend the next steps.

As Lead of the Reliability team, which consists of experienced engineers and product specialists, he/she will be coaching the engineering teams and service management teams to help them improve in application reliability with tools, monitoring, prevention activities. He/She will collaborate with the applications, incident management (IOC) and infrastructure support teams to identify and implement procedures, tools and scripts that will improve reliability and reduce downtime while improving automation.

Role & Responsibilities

• Strive for automation either by coding it or by leading and influencing engineers to build systems that are easy to run in production

• Identify significant projects that result in substantial cost savings

• Identify changes for the production architecture from the reliability, performance and availability perspective with a data driven approach

• Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make operating cost cheaper to run for all our customers

• Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents

• Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives

• Know a domain really well and radiate that knowledge through recorded demos, discussions in DNA (Design and Automation) meetings, or Incident Reviews

• Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again

• Set an example for team of SREs with positive and inclusive leadership and discussion on work

• Show ownership of a major part of the infrastructure

• De-escalate any conflicts inside the team

Requirements

Bachelor’s degree in computer science or other highly technical, scientific discipline Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C#, and JavaScript Experience with infrastructure technologies like Operating Systems (Windows and Linux), networking, storage, virtualisation Familiar with testing automation tools Have a sense of urgency to deliver & iterate fast A proactive approach to spotting problems, areas for improvement, and performance bottlenecks Previous success in software engineering Have a sense of urgency to deliver & iterate fast A proactive approach to spotting problems, areas for improvement, and performance bottlenecks Have a sense of urgency to deliver & iterate fast A proactive approach to spotting problems, areas for improvement, and performance bottlenecks Have a sense of urgency to deliver & iterate fast A proactive approach to spotting problems, areas for improvement, and performance bottlenecks Specialise in 1 or 2 of the following: Great software engineer and able to code in resolving defects or vulnerabilities of our systems Use infrastructure automation tools such as Chef or Ansible to efficiently manage our infrastructure Implement ""Infrastructure as Code"" using Terraform and CI/CD for automation Load balancing and high availability architecture of application including Proxies and CDN through the use of F5 Openshift and containerizing our system Administer and manage high-availability, high-performance Microsoft SQL Server or Oracle cluster Monitoring and Metrics in Dynatrace, ELK or eG and integrations with Dynatrace / ITSM Logging infrastructure Key, certificate and secrete management Backend storage management and scaling Disaster Recovery and High Availability strategy

Apply Now

Click Enter to update the description of Apply Now
NOTE: It only takes a few minutes to apply for a meaningful career in HealthTech - GO FOR IT

#LI-IHIS11

M-2022-2160



  • Singapur, Singapore Sea Full time

    Job Title: Site Reliability EngineerAt Sea, our Infrastructure team is responsible for providing end-to-end managed services and solutions for our entire Internet infrastructure. We excel in building architecture, providing solutions, and operating data centers, connectivity, cloud, networking, systems, storage, and security.As a Site Reliability Engineer,...


  • Singapur, Singapore Sea Full time

    Our Infrastructure team provides the end-to-end managed services and solutions for the Group's entire Internet infrastructure alongside running business applications. We excel in building the architecture, providing solutions and operations of data centre, connectivity, cloud, networking, system, storage and security. We are a proud provider of high-quality...


  • Singapur, Singapore Sea Full time

    About Sea LabsAt Sea Labs, we're at the forefront of innovation, driving the development of cutting-edge technologies that power our e-commerce, supply chain, games, payment, and finance platforms. Our team in Indonesia is a key part of this journey, working closely with global teams to deliver exceptional user experiences.We're seeking a skilled Site...


  • Singapur, Singapore Sea Full time

    At Sea, our Infrastructure team provides end-to-end managed services and solutions for our entire Internet infrastructure, alongside running business applications. We excel in building architecture, providing solutions and operations of data centre, connectivity, cloud, networking, system, storage and security. Our team is proud to provide high-quality and...


  • Singapur, Singapore Shopee Full time

    About the RoleWe are seeking a highly skilled Senior Site Reliability Engineer to join our Engineering and Technology team in Singapore. As a key member of our team, you will be responsible for managing the technical operations of Shopee's core marketplace businesses, including product lines such as shopee voucher management, shopee discount/coins...


  • Singapur, Singapore Tencent Full time

    Job Summary:Tencent Games is seeking a skilled Site Reliability Engineer to maintain the stability and performance of our overseas cloud platforms. As a key member of our team, you will be responsible for monitoring and resource management, ensuring the smooth operation of our data platforms and services.Key Responsibilities:Design and implement automatic...


  • Singapur, Singapore Sea Full time

    Our Infrastructure team provides the end-to-end managed services and solutions for the Group's entire Internet infrastructure alongside running business applications. We excel in building the architecture, providing solutions and operations of data centre, connectivity, cloud, networking, system, storage and security. We are a proud provider of high-quality...


  • Singapur, Singapore Wibit Consulting & Services (WibitCS) Full time

    In Collaboration, we are building the backbone of reliable cloud solutions! Your Mission as a Site Reliability Engineer (SRE): Ensure the stability and performance of Yealink's overseas cloud operations. Tackle performance bottlenecks and implement creative solutions. ️ Master operational tasks like incident management, service requests, and system...


  • Singapur, Singapore Sea Full time

    About Sea LabsAt Sea Labs, we're at the forefront of the Sea platform's development, supporting diverse business lines across e-commerce, supply chain, games, payment, and finance. Our strong growth and unique positioning have led to the launch of Sea Labs Indonesia, where passionate engineers drive the best experience for our users in Indonesia and...


  • Singapur, Singapore Ripple Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team in Singapore. As a key member of our infrastructure team, you will be responsible for ensuring the high availability and scalability of our systems.Key ResponsibilitiesDesign, implement, and maintain high availability systems and infrastructureCollaborate with...


  • Singapur, Singapore StarHub Full time

    Job Description We are looking for a talented and motivated Site Reliability Engineer (SRE) to join our team. This role requires a mix of infrastructure expertise, hands-on observability experience, and DevOps skills. As an SRE, you will be instrumental in building reliable, scalable, and efficient systems. The ideal candidate will have hands-on...


  • Singapur, Singapore StarHub Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at StarHub. As a Site Reliability Engineer, you will play a crucial role in designing, deploying, and managing scalable infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible, and GitHub.Key Responsibilities:Design and...


  • Singapur, Singapore GEMINI Full time

    Department : Platform Our Platform organization’s purpose is to enable Gemini to scale effectively and empower our engineering teams to focus on building innovative financial products and experiences for individuals around the world. Platform focuses around building a scalable and secure foundations platform, enabling Engineering to deploy, validate,...


  • Singapur, Singapore Blackstone Full time

    Blackstone is the world’s largest alternative asset manager. We seek to create positive economic impact and long-term value for our investors, the companies we invest in, and the communities in which we work. We do this by using extraordinary people and flexible capital to help companies solve problems. Our $ trillion in assets under management include...


  • Singapur, Singapore DBS Bank Full time

    Job SummaryDBS Bank is seeking a highly skilled Site Reliability Engineer to join our Consumer Banking Group Technology team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and performance of our production systems.Key ResponsibilitiesFacilitate and drive recovery calls for major incidents, coordinating with...


  • Singapur, Singapore GEMINI Full time

    About the Role:As a Staff Site Reliability Engineer on Gemini's Platform team, you will play a crucial role in leading our engineering teams towards modern DevOps practices. You will develop and provide modern automation and operational tooling, and work cross-functionally across Gemini's engineering teams to influence and shape our development practices and...


  • Singapur, Singapore DBS Bank Full time

    Job SummaryDBS Bank is seeking a highly skilled Site Reliability Engineer Lead to join our team. As a key member of our Technology and Operations group, you will be responsible for ensuring the operation stability and excellence within the unit.Key ResponsibilitiesEnsure the 24/7 operation teams are equipped with the right skillset and tools to manage...


  • Singapur, Singapore Ripple Full time

    At Ripple, we’re building a world where value moves like information does today. It’s big, it’s bold, and we’re already doing it. Through our crypto solutions for financial institutions, businesses, governments and developers, we are improving the global financial system and creating greater economic fairness and opportunity for more people, in more...


  • Singapur, Singapore Celanese Corporation Full time

    Job Summary:Celanese Corporation is seeking a highly skilled Electrical Reliability Engineer to join our team. As a key member of our electrical discipline, you will be responsible for enhancing electrical reliability and ensuring all KPIs are met.Key Responsibilities:Provide technical subject matter expertise to enhance electrical reliability and ensure all...


  • Singapur, Singapore Helius Full time

    ■ Job Scope Code implementation of the existing service infrastructure (IaC) Operation and performance improvement of applications and middleware Network construction and operation on AWS or GCP Development and operation of tools for automation of operations such as CI/CD Construction and operation of monitoring environment for fault detection and...