Site Reliability Engineering

7 days ago


Singapore INFINITE COMPUTER SOLUTIONS PTE LTD Full time

**Job Description Summary**

The Managed Services Cross Technology Engineer (L2) SRE is a developing engineering role, responsible for providing a managed service to clients to ensure that their IT infrastructure and systems remain operational.

Through the proactive monitoring, identifying, investigating, and resolving of technical incidents and problems, the Managed Services Cross Technology Engineer (L2) SRE is able to restore service to clients.

**The primary objective of this role is to ensure that systems are reliable, scalable, and efficient, with mínimal manual intervention.**

From an operations perspective:

- Ensuring High Availability and Uptime

Keep production systems running smoothly and within defined Service Level Objectives (SLOs).

Minimize downtime and reduce Mean Time to Recovery (MTTR) during incidents.
- Automating Operations

Identify and eliminate manual, repetitive tasks (also called “toil”).

Build automation for deployment, monitoring, incident response, and infrastructure management.
- Managing Incidents and On-Call

Respond quickly to outages and performance degradation.

Lead or participate in incident response, create postmortems, and implement preventative measures.
- Monitoring and Observability

Set up monitoring, logging, and alerting tools to detect issues before they impact users.

Provide visibility into system health and performance.
- Capacity Planning and Performance Optimization

Ensure systems can handle current and future loads.

Optimize infrastructure usage and reduce waste (cost-efficiency).
- Bridging Development and Operations

Advocate for and implement DevOps and SRE best practices.

SREs make sure that production systems are always available, fast, and efficient—by combining software engineering with traditional IT operations practices.

This role may also contribute to / support on project work as and when required.

**Key Responsibilities**:

- Develop automation scripts to reduce manual intervention, cutting recurring operational toil
- Set up and maintained monitoring and alerting using tools like Prometheus, Grafana, and PagerDuty.
- Participate in on-call rotations, driving fast resolution of P1/P2 incidents and contributing to root cause analysis and postmortem documentation.
- Development works on deployment pipelines (Jenkins, GitLab CI/CD).
- Hardened security and compliance across production systems via configuration management and patching.
- Proactively monitors the work queues.
- Performs operational tasks to resolve all incidents/requests in a timely manner and within the agreed SLA.
- Updates tickets with resolution tasks performed.
- Identifies, investigates, analyses issues and errors prior to or when they occur, and logs all such incidents in a timely manner.
- Captures all required and relevant information for immediate resolution.
- Provides second level support to all incidents, requests and identifies the root cause of incidents and problems.
- Communicates with other teams and clients for extending support.
- Executes changes with clear identification of risks and mitigation plans to be captured into the change record.
- Follows the shift handover process highlighting any key tickets to be focussed on along with a handover of upcoming critical tasks to be carried out in the next shift. If Applicable.
- Escalates all tickets to seek the right focus from CoE and other teams, if needed continue the escalations to management.
- Works with automation teams for effort optimization and automating routine tasks.
- Ability to work across various other resolver group (internal and external) like Service Provider, TAC, etc.
- Identifies problems and errors before they impact a client’s service.
- Leads and manages all initial client escalation for operational issues.
- Contributes to the change management process by logging all change requests with complete details for standard and non-standard including patching and any other changes to Configuration Items.
- Ensures all changes are carried out with proper change approvals.
- Plans and executes approved maintenance activities.
- Audits and analyses incident and request tickets for quality and recommends improvements with updates to knowledge articles.
- Produces trend analysis reports for identifying tasks for automation, leading to a reduction in tickets and optimization of effort.
- May also contribute to / support on project work as and when required.
- May work on implementing and delivering Disaster Recovery functions and tests.
- Performs any other related task as required.

**Knowledge and Attributes**:

- Ability to communicate and work across different cultures and social groups.
- Ability to plan activities and projects well in advance, and takes into account possible changing circumstances.
- Ability to maintain a positive outlook at work.
- Ability to work well in a pressurized environment.
- Ability to work hard and put in longer hours when it is necessary.
- Ability to adapt t



  • Singapore DHATCH CONSULTANCY PTE. LTD. Full time

    Site Reliability Engineer: **Preferred Qualifications** - 3+ years of experience in site reliability engineering, DevOps, or software engineering roles. - Proven skills in: - Monitoring & alerting tools (Grafana, New Relic) - CI/CD pipelines (Git, Jenkins, GitHub Actions, etc.) - Container orchestration (Docker, Kubernetes) - Infrastructure-as-code...


  • Singapore eTeam Full time

    Description Site Reliability Engineer (SRE) We are looking for a seasoned Site Reliability Engineer (SRE) with 5–10 years of experience to join our Platform Engineering team. This role is ideal for someone who thrives in a fast‑paced environment, is passionate about reliability, and enjoys solving complex challenges. You will play a key role in building...


  • Singapore eTeam Full time

    Description Site Reliability Engineer (SRE)We are looking for a seasoned Site Reliability Engineer (SRE) with 5–10 years of experience to join our Platform Engineering team. This role is ideal for someone who thrives in a fast‐paced environment, is passionate about reliability, and enjoys solving complex challenges. You will play a key role in building...


  • Singapore ETEAM WORKFORCE PTE. LTD. Full time

    Roles & Responsibilities Position: Site Reliability Engineer (SRE) Work Mode - Onsite/HybridTiming - 9am to 6 pm Duration – 1 Year (Highly extendable)Salary: 6018 SGD Work Location: Robinson Road, Singapore Job Description About the RoleWe are looking for a seasoned Site Reliability Engineer (SRE) with 5+ years of experience to join our Platform...


  • Singapore ETEAM WORKFORCE PTE. LTD. Full time

    Position: Site Reliability Engineer (SRE) Work Mode - Onsite/Hybrid Timing - 9am to 6 pm Duration – 1 Year (Highly extendable) Salary: 6018 SGD Work Location: Robinson Road, Singapore About the Role We are looking for a seasoned Site Reliability Engineer (SRE) with 5+ years of experience to join our Platform Engineering team. This role is ideal for someone...


  • Singapore NTT Data Singapore Full time $120,000 - $200,000 per year

    As a Site Reliability Engineer you will be filling a mission-critical role ensuring that our systems are healthy, monitored, automated, fault tolerant and designed to scale. You will collaborate and work closely with engineering teams to continually improve our production services, facilitating fast delivery of new products, and reducing downtime. Key...


  • Singapore eTeam Full time

    Direct message the job poster from eTeam Are you passionate about reliability, performance, and scalability? Join our dynamic engineering team and help build robust systems that power innovation! Site Reliability Engineer (SRE) Budget: Up to SGD 6,000/month Experience: 5–10 years Key Responsibilities Design, build, and maintain scalable, reliable...


  • Singapore ABAXX SINGAPORE PTE. LTD. Full time

    Site Reliability Engineer - Networking We are seeking competent candidate joining our Infrastructure Team for the mission building and operating MAS regulated marketplace and clearing house. This role is ideal for someone with a strong foundation in AWS services, infrastructure as code, and cloud security, who is passionate about building scalable, secure,...


  • Singapore Crystal Equation Corporation Full time

    We are seeking a skilled Site Reliability Engineer (SRE) to join our team. SRE will be responsible for keeping all internal user-facing applications and other production systems running smoothly. This hybrid role involves a combination of both development and operations skills to build and manage systems that are both efficient and reliable. The Enterprise...


  • Singapore Abaxx Commodity Futures Exchange and Clearinghouse Full time

    Site Reliability Engineer - Networking We are seeking a competent candidate joining our Infrastructure Team for the mission building and operating a MAS regulated marketplace and clearing house. This role is ideal for someone with a strong foundation in AWS services, infrastructure as code, and cloud security, who is passionate about building scalable,...