Site Reliability Engineer

1 day ago


Singapore Avepoint Full time

We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.

As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWSand Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.

Responsibilities:

As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation

• Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health

• Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management

• Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance

• Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.

Maintenance, Optimisation & Performance

• Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement

• Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation

• Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation

• Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.

Requirements:


• Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).

• Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.

• Proven ability to troubleshoot and resolve complex technical issues in containerized applications.

• Demonstrated experience with incident management, including post-incident reviews and continuous improvement.

• Strong documentation skills and experience in knowledge sharing across teams.

• Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.

• Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.

• Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).

• Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).

• Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.

• Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).

• Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.

• Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.

• Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.

Soft skills:


• Proactive in identifying problems and recommending strategic solutions.

• Excellent problem-solving skills with a robust analytical mindset.

• Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.

• Ability to remain calm and effective under pressure, especially during incident response.

• Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.

• Customer-focused with the ability to translate technical insights into understandable, actionable guidance.

• Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.

Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice .

#J-18808-Ljbffr

  • Singapore Sea Limited Full time

    Engineering and Technology - Infrastructure, Singapore - Entry Level Our DevOps Engineering team plays an important role in developing and maintaining the internal systems and tools for the Infrastructure team. As a Site Reliability Engineer, you are responsible for improving the availability and reliability of our Infrastructure services. - Responsible for...


  • Singapore Hyphen Connect Full time

    Site Reliability Engineer (Crypto Trading) Join to apply for the Site Reliability Engineer (Crypto Trading) role at Hyphen Connect Site Reliability Engineer (Crypto Trading) 2 days ago Be among the first 25 applicants Join to apply for the Site Reliability Engineer (Crypto Trading) role at Hyphen Connect We are hiring for one of our ecosystem projects in...


  • Singapore TRUEWATCH TECHNOLOGY INC PTE. LTD. Full time

    **Responsibility**: - Run production environment by monitoring availability and taking a holistic view of the system health. - Achieve site reliability automation, minimize system downtime, and reduce site reliability cost. - Manage risks and resolves issues that affect the release scope, schedule and quality. - Suggest architecture improvements, push for...


  • Singapore TEAMLEASE DIGITAL CONSULTING PTE. LTD. Full time

    As a Site Reliability Engineer, you will be filling a mission-critical role ensuring that our systems are healthy, monitored, automated, fault-tolerant and designed to scale. You will collaborate and work closely with engineering teams to continually improve our production services, facilitating fast delivery of new products, and reducing downtime. Key...


  • Singapore HCLTech Full time

    Get AI-powered advice on this job and more exclusive features. This role combines software and systems engineering to build run, and maintain high performant, distributed, fault tolerant and resilient financial systems. Site Reliability Engineers focus on ensuring a joyful customer journey. As a Site Reliability Engineer you will be filling a...


  • Singapore Vega Solutions Full time

    Join to apply for the Site Reliability Engineer role at Vega SolutionsJoin to apply for the Site Reliability Engineer role at Vega SolutionsGet AI-powered advice on this job and more exclusive features.Tokka Labs | Singapore | Full-TimeTokka Labs is a proprietary trading firm with a focus on close collaboration, rigorous research, and cutting-edge...


  • Singapore Tardis Group Full time

    Direct message the job poster from Tardis Group Recruiter at Tardis Group | Finding Top Talent in Tech & Quant About the Company A rapidly growing technology firm operating at the forefront of artificial intelligence and advanced software solutions. The company fosters a fast-paced, collaborative, and innovation-driven culture, uniting talent across...


  • Singapore HCLTech Full time

    Get AI-powered advice on this job and more exclusive features.This role combines software and systems engineering to build run, and maintain high performant, distributed, fault tolerant and resilient financial systems. Site Reliability Engineers focus on ensuring a joyful customer journey.As a Site Reliability Engineer you will be filling a mission-critical...


  • Singapore JJ Consulting Services Full time

    Our Client is a fast growing company in Singapore, who is seeking to recruit a Site Reliability Engineer. **Site Reliability Engineer** **Key Roles & Responsibilities** - Providing ancillary support of Enterprise-Grade Products and solutions at customer's sites - Ironing out deployment issues or challenges that our customers may face - Responsible for...


  • Singapore Qlik Full time

    **What makes us Qlik?** A Gartner® Magic Quadrant Leader for 14 years in a row, Qlik transforms complex data landscapes into actionable insights, driving strategic business outcomes. Serving over 40,000 global customers, our portfolio leverages pervasive data quality and advanced AI/ML capabilities that lead to better decisions, faster. We excel in...