Site Reliability Engineer

1 week ago


Singapur, Singapore Avepoint Full time

We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform. As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWSand Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform. Responsibilities: As a Site Reliability Engineer, you will be responsible for:Toil Reduction & Automation • Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.Observability & System Health • Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.Production Support & Incident Management • Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.Security & Compliance • Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools. Maintenance, Optimisation & Performance • Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.Strategic Customer Engagement • Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.Knowledge Sharing & Documentation • Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.Continuous Learning & Innovation • Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform. Requirements: • Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).• Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.• Proven ability to troubleshoot and resolve complex technical issues in containerized applications.• Demonstrated experience with incident management, including post-incident reviews and continuous improvement.• Strong documentation skills and experience in knowledge sharing across teams.• Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.• Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.• Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).• Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).• Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.• Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).• Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.• Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.• Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus. Soft skills: • Proactive in identifying problems and recommending strategic solutions.• Excellent problem-solving skills with a robust analytical mindset.• Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.• Ability to remain calm and effective under pressure, especially during incident response.• Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.• Customer-focused with the ability to translate technical insights into understandable, actionable guidance.• Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus. Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice . #J-18808-Ljbffr



  • Singapur, Singapore NetEase Games Full time

    Overview Join to apply for the Site Reliability Engineer role at NetEase Games . As a leading internet technology company based in China, NetEase provides premium online services centered around content creation and operates a broad gaming ecosystem. Job Description Site Reliability Engineering (SRE) refers to using software engineering methods to manage...


  • Singapur, Singapore APPLE SOUTH ASIA PTE. LTD. Full time

    Summary At Apple, new ideas have a way of becoming excellent products, services, and customer experiences very quickly. Bring passion and dedication to your job and there’s no telling what you could accomplish. The people here at Apple don’t just build products - they craft the kind of wonder that’s revolutionized entire industries. It’s the...


  • Singapur, Singapore PERSOL SINGAPORE PTE. LTD. Full time

    Overview Site Reliability Engineer (SRE) – An excellent Site Reliability Engineer (SRE) opportunity is available in a cutting-edge, fast-growing cloud environment. Job Purpose Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure. Job Responsibilities Manage and support AWS services, ensuring uptime,...


  • Singapur, Singapore PERSOL SINGAPORE PTE. LTD. Full time

    Cloud Site Reliability Engineer (AWS) An excellent Cloud Site Reliability Engineer opportunity has just arisen in a global brand supporting mission‑critical government systems. Job Purpose Ensure reliable, secure, and automated cloud operations supporting mission‑critical systems and compliance needs. Responsibilities Manage and support AWS cloud...


  • Singapur, Singapore Crystal Equation Corporation Full time

    Overview We are seeking a skilled Site Reliability Engineer (SRE) to join our team. SRE will be responsible for keeping all internal user-facing applications and other production systems running smoothly. This hybrid role involves a combination of both development and operations skills to build and manage systems that are both efficient and reliable. The...


  • Singapur, Singapore Thales Full time

    Overview Join to apply for the Site Reliability Engineer role at Thales . Location: Singapore, Singapore Thales is a global technology leader trusted by governments, institutions, and enterprises to tackle their most demanding challenges. From quantum applications and artificial intelligence to cybersecurity and 6G innovation, our solutions empower critical...


  • Singapur, Singapore E-Solutions Full time

    Job Title: Site Reliability Engineer (SRE) Experience: 8+ years (including 3+ years in Java) About the Role: We’re looking for a skilled Site Reliability Engineer with strong Java and cloud-native development experience to design, build, and maintain reliable, scalable systems on Kubernetes and AWS. You’ll work closely with development and platform teams...


  • Singapur, Singapore Razer Inc. Full time

    Join to apply for the Site Reliability Engineer role at Razer Inc. 3 weeks ago Be among the first 25 applicants Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work , offering you the opportunity to make an impact globally while working across a team located across 5 continents. Razer is...


  • Singapur, Singapore TikTok Full time

    Overview Responsibilities About the team TikTok Shop is a content e-commerce business utilising international short video products as carriers. Our aim is to become the preferred choice for users seeking to discover and purchase affordable, high-quality products. We provide users with tailored, vibrant, and efficient consumption experiences while enabling...


  • Singapur, Singapore Manpower Singapore Full time

    Site Reliability Engineer - Global Support Apply for the Site Reliability Engineer - Global Support role at Manpower Singapore . Responsibilities Deploy and manage overseas games infrastructure, including game monitor system and login services. Monitor and dashboard game observability to ensure reliability, scalability, and security. Analyze game...