
Site Reliability Engineer
4 days ago
We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWS and Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.
Responsibilities
As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
- Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
- Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
- Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
- Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
- Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
- Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
- Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
- Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
- Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
- Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
- Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
- Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
- Strong documentation skills and experience in knowledge sharing across teams.
- Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
- Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
- Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
- Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
- Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
- Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
- Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
- Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
- Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
- Proactive in identifying problems and recommending strategic solutions.
- Excellent problem-solving skills with a robust analytical mindset.
- Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
- Ability to remain calm and effective under pressure, especially during incident response.
- Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
- Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
- Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.
Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice.
Seniority levelMid-Senior level
Employment typeFull-time
Job functionEngineering and Information Technology
IndustriesData Security Software Products
#J-18808-Ljbffr-
Site Reliability Engineer
4 days ago
Singapur, Singapore IDEMIA Full timeJoin to apply for the Site Reliability Engineer role at IDEMIA Join to apply for the Site Reliability Engineer role at IDEMIA Get AI-powered advice on this job and more exclusive features. PurposeThis role plays a critical part in ensuring reliability, scalability, and performance of our systems and services. You will work closely with development and...
-
Site Reliability Engineer
4 days ago
Singapur, Singapore Beijing Foreign Enterprise Management Consultants Co.,Ltd. Full timeDirect message the job poster from Beijing Foreign Enterprise Management Consultants Co.,Ltd. On behalf of Huawei, a world-renowned information and communication technology company, we are seeking passionate and talented individuals to join our team as Site Reliability Engineer Overview On behalf of Huawei, a world-renowned information and communication...
-
Site Reliability Engineer
4 days ago
Singapur, Singapore Point72 Full timeJoin to apply for the Site Reliability Engineer role at Point72 About the role As part of Point72’s Technology Team, you will focus on developing and maintaining complex, distributed, real-time systems that support our Global Macro business. Your responsibilities will include optimizing operations through automation, building foundational SRE...
-
Site Reliability Engineer
2 days ago
Singapur, Singapore WeChat International Pte. Ltd. Full timeSite Reliability Engineer page is loadedSite Reliability Engineer Apply remote type Onsite locations Singapore-CapitaSky time type Full time posted on Posted 30+ Days Ago job requisition id R Business Unit Technology Engineering Group (TEG) is responsible for supporting the company and its business groups on technology and operational platforms, as well as...
-
Site Reliability
4 days ago
Singapur, Singapore Canonical Full timeJoin to apply for the Site Reliability / Gitops Engineer role at Canonical 1 day ago Be among the first 25 applicants Join to apply for the Site Reliability / Gitops Engineer role at Canonical Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely...
-
Site Reliability Engineer
4 days ago
Singapur, Singapore Apple Inc. Full timeThere is a lot that goes into building the most secure yet user-friendly devices in the world. We are a unique Software Development group with a charter to secure our platforms, which include iOS software, iOS Devices, and Mac. We build solutions that are used by our customers, engineering teams, and manufacturing environments.We are lookng for Site...
-
Site Reliability Engineer
4 days ago
Singapur, Singapore IDEMIA Full timeJoin to apply for the Site Reliability Engineer role at IDEMIA Overview This role plays a critical part in ensuring reliability, scalability, and performance of our systems and services. You will work closely with development and operations teams to build and maintain robust infrastructure and tools that support high availability, monitoring and rapid...
-
Site Reliability Engineer
4 days ago
Singapur, Singapore RigNet Full timeAbout us One team. Global challenges. Infinite opportunities. At Viasat, we’re on a mission to deliver connections with the capacity to change the world. For more than 35 years, Viasat has helped shape how consumers, businesses, governments and militaries around the globe communicate. We’re looking for people who think big, act fearlessly, and create an...
-
Site Reliability Engineer
4 days ago
Singapur, Singapore Tower Research Capital Full timeJoin to apply for the Site Reliability Engineer role at Tower Research Capital Join to apply for the Site Reliability Engineer role at Tower Research Capital Tower Research Capital is a leading quantitative trading firm founded in 1998. Tower has built its business on a high-performance platform and independent trading teams. We have a 25+ year track...
-
DevOps Engineer
4 days ago
Singapur, Singapore Manus AI Full timeDirect message the job poster from Manus AI Manage and maintain container clusters and other open-source component clusters across various business lines Build and enhance infrastructure operation platforms, including infrastructure management, CI/CD, monitoring/alerting, and logging systems Respond quickly to incidents and implement effective solutions to...