Site Reliability Engineer

4 days ago


Singapur, Singapore NTT SINGAPORE PTE. LTD. Full time

Must have skills Development knowledge of bash scripting, Java , Python, React or Angular. Working experience on Elastic Search, Prometheus Job description Maintain open source-based application monitoring infrastructure. Enhance, optimize, and migrate to new solutions if required. Support application teams to migrate to latest OpenShift versions, perform deployment of stateful/stateless apps, and troubleshoot issues in Kubernetes/OpenShift platforms. Work with application developers to implement application instrumentation libraries and frameworks. Maintain metrics data store using Prometheus. Perform administration and tuning like cardinality optimization, resource optimization. Maintain distributing tracing infrastructure like Otel, Jaeger, Zipkin, etc. Perform administrative functions and tuning like sampling strategy. Troubleshoot distributed tracing in microservices. Perform production support activities of enterprise logging platforms like ELK stack, Grafana LGTM stack. Implementing alerting infrastructure, integrate with PagerDuty, MS teams and any other software which needs alert-based mitigation/action. Assist application support team to define alerting rules for enterprise business apps. Deploy and do administration of visualization tools like Grafana/Elastic. Create dashboarding templates which can be reused, Implement RBAC for the entire userbase. Educate and implement observability culture in dev community. Assist them identifying golden signals, defining SLI, SLO for enterprise applications, calculate error budgets, MTTD, and MTTR. Troubleshoot the infra issues in the observability infrastructure in Linux VMs and Kubernetes PODs, Setup and secure reverse proxies, secure all application endpoints with TLS, enable MFA, LDAPS, OAuth based on requirement. Configure CI/CD pipeline for all the monitoring infrastructure and services. Modify and extend existing pipeline to cater multiple environments/regions. #J-18808-Ljbffr



  • Singapur, Singapore NetEase Games Full time

    Overview Join to apply for the Site Reliability Engineer role at NetEase Games . As a leading internet technology company based in China, NetEase provides premium online services centered around content creation and operates a broad gaming ecosystem. Job Description Site Reliability Engineering (SRE) refers to using software engineering methods to manage...


  • Singapur, Singapore APPLE SOUTH ASIA PTE. LTD. Full time

    Summary At Apple, new ideas have a way of becoming excellent products, services, and customer experiences very quickly. Bring passion and dedication to your job and there’s no telling what you could accomplish. The people here at Apple don’t just build products - they craft the kind of wonder that’s revolutionized entire industries. It’s the...


  • Singapur, Singapore PERSOL SINGAPORE PTE. LTD. Full time

    Overview Site Reliability Engineer (SRE) – An excellent Site Reliability Engineer (SRE) opportunity is available in a cutting-edge, fast-growing cloud environment. Job Purpose Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure. Job Responsibilities Manage and support AWS services, ensuring uptime,...


  • Singapur, Singapore PERSOL SINGAPORE PTE. LTD. Full time

    Cloud Site Reliability Engineer (AWS) An excellent Cloud Site Reliability Engineer opportunity has just arisen in a global brand supporting mission‑critical government systems. Job Purpose Ensure reliable, secure, and automated cloud operations supporting mission‑critical systems and compliance needs. Responsibilities Manage and support AWS cloud...


  • Singapur, Singapore Crystal Equation Corporation Full time

    Overview We are seeking a skilled Site Reliability Engineer (SRE) to join our team. SRE will be responsible for keeping all internal user-facing applications and other production systems running smoothly. This hybrid role involves a combination of both development and operations skills to build and manage systems that are both efficient and reliable. The...


  • Singapur, Singapore Thales Full time

    Overview Join to apply for the Site Reliability Engineer role at Thales . Location: Singapore, Singapore Thales is a global technology leader trusted by governments, institutions, and enterprises to tackle their most demanding challenges. From quantum applications and artificial intelligence to cybersecurity and 6G innovation, our solutions empower critical...


  • Singapur, Singapore E-Solutions Full time

    Job Title: Site Reliability Engineer (SRE) Experience: 8+ years (including 3+ years in Java) About the Role: We’re looking for a skilled Site Reliability Engineer with strong Java and cloud-native development experience to design, build, and maintain reliable, scalable systems on Kubernetes and AWS. You’ll work closely with development and platform teams...


  • Singapur, Singapore Razer Inc. Full time

    Join to apply for the Site Reliability Engineer role at Razer Inc. 3 weeks ago Be among the first 25 applicants Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work , offering you the opportunity to make an impact globally while working across a team located across 5 continents. Razer is...


  • Singapur, Singapore TikTok Full time

    Overview Responsibilities About the team TikTok Shop is a content e-commerce business utilising international short video products as carriers. Our aim is to become the preferred choice for users seeking to discover and purchase affordable, high-quality products. We provide users with tailored, vibrant, and efficient consumption experiences while enabling...


  • Singapur, Singapore Manpower Singapore Full time

    Site Reliability Engineer - Global Support Apply for the Site Reliability Engineer - Global Support role at Manpower Singapore . Responsibilities Deploy and manage overseas games infrastructure, including game monitor system and login services. Monitor and dashboard game observability to ensure reliability, scalability, and security. Analyze game...