Site Reliability Engineers/Platform Engineers
3 days ago
Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work, offering you the opportunity to make an impact globally while working across a global team located across 5 continents. Razer is also a great place to work, providing you the unique, gamer-centric #LifeAtRazer experience that will put you in an accelerated growth, both personally and professionally.
Job Responsibilities :We are looking for Site Reliability Engineers (SRE) and Platform Engineers to join our AI Software team. In this role, you will ensure the reliability, performance, scalability, and operational excellence of AI products, model-serving infrastructure, and backend API systems.
As a Platform Engineer, you could also design, build, and operate the core platforms that enable scalable AI model serving, data pipelines, and microservices across our organization. This role focuses on Kubernetes-based systems, cloud infrastructure, developer productivity tooling, automation, and the reliability of shared services.
You'll work closely with software engineers, AI teams and release teams to automate operations, enhance observability, and streamline deployments in a cloud-scale environment. This role is ideal for someone who enjoys building resilient systems, solving complex infrastructure problems, and supporting AI workloads in production.
Essential Duties and Responsibilities
- Design, deploy, and manage container-native DevOps platforms based on Kubernetes to support microservices, AI model serving, data engineering and software application workloads.
- Build Proof-of-Concepts leveraging CNCF and Kubernetes-native technologies to validate architectural patterns and platform enhancements.
• Architect secure, scalable infrastructure for AI services, GPU workloads, and distributed systems. Administer, monitor, and manage cloud-scale production environments for AI model APIs, backend services, and high-traffic web systems serving global users.
Design and implement fault-tolerant, autoscaling cloud architectures tailored for AI inference workloads, including GPU-based environments and software products.
Build automated self-recovery systems to ensure high availability, rapid failover, and cost-efficient resource usage for all software products.
Manage and monitor AI model-serving platforms, inference engines, vector databases, data pipelines, software applications
Ensure reliability and uptime for experimental, production AI software environments.
Implement and maintain comprehensive monitoring, logging, and alerting for all AI and backend services.
Reduce MTTR through actionable alerts, runbooks, and automated diagnostics.
Automate infrastructure using IaC (Terraform/CloudFormation) and configuration management.
Improve release workflows and integrate with QA for smooth handoff to Release Candidate testing.
Work closely with software engineering, ML engineering, and release management to enhance operational procedures, deployment processes, and incident response workflows.
Participate in the team's on-call rotation to support 24/7 uptime for critical systems
Qualifications
- 4+ years of relevant experience in Platform Engineering/SRE, DevOps, infrastructure engineering, or cloud operations
Strong understanding of system design, networking, web technologies, and distributed high-traffic systems.
Experience operating production services with significant availability or scaling demands.
Strong knowledge in Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX)
Comfortable with Linux and Docker administration
Basic knowledge in AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (mySQL, noSQL)
Strong ability to code and script ( preferably Bash scripting and Python)
- Proficiency in building and maintaining pipelines for , Go, Python, or similar languages.
Strong experience with modern CI/CD systems, GitOps practices, and tools (Jenkins, ArgoCD, Argo Workflows).
Ability to use or quickly pick up a wide variety of open source technologies and automation tools
Experience with Infrastructure-as-Code (Terraform, Helm, Kustomize).
Understanding of GPU-based workloads and resource scheduling.
Familiarity with vector databases, embeddings, and inference pipeline
Comfort with frequent, incremental code testing and deployment
Must have good analytical skills to debug deployment problems without taking help from developers
Deep hands-on technical expertise and problem-solving skills
Ability to work in a collaborative, technically challenging environment with rapidly changing requirements.
Education & Experience
- Has a Bachelor's or Master's degree in computer science, AI or similar discipline from an accredited institution
Travel Requirements
- Role based in Singapore office and may require up to 1 travel trip per year.
Are you game?
-
Cloud Platform Site Reliability Engineer
2 weeks ago
Singapore Barings LLC Full timeCloud Platform Site Reliability Engineer page is loaded## Cloud Platform Site Reliability Engineerlocations: Hong Kong: SG - SINGAPORE - 1 WALLICH STtime type: Full timeposted on: Posted 30+ Days Agojob requisition id: JR\_ At Barings, we are as invested in our associates as we are in our clients. We recognize those who work diligently for us and...
-
Cloud Platform Site Reliability Engineer
2 weeks ago
Singapore Barings LLC Full timeCloud Platform Site Reliability Engineer page is loaded## Cloud Platform Site Reliability Engineerlocations: Hong Kong: SG - SINGAPORE - 1 WALLICH STtime type: Full timeposted on: Posted 30+ Days Agojob requisition id: JR\_ At Barings, we are as invested in our associates as we are in our clients. We recognize those who work diligently for us and reward them...
-
Site Reliability Engineer
2 weeks ago
Singapore eTeam Full timeDescription Site Reliability Engineer (SRE) We are looking for a seasoned Site Reliability Engineer (SRE) with 5–10 years of experience to join our Platform Engineering team. This role is ideal for someone who thrives in a fast‑paced environment, is passionate about reliability, and enjoys solving complex challenges. You will play a key role in building...
-
Site Reliability Engineer
2 weeks ago
Singapore eTeam Full timeDescription Site Reliability Engineer (SRE)We are looking for a seasoned Site Reliability Engineer (SRE) with 5–10 years of experience to join our Platform Engineering team. This role is ideal for someone who thrives in a fast‐paced environment, is passionate about reliability, and enjoys solving complex challenges. You will play a key role in building...
-
Site Reliability Engineer
2 days ago
Singapore ETEAM WORKFORCE PTE. LTD. Full timePosition: Site Reliability Engineer (SRE) Work Mode - Onsite/Hybrid Timing - 9am to 6 pm Duration – 1 Year (Highly extendable) Salary: 6018 SGD Work Location: Robinson Road, Singapore About the Role We are looking for a seasoned Site Reliability Engineer (SRE) with 5+ years of experience to join our Platform Engineering team. This role is ideal for someone...
-
Site Reliability Engineer
1 week ago
Singapore ETEAM WORKFORCE PTE. LTD. Full timeRoles & Responsibilities Position: Site Reliability Engineer (SRE) Work Mode - Onsite/HybridTiming - 9am to 6 pm Duration – 1 Year (Highly extendable)Salary: 6018 SGD Work Location: Robinson Road, Singapore Job Description About the RoleWe are looking for a seasoned Site Reliability Engineer (SRE) with 5+ years of experience to join our Platform...
-
Site Reliability Engineer, Traffic Platform
1 hour ago
Singapore ByteDance Full timeSite Reliability Engineer, Traffic Platform About the Team Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed infrastructures. Our SREs are tasked to ensure the traffic services are reliable, fault‑tolerant, efficiently scalable and cost‑effective. You will have the opportunity...
-
Site Reliability Engineer
1 hour ago
Singapore ByteDance Full timeSite Reliability Engineer - Media Platform Responsibilities Build global infrastructure for multimedia transport, storage and processing, to serve billions of users all over the world. Engage in global production system management, such as monitoring, emergency response, capacity planning and optimization. Build tools, automations, visualizations and...
-
Site Reliability Engineer
4 days ago
Singapore ABAXX SINGAPORE PTE. LTD. Full timeSite Reliability Engineer - Networking We are seeking competent candidate joining our Infrastructure Team for the mission building and operating MAS regulated marketplace and clearing house. This role is ideal for someone with a strong foundation in AWS services, infrastructure as code, and cloud security, who is passionate about building scalable, secure,...
-
Site Reliability Engineer, Traffic Platform
5 days ago
Singapore ByteDance Full timeResponsibilities About ByteDance Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create...