Site Reliability Engineer
3 weeks ago
We are looking for Site Reliability Engineers (SRE) to join our AI Software team. In this role, you will ensure the reliability, performance, scalability, and operational excellence of AI products, model-serving infrastructure, and backend API systems. You’ll work closely with software engineers, AI teams and release teams to automate operations, enhance observability, and streamline deployments in a cloud‑scale environment. This role is ideal for someone who enjoys building resilient systems, solving complex infrastructure problems, and supporting AI workloads in production. Essential Duties and Responsibilities Administer, monitor, and manage cloud‑scale production environments for AI model APIs, backend services, and high‑traffic web systems serving global users. Design and implement fault‑tolerant, autoscaling cloud architectures tailored for AI inference workloads, including GPU‑based environments and software products. Build automated self‑recovery systems to ensure high availability, rapid failover, and cost‑efficient resource usage for all software products. Manage and monitor AI model‑serving platforms, inference engines, vector databases, data pipelines, software applications. Ensure reliability and uptime for experimental, production AI software environments. Implement and maintain comprehensive monitoring, logging, and alerting for all AI and backend services. Reduce MTTR through actionable alerts, runbooks, and automated diagnostics. Automate infrastructure using IaC (Terraform/CloudFormation) and configuration management. Improve release workflows and integrate with QA for smooth handoff to Release Candidate testing. Work closely with software engineering, ML engineering, and release management to enhance operational procedures, deployment processes, and incident response workflows. Participate in on‑call rotations, incident reviews, and continuous improvement initiatives. Qualifications 4+ years of relevant experience in SRE, DevOps, infrastructure engineering, or cloud operations Experience operating production services with significant availability or scaling demands. Strong knowledge in Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX) Comfortable with Linux and Docker administration Basic knowledge in AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (mySQL, noSQL) Strong ability to code and script (preferably Bash scripting and Python) Ability to use or quickly pick up a wide variety of open source technologies and automation tools Understanding of GPU‑based workloads and resource scheduling. Familiarity with vector databases, embeddings, and inference pipeline Comfort with frequent, incremental code testing and deployment Must have good analytical skills to debug deployment problems without taking help from developers Deep hands‑on technical expertise and problem‑solving skills Ability to work in a collaborative, technically challenging environment with rapidly changing requirements. Education & Experience Has a Bachelor’s or Master’s degree in computer science, AI or similar discipline from an accredited institution Travel Requirements Role based in Singapore office and may require up to 1 travel trip per year. #J-18808-Ljbffr
-
DevOps /Site Reliability Engineer
3 weeks ago
Singapur, Singapore Qube Research & Technologies Full timeJoin to apply for the DevOps /Site Reliability Engineer role at Qube Research & Technologies Qube Research & Technologies (QRT) is a global quantitative and systematic investment manager, operating in all liquid asset classes across the world. We are a technology and data driven group implementing a scientific approach to investing. Combining data, research,...
-
Site Reliability Engineer
3 weeks ago
Singapur, Singapore GroupBy Full timeOverview Site Reliability Engineer GroupBy•Singapore About Rezolve Ai Rezolve Ai (NASDAQ: RZLV) is an industry leader in AI-powered solutions, specializing in enhancing customer engagement, operational efficiency, and revenue growth. The Brain Suite delivers advanced tools that harness artificial intelligence to optimize processes, improve decision-making,...
-
Site Reliability Engineer
3 weeks ago
Singapur, Singapore Fastmarkets Full timeCompany Overview Fastmarkets is an industry-leading price-reporting agency (PRA) and information provider for global commodities, offering price data, news, analytics and events for agriculture, forest products, metals and mining, and new‑generation energy markets. Founded in 1865, it employs over 600 people across the UK, US, China, India, Singapore,...
-
Site Reliability Engineer
3 weeks ago
Singapur, Singapore Crystal Equation Corporation Full timeWe are seeking a skilled Site Reliability Engineer (SRE) to join our team. SRE will be responsible for keeping all internal user-facing applications and other production systems running smoothly. This hybrid role involves a combination of both development and operations skills to build and manage systems that are both efficient and reliable. The Enterprise...
-
Site Reliability Engineer
3 weeks ago
Singapur, Singapore Fastmarkets Full timeFastmarkets is an industry-leading price-reporting agency (PRA) and information provider for global commodities, providing price data, news, analytics and events for the agriculture, forest products, metals and mining and new-generation energy markets. Fastmarkets' data is critical for customers seeking to understand and predict dynamic, sometimes opaque...
-
Site Reliability Engineer
3 weeks ago
Singapur, Singapore Crystal Equation Corporation Full timeWe are seeking a skilled Site Reliability Engineer (SRE) to join our team. SRE will be responsible for keeping all internal user‑facing applications and other production systems running smoothly. This hybrid role involves a combination of both development and operations skills to build and manage systems that are both efficient and reliable. The Enterprise...
-
Site Reliability Engineer
2 weeks ago
Singapur, Singapore Medium Full timeAbout Rezolve Ai Rezolve Ai (NASDAQ: RZLV) is an industry leader in AI-powered solutions, specializing in enhancing customer engagement, operational efficiency, and revenue growth. The Brain Suite delivers advanced tools that harness artificial intelligence to optimize processes, improve decision-making, and enable seamless digital experiences As a leader in...
-
Site Reliability Engineer
3 weeks ago
Singapur, Singapore Viasat Full timeAbout us One team. Global challenges. Infinite opportunities. At Viasat, we’re on a mission to deliver connections with the capacity to change the world. For more than 35 years, Viasat has helped shape how consumers, businesses, governments and militaries around the globe communicate. We’re looking for people who think big, act fearlessly, and create an...
-
Site Reliability Engineer
3 weeks ago
Singapur, Singapore Starry Recruitment Full timeSite Reliability Engineer (SRE) – Singapore Responsibilities Support the operation and maintenance of overseas cloud-based services, ensuring platform stability, reliability, and performance; proactively identify and resolve system bottlenecks. Follow internal operational processes, taking ownership of incident management, service request management,...
-
Site Reliability Engineer
3 weeks ago
Singapur, Singapore Viasat Full timeOne team. Global challenges. Infinite opportunities. At Viasat, we’re on a mission to deliver connections with the capacity to change the world. For more than 35 years, Viasat has helped shape how consumers, businesses, governments and militaries around the globe communicate. We’re looking for people who think big, act fearlessly, and create an inclusive...