AIOps Engineer

16 hours ago


Central Region, Singapore Sri Trang Agro - Industry Public Company Limited Full time $120,000 - $200,000 per year

Position: AIOps Engineer

Location: Central Singapore

Department: IT Operations / Infrastructure & Cloud

Reports to: Head of IT Operations / IT Infrastructure Manager

Job Overview:

We are seeking a hands-on, visionary, and technically deep AIOps and Cloud-Native first mindset Engineer, who will play a leading role in developing our Solutions Platform — a cutting-edge Dev/Data/ML/AI/LLM-Ops platform for scalable AI/ML/Agentic innovation across Sri Trang. This role is designed for someone who thrives in complex environments, enjoys problem-solving at scale, and can architect resilient, high-performance, and automated infrastructure on multi-cloud platforms.

Join us to build scalable, intelligent, and automated infrastructures that power AI, ML, and Agentic applications at Sri Trang Group. You'll be driving CI/CD pipelines, cloud-native deployments, and AI-enhanced solutions, ensuring our systems are not only reliable but also smart enough to heal themselves.

This is not just a role — it's a mission to redefine how AI is built, tested, deployed, and monitored at scale in our organization. If you are up to the challenge, we will be happy to get in touch with you

Key Responsibilities:

  • DevOps – The Foundation of Your Role:
  • Develop and implement a comprehensive DevOps strategy that aligns with Sri Trang Group's business objectives and AI transformation goals.
  • Architect and optimize CI/CD pipelines to support high-frequency deployments.
  • Build and maintain cloud-native infrastructures (preferably Azure) using Infrastructure as Code (ARM, Terraform).
  • Automate as much as possible From deployments to monitoring, ensuring zero-touch operations whenever possible.
  • Drive observability and monitoring using cutting-edge tools like Azure monitor, Grafana, Prometheus, and Datadog.
  • Manage CPU/GPU computing resources and workloads for seamless scalability.
  • Data Operations – Because without data we can't develop AI:
  • Collaborate with Data Engineering and Infrastructure teams to ensure the availability, quality, and timeliness of data for model training, finetuning, and serving.
  • Automate workflows supporting large-scale data preparation for AI/ML/Agentic applications.
  • Integrate version control systems and CI/CD tools (Azure DevOps preferably) to streamline the deployment of scalable data pipelines.
  • Work extensively with cloud vendors (AWS, Azure, Google Cloud Platform, etc.) to scale data infrastructure leveraging cloud-native architectures like serverless computing and distributed data systems.
  • Collaborate with data engineers, data scientists, and analysts to continuously refine deployment processes.
  • Machine Learning (ML), DevOps, and Data Engineering – Where Dev Meets AI:
  • Collaborate with Data Scientists to deploy, monitor, and scale AI/ML models in production using MLflow, TensorFlow serving, TorchServe, Nvidia Triton, etc.
  • Collaborate with Data Scientists to automate model versioning, drift detection, and retraining for optimal performance.
  • Collaborate with Data Scientists to design ML pipelines with AzureML, Airflow, or Kubeflow for efficient data and model workflows.
  • Ensure cost-efficient inference through model optimization and resource scaling on CPU/GPU instances.
  • Large Language Model Operations – Keeping up with What's Coming:
  • Collaborate with Data Scientists to optimize deployment and fine-tuning of LLMs like DeepSeek, BERT, and Llama.
  • Collaborate with Data Scientists to work with vector databases to enhance real-time inference and implement Agentic AI.
  • Help Data Scientists to enable scalable AI applications through prompt engineering and model optimization.
  • Artificial Intelligence for IT Operations – Make the Infrastructure Smarter:
  • With the collaboration of Data Scientists, Data Engineers, and Infrastructure teams, implement AI-powered monitoring and anomaly detection to predict failures before they happen.
  • Use AI-driven automation for root cause analysis and self-healing infrastructure.
  • Enhance operational efficiency with intelligent incident response mechanisms.
  • Subject of Expertise: be the go-to expert on Dev/Data/ML/AI/LLM-Ops engineering best practices, spearheading state-of-the-art implementation in our team.
  • Documentation: Develop comprehensive documentation for Dev/Data/ML/AI/LLM-Ops processes and systems. Provide training and support to team members and stakeholders on tools and best practices.

Required Qualifications:

  • Education: Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field. A Master's degree is preferred but not required.
  • Experience:
  • with PhD) years of experience in either DevOps – Development and Operations, DataOps – Data Operations, MLOps – Machine Learning Operations, AIOps – Artificial Intelligence for IT Operations, LLMOps – Large Language Model Operations coupled with expertise in SRE and Cloud Engineering.
  • Strong coding skills in Python, Bash, and PowerShell, for automation and scripting.
  • Technical Skills:
  • Deep expertise in CI/CD, and multi-cloud platforms (AWS, Azure preferred, GCP).
  • Hands-on experience deploying and managing ML models in production environments.
  • Detail-Oriented:
  • Passionate about automation, AI-driven infrastructure, and making systems smarter at the highest standard possible.

How to stand out from the rest:

  • Certification in Azure (e.g., Azure AI Engineer Associate or Azure DevOps Engineer Expert).
  • Familiarity with feature stores and model registries.
  • Experience with data versioning tools like DVC.
  • MLOps Pipelines Development:
  • On-premise and edge deployment are a big plus.
  • Familiarity with AIOps and LLMOps concepts, tools, and strategies.
  • Technical Skills: knowledge of tools and technologies such as Docker, Kubernetes, SQL, Spark, Hadoop, Kafka, ONNX, and ETL processes is a big plus.
  • Continuous Integration and Deployment: experience with A/B testing and model validation in production environments is highly desirable.


  • Central Region, Singapore Huawei International Pte Ltd Full time $90,000 - $120,000 per year

    ResponsibilitiesTo be responsible for reliability, availability, user experience, capacity planning, toil reduction, process enhancement and digitalization of the cloud-based internet services.Handle SRE role for assigned cloud services owning the KPIs for reliability, issue to resolution, service deployment, business continuity management, security policy...


  • Central Singapore Lenovo Full time $90,000 - $120,000 per year

    General InformationReq #WD Career area:Information TechnologyCountry/Region:SingaporeState:Central SingaporeCity:SINGAPOREDate:Sunday, August 31, 2025Working time:Full-timeAdditional Locations:Singapore - Central Singapore - SingaporeSingapore - Central Singapore - SINGAPOREWhy Work at LenovoWe are Lenovo. We do what we say. We own what we do. We WOW our...

  • Software Engineer

    2 weeks ago


    North-East Region, Singapore NCS Pte Ltd Full time $90,000 - $120,000 per year

    NCS is a leading technology services firm that operates across the Asia Pacific region in over 20 cities, providing consulting, digital services, technology solutions, and more. We believe in harnessing the power of technology to achieve extraordinary things, creating lasting value and impact for our communities, partners, and people. Our diverse workforce...