AIOps Engineer

2 weeks ago

Central Region, Singapore Sri Trang Agro - Industry Public Company Limited Full time $120,000 - $200,000 per year

Position: AIOps Engineer

Location: Central Singapore

Department: IT Operations / Infrastructure & Cloud

Reports to: Head of IT Operations / IT Infrastructure Manager

Job Overview:

We are seeking a hands-on, visionary, and technically deep AIOps and Cloud-Native first mindset Engineer, who will play a leading role in developing our Solutions Platform — a cutting-edge Dev/Data/ML/AI/LLM-Ops platform for scalable AI/ML/Agentic innovation across Sri Trang. This role is designed for someone who thrives in complex environments, enjoys problem-solving at scale, and can architect resilient, high-performance, and automated infrastructure on multi-cloud platforms.

Join us to build scalable, intelligent, and automated infrastructures that power AI, ML, and Agentic applications at Sri Trang Group. You'll be driving CI/CD pipelines, cloud-native deployments, and AI-enhanced solutions, ensuring our systems are not only reliable but also smart enough to heal themselves.

This is not just a role — it's a mission to redefine how AI is built, tested, deployed, and monitored at scale in our organization. If you are up to the challenge, we will be happy to get in touch with you

Key Responsibilities:

DevOps – The Foundation of Your Role:
Develop and implement a comprehensive DevOps strategy that aligns with Sri Trang Group's business objectives and AI transformation goals.
Architect and optimize CI/CD pipelines to support high-frequency deployments.
Build and maintain cloud-native infrastructures (preferably Azure) using Infrastructure as Code (ARM, Terraform).
Automate as much as possible From deployments to monitoring, ensuring zero-touch operations whenever possible.
Drive observability and monitoring using cutting-edge tools like Azure monitor, Grafana, Prometheus, and Datadog.
Manage CPU/GPU computing resources and workloads for seamless scalability.
Data Operations – Because without data we can't develop AI:
Collaborate with Data Engineering and Infrastructure teams to ensure the availability, quality, and timeliness of data for model training, finetuning, and serving.
Automate workflows supporting large-scale data preparation for AI/ML/Agentic applications.
Integrate version control systems and CI/CD tools (Azure DevOps preferably) to streamline the deployment of scalable data pipelines.
Work extensively with cloud vendors (AWS, Azure, Google Cloud Platform, etc.) to scale data infrastructure leveraging cloud-native architectures like serverless computing and distributed data systems.
Collaborate with data engineers, data scientists, and analysts to continuously refine deployment processes.
Machine Learning (ML), DevOps, and Data Engineering – Where Dev Meets AI:
Collaborate with Data Scientists to deploy, monitor, and scale AI/ML models in production using MLflow, TensorFlow serving, TorchServe, Nvidia Triton, etc.
Collaborate with Data Scientists to automate model versioning, drift detection, and retraining for optimal performance.
Collaborate with Data Scientists to design ML pipelines with AzureML, Airflow, or Kubeflow for efficient data and model workflows.
Ensure cost-efficient inference through model optimization and resource scaling on CPU/GPU instances.
Large Language Model Operations – Keeping up with What's Coming:
Collaborate with Data Scientists to optimize deployment and fine-tuning of LLMs like DeepSeek, BERT, and Llama.
Collaborate with Data Scientists to work with vector databases to enhance real-time inference and implement Agentic AI.
Help Data Scientists to enable scalable AI applications through prompt engineering and model optimization.
Artificial Intelligence for IT Operations – Make the Infrastructure Smarter:
With the collaboration of Data Scientists, Data Engineers, and Infrastructure teams, implement AI-powered monitoring and anomaly detection to predict failures before they happen.
Use AI-driven automation for root cause analysis and self-healing infrastructure.
Enhance operational efficiency with intelligent incident response mechanisms.
Subject of Expertise: be the go-to expert on Dev/Data/ML/AI/LLM-Ops engineering best practices, spearheading state-of-the-art implementation in our team.
Documentation: Develop comprehensive documentation for Dev/Data/ML/AI/LLM-Ops processes and systems. Provide training and support to team members and stakeholders on tools and best practices.

Required Qualifications:

Education: Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field. A Master's degree is preferred but not required.
Experience:
with PhD) years of experience in either DevOps – Development and Operations, DataOps – Data Operations, MLOps – Machine Learning Operations, AIOps – Artificial Intelligence for IT Operations, LLMOps – Large Language Model Operations coupled with expertise in SRE and Cloud Engineering.
Strong coding skills in Python, Bash, and PowerShell, for automation and scripting.
Technical Skills:
Deep expertise in CI/CD, and multi-cloud platforms (AWS, Azure preferred, GCP).
Hands-on experience deploying and managing ML models in production environments.
Detail-Oriented:
Passionate about automation, AI-driven infrastructure, and making systems smarter at the highest standard possible.

How to stand out from the rest:

Certification in Azure (e.g., Azure AI Engineer Associate or Azure DevOps Engineer Expert).
Familiarity with feature stores and model registries.
Experience with data versioning tools like DVC.
MLOps Pipelines Development:
On-premise and edge deployment are a big plus.
Familiarity with AIOps and LLMOps concepts, tools, and strategies.
Technical Skills: knowledge of tools and technologies such as Docker, Kubernetes, SQL, Spark, Hadoop, Kafka, ONNX, and ETL processes is a big plus.
Continuous Integration and Deployment: experience with A/B testing and model validation in production environments is highly desirable.

Americas

Europe

Asia / Oceania

Africa

AIOps Engineer