
System Engineer
2 weeks ago
Job Description:
Situated in the heart of Singapore's Central Business District, Rakuten Asia Pte. Ltd. is Rakuten's Asia Regional headquarters. Established in August 2012 as part of Rakuten's global expansion strategy, Rakuten Asia comprises various businesses that provide essential value-added services to Rakuten's global ecosystem. Through advertisement product development, product strategy, and data management, among others, Rakuten Asia is strengthening Rakuten Group's core competencies to take the lead in an increasingly digitalized world.
AI & Data Division (AIDD) spearheads data science & AI initiatives by leveraging data from Rakuten Group. We build a platform for large-scale field experimentations using cutting-edge technologies to provide critical insights that enable faster and better and faster contribution for our business. Our division boasts an international culture created by talented employees from around the world. Following the strategic vision “Rakuten as a data-driven membership company”, AIDD is expanding its data & AI related activities across multiple Rakuten Group companies.
As a System Engineer (GPU Infrastructure & Platform Engineering), you will build, scale, and optimize the GPU cluster infrastructure that supports both training (e.g., ranking models, LLMs) and inference workloads. Your focus will be on the design and build of GPU platform with sophisticated scheduling, elasticity, quota management —ensuring efficient utilization, scalability, and stability for Rakuten’s AI workloads.
Key Responsibilities- Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation.- Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment.- Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization.- Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks.- Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health.- Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible).
Mandatory Qualifications- 3+ years of experience in DevOps/MLOps, GPU infrastructure, or distributed computing.- Deep expertise in Kubernetes (K8s) for GPU workload orchestration (e.g., KubeFlow, Volcano, custom schedulers).- Strong programming skills in Go or Python for platform development, automation and tooling.- Proficiency in Linux system administration, performance tuning, and networking (e.g., RDMA, InfiniBand).- Experience with IaC tools (Terraform, Ansible) and CI/CD pipelines (GitHub Actions, Jenkins).- Bachelor’s or higher degree in Computer Science, Engineering, or a related field.- Strong teamwork and communication skills, with a passion for solving infrastructure challenges.
Nice-to-Have Skills- Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed).- Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput.- Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues.- Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data.- Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM).
Why Join Us?- Build and scale cutting-edge GPU infrastructure for ranking models, LLMs, and real-time AI.- Work with global AI/ML teams to solve high-impact infrastructure challenges.- Opportunity to shape the future of Rakuten’s GPU platform for scalability and efficiency.
-
Senior Systems Engineer
1 week ago
Singapore ATT System Full time**_Role and Responsibilities _** Design, deploy, and manage scalable, secure cloud infrastructure (AWS, Azure, or GCP). - Administer, configure, and optimize Unix/Linux systems (RHEL, CentOS, Ubuntu, AIX, Solaris). - Develop and maintain automation scripts for system provisioning and configuration (Bash, Python, Ansible). - Monitor system health and...
-
System Engineer
1 week ago
Singapore PTC SYSTEM (S) PTE LTD Full time**System Engineer (Network Security) **Duties and Responsibilities**: - Work with customer to undertake the design, installation, configuration, maintenance of network system solutions which include upgrades, migrations across heterogeneous network or systems - Work in fast-paced environment with tight schedule with dynamic project teams which include...
-
System Engineer
1 week ago
Singapore SYSNET SYSTEM AND SOLUTIONS PTE. LTD. Full timeProvide leadership to the IT services support team. - Process accounts relating to user on-boarding, off-boarding and internal movement. - Handle day-to-day monitoring of client monitoring systems and follow up on IT infrastructure and security alerts. - Generate reports from the security monitoring systems and keep track of the remediation progress by the...
-
IT Manager
1 week ago
Singapore ATT System Full time**_Role and Responsibilities_** - Manage IT infrastructure, including on-premises and cloud-based servers, storage, and networking systems. - Ensure high availability and performance through proactive monitoring and maintenance of systems. - Plan and implement IT infrastructure upgrades to meet evolving business needs. - Troubleshoot and resolve...
-
System Engineer
4 weeks ago
Singapore PTC SYSTEM (S) PTE LTD Full timeRoles & ResponsibilitiesResponsibilitiesDeployment/implementation and support of AI/HPC (GPU) infrastructure solutions that include but not limited to servers, virtualization, storage, networking, AI/ML/HPC software stack. Project documentation such as Design, Statement of Work, As-Built document, Performance Test, System...
-
System Engineer
2 weeks ago
Singapore PTC System (S) Pte Ltd Full timeResponsibilities Deployment/implementation and support of AI/HPC (GPU) infrastructure solutions that include but not limited to servers, virtualization, storage, networking, AI/ML/HPC software stack. Project documentation such as Design, Statement of Work, As-Built document, Performance Test, System Integration Test, User Acceptance Test. Lead projects...
-
System Engineer
3 days ago
Singapore PTC System (S) Pte Ltd Full time $90,000 - $120,000 per yearResponsibilitiesDeployment/implementation and support of AI/HPC (GPU) infrastructure solutions that include but not limited to servers, virtualization, storage, networking, AI/ML/HPC software stack.Project documentation such as Design, Statement of Work, As-Built document, Performance Test, System Integration Test, User Acceptance Test.Lead projects or work...
-
System Engineer
4 weeks ago
Singapore PTC SYSTEM (S) PTE LTD Full timeRoles & ResponsibilitiesJob DescriptionDuties and responsibilities• Collaborate and provide presales support to sales by solving the technical business solutions for the client.• Provide consultancy to customer by understanding their needs and translating them to feasible solutions.• Assist sales team to respond to technical specifications of Tenders...
-
Project Engineer
1 week ago
Singapore ATT System Full time**_Role and Responsibilities_** - Responsible for project system design, equipment installation, configuration, testing and commissioning - Implement & ensure proper project management disciplines, structure and processes to drive scope estimation, prioritization and delivery of the project - Manage client expectations and ensure proper scope management -...
-
System Engineer
2 weeks ago
Singapore OMNI-PLUS SYSTEM LIMITED Full timeAdminister and develop Domain Controller ADDS (Active Directory Domain Services), Terminal Server to achieve centralization control. - Install and manage Virtualization Technology (Hyper-V) - Secure and structure network resources with shared folder permissions, patching, managing & housekeeping storage using Synology NAS and Windows Server - Manage domain...