
System Engineer
2 weeks ago
Job Description:
Situated in the heart of Singapore's Central Business District, Rakuten Asia Pte. Ltd. is Rakuten's Asia Regional headquarters. Established in August 2012 as part of Rakuten's global expansion strategy, Rakuten Asia comprises various businesses that provide essential value-added services to Rakuten's global ecosystem. Through advertisement product development, product strategy, and data management, among others, Rakuten Asia is strengthening Rakuten Group's core competencies to take the lead in an increasingly digitalized world.
AI & Data Division (AIDD) spearheads data science & AI initiatives by leveraging data from Rakuten Group. We build a platform for large-scale field experimentations using cutting-edge technologies to provide critical insights that enable faster and better and faster contribution for our business. Our division boasts an international culture created by talented employees from around the world. Following the strategic vision “Rakuten as a data-driven membership company”, AIDD is expanding its data & AI related activities across multiple Rakuten Group companies.
As a System Engineer (GPU Infrastructure & Platform Engineering), you will build, scale, and optimize the GPU cluster infrastructure that supports both training (e.g., ranking models, LLMs) and inference workloads. Your focus will be on the design and build of GPU platform with sophisticated scheduling, elasticity, quota management —ensuring efficient utilization, scalability, and stability for Rakuten’s AI workloads.
Key Responsibilities- Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation.- Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment.- Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization.- Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks.- Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health.- Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible).
Mandatory Qualifications- 3+ years of experience in DevOps/MLOps, GPU infrastructure, or distributed computing.- Deep expertise in Kubernetes (K8s) for GPU workload orchestration (e.g., KubeFlow, Volcano, custom schedulers).- Strong programming skills in Go or Python for platform development, automation and tooling.- Proficiency in Linux system administration, performance tuning, and networking (e.g., RDMA, InfiniBand).- Experience with IaC tools (Terraform, Ansible) and CI/CD pipelines (GitHub Actions, Jenkins).- Bachelor’s or higher degree in Computer Science, Engineering, or a related field.- Strong teamwork and communication skills, with a passion for solving infrastructure challenges.
Nice-to-Have Skills- Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed).- Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput.- Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues.- Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data.- Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM).
Why Join Us?- Build and scale cutting-edge GPU infrastructure for ranking models, LLMs, and real-time AI.- Work with global AI/ML teams to solve high-impact infrastructure challenges.- Opportunity to shape the future of Rakuten’s GPU platform for scalability and efficiency.
-
Senior Process Systems Specialist
1 day ago
Singapore beBeeProcess Engineer Full time $120,000 - $180,000Job Description">Hanwha Ocean Offshore Business is transitioning from a traditional shipbuilder to an Engineering, Procurement, Construction, Installation, and Operation (EPCI(O)) solution provider. We offer products such as FPSO, FLNG, Offshore Renewables, and other Floating Production Units.We are expanding our global footprint by establishing new Global...
-
Senior Systems Engineer
7 days ago
Singapore ATT System Full time**_Role and Responsibilities _** Design, deploy, and manage scalable, secure cloud infrastructure (AWS, Azure, or GCP). - Administer, configure, and optimize Unix/Linux systems (RHEL, CentOS, Ubuntu, AIX, Solaris). - Develop and maintain automation scripts for system provisioning and configuration (Bash, Python, Ansible). - Monitor system health and...
-
Senior CNC Machinist Specialist
1 day ago
Singapore beBeeMechanical Engineer Full time $4,500 - $5,500Job Title: Senior CNC Machinist Specialist">Overview:">We are seeking a skilled and experienced CNC Machinist to lead our machining operations. The ideal candidate will have a strong background in mechanical engineering, manufacturing, or a related field.">About the Role:">This is a leadership position that requires expertise in CNC turning and milling. You...
-
System Engineer
7 days ago
Singapore PTC SYSTEM (S) PTE LTD Full time**System Engineer (Network Security) **Duties and Responsibilities**: - Work with customer to undertake the design, installation, configuration, maintenance of network system solutions which include upgrades, migrations across heterogeneous network or systems - Work in fast-paced environment with tight schedule with dynamic project teams which include...
-
System Engineer
7 days ago
Singapore SYSNET SYSTEM AND SOLUTIONS PTE. LTD. Full timeProvide leadership to the IT services support team. - Process accounts relating to user on-boarding, off-boarding and internal movement. - Handle day-to-day monitoring of client monitoring systems and follow up on IT infrastructure and security alerts. - Generate reports from the security monitoring systems and keep track of the remediation progress by the...
-
Senior HVAC Design Specialist
4 days ago
Singapore beBeeMechanical Engineer Full timeJob DescriptionAt our organization, we apply our knowledge and expertise to drive meaningful solutions for the pharmaceutical industry. As a dedicated professional in this field, you will play a vital role in shaping the future of life sciences.As a key member of our team, you will be responsible for overseeing the entire HVAC design process. Your technical...
-
IT Manager
7 days ago
Singapore ATT System Full time**_Role and Responsibilities_** - Manage IT infrastructure, including on-premises and cloud-based servers, storage, and networking systems. - Ensure high availability and performance through proactive monitoring and maintenance of systems. - Plan and implement IT infrastructure upgrades to meet evolving business needs. - Troubleshoot and resolve...
-
System Engineer/sr. System Engineer
1 week ago
Singapore SYSNET SYSTEM AND SOLUTIONS PTE. LTD. Full time**Roles & Responsibilities** - Windows, Azure, AWS, and office 365 configuration and trouble shooting - Virtualizations like VMware, Hyper-V, Implementation, and management task - Server backup solutions like Symantec or Arcserve or Veeam or any other similar back up implementation and maintenance - Windows server infrastructure handling, more specific to...
-
System Engineer
1 week ago
Singapore PTC SYSTEM (S) PTE LTD Full time**Responsibilities** - Deployment/implementation and support of AI/HPC (GPU) infrastructure solutions that include but not limited to servers, virtualization, storage, networking, AI/ML/HPC software stack. - Project documentation such as Design, Statement of Work, As-Built document, Performance Test, System Integration Test, User Acceptance Test. - Lead...
-
System Engineer
3 weeks ago
Singapore SYSNET SYSTEM AND SOLUTIONS PTE. LTD. Full timeRoles & ResponsibilitiesRoles & ResponsibilitiesWindows, Azure, AWS, and office 365 configuration and trouble shooting Virtualizations like VMware, Hyper-V, Implementation, and management task Server backup solutions like Symantec or Arcserve or Veeam or any other similar back up implementation and maintenance Windows server ...