Large Scale Distributed Training Specialist

1 week ago


Singapore beBeeDistributedTraining Full time $125,000 - $175,000
Distributed Training & Inference Optimization Specialist

We are looking for a skilled specialist to maximize the performance and efficiency of large-scale training and inference workloads on our GPU clusters.

Key Responsibilities:
  1. Optimize LLM training frameworks: Maximize GPU utilization and reduce training time using PyTorch, DeepSpeed, Megatron-LM, and FSDP.
  2. Profile and optimize distributed training bottlenecks: Identify and resolve NCCL issues, CUDA kernel efficiency, and communication overhead.
  3. Implement and tune inference optimizations: Achieve low-latency and high-throughput LLM serving using quantization, dynamic batching, KV caching, vLLM, TensorRT-LLM, Triton, and SGLang.
  4. Collaborate with infrastructure teams: Improve GPU cluster scheduling, resource allocation, and fault tolerance for large-scale training jobs.
  5. Develop benchmarking tools: Measure and improve training throughput, memory efficiency, and inference latency.
  6. Research and apply cutting-edge techniques: Optimize LLM performance using mixture-of-experts and speculative decoding.
Requirements:
  • Hands-on experience: 3+ years in GPU-accelerated ML training and inference optimization, preferably for LLMs or large-scale deep learning models.
  • Deep expertise: PyTorch, DeepSpeed, FSDP, or Megatron-LM, with experience in distributed training optimizations.
  • Strong knowledge: LLM inference optimizations, including quantization, pruning, KV caching, continuous batching.
  • Bachelor's degree or higher: Computer Science, Engineering, or related field.
Why Join Us?

Work on cutting-edge LLM training and inference optimization at scale. Directly impact our AI infrastructure by improving efficiency and reducing costs. Collaborate with global AI/ML teams on high-impact challenges. Opportunity to research and implement state-of-the-art GPU optimizations.



  • Singapore beBeeReliability Full time $125,000 - $175,000

    Job TitleA senior site reliability engineer is needed to ensure the smooth operation of large-scale distributed systems.Design, deploy, and manage CI/CD pipelines to deliver software consistently.Administer, scale, and optimize Kubernetes deployments for high availability and fault tolerance.Architect and maintain microservices infrastructure for seamless...


  • Singapore beBeeSoftwareEngineer Full time $80,000 - $120,000

    Infrastructure Software EngineerThis role is a key part of our large-scale AI training and inference infrastructure, focusing on performance and efficiency in the recommendation domain.You will be optimizing the end-to-end stack for model training and inference, working on distributed systems, model/system co-design, GPU optimizations, and more.Your main...


  • Singapore Rakuten Asia Pte Ltd Full time

    Distributed Training & Inference Optimization Engineer (LLM)Join to apply for the Distributed Training & Inference Optimization Engineer (LLM)role at Rakuten Asia Pte Ltd Distributed Training & Inference Optimization Engineer (LLM)4 days ago Be among the first 25 applicants Join to apply for the Distributed Training & Inference Optimization Engineer...


  • Singapore beBeeOptimization Full time $150,000 - $200,000

    Optimization EngineerDevelop, optimize, and improve large-scale model operators for inference and training. Collaborate closely with hardware experts to co-optimize software and hardware systems.Key Responsibilities:- Design high-performance large model compilation systems through self-developed NPU software and hardware systems.- Conduct research and...


  • Singapore RISKDATA CONSULTING PTE. LTD. Full time

    We are hiring Data Engineering Technologist - Large-Scale Distributed Systems with below requirements; **Responsibilities** - Design and maintain ETL/ELT pipelines using Spark for batch and streaming data. - Manage and optimize Hadoop clusters (HDFS, YARN) for scalability and reliability. - Build and maintain Hive data models, partitions, and queries for...


  • Singapore beBeeDataEngineer Full time $180,000 - $240,000

    As a seasoned data engineering professional, you will play a key role in the design and implementation of large-scale distributed systems.Job DescriptionThe ideal candidate will possess a strong background in data engineering or big data development, with hands-on experience in Spark (Core, SQL, Streaming). They should also have a solid understanding of...


  • Singapore beBeeDataEngineer Full time

    As a seasoned data engineering professional, you will play a key role in the design and implementation of large-scale distributed systems. Job Description The ideal candidate will possess a strong background in data engineering or big data development, with hands-on experience in Spark (Core, SQL, Streaming). They should also have a solid understanding...


  • Singapore beBeeDataEngineering Full time $120,000 - $180,000

    Large Scale Data Engineering ExpertWe are seeking an experienced Large Scale Data Engineering Expert to join our team. As a key member of our team, you will play a vital role in designing and maintaining large-scale distributed systems.The ideal candidate will have hands-on experience with Spark (Core, SQL, Streaming), a good understanding of Hadoop (HDFS,...


  • Singapore beBeeDataEngineering Full time

    Large Scale Data Engineering Expert We are seeking an experienced Large Scale Data Engineering Expert to join our team. As a key member of our team, you will play a vital role in designing and maintaining large-scale distributed systems. The ideal candidate will have hands-on experience with Spark (Core, SQL, Streaming), a good understanding of Hadoop...


  • Singapore beBeeHigh Full time $120,000 - $180,000

    Job Description:High-Performance Computing EngineerWe are seeking a highly skilled High-Performance Computing Engineer to join our team. The ideal candidate will have expertise in designing and optimizing large-scale computing systems for artificial intelligence applications.The successful candidate will be responsible for developing large model operators,...