Site Reliability Engineer- Machine Learning Systems
1 month ago
- Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference
- Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios
- Responsible for resource management and planning, cost and budget, including computing and storage resources
- Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement
- Build software tools, products and systems to monitor and manage the mL infrastructure and services efficiently
- Be part of the global team roster that ensures system and business on-call support
The Large Model Team has a long-term vision and determination in the field of AI, with research directions covering NLP, CV, speech, and other areas. Relying on the abundant data and computing resources of the platform, the team has continued to invest in relevant fields and has launched its own general large model, providing multi-modal capabilities.
The Machine Learning (ML) System sub-team combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and inference system/services around the world, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI.
In our team, you'll have the opportunity to build the large scale heterogeneous system integrating with GPU/NPU/RDMA/Storage and keep it running steadily and reliably, enrich your expertise in coding, performance analysis and distributed system, and be involved in the decision-making process. You'll also be part of a global team with members from the United States, China and Singapore working collaboratively towards unified project direction.
Requirements
QualificationsMinimum Qualifications
- Bachelor's degree or above, majoring in Computer Science, computer engineering or related fields;
- Strong proficiency in at least one programming languages such as Go/Python/Shell in Linux environment;
- Strong hands-on experience with Kubernetes and containers skills, and have ≥3 years of relevant operation and maintenance experience;
Preferred Qualifications
- Possess excellent logical analysis ability, able to reasonably abstract and split business logic, a strong sense of responsibility, good learning ability, communication ability, self-driven and good team spirit;
- Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time.
- Engage in the operation and maintenance of large-scale ML distributed system;
- Experience in operation and maintenance of GPU server
-
Machine Reliability Engineer
1 month ago
Singapur, Singapore Shopee Full timeAbout the RoleWe are seeking a highly skilled Machine Reliability Engineer to join our Engineering and Technology team at Shopee. As a key member of our team, you will be responsible for ensuring the efficient and sustainable operation of our network and hardware infrastructure.Key ResponsibilitiesMaintain and optimize server and OS configurations to ensure...
-
Site Reliability Engineer
4 weeks ago
Singapur, Singapore Hireio, Inc. Full timeResponsibilities Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference. Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios. Responsible for resource management and planning, cost and budget, including...
-
Site Reliability Engineer
1 month ago
Singapur, Singapore Sea Full timeAbout Sea LabsAt Sea Labs, we're at the forefront of innovation, driving the development of cutting-edge technologies that power our e-commerce, supply chain, games, payment, and finance platforms. Our team in Indonesia is a key part of this journey, working closely with global teams to deliver exceptional user experiences.We're seeking a skilled Site...
-
System Machine Reliability Engineer
6 months ago
Singapur, Singapore Shopee Full timeSystem Machine Reliability Engineer - Engineering Infra DepartmentEngineering and TechnologyLevelExperienced (Individual Contributor)LocationSingapore The Engineering and Technology team is at the core of the Shopee platform development. The team is made up of a group of passionate engineers from all over the world, striving to build the best systems with...
-
Machine Learning Engineer
3 months ago
Singapur, Singapore RiceBowl Full timeJob Summary: As a Machine Learning Engineer, you will work closely with data scientists, data engineers, and software developers to build scalable machine learning solutions. You will be responsible for designing, developing, and deploying machine learning models, as well as optimizing them for performance and scalability. The ideal candidate has a strong...
-
Site Reliability Engineer
1 month ago
Singapur, Singapore Sea Full timeJob Title: Site Reliability EngineerAt Sea, our Infrastructure team is responsible for providing end-to-end managed services and solutions for our entire Internet infrastructure. We excel in building architecture, providing solutions, and operating data centers, connectivity, cloud, networking, systems, storage, and security.As a Site Reliability Engineer,...
-
Senior Site Reliability Engineer
1 month ago
Singapur, Singapore Shopee Full timeAbout the RoleWe are seeking a highly skilled Senior Site Reliability Engineer to join our Engineering and Technology team in Singapore. As a key member of our team, you will be responsible for managing the technical operations of Shopee's core marketplace businesses, including product lines such as shopee voucher management, shopee discount/coins...
-
Site Reliability Engineer
5 days ago
Singapur, Singapore Qlik Full timeDescription What makes us Qlik?AGartner Magic Quadrant Leader for 14years in a row, Qliktransforms complex data landscapes into actionable insights, driving strategic business outcomes. Serving over 40,000 global customers, our portfolio leverages pervasive data quality and advanced AI/ML capabilities that lead to better decisions, excel in...
-
Site Reliability Engineer
1 month ago
Singapur, Singapore DBS Bank Full timeJob SummaryDBS Bank is seeking a highly skilled Site Reliability Engineer to join our Consumer Banking Group Technology team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and performance of our production systems.Key ResponsibilitiesFacilitate and drive recovery calls for major incidents, coordinating with...
-
Site Reliability Engineer
2 months ago
Singapur, Singapore Sea Full timeOur Infrastructure team provides the end-to-end managed services and solutions for the Group's entire Internet infrastructure alongside running business applications. We excel in building the architecture, providing solutions and operations of data centre, connectivity, cloud, networking, system, storage and security. We are a proud provider of high-quality...
-
Machine Learning Engineer
1 month ago
Singapur, Singapore DBS Bank Full timeAbout the RoleDBS Bank is seeking a highly skilled Machine Learning Engineer to join our team. As a key member of our Data Technology group, you will be responsible for building and improving our machine learning and analytics platform.Key ResponsibilitiesDesign and develop machine learning models and algorithms to solve complex business problems.Collaborate...
-
Senior Machine Learning Engineer
1 month ago
Singapur, Singapore SAP Full timeUnlock the Power of AISAP is seeking a highly skilled Senior Machine Learning Engineer to join our team in Singapore. As a key member of our Data Science and Technology team, you will be responsible for delivering end-to-end AI solutions integrated in the processes of our customer support business.Key Responsibilities:Assess requirements necessary for...
-
Machine Learning Engineer
1 month ago
Singapur, Singapore TikTok Full timeAbout the TeamThe Search team at TikTok is a dynamic group of innovators responsible for developing cutting-edge search algorithms and architectures for our products, including Douyin, international short video, and e-commerce platforms. We leverage machine learning technology to drive end-to-end modeling and continuous innovation, while focusing on the...
-
Machine Learning Engineer Intern
1 month ago
Singapur, Singapore TikTok Full timeAbout the TeamWe are a team of passionate engineers working on large-scale recommendation systems for various offerings under TikTok and its affiliates. Our focus is on developing cutting-edge solutions for e-commerce recommendation systems.Job DescriptionWe are seeking a talented Machine Learning Engineer Intern to join our team in 2025. As an intern, you...
-
Machine Learning Engineer
2 months ago
Singapur, Singapore Hireio, Inc. Full timeAbout the team Our Search Team is responsible for building and owning one well-known app's search engine, which provides our users the best search experience. On the app's Search Team, you’ll have the opportunity to build a full-stack search engine system and combine information retrieval technology with modern machine learning methods from related fields...
-
Machine Learning Engineer
1 month ago
Singapur, Singapore RiceBowl Full timeJob Summary:RiceBowl is seeking a highly skilled Machine Learning Engineer to join our team. As a key member of our data science team, you will be responsible for designing, developing, and deploying scalable machine learning solutions that drive business growth. You will work closely with data scientists, data engineers, and software developers to build and...
-
Senior Machine Learning Engineer
3 months ago
Singapur, Singapore Hireio, Inc. Full timeAbout the team Our Search Team is responsible for building and owning XXX's search engine which provides our users the best search experience. On the XXX Search Team, you' ll have the opportunity to build a full-stack search engine system and combine information retrieval technology with modern machine learning methods from related fields such as NLP,...
-
Machine Learning Intern
3 months ago
Singapur, Singapore Upskills Full timeThis position would require the Intern to relocate and work on-site in Singapore for 4-6 months!Upskills provides expert financial software consulting to investment banks and leading financial institutions in Asia Pacific, Middle East and Europe. With a strong, Front to Back expertise in the cash and derivatives markets, coupled by an in-depth knowledge of...
-
Machine Learning Intern
3 months ago
Singapur, Singapore Upskills Full timeJob DescriptionThis position would require the Intern to relocate and work on-site in Singapore for 4-6 months!Upskills provides expert financial software consulting to investment banks and leading financial institutions in Asia Pacific, Middle East and Europe. With a strong, Front to Back expertise in the cash and derivatives markets, coupled by an in-depth...
-
Data Engineer, Sensor
3 months ago
Singapur, Singapore Changi Airport Full timeAbout the job As a Machine Learning Data Engineer at CAG, you will be responsible for designing, implementing, and maintaining the data pipelines and infrastructure that support our machine learning projects. You will work closely with data scientists, machine learning engineers, cloud engineer and other cross-functional teams to ensure the availability,...