Site Reliability Engineer

2 weeks ago


Singapore This is an IT support group Full time

About Doubao (Seed)
Founded in 2023, the ByteDance Doubao (Seed) Team is dedicated to pioneering advanced AI foundation models. Our goal is to lead in cutting-edge research and drive technological and societal advancements.
With a strong commitment to AI, our research areas span deep learning, reinforcement learning, language, vision, audio, AI infrastructure, and AI safety. Our team has labs and research positions across China, Singapore, and the US.
Leveraging substantial data and computing resources and through continued investment in these domains, we have developed a proprietary general-purpose model with multimodal capabilities. In the Chinese market, Doubao models power over 50 ByteDance apps and business lines, including Doubao, Coze, and Dreamina, and is available to external enterprise clients via Volcano Engine. Today, the Doubao app stands as the most widely used AIGC application in China.
Why Join Us
Creation is the core of ByteDance's purpose. Our products are built to help imaginations thrive. This is doubly true of the teams that make our innovations possible. Together, we inspire creativity and enrich life - a mission we aim towards achieving every day. To us, every challenge, no matter how ambiguous, is an opportunity; to learn, to innovate, and to grow as one team. Status quo? Never. Courage? Always.
At ByteDance, we create together and grow together. That's how we drive impact - for ourselves, our company, and the users we serve. Join us.
About the Team
The ByteDance Large Model Team is committed to developing the most advanced AI large model technology in the industry, becoming a world-class research team, and contributing to technological and social development. The team has a long-term vision and determination in the field of AI, with research directions covering NLP, CV, speech, and other areas. Relying on the abundant data and computing resources of the platform, the team has continued to invest in relevant fields and has launched its own general large model, providing multi-modal capabilities.
The Machine Learning (ML) System sub-team combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and inference systems/services around the world, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI.
In our team, you'll have the opportunity to build large-scale heterogeneous systems integrating with GPU/NPU/RDMA/Storage and keep them running steadily and reliably, enrich your expertise in coding, performance analysis, and distributed systems, and be involved in the decision-making process. You'll also be part of a global team with members from the United States, China, and Singapore working collaboratively towards a unified project direction.
Responsibilities
Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference.
Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios.
Responsible for resource management and planning, cost and budget, including computing and storage resources.
Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilization improvement, and operation efficiency improvement.
Build software tools, products, and systems to monitor and manage the ML infrastructure and services efficiently.
Be part of the global team roster that ensures system and business on-call support.
Qualifications
Minimum Qualifications
Bachelor's degree or above, majoring in Computer Science, computer engineering or related fields.
Strong proficiency in at least one programming language such as Go/Python/Shell in a Linux environment.
Strong hands-on experience with Kubernetes and containers skills, and have 3 years of relevant operation and maintenance experience.
Preferred Qualifications
Possess excellent logical analysis ability, able to reasonably abstract and split business logic, a strong sense of responsibility, good learning ability, communication ability, self-driven, and good team spirit.
Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time.
Engage in the operation and maintenance of large-scale ML distributed systems.
Experience in operation and maintenance of GPU servers.
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
#J-18808-Ljbffr



  • Singapore HW Search & Selection Ltd Full time

    Site Reliability Engineer A new opportunity has arisen for a Site Reliability Engineer for a prestigious investment management firm in Singapore. You will be responsible for providing production support for the trading infrastructure. Your main responsibilities will include: Linux trading infrastructure support Providing Level II support Utilizing Python to...


  • Singapore Qlik Full time

    What makes us Qlik? A Gartner Magic Quadrant Leader for 14 years in a row, Qlik transforms complex data landscapes into actionable insights, driving strategic business outcomes. Serving over 40,000 global customers, our portfolio leverages pervasive data quality and advanced AI/ML capabilities that lead to better decisions, faster. We excel in integration...


  • Singapore Bright Vision Technologies Full time

    Bright Vision Technologies has an immediate Full-time opportunity for Site Reliability Engineer (SRE)Job Role:Site Reliability Engineer (SRE)Job Type: Full TimeCandidates Looking for Visa sponsorship and willing to relocate to USA are encouraged to apply.About Bright Vision Technologies: Bright Vision Technologies is a fast-growing technology company...


  • Singapore EXASOFT PTE. LTD. Full time

    Roles & ResponsibilitiesPOSITION OVERVIEW : Software Development AnalystResponsibilities and Requirements: Sound knowledge of operating Systems (like LINUX). Understanding all stages of software Development. Supporting incident escalation and troubleshooting. Documenting processes and related knowledge. Evaluating incidents after resolution. ...


  • Singapore EXASOFT PTE. LTD. Full time

    Roles & ResponsibilitiesPOSITION OVERVIEW : Software Development AnalystResponsibilities and Requirements: Sound knowledge of operating Systems (like LINUX). Understanding all stages of software Development. Supporting incident escalation and troubleshooting. Documenting processes and related knowledge. Evaluating incidents after resolution. ...


  • Singapore Aptitude Asia Limited Full time

    Our client, a top-tier hedge fund, is looking to hire a talented Site Reliability Engineer to join their growing SRE team in Singapore. Job Responsibilities: Ensure high reliability, availability, and performance of applications throughout their lifecycle. Automate repetitive tasks and systematically address recurring issues. Generate innovative ideas for...


  • Singapore HEXACON CONSTRUCTION PTE LTD Full time

    Job DescriptionAs a key member of the HEXACON CONSTRUCTION PTE LTD team, we are seeking a highly skilled and experienced Site Reliability Engineer to join our facilities operations department.The ideal candidate will have a strong background in maintenance and reliability engineering, with a proven track record of leading and guiding sub-contractors to...


  • Singapore CLIMATE IMPACT X PTE. LTD. Full time

    Roles & ResponsibilitiesWe are seeking a motivated Site Reliability Engineer (SRE) to join our team. The ideal candidate will ensure the reliability, performance, and scalability of CIX’s technology stack while supporting critical infrastructure needs globally. With a diverse client base across multiple jurisdictions, you are also required to cover London...


  • Singapore CLIMATE IMPACT X PTE. LTD. Full time

    Roles & ResponsibilitiesWe are seeking a motivated Site Reliability Engineer (SRE) to join our team. The ideal candidate will ensure the reliability, performance, and scalability of CIX’s technology stack while supporting critical infrastructure needs globally. With a diverse client base across multiple jurisdictions, you are also required to cover London...


  • Singapore Qlik Full time

    What makes us Qlik?A Gartner Magic Quadrant Leader for 14 years in a row, Qlik transforms complex data landscapes into actionable insights, driving strategic business outcomes. Serving over 40,000 global customers, our portfolio leverages pervasive data quality and advanced AI/ML capabilities that lead to better decisions, faster. We excel in integration...


  • Singapore GXS BANK PTE. LTD. Full time

    Roles & ResponsibilitiesJob Description & RequirementsGet to know the Role: As a Site Reliability Engineer (SRE) you will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Much of our support and software development focuses on optimizing existing systems,...


  • Singapore GXS BANK PTE. LTD. Full time

    Roles & ResponsibilitiesJob Description & RequirementsGet to know the Role: As a Site Reliability Engineer (SRE) you will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Much of our support and software development focuses on optimizing existing systems,...


  • Singapore Qlik Full time

    Director of Regional Site Reliability EngineeringQlik is seeking an experienced leader to oversee the development and scaling of our regional Site Reliability Engineering (SRE) organization in APAC. This role will be instrumental in ensuring the availability, scalability, and reliability of our services.About QlikWe are a global company that transforms...


  • Singapore Chemical Engineering Site Full time

    Job Title: MSAT Process Data Scientist Intern Location: EVolutive Facility (EVF) at 5 Tuas South Street 2, Singapore 639328Eligibility: Credit bearing internship with 12 months duration preferably (6 months minimum)Others: Company transport provision at designated MRT Station About the job Sanofi Manufacturing and Supply Organization is preparing its future...


  • Singapore DEUTSCHE BANK AKTIENGESELLSCHAFT Full time

    About the RoleWe are seeking an experienced Site Reliability Engineer to join our team at Deutsche Bank AKTIENGESELLSCHAFT. As a Site Reliability Engineer, you will play a critical role in ensuring the availability, performance, and security of our cloud-based infrastructure.


  • Singapore NTT DATA SINGAPORE PTE. LTD. Full time

    Roles & Responsibilities EMAIL ID : Interested candidates may also send their resume via email to mike.ramos@nttdata.comOnly shortlisted candidates would be contacted for interview.Role: Site Reliability Engineer - 12 months Renewable contractExperience: Minimum of 5 yearsLocation : Changi Business ParkSummary:We are seeking a highly motivated and...


  • Singapore NTT DATA SINGAPORE PTE. LTD. Full time

    Roles & Responsibilities EMAIL ID : Interested candidates may also send their resume via email to mike.ramos@nttdata.comOnly shortlisted candidates would be contacted for interview.Role: Site Reliability Engineer - 12 months Renewable contractExperience: Minimum of 5 yearsLocation : Changi Business ParkSummary:We are seeking a highly motivated and...


  • Singapore Luxoft Full time

    Project Description With award-winning mobile banking apps and trading systems, our technology platforms help Bank deliver best-in-class products to clients. Naturally, we make sure that the phones work, emails are delivered and PCs run - but we also develop innovative collaboration platforms and workspaces that help our people share their knowledge, their...


  • Singapore This is an IT support group Full time

    Bright Vision Technologies has an immediate Full-time opportunity for Site Reliability Engineer (SRE). Job Role: Site Reliability Engineer (SRE)Job Type: Full Time Candidates looking for visa sponsorship and willing to relocate to the USA are encouraged to apply. About Bright Vision Technologies: Bright Vision Technologies is a fast-growing technology...


  • Singapore This is an IT support group Full time

    Singapore, Singapore Relocation friendly DevOps BCM Industry 02/12/2024Req. VR-109808Project Description With award-winning mobile banking apps and trading systems, our technology platforms help Bank deliver best-in-class products to clients. Naturally, we make sure that the phones work, emails are delivered and PCs run - but we also develop innovative...