Site Reliability Engineer

2 weeks ago


Singapore NodeFlair Full time

**Job Summary**:
**Salary**
S$9,471 - S$18,942 / Monthly EST

**Job Type**
Permanent

**Seniority**

Mid

**Years of Experience**
At least 3 years

**Tech Stacks**
C++ Go Shell Linux Kubernetes Python
- The Machine Learning (ML) System team combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and Inference system/services around the world.
- In our team, you'll have the opportunity to build the large scale heterogeneous system integrating with GPU/RDMA/Storage and keep it running stable and reliable, enrich your expertise in coding, performance analysis and distributed system, and be involved in the decision-making process. You'll also be part of a global team with members from United States, China and Singapore working collaboratively towards unified project direction.**Responsibilities**:

- 1. Responsible for ensuring our internal systems are operating efficiently for model development, training and deployment;
- 2. Responsible for resource management and planning, cost and budget, including computing and storage resources;
- 3. Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement;
- 4. Build software products and systems to monitor and manage the ML infrastructure and services;
- 5. Be part of the global team roster that ensures system and business on-call support;
- 6. Research, design, and develop computer and network software or specialised utility programs;
8. Update software, enhances existing software capabilities, and develops and direct software testing and validation procedures;
- 9. Work with computer hardware engineers to integrate hardware and software systems and develop specifications and performance requirements;**Qualifications**

1. Bachelor's degree or above, major in Computer Science, computer engineering or related;
- 2. At least 3 years or more working experiences;
- 3. Strong proficiency in at least one programming languages such as C++/Go/Python/Shell in Linux environment;
- 4. Strong hands-on experience with Kubernetes and containers skills, and have more than 1 year of relevant operation and maintenance experiences;
- 5. Possess excellent logical analysis ability, able to reasonably abstract and split business logic;
- 6. Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time;
- 7. Possesses a strong sense of responsibility, good learning ability, communication ability and self-drive, good team spirit;
- Preferred
- 1. Engaged in the operation and maintenance of large-scale distributed systems;
- 2. Experience in operation and maintenance of GPU servers;
- TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.



  • Singapore IDEMIA Full time

    Join to apply for the Site Reliability Engineer role at IDEMIA Join to apply for the Site Reliability Engineer role at IDEMIA Get AI-powered advice on this job and more exclusive features. PurposeThis role plays a critical part in ensuring reliability, scalability, and performance of our systems and services. You will work closely with development and...


  • Singapore IDEMIA Full time

    Join to apply for the Site Reliability Engineer role at IDEMIA Join to apply for the Site Reliability Engineer role at IDEMIA Get AI-powered advice on this job and more exclusive features. PurposeThis role plays a critical part in ensuring reliability, scalability, and performance of our systems and services. You will work closely with development and...


  • Singapore IDEMIA Full time

    Join to apply for the Site Reliability Engineer role at IDEMIA Join to apply for the Site Reliability Engineer role at IDEMIA Get AI-powered advice on this job and more exclusive features. Purpose This role plays a critical part in ensuring reliability, scalability, and performance of our systems and services. You will work closely with development and...


  • Singapore beBeeSiteReliability Full time $90,000 - $120,000

    Unlock Your Full Potential in Site Reliability EngineeringAbout the RoleThis is an exciting opportunity to work with a global banking institution, leveraging your skills in production management and site reliability engineering to drive business growth.Develop and implement proactive, predictive models for shift production management using SRE...


  • Singapore beBeeSiteReliability Full time

    Unlock Your Full Potential in Site Reliability Engineering About the Role This is an exciting opportunity to work with a global banking institution, leveraging your skills in production management and site reliability engineering to drive business growth. Develop and implement proactive, predictive models for shift production management using SRE...


  • Singapore DHATCH CONSULTANCY PTE. LTD. Full time

    Site Reliability Engineer: **Preferred Qualifications** - 3+ years of experience in site reliability engineering, DevOps, or software engineering roles. - Proven skills in: - Monitoring & alerting tools (Grafana, New Relic) - CI/CD pipelines (Git, Jenkins, GitHub Actions, etc.) - Container orchestration (Docker, Kubernetes) - Infrastructure-as-code...


  • Singapore HCLTech Full time

    Get AI-powered advice on this job and more exclusive features. This role combines software and systems engineering to build run, and maintain high performant, distributed, fault tolerant and resilient financial systems. Site Reliability Engineers focus on ensuring a joyful customer journey. As a Site Reliability Engineer you will be filling a...


  • Singapore HCLTech Full time

    Get AI-powered advice on this job and more exclusive features. This role combines software and systems engineering to build run, and maintain high performant, distributed, fault tolerant and resilient financial systems. Site Reliability Engineers focus on ensuring a joyful customer journey. As a Site Reliability Engineer you will be filling a...


  • Singapore Tardis Group Full time

    Direct message the job poster from Tardis Group Recruiter at Tardis Group | Finding Top Talent in Tech & Quant About the Company A rapidly growing technology firm operating at the forefront of artificial intelligence and advanced software solutions. The company fosters a fast-paced, collaborative, and innovation-driven culture, uniting talent across...


  • North-East Singapore PERSOLKELLY Full time

    The Site Reliability Engineer is responsible for ensuring the reliability, scalability, and efficiency of our systems and infrastructure. This role involves monitoring, troubleshooting, and resolving issues to maintain optimal performance. The engineer will also collaborate with cross-functional teams to automate processes and improve system reliability....