
Site Reliability Engineer
2 weeks ago
Responsibilities
TikTok will be prioritising applicants who have a current right to work in Singapore, and do not require TikTok's sponsorship of a visa.
TikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Singapore, Jakarta, Seoul and Tokyo.
Why Join Us
At TikTok, our people are humble, intelligent, compassionate and creative. We create to inspire - for you, for us, and for more than 1 billion users on our platform. We lead with curiosity and aim for the highest, never shying away from taking calculated risks and embracing ambiguity as it comes. Here, the opportunities are limitless for those who dare to pursue bold ideas that exist just beyond the boundary of possibility. Join us and make impact happen with a career at TikTok.
About the Team
The Machine Learning (ML) System team combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and Inference system/services around the world.
In our team, you'll have the opportunity to build the large scale heterogeneous system integrating with GPU/RDMA/Storage and keep it running stable and reliable, enrich your expertise in coding, performance analysis and distributed system, and be involved in the decision-making process. You'll also be part of a global team with members from United States, China and Singapore working collaboratively towards unified project direction.
**Responsibilities**:
1. Responsible for ensuring our internal systems are operating efficiently for model development, training and deployment;
2. Responsible for resource management and planning, cost and budget, including computing and storage resources;
3. Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement;
4. Build software products and systems to monitor and manage the ML infrastructure and services;
5. Be part of the global team roster that ensures system and business on-call support
**Qualifications**:
1. Bachelor's degree or above, major in Computer Science, computer engineering or related;
2. Strong proficiency in at least one programming languages such as C++/Go/Python/Shell in Linux environment;
3. Strong hands-on experience with Kubernetes and containers skills, and have more than 1 year of relevant operation and maintenance experiences;
4. Possess excellent logical analysis ability, able to reasonably abstract and split business logic;
5. Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time;
6. Possesses a strong sense of responsibility, good learning ability, communication ability and self-drive, good team spirit;
Preferred
1. Engaged in the operation and maintenance of large-scale distributed systems;
2. Experience in operation and maintenance of GPU servers;
TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
-
Site Reliability Engineer
7 days ago
Singapore IDEMIA Full timeJoin to apply for the Site Reliability Engineer role at IDEMIA Join to apply for the Site Reliability Engineer role at IDEMIA Get AI-powered advice on this job and more exclusive features. PurposeThis role plays a critical part in ensuring reliability, scalability, and performance of our systems and services. You will work closely with development and...
-
Site Reliability Engineer
2 weeks ago
Singapore IDEMIA Full timeJoin to apply for the Site Reliability Engineer role at IDEMIA Join to apply for the Site Reliability Engineer role at IDEMIA Get AI-powered advice on this job and more exclusive features. PurposeThis role plays a critical part in ensuring reliability, scalability, and performance of our systems and services. You will work closely with development and...
-
Site Reliability Engineer
7 days ago
Singapore IDEMIA Full timeJoin to apply for the Site Reliability Engineer role at IDEMIA Join to apply for the Site Reliability Engineer role at IDEMIA Get AI-powered advice on this job and more exclusive features. Purpose This role plays a critical part in ensuring reliability, scalability, and performance of our systems and services. You will work closely with development and...
-
Site Reliability Engineer
2 weeks ago
Singapore beBeeSiteReliability Full time $90,000 - $120,000Unlock Your Full Potential in Site Reliability EngineeringAbout the RoleThis is an exciting opportunity to work with a global banking institution, leveraging your skills in production management and site reliability engineering to drive business growth.Develop and implement proactive, predictive models for shift production management using SRE...
-
Site Reliability Engineer
1 week ago
Singapore beBeeSiteReliability Full timeUnlock Your Full Potential in Site Reliability Engineering About the Role This is an exciting opportunity to work with a global banking institution, leveraging your skills in production management and site reliability engineering to drive business growth. Develop and implement proactive, predictive models for shift production management using SRE...
-
Site Reliability Engineer
5 days ago
Singapore DHATCH CONSULTANCY PTE. LTD. Full timeSite Reliability Engineer: **Preferred Qualifications** - 3+ years of experience in site reliability engineering, DevOps, or software engineering roles. - Proven skills in: - Monitoring & alerting tools (Grafana, New Relic) - CI/CD pipelines (Git, Jenkins, GitHub Actions, etc.) - Container orchestration (Docker, Kubernetes) - Infrastructure-as-code...
-
Site Reliability Engineer
1 week ago
Singapore HCLTech Full timeGet AI-powered advice on this job and more exclusive features. This role combines software and systems engineering to build run, and maintain high performant, distributed, fault tolerant and resilient financial systems. Site Reliability Engineers focus on ensuring a joyful customer journey. As a Site Reliability Engineer you will be filling a...
-
Site Reliability Engineer
7 days ago
Singapore HCLTech Full timeGet AI-powered advice on this job and more exclusive features. This role combines software and systems engineering to build run, and maintain high performant, distributed, fault tolerant and resilient financial systems. Site Reliability Engineers focus on ensuring a joyful customer journey. As a Site Reliability Engineer you will be filling a...
-
Site Reliability Engineer
2 weeks ago
Singapore Tardis Group Full timeDirect message the job poster from Tardis Group Recruiter at Tardis Group | Finding Top Talent in Tech & Quant About the Company A rapidly growing technology firm operating at the forefront of artificial intelligence and advanced software solutions. The company fosters a fast-paced, collaborative, and innovation-driven culture, uniting talent across...
-
Site Reliability Engineer
1 week ago
North-East Singapore PERSOLKELLY Full timeThe Site Reliability Engineer is responsible for ensuring the reliability, scalability, and efficiency of our systems and infrastructure. This role involves monitoring, troubleshooting, and resolving issues to maintain optimal performance. The engineer will also collaborate with cross-functional teams to automate processes and improve system reliability....