Site Reliability Engineer

3 hours ago


Singapore ByteDance Full time

[About ByteDance]
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Helo, and Resso, as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.

[About the Team]
The Datacenter Infrastructure Engineering team supports the company's fast growth by building and operating hyperscale datacenters. The team manages the end to end lifecycle of server fleet, providing cloud solutions and various infrastructure services ensuring that they are scalable and are reliable.

[Responsibilities]
As the [Site Reliability Engineer - Infrastructure Engineering], you would be responsible for at least one if not all of these areas:
**Infrastructure**:

- Build, expand and operate global infrastructures, including large-scale systems in public and private clouds, data centers and content delivery networks.
- Build tools, automations, visualizations and monitors to facilitate the operation and optimization of the global infrastructure.
- Help improve the whole lifecycle of infrastructure services from inception and design throughout development, to deployment, user support and refinement.
- Supporting end-to-end to production environment by responding to performance and reliability issues and participating in rotational on-calls.

**Security**:

- Conduct security reviews of core corporate and production infrastructure.
- Carry out security updates and protect enterprise infrastructure in system and network level.
- Drive enterprise focused security improvements to products and services.
- Build security tools and processes for critical infrastructure protection, monitoring and remediation.

**Traffic**:

- Build tools, automations, visualizations and monitors to facilitate the operation and optimization of the traffic infrastructure.
- Provide primary operational support and engineering for traffic infrastructure systems.
- Gather and analyze metrics to assist in performance tuning and fault finding.

[Minimum Qualifications]
- Bachelor’s degree in Computer Science or equivalent with 3+ years of relevant experience.
- Experience in one or more programming languages such as Java, Python C++, Go, or scripting experience in Shell and Python.
- Ability to thrive in a fast-paced environment.
- Relevant experience working in a Datacenter setup or environment with large scale infrastructure setup featuring high traffic.

As a Site Reliability Engineer with the Infrastructure Engineering team, you would be expected to be an expert in at least one if not all of these areas as well:
**Infrastructure**:

- Experience working with Cloud infrastructure
- Experience in building solutions with AWS, Google, Azure and other cloud services.
- Experience in developing and operating one or more following systems: OpenStack, Kubernetes, Nginx, ipvs, ELK stack, Hadoop, etc.
- Experience working with Unix Linux systems, from kernel to shell and beyond.
- Experience working with system libraries, file systems, and client-server protocols.
- Experience in designing, analyzing, and building automation and tools for large scale systems.
- Experience in networking technologies such TCP/IP, BGP, DNS, etc. in a carrier grade environment.

**Security**:

- Experience in networking security like DDoS and WAF protection.
- Experience in security protocols like TLS protocol features and updates.
- Experience in VPNs and building encrypted communication channel.
- Conducted infrastructure security review, patch and update potential security vulnerabilities.
- Experience in one or more programming languages such as Java, C++, Go, or scripting experience in Shell and Python.

**Traffic**:

- Experience working with traffic systems from CDNs to loadbalancers and beyond.
- Experience working with network devices, remote management systems, and client-server protocols.
- Knowledge of network infrastructure and/or routing.
- Experience with Layer 4 / Layer 7 loadbalancers.
- Knowledge of protocols like TCP/IP, HTTP, RPC, TLS etc.
- Experience working with containerized environment.
- Experience in one or more programming languages such as Java, C++, Go, or scripting experience in Shell and Python.



  • Singapore TRUEWATCH TECHNOLOGY INC PTE. LTD. Full time

    **Responsibility**: - Run production environment by monitoring availability and taking a holistic view of the system health. - Achieve site reliability automation, minimize system downtime, and reduce site reliability cost. - Manage risks and resolves issues that affect the release scope, schedule and quality. - Suggest architecture improvements, push for...


  • Singapore ETEAM WORKFORCE PTE. LTD. Full time

    Position: Site Reliability Engineer (SRE) Work Mode - Onsite/Hybrid Timing - 9am to 6 pm Duration – 1 Year (Highly extendable) Salary: 6018 SGD Work Location: Robinson Road, Singapore About the Role We are looking for a seasoned Site Reliability Engineer (SRE) with 5+ years of experience to join our Platform Engineering team. This role is ideal for someone...


  • Singapore JJ Consulting Services Full time

    Our Client is a fast growing company in Singapore, who is seeking to recruit a Site Reliability Engineer. **Site Reliability Engineer** **Key Roles & Responsibilities** - Providing ancillary support of Enterprise-Grade Products and solutions at customer's sites - Ironing out deployment issues or challenges that our customers may face - Responsible for...


  • Singapore Qlik Full time

    **What makes us Qlik?** A Gartner® Magic Quadrant Leader for 14 years in a row, Qlik transforms complex data landscapes into actionable insights, driving strategic business outcomes. Serving over 40,000 global customers, our portfolio leverages pervasive data quality and advanced AI/ML capabilities that lead to better decisions, faster. We excel in...


  • Singapore Adyen Full time

    **This is Adyen** Adyen provides payments, data, and financial products in a single solution for customers like Meta, Uber, H&M, and Microsoft - making us the financial technology platform of choice. At Adyen, everything we do is engineered for ambition. For our teams, we create an environment with opportunities for our people to succeed, backed by the...


  • Singapore Crystal Equation Corporation Full time

    We are seeking a skilled Site Reliability Engineer (SRE) to join our team. SRE will be responsible for keeping all internal user-facing applications and other production systems running smoothly. This hybrid role involves a combination of both development and operations skills to build and manage systems that are both efficient and reliable. The Enterprise...


  • Singapore Point72 Full time

    Join to apply for the Site Reliability Engineer role at Point72 About the role As part of Point72’s Technology Team, you will focus on developing and maintaining complex, distributed, real-time systems that support our Global Macro business. Your responsibilities will include optimizing operations through automation, building foundational SRE components,...


  • Singapore APPLE SOUTH ASIA PTE. LTD. Full time

    Summary At Apple, new ideas have a way of becoming excellent products, services, and customer experiences very quickly. Bring passion and dedication to your job and there’s no telling what you could accomplish. The people here at Apple don’t just build products - they craft the kind of wonder that’s revolutionized entire industries. It’s the...


  • Singapore DT One Full time

    About DT One DT One was founded to provide mobile carriers with the infrastructure and services they need to help migrant workers stay in touch with their family and friends back home. Today we operate a leading global network for mobile top‑up solutions, innovative mobile rewards, and Phone‑to‑Phone solutions. Our global network delivers better...


  • Singapore Second Talent Full time

    Infrastructure Platform Development Design, build, and enhance infrastructure operation platforms Develop and maintain systems for infrastructure management, CI/CD pipelines, monitoring/alerting, and centralized logging Drive platform standardization and automation initiatives High Availability & Reliability Ensure maximum uptime for production services...