Senior Site Reliability Engineer- Remote

1 week ago


Singapore ClickHouse Full time

**About ClickHouse**:
We are the company behind the popular open-source, high performance columnar OLAP database management system for real-time analytics. ClickHouse works 100-1000x faster than traditional approaches. By offering a true column-based DBMS, it allows for systems to generate reports from petabytes of raw data with sub-second latencies. With an amazing community already adopting our open-source technology, we are now embracing our journey in delivering Cloud first solutions to delight our customers.

With top adopters such as Lyft, Cisco, and eBay - not only do our products work at lightning speed, so do we.

We are an open and collaborative company. Our colleagues are curious, engaged and excited about what they do. If you want to work in an environment where you can learn, grow, be an agent of change and have your voice heard - then please read on

We are committed to providing our customers with reliable and secure services so we are building out our newly formed Site Reliability Engineering team. As one of the first joiners to our Reliability Engineering Team at ClickHouse, you will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure that runs ClickHouse databases. You will collaborate with different teams like Control Plane, Dataplane, Core, Security, Support and Operations and guide them to design and implement scalable, secure, highly available and fault-tolerant distributed systems. You will also own the areas of incident management and response, post-mortem analysis including running blameless postmortems, and continuous improvement of our ClickHouse services. You will be leveraging your software engineering expertise to develop software platforms and tools to optimize the operational and engineering efficiencies of ClickHouse Cloud. This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance, serverless ClickHouse Cloud.

**What will you do?**
- Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
- Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
- Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
- Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
- Continuously improve the reliability and performance of our ClickHouse services.
- Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
- Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

**About you**:

- Bachelor's or Master's degree in Computer Science or a related field.
- At least 8 years of experience in Site Reliability Engineering or a related field.
- Previous experience using ClickHouse in production.
- Hands on experience with Go and/or Python.
- Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
- Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
- Hands on experience with container orchestration tools such as Kubernetes or Docker Swarm.
- Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.
- You are a strong problem solver and have solid production debugging skills.
- You are passionate about efficiency, availability, scalability, and data governance.
- You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward.
- You have a high level of responsibility, ownership, and accountability.
- Excellent communication and interpersonal skills.

**#LI-Remote**

**Compensation**:
This role offers cash compensation and a stock options grant. For roles based in the **United States**, you can find above our typical starting salary ranges for this role, depending on your specific location.

**Perks**:

- **Flexible work environment** - ClickHouse is a distributed company offering remote-first work to all employees
- **Healthcare** - Employer contributions towards your healthcare.
- **Equity in the company** - Every new team member who joins our company receives stock options.
- **Time off** - Flexible time off in the US, generous entitlement in all countries.
- **A $500 Home office setup **if you're a remote employee.
- **Employee-driven international mobility**:
**Culture - We All Shape It**

As part of our first 200 employees, you will be instrumental



  • Singapore Percept Solutions Full time

    Join to apply for the Site Reliability Engineer (SRE) role at Percept SolutionsContinue with Google Continue with Google2 years ago Be among the first 25 applicantsJoin to apply for the Site Reliability Engineer (SRE) role at Percept SolutionsJob DescriptionJob DescriptionDesign and implementation of new solutions as well as enhancement and integration of...


  • Singapore Percept Solutions Full time

    Join to apply for the Site Reliability Engineer (SRE) role at Percept Solutions Continue with Google Continue with Google 2 years ago Be among the first 25 applicants Join to apply for the Site Reliability Engineer (SRE) role at Percept Solutions Job Description Job Description Design and implementation of new solutions as well as enhancement and...


  • Singapore Tribus Full time

    This is a rare opportunity to join a fast-growing firm at the forefront of the digital asset ecosystem, working on cutting-edge infrastructure and tooling with a global, remote-first team. **Key Responsibilities** - Maintain and scale highly available, low-latency trading infrastructure deployed across multiple regions. - Design, build, and improve...


  • Singapore NodeFlair Full time

    **Job Summary**: **Salary** S$11,500 - S$16,500 / Monthly **Job Type** **Seniority** Senior **Years of Experience** At least 7 years **Tech Stacks** Microsoft Puppet Java Ansible Python **This is Adyen** Adyen provides payments, data, and financial products in a single solution for customers like Meta, Uber, H&M, and Microsoft - making us the...


  • Singapore Hyphen Connect Full time

    Site Reliability Engineer (Crypto Trading) Join to apply for the Site Reliability Engineer (Crypto Trading) role at Hyphen Connect Site Reliability Engineer (Crypto Trading) 2 days ago Be among the first 25 applicants Join to apply for the Site Reliability Engineer (Crypto Trading) role at Hyphen Connect We are hiring for one of our ecosystem projects in...


  • Singapore Kraken Digital Asset Exchange Full time

    **About Kraken** As one of the largest and most trusted **digital asset platforms** globally, we are empowering people to experience the life-changing potential of crypto. Trusted by over 10 million consumer and pro traders, institutions, and authorities worldwide - our unique combination of products, services, and global expertise is helping tip the scales...


  • Singapore JJ Consulting Services Full time

    Our Client is a fast growing company in Singapore, who is seeking to recruit a Site Reliability Engineer. **Site Reliability Engineer** **Key Roles & Responsibilities** - Providing ancillary support of Enterprise-Grade Products and solutions at customer's sites - Ironing out deployment issues or challenges that our customers may face - Responsible for...


  • Singapore AKAMAI TECHNOLOGIES APJ PTE. LTD. Full time

    As a Senior Site Reliability Engineer, you will influence a wide array of teams. You will be responsible for the performance and reliability of Akamai’s delivery products by working with the Product, Engineering and Support teams to diagnose, mitigate and solve outages. You will have to solve some of the most complex problems in distributed systems at...


  • Singapore Shopify Full time

    Company Description Shopify is a leading global commerce company, providing trusted tools to start, grow, market, and manage a retail business of any size. Shopify makes commerce better for everyone with a platform and services that are engineered for reliability, while delivering a better shopping experience for consumers everywhere. Shopify powers...


  • Singapore Vega Solutions Full time

    Join to apply for the Site Reliability Engineer role at Vega Solutions Join to apply for the Site Reliability Engineer role at Vega Solutions Get AI-powered advice on this job and more exclusive features. Tokka Labs | Singapore | Full-TimeTokka Labs is a proprietary trading firm with a focus on close collaboration, rigorous research, and cutting-edge...