Senior Data/ml Engineer

5 days ago


Remote, Singapore BetterData Pte Ltd Full time

Job Posted
- Empty
- Location
Singapore
-Full time
-Remote
-Non-Remote

**Who Are We Looking For**:
This role requires someone familiar with the **dynamic nature of a startup**, capable of rapidly designing and implementing scalable solutions. You'll work closely with research teams to optimize performance and ensure seamless integration of systems, handling data from **financial institutions, government agencies, consumer brands, and internet companies**.

**Key Responsibilities**:
**Strong understanding of ML concepts and algorithms**:
Practical experience working with models in production settings in AI / data science teams to transform AI / data science code into scalable, production-ready systems.

**Data Ingestion & Integration**:
Ingest data from **enterprise relational databases** such as **Oracle**, **SQL Server**, **PostgreSQL**, and **MySQL**, as well as **enterprise SQL-based data warehouses** like **Snowflake**, **BigQuery**, **Redshift**, **Azure Synapse**, and **Teradata** for large-scale analytics.

**Data Validation & Quality Assurance**:
Ensure ingested data conforms to predefined **schemas**, checking data types, missing values, and field constraints.

Implement **data quality checks** for nulls, outliers, and duplicates to ensure data reliability.

**Data Transformation & Processing**:
**Design scalable data pipelines** for **batch processing**, deciding between **distributed computing** tools like **Spark**, **Dask**, or **Ray** when handling extremely large datasets across multiple nodes, and **single-node tools** like **Polars** and **DuckDB** for more lightweight, efficient operations. The choice will depend on the size of the data, system resources, and performance requirements.

**Leverage Polars** for high-speed, in-memory data manipulation when working with large datasets that can be processed efficiently in-memory on a single node.

**Utilize DuckDB** for on-disk query execution, offering SQL-like operations with mínimal overhead, suitable for environments that need a balance between memory use and query performance.

Seamlessly transform **Pandas-based research code** into **production-ready pipelines**, ensuring efficient memory usage and fast data access without adding unnecessary complexity.

**Data Storage & Retrieval**:
Work with internal data representations such as **Parquet**, **Arrow**, and **CSV** to support the needs of our generative models, choosing the appropriate format based on **data processing and performance needs**.

**Distributed Systems & Scalability**:
Ensure that the system can **scale efficiently from a single node to multiple nodes**, providing **graceful scaling** for users with varying compute capacities.

Optimize **SQL-based queries** for performance and scalability in **enterprise SQL environments**, ensuring efficient querying across large datasets.

**GPU Acceleration & Parallel Processing**:
Utilize **GPU acceleration** and **parallel processing** to improve performance in large-scale model training and data processing.

**Data Lineage & Metadata Management** (Reduced Emphasis)**:
Implement **basic data lineage** for auditability, ensuring traceability in data transformations when required.

Manage metadata as needed to document pipelines and workflows.

**Error Handling, Recovery, & Performance Monitoring**:
Design robust **error handling** mechanisms, with **automatic retries** and **data recovery** in case of pipeline failures.

Track performance metrics such as **data throughput**, **latency**, and **processing times** to ensure efficient pipeline operations at scale.

**Documentation & Reporting**:
Create clear **documentation** of data pipelines, workflows, and system architectures to enable smooth handovers and collaboration across teams.

**Essential Skills and Qualifications**:
**High Priority**:
Hands-on experience **scaling data pipelines** and **machine learning systems** to handle **hundreds of millions to billions of rows** in enterprise environments.

4+ years of experience in building scalable data solutions with **Python** and distinct libraries such as:
**Data Science Libraries**:Pandas**, **NumPy**, **Scikit-learn**.

**Scaling Libraries**:Polars** for in-memory processing and **DuckDB** for efficient on-disk queries.

Ability to **choose the right framework** (e.g., **Dask**, **Ray**, **Polars**, **DuckDB**) depending on the workload and environment, with a focus on balancing simplicity and scalability.

Experience in **data validation** and ensuring data quality with tools like **Pandera** or **Pydantic**.

Proficiency in building **ETL/ELT pipelines** and managing data across **relational databases**, **data warehouses**, and **cloud storage**.

Strong knowledge of **GPU parallelization** for deep learning models using **PyTorch**.

**Good to Have**:
Experience with logging and monitoring in production environments.

Understanding of **data lineage** and **metadata management** systems to support data transparency.

Famil



  • Remote, Singapore Bluesky Data Full time

    Company/Founders’ Location: California Remote (Singapore) We are a stealth mode early-stage startup with the mission to build a new generation of data infra on the cloud. Today, users suffer from unexpected incidents, slowness, and huge bills. We are big data domain experts [1, 2] with 15+ years of experience solving similar problems across Google,...


  • Remote, Singapore Data Direct Networks Full time

    Overview: This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to...

  • Data Engineer

    3 days ago


    Remote, Singapore KeepFlying Full time

    **Why **KeepFlying®**: - KeepFlying® is an Aviation DSaaS (Data Science as a Service) platform which will serve Airlines, Lessors, Financiers & OEMs simulate revenue potential of their assets using financial and risk models. KeepFlying® will bridge the gap between Technical & Engineering data with that of Finance & Risk data to help value assets and...


  • Remote, Singapore Cloudera Full time

    Business Area: Sales Engineering Seniority Level: Mid-Senior level Job Description: At Cloudera, we empower people to transform complex data into clear and actionable insights. With as much data under management as the hyperscalers, we're the preferred data partner for the top companies in almost every industry. Powered by the relentless innovation of the...

  • Lead Data Architect

    2 days ago


    Remote, Singapore WRS Health Full time

    **Company Overview** Voted #1 EHR by PC Mag, WRS Health delivers a fully integrated cloud based EMR and practice management solutions to its clients. We bring solutions to physicians by providing constant enhancement of our products and services including EHR, practice management, marketing, patient coordination and billing. **Job Purpose and Role** WRS...


  • Remote, Singapore Chartbeat, Inc. Full time

    Tubular and Lineup have partnered with Chartbeat to help you grow reach and revenue for your content._ In 2023, Chartbeat joined forces with Tubular, the leader in global social video intelligence and measurement, and Lineup Systems, the leading global provider of media sales technology. Together, we’re expanding the ecosystem of insights we provide to...


  • Remote, Singapore Twelve Data Pte. Ltd. Full time

    Singapore- Remote**What we need**: **Basic Requirements**: - Proficient in PHP, JavaScript, TypeScript, HTML, CSS, etc. - Work with the Git version control system, understanding Git Flow. - Knowledge of CSS/JS preprocessors and build systems (Stylus, Babel, Webpack). - Experience or desire to develop large projects. - Understanding the modern front-end...


  • Remote, Singapore Twelve Data Pte. Ltd. Full time

    Singapore - Remote **What we need**: **Basic Requirements**: - Proficient in PHP, JavaScript, TypeScript, HTML, CSS, etc. - Work with the Git version control system, understanding Git Flow. - Knowledge of CSS/JS preprocessors and build systems (Stylus, Babel, Webpack). - Experience or desire to develop large projects. - Understanding the modern front-end...


  • Remote, Singapore Twelve Data Pte. Ltd. Full time

    Remote **Summary**: Twelve Data’s Backend team is looking for a high-energy Golang backend developer with a passion for creating impactful FinTech products. This team is responsible for developing stable systems that process terabytes of financial quotes daily. The tools we develop will be used by thousands of users creating their own financial worlds...


  • Remote, Singapore Twelve Data Pte. Ltd. Full time

    Remote **Summary**: Twelve Data’s Backend team is looking for a high-energy Golang backend developer with a passion for creating impactful FinTech products. This team is responsible for developing stable systems that process terabytes of financial quotes daily. The tools we develop will be used by thousands of users creating their own financial worlds...