Photo

Let’s talk Architecture: Limits of Configuration-driven Ingestion Pipelines

Alexander Gilevich

from EPAM (Spain)

About speaker

Programming-as-a-Passion, Architecture-as-a-Job

Data Solutions Architect and Tech Lead with a passion for programming and well-architected highly-available and scalable applications. 12+ years of production experience in IT. Certified Databricks/Azure/AWS Solutions Architect Expert.

About speakers company

EPAM Systems, Inc. is an American company that specializes in software engineering services, digital platform engineering, and digital product design. Since 1993, EPAM has helped customers digitally transform their businesses through a unique blend of world-class software engineering, design and consulting services. EPAM is a founding member of the MACH Alliance.

Abstracts

broad

Need to continuously ingest data from numerous disparate and/or non-overlapping data sources and then merge them together into one huge knowledge graph to deliver insights to your end users?

Pretty cool, huh? And what about multi-tenancy, mirroring access policies and data provenance? Perhaps, incremental loading of data? Or monitoring the current state of ingestion in a highly-decoupled distributed microservices-based architecture?

In my talk I will tell you our story: all started with a simple idea of building connectors, we ended up building fully configurable and massively scalable data ingestion pipelines which deliver disparate data pieces into a single data lake for their later decomposition and digestion in a multi-tenant environment. All while allowing customers and business analysts to create and configure their own ingestion pipelines in a friendly way with a bespoke pipeline designer with each pipeline building block being a separate decoupled microservice (think Airflow, AWS Step Functions and Azure Logic Apps). Furthermore, we'll touch such aspects as choreography vs orchestration, incremental loading strategies, ingestion of access control policies (ABAC, RBAC, ACLs), parallel data processing, how frameworks can help in the implementation of cross-cutting concerns, and even briefly talk about benefits of knowledge graphs.


Building a distributed highly-available system is a huge undertaking. Building such a system requires a lot of upfront designing and consideration.

An example of one such system that will be considered in this talk is an ingestion platform which:
a) allows to ingest data from a set of external data sources
b) enables end users to configure managed data pipelines to ingest data of arbitrary type and shape
c) is microservice-based
d) allows to securely and reliably transfer data from on-premises or another cloud

The discussion consists of two major parts: why we set out to build such a system and how we accomplished it.

The first part revolves around key business requirements that drived majority of our decisions. Our clients were primarily from the engineering domain so the data that we had to ingest can be descibed as technical requirements and specifications how to build a complex machinery or equipment (e.g. Part A is contained within Part B which is contained within Part C, can tolerate temperatures in the range [-30º; +40º], is of size X x Y x Z and of weight N kg).

What stands out the most is that the system had to be entirely configuration-driven meaning that end users having very little knowledge of the platform had to be able to configure the behavior of the data pipelines themselves in a user-friendly UI. While this requirement along significantly complicates the design, it was crucial in our case as we had a number of customers with their own processes, requirements and policies which dictated how to process and manage their data. It consequently led us to the second requirement – extensibility. We had to make sure that the system could be extended to support new use-cases and mostly proprietary data sources. And because data is often considered to be one of the most valuable assets by any data-driven company, we had to ensure that we could securely access and transfer the data from one cloud to another.

Now, the second part of the talk is going to shed light onto the implementation of these requirements. Having three types of components (connector, operator and uploader) allowed us to build a system that can be extended with new implementations of these components. Implementing these components as relatively small microservices allowed us to develop and maintain them independently in a very scalable and agile fashion. Microservices allowed us to scale the platform in and out at arbitrary points of time to be able to process the data with unknown data churn patterns and be ready to ingest potentially huge volumes of data. These are just a few examples of the architectural decisions that we had to make.

In the end of the talk, I am going to tell a bit about what we had to do with all the ingested data and what alternatives we had considered before making the decision to build such a system ourselves.

The talk was accepted to the conference program

other talks of this topic

Photo
Dismantling Big Data with DuckDB

Yoav Nordmann

Tikal Knowledge

specific
Photo
Cloud Costs with ClickHouse and OpenCost

Denys Kondratenko

Altinity

specific
Photo
Azure cloud architecture for high availability and low latency

Florian Lenz

neocentric GmbH - Azure Cloud Developer / Architect

specific
Photo
Open Source Ecosystem for ClickHouse on Kubernetes

Denys Kondratenko

Altinity

specific
Photo
Architectures that we can use with .NET

Alexej Sommer

Capgemini

broad
Photo
Beyond Caching: Valkey's Advanced Data Structures in Action

Viktor Vedmich

Amazon Web Services

specific
Photo
The Art of Decision Making: Balancing Trade-Offs in Software Architecture

Florian Lenz

neocentric GmbH - Azure Cloud Developer / Architect

broad
Photo
Blending Product Thinking with Architecture

Joel Tosi

Dojo and Co

broad
Photo
The simplest way to build resilient applications

Francesco Guardiani

Restate Gmbh

broad
Photo
Achieving True Layered Separation with Hexagonal Architecture in Spring Boot

Adrian Kodja

softgarden e-recruiting GmbH

specific
Photo
Serverless First Mindset: seize opportunities, know your limits and experience real success stories

Florian Lenz

neocentric GmbH - Azure Cloud Developer / Architect

specific
Photo
Using Heterogeneous Computing in Databases

Aleksandr Borgardt

OtterStax

specific
Photo
Federate it! Limits of GraphQL-based architectures.

Alexander Gilevich

EPAM

specific
Photo
The Anatomy of a Distributed JavaScript Runtime

Peter van Vliet

Masking Technology

broad
Photo
Writing a TSDB from Scratch: Performance Optimization

Roman Khavronenko

VictoriaMetrics

specific
Photo
Mastering Software Design: Best Practices for Building Robust Applications

Ambesh Singh

Visionet Systems Deutschland

broad
Photo
The forgotten broker-less message queue

Aivars Kalvans

Ebury

specific
Photo
Exploring the Tradeoffs of Event-Driven Architecture in Microservices

Florian Lenz

neocentric GmbH - Azure Cloud Developer / Architect

specific
Photo
Organizational Sustainability with Platform Engineering

Lesley Cordero

The New York Times

specific
Photo
Mindset by Design: Transforming How You Build Software

Mihaela-Roxana Ghidersa

Signant Health

broad
Photo
ML/AI in the cloud - State of the Art in 2025

Federico Fregosi

OpsGuru

broad
Photo
Just Use Postgres for Everything

Giorgi Dalakishvili

Space International

specific
Photo
REST or gRPC: Best practices for modern architectures

Kristina Kraljić

PIS d.o.o.

specific