Tech Internals Conf is the largest conference for developers of complex and high-load systems

schedule

purchase video our meetups

stay tuned

filter >

Let’s talk Architecture: Limits of Configuration-driven Ingestion Pipelines

Alexander Gilevich

from EPAM (Spain)

About speaker

Programming-as-a-Passion, Architecture-as-a-Job

Data Solutions Architect and Tech Lead with a passion for programming and well-architected highly-available and scalable applications. 12+ years of production experience in IT. Certified Databricks/Azure/AWS Solutions Architect Expert.

About speakers company

EPAM Systems, Inc. is an American company that specializes in software engineering services, digital platform engineering, and digital product design. Since 1993, EPAM has helped customers digitally transform their businesses through a unique blend of world-class software engineering, design and consulting services. EPAM is a founding member of the MACH Alliance.

Abstracts

broad

Need to continuously ingest data from numerous disparate and/or non-overlapping data sources and then merge them together into one huge knowledge graph to deliver insights to your end users?

Pretty cool, huh? And what about multi-tenancy, mirroring access policies and data provenance? Perhaps, incremental loading of data? Or monitoring the current state of ingestion in a highly-decoupled distributed microservices-based architecture?

In my talk I will tell you our story: all started with a simple idea of building connectors, we ended up building fully configurable and massively scalable data ingestion pipelines which deliver disparate data pieces into a single data lake for their later decomposition and digestion in a multi-tenant environment. All while allowing customers and business analysts to create and configure their own ingestion pipelines in a friendly way with a bespoke pipeline designer with each pipeline building block being a separate decoupled microservice (think Airflow, AWS Step Functions and Azure Logic Apps). Furthermore, we'll touch such aspects as choreography vs orchestration, incremental loading strategies, ingestion of access control policies (ABAC, RBAC, ACLs), parallel data processing, how frameworks can help in the implementation of cross-cutting concerns, and even briefly talk about benefits of knowledge graphs.

Building a distributed highly-available system is a huge undertaking. Building such a system requires a lot of upfront designing and consideration.

An example of one such system that will be considered in this talk is an ingestion platform which:
a) allows to ingest data from a set of external data sources
b) enables end users to configure managed data pipelines to ingest data of arbitrary type and shape
c) is microservice-based
d) allows to securely and reliably transfer data from on-premises or another cloud

The discussion consists of two major parts: why we set out to build such a system and how we accomplished it.

The first part revolves around key business requirements that drived majority of our decisions. Our clients were primarily from the engineering domain so the data that we had to ingest can be descibed as technical requirements and specifications how to build a complex machinery or equipment (e.g. Part A is contained within Part B which is contained within Part C, can tolerate temperatures in the range [-30º; +40º], is of size X x Y x Z and of weight N kg).

What stands out the most is that the system had to be entirely configuration-driven meaning that end users having very little knowledge of the platform had to be able to configure the behavior of the data pipelines themselves in a user-friendly UI. While this requirement along significantly complicates the design, it was crucial in our case as we had a number of customers with their own processes, requirements and policies which dictated how to process and manage their data. It consequently led us to the second requirement – extensibility. We had to make sure that the system could be extended to support new use-cases and mostly proprietary data sources. And because data is often considered to be one of the most valuable assets by any data-driven company, we had to ensure that we could securely access and transfer the data from one cloud to another.

Now, the second part of the talk is going to shed light onto the implementation of these requirements. Having three types of components (connector, operator and uploader) allowed us to build a system that can be extended with new implementations of these components. Implementing these components as relatively small microservices allowed us to develop and maintain them independently in a very scalable and agile fashion. Microservices allowed us to scale the platform in and out at arbitrary points of time to be able to process the data with unknown data churn patterns and be ready to ingest potentially huge volumes of data. These are just a few examples of the architectural decisions that we had to make.

In the end of the talk, I am going to tell a bit about what we had to do with all the ingested data and what alternatives we had considered before making the decision to build such a system ourselves.

The talk was declined

Let’s talk Architecture: Limits of Configuration-driven Ingestion Pipelines

Abstracts

other talks of this topic

Become partner