What if I built my own data platform?

Why?

For several reasons, professionally personal, as well as technical.

To apply all the Data skills I acquired at Oktopus S.A. and then at Edda Luxembourg S.A. after our merger.
To demonstrate my Infra/DevOps skills that I’ve never applied professionally in a project.
To serve as a demonstration and training material for all the tools presented.

As mentioned, the platform in this series of articles serves as a demonstration, with minimal configuration, to show that you can develop on your machine before deploying to a dev/test/quality/prod environment.

Another post, following this series, will showcase a secure data platform with centralized access rights management.

How?

Speaking of tools, I will present and use the following technologies:

Kubernetes

Kind, to set up a k8s dev/demo cluster
Helm, to deploy pre-packaged applications into this cluster
Helmfile, to store the configuration of Helm releases

Infra

PostgreSQL, a classic and performant database
Minio, an on-premise object storage that replaces Amazon S3
Cert-manager to automatically issue SSL certificates, Trust-manager to distribute the trust chain automatically
Traefik, a reverse-proxy with dynamic configuration
A container image registry, useful for some of the tools

Data

Airbyte, for ingesting data from disparate sources, for testing this tool
Nessie as a metastore of Iceberg data
Trino as a query engine
Dagster, for orchestrating dataset pipelines
OpenMetadata, to have a data catalog
Superset, to create dashboards

It is evident that the choice has been made in favor of

the use of containers and Kubernetes clusters, to facilitate deployment and scalability of applications
open-source or semi open-source tools, to ensure flexibility and scalability
the use of lakehouse, to separate storage and computing resources, improving performances and scalability of analytic workload

Use cases

For demonstrating this platform, I’ll use two use cases. First one will be to take movies data from IMDb, with csv files freely available. Second will be using structured data storing status of EV stations of Chargy brand in Luxembourg. I’m storing snapshots each 5 minutes since a year and a half. The Postgres data I’m providing is only for one month, as the whole dataset is too big to store in a docker image open to the public.

Overview of my data platform — My data platform

Let’s get started!

Links to the series' posts

1st part, the infra

Why?#

How?#

Use cases#

Why?

How?

Use cases