Why?
For several reasons, professionally personal, as well as technical.
-
To apply all the Data skills I acquired at Oktopus S.A. and then at Edda Luxembourg S.A. after our merger.
-
To demonstrate my Infra/DevOps skills that I’ve never applied professionally in a project.
-
To serve as a demonstration and training material for all the tools presented.
As mentioned, the platform in this series of articles serves as a demonstration, with minimal configuration, to show that you can develop on your machine before deploying to a dev/test/quality/prod environment.
Another post, following this series, will showcase a secure data platform with centralized access rights management.
How?
Speaking of tools, I will present and use the following technologies:
-
PostgreSQL, a classic and performant database
-
Minio, an on-premise object storage that replaces Amazon S3
-
Cert-manager to automatically issue SSL certificates, Trust-manager to distribute the trust chain automatically
-
Traefik, a reverse-proxy with dynamic configuration
-
A container image registry, useful for some of the tools
-
Airbyte, for ingesting data from disparate sources, for testing this tool
-
Nessie as a metastore of Iceberg data
-
Trino as a query engine
-
Dagster, for orchestrating dataset pipelines
-
OpenMetadata, to have a data catalog
-
Superset, to create dashboards
It is evident that the choice has been made in favor of
-
the use of containers and Kubernetes clusters, to facilitate deployment and scalability of applications
-
open-source or semi open-source tools, to ensure flexibility and scalability
-
the use of lakehouse, to separate storage and computing resources, improving performances and scalability of analytic workload
Use cases
For demonstrating this platform, I’ll use two use cases. First one will be to take movies data from IMDb, with csv files freely available. Second will be using structured data storing status of EV stations of Chargy brand in Luxembourg. I’m storing snapshots each 5 minutes since a year and a half. The Postgres data I’m providing is only for one month, as the whole dataset is too big to store in a docker image open to the public.
Let’s get started!
-
1st part, the infra