From Raw Data to Readiness: The Hero’s Journey of Data Integration

This post is also available in: Deutsch

Wenn man über die Integration von Daten in ein Data Warehouse spricht, klingt das oft ganz einfach: Datenquelle identifizieren, verbinden, Daten reinladen und 💥boom💥 – alles ist verfügbar. Aber wie jede:r, der schon mal mit Daten gearbeitet hat, weiß: Ist die Realität deutlich komplexer. In den letzten Jahren habe ich als Datenexpertin viele verschiedene Datensätze für Kund:innen integriert, und jeder neue Fall erfordert einen individuellen Ansatz, der technisches Verständnis mit klugen Geschäftsentscheidungen verbindet.

Für diesen Artikel verwenden wir einen zufällig generierten Beispieldatensatz, der auf einem realen Szenario basiert (in diesem Fall eine vereinfachte Kreditkartenkontenstruktur). Die Daten sind zwar fiktiv, spiegeln aber typische Herausforderungen wider, die branchenübergreifend auftreten. Dadurch lassen sich die wichtigsten Prüfungen und Stolpersteine gut erklären, ohne vertrauliche Informationen preiszugeben.

Woher kommen die Daten?

Bevor wir mit dem Aufbau einer Pipeline starten, ist es entscheidend zu verstehen, wie die Daten eingespeist werden. Die Quellen sind hierbei vielfältig und umfassen Excel- oder CSV-Dateien, APIs, Datenbank-Exports oder Dateien in Cloud-Speichern. Aber neben dem Format müssen wir auch wissen, wie oft die Daten aktualisiert werden, wie sich ihre Struktur (Schema) verändert und ob sie bestehende Inhalte überschreiben oder anhängen. Diese Eigenschaften bestimmen, wie wir Architektur und Monitoring gestalten, um einen zuverlässigen Prozess aufzubauen.

Durch das Erkennen dieser Merkmale werden neben funktionierenden auch gleichzeitig robuste Systeme geschaffen.

Doch der erste Schritt ist, zu verstehen, mit welcher Art von Daten wir es zu tun haben. Denn das beeinflusst, wie wir die Daten verarbeiten, überwachen und welche Lösung wir bauen.

  1. Batch-Daten kommen in regelmäßigen Abständen (z. B. tägliche Reports oder wöchentliche Exporte) und müssen in gleichmäßigen Blöcken verarbeitet werden.
  2. Einmalige Datensätze – oft Altlasten oder Ad-hoc-Dateien tauchen nur ein einziges Mal auf und müssen daher besonders gründlich geprüft werden.
  3. Streaming-Daten fließen kontinuierlich in Echtzeit – dafür braucht es robuste, reaktionsfähige Pipelines, die auch mit Schwankungen in Volumen und Latenz umgehen können.

The Blueprint: Mapping Data Relationships

When working with multiple data tables, creating an Entity Relational Diagram (ERD) is essential. It helps clarify how tables relate to each other:

  • what the keys are
  • how cardinality is managed
  • what dependencies exist

Picture 1: Entity Relational Diagram (ERD): Showing the relationships between the data and data structures

This particular ERD reflects the simulated credit card system used for our analysis, but the concept applies universally. An ERD improves communication, documents assumptions, and reduces implementation errors.

Metadata, on the other hand, is a silent hero in this context. Metadata is the invisible framework that holds everything together. It explains what a column means, how values are coded and whether a certain standard is adhered to. Without metadata, even a clean data set becomes a mystery.

While often overlooked, metadata is foundational to data reliability. It gives us context: what a column means, how values are encoded, and whether the data follows a particular standard. Without metadata, even the cleanest dataset becomes a puzzle. Well-documented metadata prevents misinterpretation, streamlines pipeline development, and saves time during debugging or scaling. In real-world terms, metadata bridges the gap between data producers and data consumers by creating a common language everyone can rely on.

Column Name Table Data Type Description Nullable Source System
CustomerID Customer Integer Unique identifier for a customer No CRM Export
TransactionDate PurchaseTransaction Date Date of the transaction No Payment Processor
Balance CreditCardAccount Float Current account balance Yes Finance System
AccountStatus CreditCardAccount String Status of the credit card account No Internal
RetailerID Retailer Integer ID of the retailer No Merchant Feed

The Superpower Unlocked: Flattening and Exploring the Data

When dealing with normalized data, where information is spread across a fact table and multiple dimension tables, it’s crucial to understand the relationships and how they are maintained. These structures are often not obvious from the fact table alone and can only be fully understood by referencing metadata and sampling the data. That’s why, before beginning any in-depth analysis, I flatten the structure by denormalizing the tables into a single, unified fact table. This allows me to perform more complex analysis more efficiently, especially in Python.

With a denormalized dataset in hand, the next step is to explore the data. This step is about establishing trust in the data; validating that it’s complete, consistent, and meaningful.

Data exploration helps us identify gaps, such as missing values, inconsistent entries, or unbalanced distributions. It also allows us to test hypotheses and gain an intuitive understanding of what the dataset is saying before we move into structured analysis or machine learning.

Picture 2: Transaction Amount Distribution: This histogram visualises how often specific transaction values appear in the dataset.

Statistical checks like value distributions and frequency counts reveal whether the data behaves as expected. One way to see this in action on our sample data set is through a histogram showing the frequency of different transaction amounts.

Abbildung 3: Monthly Transactions vs. Average Amount Spent.

In Abbildung 3, we see that while the number of transactions remains relatively consistent over time, the average amount spent fluctuates significantly. This inconsistency may reflect different customer usage patterns, promotions, or seasonal influences. But it can also signal potential issues with the data, such as a misplaced decimal point or formatting error during ingestion.

Abbildung 4: Null Value Summary Heatmap

The heatmap in Abbildung 4 shows the percentage of null or missing values across each column in the denormalized dataset. Null values directly impact both data integrity and data usability. From a data integrity perspective, nulls can indicate system or process gaps in how the data was collected, transferred, or ingested. They might reflect inconsistencies in business processes and/or system processes. For example, consistently missing values in ContactInfo could point to a misconfigured CRM export or a permissions issue blocking that data.

By checking for nulls at the start, we safeguard the reliability of any conclusions drawn from the data. This proactive review helps avoid time-consuming rework later and builds confidence in the data’s readiness for production use.

The Investigator’s Toolkit: Summaries That Guide Us

Before we commit to using any dataset, we run key summaries to assess its quality, structure, and readiness for downstream use. These checks are not just about validating content. They’re about confirming that the dataset reflects reality, aligns with expectations, and is ready for production use. Here’s a structured checklist we follow during this validation phase:

Structural & Content Checks

  • Row count is consistent across time or source files
  • All expected columns are present
  • Critical fields (e.g. IDs, timestamps) have acceptable fill rates
  • Primary keys are unique (no duplicates)
  • Foreign key references are valid and resolvable
  • Null values are scanned and their impact is assessed

Distribution & Logic Checks

  • Numeric values are within expected ranges and show reasonable distribution
  • Categorical values match known domain lists (e.g., statuses, types)
  • Timestamps are complete, sequential, and reasonable (no future dates or gaps)
  • Outliers are identified and explained

Business Context & Relevance

  • Grouped summaries (by region, customer, product) align with known figures
  • Data has sufficient coverage across business segments
  • All expected time periods are included (especially in time series datasets)

These checks don’t just help uncover issues, they also enable data-driven decisions about whether the dataset is ready for analytics, reporting, or machine learning. Performing these validations early ensures trust in the data and avoids rework later in the project lifecycle.

The Plot Twist: It’s Not Just Yes or No

It’s tempting to approach data intake with a binary mindset: either the data is good enough, or it’s not. But in reality, the answer isn’t that simple. Our role isn’t just to say yes or no, but it’s to do the homework, dig into the details, and understand what’s really there.

That means going beyond a surface check and looking at completeness, consistency, and statistical behaviour. It means assessing the volume of data, how often it updates, and how reliable the inputs are. And, most importantly, it means aligning everything with what the business actually wants to do.

We also need to remain aware of the broader context—tight timelines, evolving goals, and what it would mean to base a product or model on incomplete or misleading data. Pulling in data isn’t just a technical task, it’s a strategic decision. And that’s why these early checks matter so much.

In the end, it’s not just about whether the data exists, it’s about whether it can be trusted, used effectively, and built upon.

Final Thoughts

Start strong, stay curious, and don’t skip the foundations. That’s how we turn raw data into something powerful.

This post is also available in: Deutsch

Would you like to find out more about this topic? Then get in touch with us.

N°1
München

Ainmillerstr. 22
D-80801 München
Tel: +49 (0)89 45205420

N°2
Wien

Frankenberggasse 9/5
AT-1040 Wien
Tel: +49 (0)89 45205420

Follow us on Social Media

LinkedIn
Xing
Name(Required)
This field is for validation purposes and should be left unchanged.