Flattening SQL Tables: Navigating Complex Data Structures
Written on
Reshape and Organize Your SQL Tables
In contrast to practice datasets, the data you encounter in real-world scenarios as a data engineer, scientist, or analyst often presents significant challenges.
Beyond having missing values, duplicates, and other undesirable elements, your data might also have a complicated structure.
For instance, my first professional project involved processing a data source that yielded nested JSON objects.
Since that initial experience nearly two years ago, I have engaged with and developed numerous nested data sources for end users.
For those working with data upstream, such as data engineers, nested data is often beneficial.
Properly ingested nested data is efficiently stored and can help to prevent duplication.
However, this format can complicate matters for downstream users.
For example, if a data analyst needs to utilize your table in a complex query that involves multiple joins or WITH clauses, the need to UNNEST() values adds an extra layer of complexity.
Thus, to prevent passing this challenge downstream, you might need to create a flattened version of your source table.
While the SQL for this process is essentially an extension of SELECT col FROM table, the unnest operations can grow quite intricate.
To demonstrate how to flatten a SQL table, I will utilize the BigQuery public dataset the_met, specifically focusing on the vision_api_data table.
Understanding the RECORD Type
In contrast to my prior discussions on complex SQL data types, which focused solely on nested records, this guide will show how to merge nested values with their flattened versions to create a more intuitive and user-friendly table.
This makes the vision_api_data table an ideal subject for analysis, as evidenced by its schema:
If you've followed my previous writings on complex data types, you'll be familiar with the RECORD type.
To clarify: a RECORD indicates a nested column and can hold columns of the same type, as illustrated by the faceAnnotations field:
A RECORD field can also include varying data types, including other RECORD fields (known as nested records):
These RECORD types can become quite intricate, as evident when examining the schema for imagePropertiesAnnotation:
Understanding how the data is organized is crucial before you can effectively transform it, as will be illustrated below.
UNNEST and Further UNNEST
By the end of this guide, you will become quite familiar with the UNNEST() function.
However, this is often the best (and perhaps the only) approach to access nested columns.
To begin, let's address a column with the fewest nested records.
I will examine labelAnnotations, a RECORD of REPEATED mode.
It's important to note that when using the UNNEST() function, you can assign an alias to the column, similar to how you would with a table in a JOIN.
Additionally, remember to include a comma between your referenced table and the table you will be using UNNEST() on.
The comma indicates a CROSS JOIN between the source table and the newly flattened column.
SELECT
l_a.properties,
l_a.locations,
l_a.boundingPoly,
l_a.topicality,
l_a.confidence,
l_a.score,
l_a.description,
l_a.locale,
l_a.mid
FROM
bigquery-public-data.the_met.vision_api_data,
UNNEST(labelAnnotations) l_a
Just like any other type, we can apply functions to manipulate un-nested columns. For instance, I can use ROUND() on the score column to refine the output.
SELECT
l_a.properties,
l_a.locations,
l_a.boundingPoly,
l_a.topicality,
l_a.confidence,
ROUND(l_a.score, 2) AS score,
l_a.description,
l_a.locale,
l_a.mid
FROM
bigquery-public-data.the_met.vision_api_data,
UNNEST(labelAnnotations) l_a
Now that we’ve un-nested the labelAnnotations column, let's add another un-nested column to continue flattening this table.
SELECT
object_id,
logoAnnotations,
l_a.properties,
l_a.locations,
l_a.boundingPoly,
l_a.topicality,
l_a.confidence,
ROUND(l_a.score, 2) AS score,
l_a.description,
l_a.locale,
l_a.mid
FROM
bigquery-public-data.the_met.vision_api_data,
UNNEST(labelAnnotations) l_a
Next, we will attempt to flatten a more nested column.
cropHintsAnnotation has four layers of nesting.
Because it includes ARRAY types within STRUCT types, we cannot simply un-nest each record. We must utilize both the un-nest function and dot notation to access these nested records.
Similar to the un-nesting of labelAnnotations, we can assign an alias to the column after the un-nest operation; I will use “ch” to refer to fields within this RECORD (see the last line in the code snippet).
SELECT
object_id,
logoAnnotations,
l_a.properties,
l_a.locations,
l_a.boundingPoly,
l_a.topicality,
l_a.confidence AS label_confidence,
ROUND(l_a.score, 2) AS score,
l_a.description,
l_a.locale,
l_a.mid,
ch.importanceFraction,
ch.confidence
FROM
bigquery-public-data.the_met.vision_api_data,
UNNEST(labelAnnotations) l_a,
UNNEST(cropHintsAnnotation.cropHints) ch
It’s apparent that more than a simple un-nesting of cropHintsAnnotation is required. Since “importanceFraction” and “confidence” are contained within cropHints, which is an array inside cropHintsAnnotation, we need to access that array as part of the un-nesting process.
Next, we will focus on the final part of the cropHints object, boundingPoly, which holds vertex variables “x” and “y.”
SELECT
object_id,
logoAnnotations,
l_a.properties,
l_a.locations,
l_a.boundingPoly,
l_a.topicality,
l_a.confidence AS label_confidence,
ROUND(l_a.score, 2) AS score,
l_a.description,
l_a.locale,
l_a.mid,
ch.importanceFraction,
ch.confidence,
b_v.x,
b_v.y
FROM
bigquery-public-data.the_met.vision_api_data,
UNNEST(labelAnnotations) l_a,
UNNEST(cropHintsAnnotation.cropHints) ch,
UNNEST(ch.boundingPoly.vertices) b_v
Constructing the Final Flattened Table
While these query operations aid in performing un-nesting, we aim to format the result into something accessible for data consumers.
There are three primary formats an un-nested table can be made available to an end user:
- View
- Table
- Dashboard
To wrap up this guide, I will convert this query into a view that any hypothetical BigQuery user within our organization can access.
After creating the view using the BigQuery UI, it will appear as follows:
Once established, we can access the view like any standard table.
Note how the schema now lacks any nested values:
-- Access the view
SELECT * FROM your_project.sample_dataset.vision_api_flattened
A Final Caution
While flattening a table can facilitate easier access to its contents, several risks must be considered before proceeding with such an operation.
When you unnest a RECORD type, you're not merely extracting a value from a container; you're fundamentally changing the structure of your data.
This can lead to duplicates and other extraneous data that may hinder analysis.
Flattening a table will typically increase the number of rows, as demonstrated by executing this query:
WITH un_nested AS (
SELECT
object_id,
logoAnnotations,
l_a.properties,
l_a.locations,
l_a.boundingPoly,
l_a.topicality,
l_a.confidence AS label_confidence,
ROUND(l_a.score, 2) AS score,
l_a.description,
l_a.locale,
l_a.mid,
ch.importanceFraction,
ch.confidence,
b_v.x,
b_v.y
FROM
bigquery-public-data.the_met.vision_api_data,
UNNEST(labelAnnotations) l_a,
UNNEST(cropHintsAnnotation.cropHints) ch,
UNNEST(ch.boundingPoly.vertices) b_v
),
nested AS (
SELECT
object_id,
logoAnnotations,
NULL AS properties,
NULL AS locations,
NULL AS boundingPoly,
NULL AS topicality,
NULL AS confidence,
NULL AS score,
NULL AS description,
NULL AS locale,
NULL AS mid,
NULL AS importanceFraction,
NULL AS confidence,
NULL AS x,
NULL AS y
FROM
bigquery-public-data.the_met.vision_api_data)
SELECT COUNT(1) AS count, "un_nested" AS table
FROM un_nested
UNION ALL
SELECT COUNT(1) AS count, "nested" AS table
FROM nested
Note that NULL values are used to ensure both CTEs have an identical number of fields, allowing the final UNION operation to work.
This method enables us to accurately count the rows in both the un-nested query we wish to convert to the flattened table and the original query.
Alternatively, this could have been mapped to the new flattened view we created.
Regardless, it's evident that there is a substantial difference in row counts:
The un-nested table adds over 2 million rows!
This is a crucial factor to consider when assessing the long-term usability and accessibility of your table.
Despite the potential drawbacks, offering a flattened table can greatly assist data consumers who may not be as comfortable with nested data.
Ultimately, mastering the understanding of nested data and the ability to flatten a SQL table is an essential skill in your SQL toolkit.
Transform your SELECT * into an interview-ready project. Download our free 5-page guide.