Semi-Structured Data: How to work with nested JSON in Snowflake

As the amount of data in the world continues to grow, the formats in which it is stored have also evolved. 
Gone are the days of strictly structured data. Instead we are now dealing more and more with semi-structured formats such as XML and JSON. 
These formats, while providing more flexibility and expressiveness, can also make working with the data more challenging. 
With APIs outputting JSON data, it's essential to have a robust and efficient way to handle this data. 
Snowflake offers a solution to this problem by providing an easy-to-use way for working with semi-structured and nested semi-structured data. 
In this post, we will explore how to use Snowflake to work with JSON data, using one of my favorite topics - Star Wars - as an example. 
Join me as we dive into the world of Star Wars characters and discover how to handle this data.

Before we can start working, we first need to create a new database to hold our data. 

In Snowflake, we can use the following command to create a new database:
With our database in place, the next step is to create a new schema specifically for our Star Wars data. 
We use this command:

Now that we have our schema set up, we need data to work with. 

I have shared a file on my Azure Blob Storage that contains JSON data, which we will use for our examples. 
https://patricksdata.blob.core.windows.net/starwars/AllSWCharacters.json


This data comes from an API created by Yoann Cribier, which is available on the GitHub page.

https://github.com/akabab/starwars-api

To access the data from the Azure Blob Storage, we first need to create a stage in Snowflake. 

A stage is a location where data files are temporarily stored before they are loaded into a table. 
Stages can be used to store data files in various formats, such as CSV, JSON, Avro, and Parquet.
Stages serve as a buffer between the data files and the Snowflake tables, allowing for the data to be loaded and transformed before it is inserted into the final table. 
Stages also enable the data files to be compressed, encrypted, and partitioned to optimize performance and reduce costs.

In this case, we will create a stage for our Azure Blob Storage:

To confirm that the stage has been created successfully, we can use the following command:

To see a list of the files that are stored on the stage, we use:

Now that we have our data on the stage, we need to create a file format that Snowflake can use to read the JSON files. 

To create a file format we use the this code:

We confirm that the file format has been created successfully:

Since we have our file format ready now, we create a table in the new database to hold our JSON data.

The next command creates a table named "CHARACTERS" within the "STARWARS" schema. 

The "RAW" column is of type VARIANT, which allows us to store JSON data in a single column without having to define the schema beforehand. 
This allows for more flexibility when working with the data, as you don't have to worry about the structure of the data changing over time.

This command will show the content of the table, as the table is empty it will show only the header.

Next we can use the COPY INTO command to load the data from the stage into the table, using the file format that we created earlier. 

The following code loads the data from the stage into the table, using our file format.

After the data is loaded, we can use the SELECT statement again to query the data and check the output. 
Note that Snowflake sorts the data alphabetically, so the output may look a bit different from the original JSON data.


Since that the data is now loaded into the table, we can use the functions that Snowflake provides to query the data and extract the information that we need.

The first query extracts specific columns from the JSON data (feel free to add more columns from the file):


The next query uses the FLATTEN function to extract the nested array of affiliations from the JSON data:


The last query combines the first two queries to extract all the columns and affiliations in one query:


These are just a few examples of the types of queries that you can run on semi-structured data using Snowflake. 

The functions provided by Snowflake give you a lot of flexibility when working with semi-structured data, and allow you to extract and manipulate the data in a way that is meaningful to you.

But there is a small problem with our query above.
All rows without an affiliation gets lost (id 47-50).
Thats why we create this query with a CTE (Common table expression) and a LEFT JOIN.


Since we dont lose data now, we create a view on top of this code:

Lastly we run the select * to have a look at the finished normalized data.


In conclusion, we have shown how to work with semi-structured JSON data in Snowflake, using the functions provided by the platform. 

We have loaded the data from an Azure Blob Storage into a Snowflake table and used the FLATTEN function to extract the nested array of affiliations. 

We also addressed the problem of losing rows with no affiliations by using a CTE and a LEFT JOIN. 

Additionally, we created a view to simplify querying the data.

Click on the chart button of the query result window.


As a fun fact, we can see that most of the characters in our JSON data are affiliated with the Galactic Republic. 

And as a Star Wars fan, I have to say that the Star Wars Universe never gets old and is always a great topic to explore.

With Snowflake, we can easily work with the data and discover new insights about the characters, factions, and events in the Star Wars universe.

Thank you for following along, and I'll see you next time for maybe more data exploration with Snowflake?


Greetings

Patrick :)




Comments

Popular posts from this blog

Streaming Data with Kafka in Python: A small walkthrough

Data Transfer from Azure SQL to Snowflake: My Experience Using Slowly Changing Dimensions, Change Data Capture, and Column Data Masking with Fivetran