Sign in
Log inSign up
Array data in BigQuery

Array data in BigQuery

Mr. Yap's photo
Mr. Yap
·Feb 28, 2020

A walk through of working with semi-structured data (ingesting JSON, Array data types) inside BigQuery.

BigQuery is Google's fully managed, NoOps, low cost analytics database.

With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator.

BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights instead of taking care of server, connectivity and backup. Hence, the word NoOps.

NoOps (no operations) is the concept that an IT environment can become so automated and abstracted from the underlying infrastructure that there is no need for a dedicated team to manage software in-house.[link]

Normally in SQL you will have a single value for each row like this list of fruits below:

0.png

If you wanted a list of fruit items for each person at the store, it could look something like this:

0.png

In traditional relational database SQL, you would look at the repetition of names and immediately think to split the above table into two separate tables: Fruit Items (raspberry, blackberry, strawberry, cherry, orange, apple) and People. (Sally, Frederick) in our example.

This is a common approach for transactional databases like mySQL. The process is called normalization going from one table to many.

For data warehousing, data analysts often go the reverse direction. Bringing many separate tables into one large reporting table, know as denormalization.

A potential issues if you stored all your data in one giant table, the table row size could be too large for traditional reporting databases

The array data type!

A different approach that stores data at different levels of granularity all in one table using repeated fields.

0.png

What looks strange about the previous table?

  • It's only two rows.
  • There are multiple field values for Fruit in a single row.
  • The people are associated with all of the field values.

An easier way to interpret the Fruit array:

0.png

Both of these tables are exactly the same. There are two key learning here:

  • An array is simply a list of items in brackets [ ]
  • BigQuery visually displays arrays as flattened. It simply lists the value in the array vertically (note that all of those values still belong to a single row)

A real-life example

0.png

In the above results, 2 rows are returned - one for each day (2017-07-31 & 2017-08-01)

Arrays are called REPEATED fields in BigQuery. BigQuery natively supports arrays. Array values must share a data type - either all strings or all numbers. No mixing together.

YES

['raspberry', 'blackberry', 'strawberry', 'cherry'] AS fruit_array

NO

['raspberry', 'blackberry', 'strawberry', 'cherry', 1234567] AS fruit_array