Approaches to Store Key-Value Pairs in Google BigQuery
Introduction
Google BigQuery is an enterprise data warehouse system built using Google Cloud Platform. In this post, I will be covering two approaches that I have tried out to store key-value pairs in the BigQuery table for my web analytics use case and my findings with these approaches.
My requirement was to store key-value pairs in a BigQuery table for web analytics use-case. Extracting, filtering, and aggregation should be easily doable to analyze the web page data collected in the BigQuery table. The key that is getting stored in the table is not fixed, users can store any key of their choice.
Data Modeling
Approach 1: Storing Key-Value Pairs as Nested Objects
In this approach, An ARRAY of STRUCTs is used to store key-value pairs, Each of which contains two Strings. ie; STRUCT<STRING, STRING>>. The column data type is ARRAY<STRUCT<STRING, STRING>>
Example:
[{“key”: “key1” , “value : “value1”},
{“key”: “key2” , “value : “value3”},
{“key”: “key3” , “value : “value3”},
{“key”: “key4” , “value : “value4”}
]Array of objects where each object has a key and value field.
Approach 2: Storing Key-Value Pairs as String
This is the simplest approach, All key-values are stored as a string with some delimiter.
Example:
key1=value1&key2=value2&key3=value3&key4=value4'&' to seperate pairs;
'=' to seperate key and value of a pair;
I have created the below table for approach 2, Key-Value pairs are stored as a string under kv_pairs column in the approach2_table.
Querying Data
Approach 1
In the below query by using UNNEST the key-values are extracted from kv_pairs column. By UNNEST ing kv_pairs we are able to flatten kv_pair arrays and query the required key/value. To get more details on how to use UNNEST refer to this blog post.
This table has 15,289,214,877 rows for the date 2020–11–01.
The query took 255 seconds to execute.
bq query --use_legacy_sql=false
'select count (distinct (SELECT value FROM UNNEST(kv_pairs) WHERE key = "isp_name")) from dataset.approach1_table where session_date=("2020-11-01") ';Waiting on bqjob_xxxxxxxxx_000000000000_1 ... (255s) Current status: DONE+-------+| f0_ |+-------+| 39630 |+-------+
Approach 2
In the below query using the RegEx Extract function isp_name’s are extracted from kv_pairs string and the distinct count is calculated.
This table has the same data of approach1_table except the kv_pairs column in apprioach2_table is delimited String. Hence this table too have 15,289,214,877 rows for the date 2020–11–01.
The query took 109 seconds to execute.
bq query --use_legacy_sql=false
'select count (distinct REGEXP_EXTRACT(kv_pairs, r"isp_name=([^&]*)")) from dataset.approach2_table where session_date=("2020-11-01") ';Waiting on bqjob_xxxxxxxxx_000000000000_2 ... (109s) Current status: DONE+-------+| f0_ |+-------+| 39630 |+-------+
Querying Multiple Key Values
Approach 1
Querying and Aggregating TWO key-values
bq query --use_legacy_sql=false
'select count(distinct (SELECT value FROM UNNEST(kv_pairs) WHERE key = "browser_language")), count(distinct (SELECT value FROM UNNEST(kv_pairs) WHERE key = "isp_name")) from dataset.approach1_table where session_date=("2020-11-01") ';Waiting on bqjob_xxxxxxxxx_000000000000_3 ... (285s) Current status: DONE+-----+-------+| f0_ | f1_ |+-----+-------+| 326 | 39630 |+-----+-------+Querying and Aggregating THREE key-values
bq query --use_legacy_sql=false 'select count(distinct (SELECT value FROM UNNEST(kv_pairs) WHERE key = "browser_language")), count(distinct (SELECT value FROM UNNEST(kv_pairs) WHERE key = "isp_name")), count(distinct (SELECT value FROM UNNEST(kv_pairs) WHERE key = "user_consent")) from dataset.approach1_table where session_date=("2020-11-01") ';Waiting on bqjob_xxxxxxxxx_000000000000_4 ... (375s) Current status: DONE+-----+-------+-----+| f0_ | f1_ | f2_ |+-----+-------+-----+| 326 | 39630 | 1 |+-----+-------+-----+
Approach 2
Querying and Aggregating TWO key-values
bq query --use_legacy_sql=false
'select count (distinct REGEXP_EXTRACT(kv_pairs, r"browser_language=([^&]*)")),count (distinct REGEXP_EXTRACT(kv_pairs, r"isp_name=([^&]*)")) from dataset.approach2_table where session_date=("2020-11-01") ';Waiting on bqjob_xxxxxxxxx_000000000000_5 ... (135s) Current status: DONE+-----+-------+| f0_ | f1_ |+-----+-------+| 326 | 39630 |+-----+-------+Querying and Aggregating THREE key-values
bq query --use_legacy_sql=false
'select count (distinct REGEXP_EXTRACT(kv_pairs, r"browser_language=([^&]*)")),count (distinct REGEXP_EXTRACT(kv_pairs, r"isp_name=([^&]*)")),count (distinct REGEXP_EXTRACT(kv_pairs, r"user_consent=([^&]*)")) from dataset.approach2_table where session_date=("2020-11-01") ';Waiting on bqjob_xxxxxxxxx_000000000000_6 ... (194s) Current status: DONE+-----+-------+-----+| f0_ | f1_ | f2_ |+-----+-------+-----+| 326 | 39630 | 1 |+-----+-------+-----+
Adding New KVs by Updating Existing Row
Approach 1
By using the below DML statement I updated a row to add a new key-value pair <“my_key”, “my_value”> in the repeated array column kv_pairs.
The update statement took 254 seconds to complete the execution.
bq query --use_legacy_sql=false
'update dataset.approach1_table set kv_pairs=ARRAY_CONCAT(kv_pairs,ARRAY<STRUCT<key STRING, value STRING>>[("my_key", "my_value")]) where session_date=("2020-11-01") and session_id=4216564460077758 and page_url="/v1/payments/payment/PAYID=xxxxxx" ';Waiting on bqjob_xxxxxxxxx_000000000000_7 ... (254s) Current status: DONENumber of affected rows: 1
Approach 2
By using the below DML statement I updated a row to add a new key-value pair <“my_key”, “my_value”> in the string column kv_pairs.
The update statement took 109 seconds to complete the execution.
bq query --use_legacy_sql=false
'update dataset.approach2_table set kv_pairs=CONCAT(kv_pairs,"&my_key=my_value") where session_date=("2020-11-01") and session_id=4216564460077758 and page_url="/v1/payments/payment/PAYID=xxxxxx" ';Waiting on bqjob_xxxxxxxxx_000000000000_8 ... (109s) Current status: DONENumber of affected rows: 1
Summary
Storing key values as a delimited string outperforms while querying and updating the table compared to storing it as an Array of structs. The query execution time is 2x slower in the case of an Array of structs. The reason for this slowness in execution time is during UNNEST ing there is a self-join that is performed on the table. It is possible that Google can optimize this later and the query time is improved.
Until Google optimizes how nested objects are handled inside BigQuery it is better to store key-values as a delimited string and use RegExpression to extract value from the string column.