Effortless Ways to handle Google Analytics 4 data in BigQuery like a pro
A Head Start: Still hanging around with Google’s Universal Analytics? It’s Time to be more profound with The New GA4! By Vaishnavee Baskaran
In the trail of exploring more business values through data analysis, the integration of Google Analytics into BigQuery is quite challenging. We will fathom the challenges while using Google BigQuery with GA4 data in this blog and also straightforward solutions for it. Before all else, we will read through some main changes made in GA4 to overhaul Universal Analytics.
- Cross-device and cross-platform tracking
- Different reporting interface
- Even pageview is considered an event as GA4 uses event-based measures
- Tracking ID starts with ‘G-’
- Create data streams instead of views
- Different event tracking setup
- A new set of engagement metrics
- Default IP anonymization feature
GA4 is a TRIUMPH of Data Tracking
Google brought in GA4 as a replacement for Universal Web Analytics in the fall of 2020 making the businesses analyze useful metrics and KPIs not only with Web Analytics but also with App analytics. Even Though Google announces a new property type as App+Web in 2019, it is designed for any organization, even those without a native iOS or Android mobile application. To know more about the evolution of GA4, please read Google Analytics Evolution and also check out the difference between UA and GA4.
Universal Analytics buzz off
Until July 1, 2023, Google announces normal data processing in Universal Analytics properties. Though after July 1, 2023, users will be able to access only previously processed data in Universal Analytics property for at least six months. Google recommends the users who created Google Analytics properties before October 14, 2020, should change to GA4 Properties compulsorily. For more information, please visit UA goes away and to make a switch to GA4.
Event-based measurement model
The main difference between Universal Analytics and GA4 is Universal Analytics session records are based on different types of hits such as page hits, event hits, eCommerce hits, and social interaction hits, but GA4 data is event-based where all user interactions are captured as an event.
Events are triggered from the user’s interaction or activity performed with an element implanted on websites and/or mobile apps and/or firebase. There are four types of events in GA4.
Automatically collected Events
- As the name says, these events are automatically collected events that are triggered by GA4 on predefined user activities.
Enhanced measurement events
- This event is the same as automatically collected events but with a difference in enabling options. These events will be logged only if the user enables them.
Recommended Events
- Recommended events do not automatically get collected or triggered unless the user manually implements them.
Custom Events
- These events are not tracked by default as other event types, rather it enables users to create and use their events. Hence, Google recommends that users double-check existing events whether the event the user needs already exists or not. For instance, button click, sign up events, etc.
For more information on events, check out the Events reference page.
Quick Integration Setup
I assume that the reader already knows about Google BigQuery and possesses basic knowledge of SQL. BigQuery plays a vital role in turning the big memory of data into useful information for business insights. Google enables Cloud platform users to import the GA4 analytics raw data in Google BigQuery with no charges (if exceeds fixed sandbox limits, then charges incur).
Please read this document on how to link BigQuery to Google Analytics 4
In the wake of importing GA4 into BigQuery, your BigQuery project stores the imported GA4 analytics raw data in a dataset named ‘analytics_<property_id>’ which contains multiple tables. Each table in the dataset represents the collection of data for each day. The dataset always contains two data tables,
- events_ (<number of days>) — all the session data from the previous <number of days> is present in the table.
- events_ intraday_(<current date>) — gives the session data of the current day.
In the above figure, the number after analytics_ is the property ID of GA4 having the dataset consisting of imported GA4 raw data and the events_ consists of day-by-day analytics. The first events_ (509) contains all sessions of the last 509 days, which means more than a year of data is present in the table.
Our first challenge is to deal with consolidating all events of analytics to analyze and report weekly, monthly, and even yearly performance.
The above query retrieves overall events (days) in GA4, which eventually takes a lot of time and space. To query sessions of specific days or periods, we use WILDCARD Queries.
Query multiple events with Static Date
The above query retrieves all sessions between the date range 01–01–2022 and 28–02–2022. In GA4 analytics raw data, each session is named by the respective date in the format of a string which is called ‘Static Date’.
Query multiple sessions with Dynamic Date
With the help of dynamic dates in a BigQuery, we can retrieve the information for a specific period such as the last 30 days, yesterday, and so on.
In the above query, DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) takes the last 30 days interval from the current date (today) and for DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) the query gives yesterday’s analytics as the interval is 1 from the current date.
Query multiple sessions with a Hybrid Date
BigQuery also enables the use of both the static and dynamic date range.
We can also select the preferred date with the year, month, and date selection option in the BigQuery UI as shown in the below screenshot.
Now we will dive into the structure of GA4 events, where the column contains not only single data but also multiple records or arrays. For instance, in the below picture, the field name such as event_params, privacy_info, user_properties, geo, app_info, user_ltv, trafficSource, device, and so on has a drop-down list.
These dropdowns are the array of various subfields in each of the columns in the table which makes the GA4 analytics data table in JSON format. The JSON table consists of multiple records of value inside each field and subfield. Here comes the special tips to handle nested fields in the GA4 dataset.
How to handle repeated records in SQL?
The solution is nothing but, flattening the nested fields (also said like un-nesting the nested field) which is in the form of an array. In the above picture, you could see hierarchical arrangements of the nested field or repeated record-like events_param to its subfield page. Hence, the flattened nested field will be represented as (<columnname.fieldname.subfieldname>). In our example, it is events_param.value.string_value, but if you use this type of pointer addressing directly in the SQL query for repeated records, the BigQuery prompts an error.
Hence, we use UNNEST or CROSS JOIN UNNEST in the way that BigQuery understands and fetches the required information from the flattened nested field or repeated records. For example, if you want to get analytics information specifically conditioned for the file_download event with extension ‘.pdf’, then use ‘UNNEST (events_param) AS ep’ where ep is used in place of events_param as below.
The above query fetches event and user information such as ‘file_download’ (list of downloaded files by which user_id and using which platform). The UNNEST makes it possible, by creating a new column events_param.value.string_value as key and its single bit of data as value.
For simple subfields like category in device as in the picture above, the field names can be addressed directly by giving their column name device. For example, in the below query, device.category is being used directly without the help of UNNEST as <fieldname.subfieldname> to call that particular subfield value country.
How to handle multiple repeated records?
To use more than one repeated record simultaneously in the query, we can join two UNNEST in one query easily by chaining nested fields together.
In the above example, we want to unnest two different nested fields event_params and user_properties, but BigQuery prompts an error if you use both the UNNEST in the same query.
Hence, we use WITH clause to unnest multiple repeated records in BigQuery. We have created two tables for each UNNEST (event_params) and UNNEST (user_properties) in the above example query. This query retrieves a list of all downloaded files by which user in what device category, platform, and country. As mentioned earlier, the query used two nested fields by chaining them together in one query. The same WITH clause can be used to handle more than two nested fields in BigQuery.
How to convert Date in string type into a Date format?
As given below, using PARSE_DATE in the query, the string type date field is converted into proper date format. Using the UNNEST name, the nested field can be used with SELECT statement in a query mainly wherever the array is referenced and if it is not referenced to an array otherwise STRUCT, then use a simple naming convention for nested fields as <columnname.fieldname.subfieldname>.
With these basic handling techniques, you can easily access GA4 analytic data in BigQuery. For more information, please read wildcard queries.
WITH table_1 AS (
SELECT DISTINCT
event_date,
ep.value.string_value AS file_path,
REGEXP_EXTRACT(ep.value.string_value, r’.([a-zA-Z0–9-]+\.[a-zA-Z0–9-.]+$)’) AS file_name,
stream_id,
platform,
geo.country,
geo.continent,
device.category,
device.web_info.browser
FROM `project_name.analytics_propertyID.events_*`,
UNNEST (event_params) AS ep
WHERE
_TABLE_SUFFIX BETWEEN ‘20220101’
AND FORMAT_DATE(‘%Y%m%d’,DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND event_name = ‘file_download’
AND ep.value.string_value LIKE ‘%.pdf%’
AND ep.key = ‘file_name’
),
table_2 AS (
SELECT DISTINCT
event_date,
user_id,
usr.key,
usr.value.string_value,
usr.value.int_value,
usr.value.float_value,
FROM `project_name.analytics_propertyID.events_*`, UNNEST (user_properties) AS usr
WHERE
_TABLE_SUFFIX BETWEEN ‘20220101’
AND FORMAT_DATE(‘%Y%m%d’,DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
)
SELECT *
FROM table_1
LEFT JOIN table_2
USING (event_date)
GA4 properties can capture additional information such as Coupons, discounts (new), Promotions, and Refunds. Please refer to GA4 Exploration to know more about GA4 features.
Conclusion
Since Universal Analytics made our work effortless and serene all over the years, now it could be a little over our head in the commencing phase to migrate to new Google Analytics 4 properties. On the brighter side, GA4 has enormous essential features and intriguing events tracing than UA, and also hoping for more new GA4 features from Google.
While on the other hand, to make the migration process easier, Google recommends users handle a parallel tracking approach. Which is nothing but, the users already having existing UA property can create a new GA4 property that collects data in parallel to UA. For more information, please visit Parallel tracking and Migrate E-commerce data from UA to GA4.
Upcoming datadice blog posts for this month
Further Links
This post is part of the Data Analytics series from datadice and unfold data handling, manipulation, visualisation techniques using Google Cloud Platforms in accompany with building ML models in AI platform.
Check out our LinkedIn account, to get insights into our daily working life and get important updates about BigQuery, Looker Studio (formerly called Google Data Studio), and marketing analytics
We also started with our own YouTube channel. We talk about important DWH, BigQuery, Looker Studio (formerly called Google Data Studio), and many more topics. Check out the channel here.
If you want to learn more about how to use Google Looker Studio and take it to the next level in combination with BigQuery, check our Udemy course here.
If you are looking for help to set up a modern and cost-efficient data warehouse or analytical dashboards, send us an email at hello@datadice.io and we will schedule a call.
Originally published at https://www.datadice.io.