BigQuery Cost Efficiency
Best practices to avoid costly behaviour on the Cloud. By Ahmed Masmoudi
Google Data Studio is a powerful visualisation tool loaded with multiple possibilities that we talked about in the previous blog posts of this series.
This time we are going to talk about the best practices to save some bucks using this tool.
GDS or Google Data Studio does not have paid services included, nevertheless, the major data connector for this tool is BigQuery alongside other services that can be connected to it as well.
This blog post will mainly focus on BigQuery’s cost-effectiveness as it is the main service used for data warehousing before illustrating the results on Data Studio for a lot of analysts.
Google Cloud Platform offers BigQuery as a fully-managed, server-less data warehouse that enables scalable analysis over petabytes of data. BigQuery is well known for its optimised querying costs. However, on a higher scale of data, costs can become more significant. The great news is that your bill can be slashed to a minimal amount using a variety of techniques to optimise your scheduled queries with much better efficiency and quicker results.
BigQuery is measuring its users’ expenses through two main criteria; storage cost and analytical cost. For instance, active storage costs $0.020 per GB and $5.00 per TB for the on-demand analysis which is a significant amount of money for major analytical projects.
BigQuery free services:
- Batch loading data
- Automatic re-clustering
- Exporting data from tables
- Deleting table, views or partitions
- Metadata operations
- Cached queries
- Queries that result in an error
- Query data processed for the first 1TB of data per month
BigQuery paid services:
Storage
- Active storage
- Long-term storage
- Streaming inserts
Query processing
- On-demand
- Flat-rate
So in order to have your organisation be more cost-effective while still preserving a high query speed of processing terabytes of data within a few seconds, the following tips for both kinds of costs are the way to go.
Storage
Set an expiration date for your data
Configure default table expiration on your dataset. Expiration can be set at the table and partition level as well.
Only edit necessary tables
In BigQuery, your data table’s price will automatically drop by about 50% if your table has not been edited for 90 days.
Avoid data duplicates
BigQuery uses a federated data access model that allows you to query your data directly from other Google services like Cloud Bigtable, Cloud Storage, Google Drive and Cloud SQL. That way you do not have to copy your data stored in other cloud services in BigQuery to cut the storage expenses.
Make use of the BigQuery backup process
BigQuery offers a seven-day history of changes to your table, which allows you to query a point-in-time snapshot of your data. So you can revert back the data without the need to use recovery backups for certain use cases. Even for deleted tables, BigQuery offers a two-day history to revert the deletion.
Query Processing
Only query the data you need
For some queries, the screening of every possible record is not necessary. So when exploring schemas or proof-testing a query’s logic, LIMIT expression comes in handy. This process will reduce the execution time and limit the number of processed rows.
Also, another good habit to adopt while querying your data is the restricted SELECT, where you only select the desired subset of columns for your result. This will significantly reduce the processing time because BigQuery scans the data, one field at a time, decompresses it, and then executes the requested functions.
In this query example, we tried to query the COVID-19 cases table which is considered as a small data table. Nevertheless, the first query with all columns took 6 times the amount of data processed compared to selecting specific features to query as shown above.
Set up controls
Another tip here is to set the maximum bytes billed setting to limit query costs. When you set maximum bytes billed, the number of bytes that the query will read is estimated before the query execution. If the number of estimated bytes is beyond the limit, then the query fails without incurring a charge.
As shown above, BigQuery always offers an estimate of the queries’ processed bytes and time plus the actual numbers at the bottom that will show up after firing up the query. So setting up controls here will prevent causing a lot of expenses when executing a cost-heavy query.
Partitioning
Under the hood, BigQuery is using columnar storage format instead of the traditional row-oriented format. The columnar format is optimised for analytical queries because it allows ignoring of record fields that are not part of the query.
With the huge amount of data to store on the cloud, BigQuery offers a partition by feature with a maximum of 4000 partitions per partitioned table for faster access to the right chunks of data without the need for scanning all records.
A good use case is to partition your table by the record timestamp with a predefined period of time for each partition.This column is frequently used to indicate the time a specific row has been stored. As a result of this scenario, querying data from a specific record timestamp would be much faster than querying a traditional table with no partitions because only the partitions with a date of storage within the predefined time frame will be scanned.
Clustering
BigQuery also offers clustering fields that you can predefine, then the data will be sorted by the values of that field into separate blocks.
When running a query that filters, aggregates, or joins data based on the clustering field, BigQuery will intelligently determine which blocks of data to read. This is called block pruning or cluster pruning.
In the following query, we replace our traditional COVID-19 cases table by a partitioned and clustered table. The new table will be stored based on partitions of the dates of the recorded cases, and we specified that each day will be stored in a separate partition ( we can also do partitions per week or month for instance). We also clustered our data by number of confirmed cases intervals (usually clustering is used for unique identifiers for large datasets).
Data caching
BigQuery is well-known for its caching feature, where you can query a table from its cached data cost-free.
If you rerun the same query within a reasonable time frame, the results may be fetched from the cache and therefore no cost is incurred. The caching time frame can be tuned to avoid multiple useless scans of the data.
Materialise query results in stages
Instead of running your query for results every time you need them, you can break it into stages where each stage materialises the query results by writing them to a destination table. This way you can reduce the processing time and save a lot on your bill.
Monitoring cost
In order to manage your costs on BigQuery, Google Cloud’s Stackdriver monitoring and auditing provides insights into BigQuery usage. Plus, BigQuery offers querying for retrieving real-time metadata about the jobs running into your project. Using the job metadata, you can find out all executed queries within a specified time period, sort them by cost, bytes processed and group them by user, dataset or project.
Conclusion
The impact of the previous instructions is not a fraction of your bill, however, optimising your workflow with the data can help you save up to 90% of your expenses.
Google lately has been posting on their blog many stories about big companies with 7 digits spent on the cloud successfully saving up a lot of money that way either migrating from other cloud providers or even being already a Google Cloud partner from the beginning.
So the cost efficiency topic is worth investigating for all businesses no matter how big they are.
Further links
This post is part of our Google Data Studio series in the Data School of datadice. We will be sharing the best insights that we’ve learned through years of creating data solutions.
If you want to learn more about how to use Google Data Studio and take it to the next level in combination with BigQuery, check our Udemy course here.