⚡️ BigQuery Query Optimization: Partition Pruning with Subqueries vs SQL Variables

Vara Tumpala included in Cloud Computing Data Engineering Google Cloud

2025-12-12 1220 words 6 minutes

Contents

📋 Overview

BigQuery Partitioning divides a large table into smaller, manageable segments called partitions based on the values of a specific column. BigQuery supports three types of partitioning:

Time-unit partitioning: Partition based on a TIMESTAMP, DATE, or DATETIME column into daily, hourly, daily, monthly, or yearly segments.
Integer-range partitioning: Partition based on ranges of values in a specific INTEGER column, defined by a start value, end value, and interval.
Ingestion-time partitioning: Automatically partitions data based on the time when BigQuery ingests the data using a pseudo column _PARTITIONTIME.

This strategy improves query performance by limiting the amount of data scanned and reducing the cost of the query by reading only the relevant partitions of a table. This process is known as partition pruning.

BigQuery automatically prunes partitions when the query contains filters on the partitioning column with static literal values. However, using subqueries to determine partition values can prevent BigQuery from effectively pruning partitions, leading to higher costs and slower performance.

In this article, we will explore an approach that disables partition pruning and demonstrate how to rewrite the query to enable it.

📊 The Scenario: Querying the “Latest N” Partitions

Imagine there is a large historical dataset partitioned by day on the date column. This table holds years of data, amounting to several gigabytes. The task is to analyze only the data from the 10 most recent days.

The following sample dataset is used throughout this article to demonstrate partition pruning behavior.

Property	Value
Project ID	`xxxxxxxxxxx`
Dataset ID	`lab_dataset`
Table Name	`lcl_data`
Table Size	`9.37 GB`
Number of Partitions	`830`
Partitioning Column	`date_processed` (DATE type)
Partition Type	`Time-unit partitioning (daily)`
Number of records	`167932474`

🕵️‍♂️ The Flawed Approach: Dynamic Subquery in the `WHERE` Clause

A common approach is to use a subquery to find the latest partition dates and feed them into the main query’s WHERE clause.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-- ❌ This query scans the ENTIRE table!
SELECT *
FROM `xxxxxxxxxxx.lab_dataset.lcl_data`
WHERE date_processed IN (
    -- Subquery to dynamically find the 10 most recent partitions
    SELECT 
      PARSE_DATE('%Y%m%d', PARTITION_ID)
    FROM `xxxxxxxxxxx.lab_dataset.INFORMATION_SCHEMA.PARTITIONS`
    WHERE TABLE_NAME = 'lcl_data'
      AND PARTITION_ID != '__NULL__'
    ORDER BY PARSE_DATE('%Y%m%d', PARTITION_ID) DESC
    LIMIT 10
);

⚠️ Critical Issue: While this looks correct, it forces BigQuery to scan the entire table because the query planner cannot resolve the list of dates from the subquery before it executes the query, so it must scan all partitions.

After running the query, a look at the job information will show a high number for “Bytes billed.”

💸 Result: Scanning full table (9.37 GB) instead of just 10 partitions.

BigQuery partition pruning disabled (flawed subquery approach)

🛠️ The Solution: Replace the Subquery with a SQL Variable

✨ Key Insight: The correct approach is to decouple the discovery of partitions from the main data query. You do this by finding the partition dates first and storing them in a SQL variable. The main query then uses this SQL variable which has the partition dates in it, allowing the query planner to perform pruning correctly.

This single principle can be applied whether you are working in the BigQuery console or programmatically.

1️⃣ In the BigQuery Console (using SQL Scripting)

You can use a BEGIN...END block to run a multi-statement script. The DECLARE statement runs first, populating the variable. The main query then uses it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
-- ✅ This works! Partition pruning is enabled.
BEGIN
  -- Step 1: Discover the partitions and store them in a variable.
  DECLARE PARTITION_DATES ARRAY<DATE> DEFAULT ARRAY(SELECT PARSE_DATE('%Y%m%d', PARTITION_ID)
    FROM `xxxxxxxxxxx.lab_dataset.INFORMATION_SCHEMA.PARTITIONS`
    WHERE TABLE_NAME = 'lcl_data'
      AND PARTITION_ID != '__NULL__'
    ORDER BY PARSE_DATE('%Y%m%d', PARTITION_ID) DESC
    LIMIT 10
  );

  -- Step 2: Use the static variable in the main query.
  SELECT *
  FROM `xxxxxxxxxxx.lab_dataset.lcl_data`
  WHERE date_processed IN UNNEST(PARTITION_DATES);
END;

2️⃣ Programmatically (e.g., in Python)

The same two-step logic applies. First, you run a query to get the dates. Second, you use those dates as a parameter in your main query.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Step 1: Run a query to discover the partition dates.
partition_query = f"""
    SELECT ARRAY(
        SELECT PARSE_DATE('%Y%m%d', PARTITION_ID)
        FROM `{PROJECT_ID}.{DATASET_ID}.INFORMATION_SCHEMA.PARTITIONS`
        WHERE TABLE_NAME = '{TABLE_ID}' AND PARTITION_ID != '__NULL__'
        ORDER BY PARSE_DATE('%Y%m%d', PARTITION_ID) DESC
        LIMIT {PARTITION_LIMIT}
    )
"""
# Extract partition dates from result
partition_dates = []
for row in partition_result:
    partition_dates = row[0]
    break

# Step 2: Use the result as a query parameter in the main query.
main_query = f"""
    SELECT *
    FROM `{FULL_TABLE_ID}`
    WHERE date_processed IN UNNEST(@PARTITION_DATES)
"""
job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ArrayQueryParameter('PARTITION_DATES', 'DATE', partition_dates)
    ]
)
query_job = client.query(main_query, job_config=job_config)

📚 Full code example: Check out the 🔗 bq_partition_pruning_subquery_vs_sqlparam_demo.py script

✅ Result: Partition Pruning Enabled!

🎉 After implementing the above solution, running the query will show a significantly lower “Bytes billed” value, indicating that partition pruning is now effectively enabled. It has 2 queries instead of 1:

🔍 First query gets the partition dates
📊 Second query fetches data from only those partitions

📈 Breakdown of bytes billed:

🔹 First query (partition discovery): ~10 MB
🔹 Second query (main data query with pruning): ~124 MB
🟢 Total: ~134 MB (vs. ❌ 9.37 GB without pruning) — 98% savings!

Below is a parent-and-children layout to present the result meaningfully: the parent shows both queries (BEGIN…END), and the two children show each job detail.

Parent: BEGIN...END block showing both queries executed

Parent: Execution overview of BEGIN...END block (two queries).

Child 1: Job info for partition dates query

Child 1: Job information for partition dates query.

Child 2: Job info for pruned partitions data query

Child 2: Job information for pruned partitions data query.

🔍 Verify Only 10 Partitions Were Scanned

✔️ To confirm that partition pruning is working correctly, you can query the INFORMATION_SCHEMA.PARTITIONS table to calculate the total billable bytes for the 10 most recent partitions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
select *, 
 round((sum(total_billable_bytes) over(order by partition_id rows between unbounded preceding and unbounded following))/(1024*1024)) || ' MB' as total_billable_bytes_sum
 from (
    SELECT table_schema, 
      table_name, 
      partition_id, 
      total_rows,
      total_logical_bytes,
      total_billable_bytes
    FROM `xxxxxxxxxxx.lab_dataset.INFORMATION_SCHEMA.PARTITIONS`
    WHERE TABLE_NAME = 'lcl_data' 
      AND PARTITION_ID != '__NULL__'
    ORDER BY PARSE_DATE('%Y%m%d', PARTITION_ID) DESC
    limit 10)
ORDER BY PARSE_DATE('%Y%m%d', PARTITION_ID) DESC

✅ Expected Result: The total_billable_bytes_sum column should be around ~124 MB (not 9.37 GB!)

💰 The Financial Impact: A Stunning Difference

🚀 This single optimization can dramatically reduce your BigQuery costs. By eliminating the subquery and using a SQL variable, you ensure BigQuery only scans the partitions you actually need.

📊 The table below compares the resource consumption before and after implementing this fix, revealing the substantial impact on both performance and cost:

Metric	❌ Subquery Approach	✅ SQL Variable Approach
Bytes Billed	`9.37 GB`	`~134 (10 + 124) MB`
Partition Pruning	❌ NO (All partitions scanned)	✅ YES (Only 10 partitions scanned)
Cost Savings	Baseline	Over 98%

🎯 Conclusion

📌 Key Takeaway: Using subqueries in the WHERE clause to determine partition values can disable BigQuery’s partition pruning, leading to inefficient queries that scan the entire table.

✨ The Fix: By separating the partition discovery step from the main data query and using SQL variables or query parameters, you can enable partition pruning effectively.

🏆 The Impact: This approach results in significant performance improvements and cost savings (often 90%+) when querying large partitioned tables in BigQuery.