Skip to content

hook

hook

__all__ = ['BigqueryHook'] module-attribute

BigqueryHook

Bases: BaseHook

Hook for interacting with Google BigQuery.

This hook provides methods for managing datasets, tables, and jobs in BigQuery. It uses the google-cloud-bigquery library to communicate with the BigQuery API.

__init__()

Initializes the BigqueryHook.

Creates a BigQuery client instance.

build_table_id(project, dataset, table)

Builds a BigQuery table ID string.

Parameters:

Name Type Description Default
project str

The Google Cloud project ID.

required
dataset str

The BigQuery dataset ID.

required
table str

The BigQuery table ID.

required

Returns:

Name Type Description
str

The fully qualified table ID in the format 'project.dataset.table'.

execute_load_job(from_filepath, to_project, to_dataset, to_table, job_config, timeout=240)

Executes a BigQuery load job from a URI.

Parameters:

Name Type Description Default
from_filepath str

The GCS URI of the source file.

required
to_project str

The Google Cloud project ID for the destination table.

required
to_dataset str

The BigQuery dataset ID for the destination table.

required
to_table str

The BigQuery table ID for the destination table.

required
job_config LoadJobConfig

The configured load job.

required
timeout int

The timeout for the job in seconds. Defaults to 240.

240

execute_query_job(query, to_project, to_dataset, to_table, to_write_disposition, to_time_partitioning, timeout=480)

Executes a BigQuery query job.

Parameters:

Name Type Description Default
query str

The SQL query to execute.

required
to_project str

The Google Cloud project ID for the destination table (if any).

required
to_dataset str

The BigQuery dataset ID for the destination table (if any).

required
to_table str

The BigQuery table ID for the destination table (if any).

required
to_write_disposition str

The write disposition if writing to a table.

required
to_time_partitioning dict

Configuration for time-based partitioning if writing to a table.

required
timeout int

The timeout for the job in seconds. Defaults to 480.

480

Raises:

Type Description
TimeoutError

If the job times out.

export_to_gcs(from_project, from_dataset, from_table, to_filepath)

Exports a BigQuery table to Google Cloud Storage (GCS).

Parameters:

Name Type Description Default
from_project str

The Google Cloud project ID of the source table.

required
from_dataset str

The BigQuery dataset ID of the source table.

required
from_table str

The BigQuery table ID of the source table.

required
to_filepath str

The GCS URI where the table will be exported.

required

get_all_columns(rows)

Gets a unique set of all column names from a list of rows.

Parameters:

Name Type Description Default
rows list

A list of dictionaries, where each dictionary represents a row.

required

Returns:

Name Type Description
set

A set of unique column names.

get_dataset(dataset)

Gets a BigQuery dataset, creating it if it doesn't exist.

Parameters:

Name Type Description Default
dataset str

The BigQuery dataset ID.

required

Returns:

Type Description

google.cloud.bigquery.dataset.Dataset: The BigQuery dataset.

get_query_results(query, timeout=480)

Executes a query and returns the results.

Parameters:

Name Type Description Default
query str

The SQL query to execute.

required
timeout int

The timeout for the query in seconds. Defaults to 480.

480

Returns:

Type Description

google.cloud.bigquery.table.RowIterator: An iterator of rows resulting from the query.

Raises:

Type Description
TimeoutError

If the query times out.

get_rows_from_table(project, dataset, table, timeout=480)

Retrieves all rows from a BigQuery table.

Parameters:

Name Type Description Default
project str

The Google Cloud project ID.

required
dataset str

The BigQuery dataset ID.

required
table str

The BigQuery table ID.

required
timeout int

The timeout for the query in seconds. Defaults to 480.

480

Returns:

Type Description

google.cloud.bigquery.table.RowIterator: An iterator of rows from the table.

get_table(project, dataset, table, schema, partition_column)

Gets a BigQuery table, creating it if it doesn't exist.

Parameters:

Name Type Description Default
project str

The Google Cloud project ID.

required
dataset str

The BigQuery dataset ID.

required
table str

The BigQuery table ID.

required
schema list

A list of dictionaries representing the table schema. Each dictionary should have 'key', 'type', and 'mode' keys.

required
partition_column str

The name of the column to use for time-based partitioning. If None, the table will not be partitioned.

required

Returns:

Type Description

google.cloud.bigquery.table.Table: The BigQuery table.

list_datasets()

Lists all datasets in the current project.

Returns:

Type Description

google.cloud.bigquery.dataset.DatasetListItem: An iterator of dataset list items.

load_file(from_filepath, from_file_format, from_separator, from_skip_leading_rows, from_quote_character, from_encoding, to_project, to_dataset, to_table, to_mode, to_schema, to_time_partitioning)

Loads data from a file in GCS to a BigQuery table.

Parameters:

Name Type Description Default
from_filepath str

The GCS URI of the source file.

required
from_file_format str

The format of the source file.

required
from_separator str

The delimiter for CSV files.

required
from_skip_leading_rows int

Number of leading rows to skip for CSV.

required
from_quote_character str

Quote character for CSV files.

required
from_encoding str

File encoding.

required
to_project str

Destination Google Cloud project ID.

required
to_dataset str

Destination BigQuery dataset ID.

required
to_table str

Destination BigQuery table ID.

required
to_mode str

Write disposition (e.g., 'overwrite', 'WRITE_APPEND').

required
to_schema list

Schema for the destination table.

required
to_time_partitioning dict

Configuration for time-based partitioning.

required

setup_job_config(from_file_format, from_separator, from_skip_leading_rows, from_quote_character, from_encoding, to_mode, to_schema, to_time_partitioning)

Configures a BigQuery load job.

Parameters:

Name Type Description Default
from_file_format str

The format of the source file (e.g., 'csv', 'json').

required
from_separator str

The delimiter used in CSV files.

required
from_skip_leading_rows int

The number of leading rows to skip in CSV files.

required
from_quote_character str

The character used to quote fields in CSV files.

required
from_encoding str

The encoding of the source file.

required
to_mode str

The write disposition for the load job (e.g., 'overwrite', 'WRITE_APPEND').

required
to_schema list

The schema for the destination table. If None, autodetect is used.

required
to_time_partitioning dict

Configuration for time-based partitioning. Should include 'type' and 'field'.

required

Returns:

Type Description

google.cloud.bigquery.job.LoadJobConfig: The configured load job object.

Raises:

Type Description
Exception

If the file format is not supported.

update_table_schema(bq_table, rows)

Updates the schema of a BigQuery table if new columns are present in the rows.

Parameters:

Name Type Description Default
bq_table Table

The BigQuery table object.

required
rows list

A list of dictionaries representing the rows to be inserted.

required

Returns:

Type Description

google.cloud.bigquery.table.Table: The updated BigQuery table object.

write(project, dataset, table, schema, partition_column, rows)

Writes rows to a BigQuery table.

This method ensures the dataset and table exist, updates the table schema if necessary, and then inserts the rows.

Parameters:

Name Type Description Default
project str

The Google Cloud project ID.

required
dataset str

The BigQuery dataset ID.

required
table str

The BigQuery table ID.

required
schema list

A list of dictionaries representing the table schema.

required
partition_column str

The name of the column for time-based partitioning.

required
rows list

A list of dictionaries representing the rows to insert.

required

Raises:

Type Description
Exception

If there are errors during the insertion process.