hook
hook
__all__ = ['BigqueryHook']
module-attribute
BigqueryHook
Bases: BaseHook
Hook for interacting with Google BigQuery.
This hook provides methods for managing datasets, tables, and jobs in BigQuery.
It uses the google-cloud-bigquery
library to communicate with the BigQuery API.
__init__()
Initializes the BigqueryHook.
Creates a BigQuery client instance.
build_table_id(project, dataset, table)
Builds a BigQuery table ID string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
project
|
str
|
The Google Cloud project ID. |
required |
dataset
|
str
|
The BigQuery dataset ID. |
required |
table
|
str
|
The BigQuery table ID. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
The fully qualified table ID in the format 'project.dataset.table'. |
execute_load_job(from_filepath, to_project, to_dataset, to_table, job_config, timeout=240)
Executes a BigQuery load job from a URI.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
from_filepath
|
str
|
The GCS URI of the source file. |
required |
to_project
|
str
|
The Google Cloud project ID for the destination table. |
required |
to_dataset
|
str
|
The BigQuery dataset ID for the destination table. |
required |
to_table
|
str
|
The BigQuery table ID for the destination table. |
required |
job_config
|
LoadJobConfig
|
The configured load job. |
required |
timeout
|
int
|
The timeout for the job in seconds. Defaults to 240. |
240
|
execute_query_job(query, to_project, to_dataset, to_table, to_write_disposition, to_time_partitioning, timeout=480)
Executes a BigQuery query job.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The SQL query to execute. |
required |
to_project
|
str
|
The Google Cloud project ID for the destination table (if any). |
required |
to_dataset
|
str
|
The BigQuery dataset ID for the destination table (if any). |
required |
to_table
|
str
|
The BigQuery table ID for the destination table (if any). |
required |
to_write_disposition
|
str
|
The write disposition if writing to a table. |
required |
to_time_partitioning
|
dict
|
Configuration for time-based partitioning if writing to a table. |
required |
timeout
|
int
|
The timeout for the job in seconds. Defaults to 480. |
480
|
Raises:
Type | Description |
---|---|
TimeoutError
|
If the job times out. |
export_to_gcs(from_project, from_dataset, from_table, to_filepath)
Exports a BigQuery table to Google Cloud Storage (GCS).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
from_project
|
str
|
The Google Cloud project ID of the source table. |
required |
from_dataset
|
str
|
The BigQuery dataset ID of the source table. |
required |
from_table
|
str
|
The BigQuery table ID of the source table. |
required |
to_filepath
|
str
|
The GCS URI where the table will be exported. |
required |
get_all_columns(rows)
Gets a unique set of all column names from a list of rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rows
|
list
|
A list of dictionaries, where each dictionary represents a row. |
required |
Returns:
Name | Type | Description |
---|---|---|
set |
A set of unique column names. |
get_dataset(dataset)
Gets a BigQuery dataset, creating it if it doesn't exist.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
str
|
The BigQuery dataset ID. |
required |
Returns:
Type | Description |
---|---|
google.cloud.bigquery.dataset.Dataset: The BigQuery dataset. |
get_query_results(query, timeout=480)
Executes a query and returns the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The SQL query to execute. |
required |
timeout
|
int
|
The timeout for the query in seconds. Defaults to 480. |
480
|
Returns:
Type | Description |
---|---|
google.cloud.bigquery.table.RowIterator: An iterator of rows resulting from the query. |
Raises:
Type | Description |
---|---|
TimeoutError
|
If the query times out. |
get_rows_from_table(project, dataset, table, timeout=480)
Retrieves all rows from a BigQuery table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
project
|
str
|
The Google Cloud project ID. |
required |
dataset
|
str
|
The BigQuery dataset ID. |
required |
table
|
str
|
The BigQuery table ID. |
required |
timeout
|
int
|
The timeout for the query in seconds. Defaults to 480. |
480
|
Returns:
Type | Description |
---|---|
google.cloud.bigquery.table.RowIterator: An iterator of rows from the table. |
get_table(project, dataset, table, schema, partition_column)
Gets a BigQuery table, creating it if it doesn't exist.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
project
|
str
|
The Google Cloud project ID. |
required |
dataset
|
str
|
The BigQuery dataset ID. |
required |
table
|
str
|
The BigQuery table ID. |
required |
schema
|
list
|
A list of dictionaries representing the table schema. Each dictionary should have 'key', 'type', and 'mode' keys. |
required |
partition_column
|
str
|
The name of the column to use for time-based partitioning. If None, the table will not be partitioned. |
required |
Returns:
Type | Description |
---|---|
google.cloud.bigquery.table.Table: The BigQuery table. |
list_datasets()
Lists all datasets in the current project.
Returns:
Type | Description |
---|---|
google.cloud.bigquery.dataset.DatasetListItem: An iterator of dataset list items. |
load_file(from_filepath, from_file_format, from_separator, from_skip_leading_rows, from_quote_character, from_encoding, to_project, to_dataset, to_table, to_mode, to_schema, to_time_partitioning)
Loads data from a file in GCS to a BigQuery table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
from_filepath
|
str
|
The GCS URI of the source file. |
required |
from_file_format
|
str
|
The format of the source file. |
required |
from_separator
|
str
|
The delimiter for CSV files. |
required |
from_skip_leading_rows
|
int
|
Number of leading rows to skip for CSV. |
required |
from_quote_character
|
str
|
Quote character for CSV files. |
required |
from_encoding
|
str
|
File encoding. |
required |
to_project
|
str
|
Destination Google Cloud project ID. |
required |
to_dataset
|
str
|
Destination BigQuery dataset ID. |
required |
to_table
|
str
|
Destination BigQuery table ID. |
required |
to_mode
|
str
|
Write disposition (e.g., 'overwrite', 'WRITE_APPEND'). |
required |
to_schema
|
list
|
Schema for the destination table. |
required |
to_time_partitioning
|
dict
|
Configuration for time-based partitioning. |
required |
setup_job_config(from_file_format, from_separator, from_skip_leading_rows, from_quote_character, from_encoding, to_mode, to_schema, to_time_partitioning)
Configures a BigQuery load job.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
from_file_format
|
str
|
The format of the source file (e.g., 'csv', 'json'). |
required |
from_separator
|
str
|
The delimiter used in CSV files. |
required |
from_skip_leading_rows
|
int
|
The number of leading rows to skip in CSV files. |
required |
from_quote_character
|
str
|
The character used to quote fields in CSV files. |
required |
from_encoding
|
str
|
The encoding of the source file. |
required |
to_mode
|
str
|
The write disposition for the load job (e.g., 'overwrite', 'WRITE_APPEND'). |
required |
to_schema
|
list
|
The schema for the destination table. If None, autodetect is used. |
required |
to_time_partitioning
|
dict
|
Configuration for time-based partitioning. Should include 'type' and 'field'. |
required |
Returns:
Type | Description |
---|---|
google.cloud.bigquery.job.LoadJobConfig: The configured load job object. |
Raises:
Type | Description |
---|---|
Exception
|
If the file format is not supported. |
update_table_schema(bq_table, rows)
Updates the schema of a BigQuery table if new columns are present in the rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bq_table
|
Table
|
The BigQuery table object. |
required |
rows
|
list
|
A list of dictionaries representing the rows to be inserted. |
required |
Returns:
Type | Description |
---|---|
google.cloud.bigquery.table.Table: The updated BigQuery table object. |
write(project, dataset, table, schema, partition_column, rows)
Writes rows to a BigQuery table.
This method ensures the dataset and table exist, updates the table schema if necessary, and then inserts the rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
project
|
str
|
The Google Cloud project ID. |
required |
dataset
|
str
|
The BigQuery dataset ID. |
required |
table
|
str
|
The BigQuery table ID. |
required |
schema
|
list
|
A list of dictionaries representing the table schema. |
required |
partition_column
|
str
|
The name of the column for time-based partitioning. |
required |
rows
|
list
|
A list of dictionaries representing the rows to insert. |
required |
Raises:
Type | Description |
---|---|
Exception
|
If there are errors during the insertion process. |