Skip to content
Loading and Preparing Customer Data for BigQuery ML
todd-bernson-leadership

Overview

Data preparation is a foundational step in any machine learning project, impacting both model performance and interpretability. Properly formatted and well-structured data is essential for building robust models and producing reliable predictions. In this project, we leverage Google BigQuery for both data storage and model training. This approach minimizes data movement, leverages GCP’s scalability, and keeps our entire machine learning pipeline centralized.


Data Schema

To build an effective churn prediction model, the data schema must capture relevant customer attributes that correlate with churn likelihood. We define each column’s type and mode based on customer characteristics commonly linked to churn in telecommunications, such as tenure, contract type, and monthly charges. Ensuring accurate data types, like STRING for categorical fields and FLOAT for numeric fields, helps BigQuery ML to interpret and process the data correctly.

Schema Fields and Types

Each column in our schema has been chosen to provide insights into customer behaviors:

  • customerID (STRING): A unique identifier for each customer.
  • gender (STRING): Demographic data, useful for segmentation.
  • SeniorCitizen (INTEGER): Encodes customer’s seniority, affecting churn patterns.
  • Contract (STRING): Identifies contract type, impacting churn likelihood.
  • MonthlyCharges (FLOAT): Financial metric, often a direct churn indicator.
  • Churn (BOOLEAN): The target variable, representing whether the customer has churned.

This schema creates a structured, organized dataset, making it easy to build and evaluate models.


Data Ingestion with BigQuery Jobs

Once the data schema is defined, we upload the raw customer data from Google Cloud Storage (GCS) to BigQuery using a BigQuery load job. This approach efficiently imports CSV data, ensuring consistency and data integrity.

Terraform Configuration for Data Loading

The following Terraform configuration sets up a BigQuery load job that ingests data from a CSV file in GCS into the BigQuery table. Key options include source_format for the CSV format, write_disposition to overwrite existing data, and null_marker to handle null values.

resource "google_bigquery_job" "load_customer_data" {
  job_id = "${var.project}_${random_string.this.result}"

  location = var.region

  load {
    destination_table {
      project_id = google_bigquery_table.this.project
      dataset_id = google_bigquery_table.this.dataset_id
      table_id   = google_bigquery_table.this.table_id
    }

    source_uris       = ["gs://${google_storage_bucket.data_bucket.name}/${google_storage_bucket_object.customer_data.name}"]
    source_format     = "CSV"
    write_disposition = "WRITE_TRUNCATE"
    skip_leading_rows = 1
    autodetect        = true
    field_delimiter   = ","
    null_marker       = "\\N"
  }

  labels = var.labels

  depends_on = [google_storage_bucket_object.customer_data]
}

Explanation

  • source_uris: Points to the GCS bucket where the CSV data is stored.
  • source_format: Specifies that the input format is CSV.
  • write_disposition: Ensures that any previous data in the table is overwritten.
  • null_marker: Defines a marker for null values to avoid issues when processing incomplete data.
  • skip_leading_rows: Skips the first row if it contains column headers.

By automating the data load process with Terraform, we create a reproducible workflow that can be triggered consistently.


Common Errors and Troubleshooting

Data ingestion can sometimes result in errors. Below are common issues and how to address them:

  1. Schema Mismatch: Ensure the data types in the CSV file match the BigQuery schema exactly.
  2. NULL Handling: If null values aren't recognized, adjust the null_marker setting.
  3. Invalid Field Delimiters: Ensure the field_delimiter in the BigQuery job matches the CSV format (e.g., commas for CSV).
  4. Permission Errors: Verify that the service account has access to both GCS and BigQuery.

If errors occur, check the BigQuery job logs in the console for detailed messages. Making sure your data meets schema expectations helps avoid these common pitfalls.


Code Snippets for Data Validation

After loading data, validating it in BigQuery ensures that it meets the schema requirements and accurately reflects the raw data. The following SQL query checks the schema and reviews the first few rows of data.

'''
SELECT *
FROM <DATASET_NAME>.<TABLE_NAME>
LIMIT 10;
'''

Running this query confirms that all columns are loaded correctly and allows you to visually inspect sample data, verifying that columns such as customerID, MonthlyCharges, and Churn are correctly populated.


In this article, we’ve covered the steps for loading and preparing customer data for use in BigQuery ML. By setting up a robust data ingestion pipeline, we lay the foundation for training an accurate churn prediction model.

Related Articles

Inter-Region WireGuard VPN in AWS

Read more

Making PDFs Searchable Using AWS Textract and CloudSearch

Read more

Slack AI Bot with AWS Bedrock Part 2

Read more

Contact Us

Achieve a competitive advantage through BSC data analytics and cloud solutions.

Contact Us