spark-operator

Integration with Google Cloud Storage and BigQuery

This document describes how to use Google Cloud services, e.g., Google Cloud Storage (GCS) and BigQuery as data sources or sinks in SparkApplications. For a detailed tutorial on building Spark applications that access GCS and BigQuery, please refer to Using Spark on Kubernetes Engine to Process Data in BigQuery.

A Spark application requires the GCS and BigQuery connectors to access GCS and BigQuery using the Hadoop FileSystem API. One way to make the connectors available to the driver and executors is to use a custom Spark image with the connectors built-in, as this example Dockerfile shows. An image built from this Dockerfile is located at gcr.io/ynli-k8s/spark:v2.3.0-gcs.

The connectors require certain Hadoop properties to be set properly to function. Setting Hadoop properties can be done both through a custom Hadoop configuration file, namely, core-site.xml in a custom image, or via the spec.hadoopConf section in a SparkApplication. The example Dockerfile mentioned above shows the use of a custom core-site.xml and a custom spark-env.sh that points the environment variable HADOOP_CONF_DIR to the directory in the container where core-site.xml is located. The example core-sitem.xml and spark-env.sh can be found here.

The GCS and BigQuery connectors need to authenticate with the GCS and BigQuery services before they can use the services. The connectors support using a GCP service account JSON key file for authentication. The service account must have the necessary IAM roles for access GCS and/or BigQuery granted. The tutorial has detailed information on how to create an service account, grant it the right roles, furnish a key, and download a JSON key file. To tell the connectors to use a service JSON key file for authentication, the following Hadoop configuration properties must be set:

google.cloud.auth.service.account.enable=true
google.cloud.auth.service.account.json.keyfile=<path to the service account JSON key file in the container>

The most common way of getting the service account JSON key file into the driver and executor containers is mount the key file in through a Kubernetes secret volume. Detailed information on how to create a secret can be found in the tutorial.

Below is an example SparkApplication using the custom image at gcr.io/ynli-k8s/spark:v2.3.0-gcs with the GCS/BigQuery connectors and the custom Hadoop configuration files above built-in. Note that some of the necessary Hadoop configuration properties are set using spec.hadoopConf. Those Hadoop configuration properties are additional to the ones set in the built-in core-site.xml. They are set here instead of in core-site.xml because of their application-specific nature. The ones set in core-site.xml apply to all applications using the image. Also note how the Kubernetes secret named gcs-bg that stores the service account JSON key file gets mounted into both the driver and executors. The environment variable GCS_PROJECT_ID must be set when using the image at gcr.io/ynli-k8s/spark:v2.3.0-gcs.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: foo-gcs-bg
spec:
  type: Java
  mode: cluster
  image: gcr.io/ynli-k8s/spark:v2.3.0-gcs
  imagePullPolicy: Always
  hadoopConf:
    "fs.gs.project.id": "foo"
    "fs.gs.system.bucket": "foo-bucket"
    "google.cloud.auth.service.account.enable": "true"
    "google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/key.json"
  driver:
    cores: 1
    secrets:
    - name: "gcs-bq"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: foo
    serviceAccount: spark
  executor:
    instances: 2
    cores: 1
    memory: "512m"
    secrets:
    - name: "gcs-bq"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: foo