Working on On-prem/External Airflow with Google Cloud Platform(GCP)

Ashish Patel
Codebrace
Published in
4 min readMar 11, 2021

--

If you want to work with Airflow and just starting up with your installation then Google Cloud Composer is the best solution, As it creates all the required services and manages Kubernetes Cluster via GKE and everything connects like magic.

But if you already have an On-prem Airflow or Airflow working on some other Cloud Provider and want to connect with GCP, you will have to do a couple of things to get everything up and running.

Local Setup via Docker

You might want to test this setup locally in your local Airflow before deploying your DAG in your deployed instance.

There are 2 ways to install Airflow on your machines
1. Running Airflow locally on your machine
2. Running Airflow via Docker

If you have already decided not to work with Docker, good luck with that here are some docs you can refer .
I love Docker and I will help you with this.

Step 1. Install Docker more
Step 2. Use Docker compose to get the instance up and running in a couple of mins.

mkdir docker-local && cd docker-local# downloading docker-compose file
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.0.1/docker-compose.yaml'
# initalizing the Airflow Environment
docker-compose up airflow-init
# Start the services
docker-compose up

Airflow should be available on Port:8080, by default here.

Local setup for Airflow

for more refer to the documentation.

Connecting Airflow to Google Cloud Platform

  • Now, this is the exciting part,
    first things first, you will have to choose a way to authenticate with the GCP from your Airflow to be able to execute all the tasks.

Step 1 — Creating Service Account Key

  • There are mainly 2 ways to do it using credentials or using a service account key. you can use either key file path or key as JSON, more.
  • We will be using the service key file JSON (but you can use the service key file path and store it, remember to add a volume in the docker so it could be found on the container).
  • Let's, create our Service Account key,
    select your project from the dropdown and go to GCP console> IAM & Admin > Service Accounts
Service Accounts on GCP
  • Now create a new service account or click on your existing one ( it must have required permissions to perform the operations), click on manage keys by clicking 3 dots on the right side.
  • create a new key and store
  • Choose type as JSON and create, the key file will be downloaded to your machine as JSON.
  • that's it, now its time to create a connection in Airflow.

Step 2— Creating GCP Connection in Airflow

  • If we would have been using GCP Cloud composer, connections are already configured for us by default. more
  • Here we will have to create those manually, so let's create one.
  • Go to Admin>Connections>Add in Airflow UI
  • Add your GCP Project Name and Copy JSON from Key file and paste inside KeyFile Json field ( remember when you will open this connection for edit next time you will see nothing but its still stored securely in Airflow).
Creating Google Cloud Connection in Airflow

Note — Here Conn Id is ‘google_cloud_default’, as this is default connection name for GCP, if we choose to have another, we will need to pass connection_id in operators.

Step 3— Running DAGS using Dataproc Cluster

  • Now, let's create a sample dag that connects to GCP and runs word count program.

please find the DAG here

  • Upload file to dags/ folder on Airflow ( if you are using the local setup you can just put that to dags/ inside docker-local/
  • Now, let's enable to dag and run it.
  • After some time, we should be able to see that it gets completed without any error.
  • That's it, you have configured your Airflow to work with Google Cloud platform.

--

--

Ashish Patel
Codebrace

Big Data Engineer at Skyscanner , loves Competitive programming, Big Data.