Skip to content

Intro to Google Dataflow (Hosted Apache Beam)

Posted on:February 6, 2022 at 12:00 AM

author: Sat Naing pubDatetime: 2022-09-23T15:22:00Z title: Adding new posts in AstroPaper theme postSlug: adding-new-posts-in-astropaper-theme featured: true draft: false tags:


Background

Google’s Dataflow is a runner for Apache Beam. Apache beam is a unified and portable batch and streaming data processing framework. Dataflow is one of the popular beam runner. FlinkRunner and SparkRunner are also popular choices in community. Here’s a capability matrix for reading pleasure.

Why Dataflow

Overall Architecture

overall architecture

V2 portable runner

https://www.youtube.com/watch?v=tXdnPKPnY3E&list=PLjYq1UNvv2UcrfapfgKrnLXtYpkvHmpIh&t=396s

From code to dataflow job

https://www.youtube.com/watch?v=udKgN1_eThs&t=250s

Read

Summary

To summarize, below are the diffrent stages of manifest

  1. Code
  2. Graph (graph construction time)
    1. Traverse from main entry point to generate nodes of graph
    2. Executes locally on the machine that runs the pipeline (2b differs depending on use of classic vs flex template)
    3. Validates all resources (GCS, PubSub, etc)
    4. Check other errors
  3. Job (job creation time)
    1. Translated to JSON format and sent it to Dataflow service endpoint
    2. Validates JSON and replies with jobId
    3. Becomes a job on the Dataflow Service
  4. Job Execution Time
    1. The Dataflow service starts provisioning worker VMs. Serialized processing functions from the execution graph and required libraries are downloaded to the worker VMs and the Dataflow service starts distributing the data bundles to be processed on these VMs.

Sample execution graph (WordCount)

sample execution graph

Unique services to dataflow

Shuffle Service

https://youtu.be/udKgN1_eThs?t=519

Streaming Engine

https://youtu.be/tXdnPKPnY3E?list=PLjYq1UNvv2UcrfapfgKrnLXtYpkvHmpIh&t=396

FlexRS

https://youtu.be/udKgN1_eThs?t=659

Dataflow graph optimizations

https://youtu.be/udKgN1_eThs?t=74

Auto scaling and sharding

These are two different concepts. Personally I consider

Auto-sharding = dynamically adjust partition size Auto-scaling = dynamically adjust worker size

Auto sharding and scaling

https://youtu.be/tXdnPKPnY3E?list=PLjYq1UNvv2UcrfapfgKrnLXtYpkvHmpIh&t=985

Auto-sharding

https://www.youtube.com/watch?v=udKgN1_eThs&t=461s

New features

Dataflow Prime

https://youtu.be/udKgN1_eThs?t=813

Resources