A bit of context first..
As some of you may know, in 2004 Google released the MapReduce paper that became the cornerstone of a whole new set of open source technologies composing the big data ecosystem as we know it (Hadoop, Pig, Hive, Spark, Kakfa, etc.). Meantime, Google followed its own path by developing other tools -not that open source… or even known- to process data for its own services.
In 2015, Google presented the Google Dataflow service as the culmination of that development, including it as a service within its Cloud platform. A year ago Google opensourced the Dataflow Sdk and donated it to Apache Foundation under the name of Apache Beam.
Apache Beam is aiming pretty high. It tries to unify those two parallel roads taken by the open source community and Google and be a liaison between both ecosystem.
In this two-part post we will introduce Google Dataflow and Apache Beam. We’ll talk about their main features, and we’ll see some example.
In this first post, we start with Google Dataflow.
So.. What is Dataflow?
Dataflow is a unified cloud-based service developed by Google that allows us to process large amounts of data. These processes can be both Batch and Streaming.
This is a nice definition that we can find everywhere, but for the sake of clarification:
Google dataflow is composed of mainly three parts:
- A workflow model and a programming tool, in other words, sdk.
- A highly orchestrated infrastructure of clusters, deployed in the Google cloud Platform.
- An online platform to monitor and manage processes (aka jobs).
The management and optimization of the clusters infrastructure is transparent to the user. Maybe for some developers this could be a con instead of pro, but for Google, it’s one of their main features: Forget about resource management and configuration, and just focus on developing your business model!
What does Dataflow offer us?
Serverless. Dataflow is presented as a self-managed and transparent system for the user. In this way the user can forget about orchestration and configuration of clusters and focus on developing the business model.
Autoscaling. It is not known exactly what algorithm is used, but the fact is that Dataflow increases/decreases the number of «workers» (instances/virtual machines of Google Compute Engine) depending on the process needs, ensuring that it will be executed in a reasonable amount of time. This number of workers assigned may vary slightly depending on the time that the process is launched or the region where our resources are hosted.
Unified model. Dataflow has a unified programming model for data processing such as ETL operations, batch, or streaming processes. In other words, we can use pretty much the same code to run both batch and streaming processes. The change from one model to another is made through minor adjustments.
Pipelines. Each process is defined as a pipeline, which represents the set of steps starting from read the data, apply whatever transformations we need and finally store the results in an external source.
SDK. Dataflow includes a development API for Java and Python.
What is a process in Dataflow?
Dataflow models a process as a pipeline. A pipeline is an ordered and controlled sequence of steps that reads data, transforms them, and finally stores them.
A pipeline consists of the following elements:
PCollection: Data structure that represents a limited set of the data to be processed (although its size is virtually unlimited).
Transform: It is an operation that we applied to a dataset in a PCollection. The output can be one or more PCollections.
I / O Sources and Sinks: represent the source and destination of the data, which can be other Google platforms such as BigQuery, Storage, or BigTable.
Let’s see an example…
Let’s say we have a file with the format below. And we want to perform a couple of simple operations over each line, such as counting the number of occurrences of each code, and the average numerical values associated with it.
But let’s try with a bigger sample, for instance, a file with 25 million lines. This file will be stored in a directory of Google Storage. The same as the output file.
The next code will do the trick.
This main method will use a couple of customized transformations (Called ParDo functions) to implement the calculations we want to make. The other transformations are predefined on the sdk.
This transformation is used for read each line and inserts it in a bean.
This other one is for the actual calculations.
And that’s pretty much it.
Now, to execute this process we need to run this code on premise which will launch it as a job in the management console of Dataflow. There we can see it as a DAG (Direct Acyclic Graph) that represents the pipeline that we have designed.
Along with the information of the job that has been generated.
As some nice parameters of the resources used to run the job.
Since it seems that the process ended successfully, we access the Google Storage path that we have specified in our code, and we find the generated output file (in fact, files! For some reason, Dataflow is unable to dump the output in only one, in this case, the output is divided into four different files).
The following is an example of the content output.
Google Dataflow is a really powerful tool, and quite simple to use, especially because it makes the whole hardware layer abstract and transparent, which allows us to focus on optimizing our business model and saving some time in the process. Although, of course, this is not free.
In addition, it offers a native integration with different Google ecosystem tools, such as Pub/Sub, BigQuery, BigTable or Storage.
On the other hand, Dataflow has some interesting control mechanisms while ingesting data in a PCollection. This is done by setting consumption windows and triggers depending on reception time of the data.
Google Dataflow is a great set of tools with a major potential, however, we do a miss more openness in terms of integration with other tools outside the Google ecosystem, such as Kafka, Hdfs, parquet or sql. And there’s also some improvements that can be made. For instance, a job scheduler. Google proposes us to schedule programmatically our jobs using the App Engine Cron Service, although this option is not consistent with the ease of use of the rest of the services. But Hey, it’s a start.
See you in the next post, where we’ll talk about the opensourced Dataflow: Apache Beam!
I’m fascinated by how some people -and by extension, some teams- performs at the highest with great levels of engagement, commitment and motivation, while others not that much. My job as a Scrum Master is to understand this gap and be able to help the teams I work with to grow. When I’m not doing this, I love watching movies, travel and drink tea.
3 thoughts on “Google Dataflow And Apache Beam (I)”
Great post but it seems like the inheritance of Apache Beam from GCP Dataflow is in opposit direction. If we open url of Dataflow documentation it regirects to Apache Beam : //cloud.google.com/dataflow/java-sdk/JavaDoc.
Hey, great post as always and very informative.
I am already looking forward to the next one!
Thanks Yassin for this information. Great stuff ! Personally I have been using sarasanalytics.com for Google Data Flow . It has been very easy to use and also cost effective too.