What is Apache Beam?
Apache beam is an open source, unified programming model that defines and executes data processing pipelines. These pipelines can be both batch and streaming. It is exposed via several sdks that allow to execute a pipeline in different processing engines, aka, runners. The supported runners so far are:
- Apache Spark
- Apache Flink
- Apache Apex
- Google Cloud Dataflow
- Apache Gearpump. Still incubating though
- DirectRunner. A local runner for tests and debug
In other words, what Apache Beam promises is that we can execute the same code in each of these runners, with fundamentally no changes, saving the costs and time that would involve to do so. However, we must point out that, of course, changes have to be made, even though they are minor and related the runner that we are going to use. In any case, the data processing code (the transformations) remains invariable.
As we discussed in the first part of this post. Apache Beam, is originated from Dataflow sdk, so it inherits basically all its benefits as well as the internal programming model. For more details, check out the post.
Beside of that, Apache Beam aims to be a kind of a bridge between both Google and open source ecosystems.
A code example
Apache Beam supports java and python. And since it’s more widespread, we will see an example in java. So, let’s see what an Apache Beam code looks like.
The following is executed with the Dataflow runner.
And this other one, with the Spark runner.
As we can see, they differ only in the creation of that pipeline options that do depend on the Runner (the type, the input/output routes and some more necessary parameter for the runner used). Once this initialization is covered, the rest of the code is identical (for more details see the example on the first post). Of course this is a very simple example but in a real project this part is way much bigger.
Let’s run this thing
We run both codes and we obtain the following DAGs.
First, from Google Dataflow console:
And then from Spark UI, which internally transforms it into two jobs.
The first job would be something like this:
And one of the stages of it would be:
On paper, Apache Beam offers what no other tool is capable to give. That is, being able to run a same code in Spark, Dataflow, Flink… with just some small adjustments and without growing any trauma in the process. In its favor plays that the sdk expose a simple and manageable code semantics allowing us to encapsulate the data in beans and manipulating them in any form. In fact, the execution of these examples has been surprisingly easy. And as Dataflow sdk, Beam offers an impressive flexibility, in terms of using the same code in both batch and stream mode.
The not so great stuff are that Beam is not heavily tested enough in real and complex projects, in order to see the performance of that promised flexibility. On the other hand, Google/Apache Beam claims that when abstracting the features of the different runners, they have opted neither for the intersection of all the common functionalities (too few), nor for the union of all of them (too complex). Instead of that, the approach taken was to bet on «where data processing should be going» according to the Beam Project Management Committee, and expect to influence the roadmap of the different supported engines. That is why they maintain a matrix of capabilities with each supported feature per runner.
In any case, Apache Beam is a player to not lose track of.
I’m fascinated by how some people -and by extension, some teams- performs at the highest with great levels of engagement, commitment and motivation, while others not that much. My job as a Scrum Master is to understand this gap and be able to help the teams I work with to grow. When I’m not doing this, I love watching movies, travel and drink tea.