Building real time data pipelines with Kafka Streams I

Many interesting projects have been developed within the stream processing field in the last years. Most of us could name open source projects (Apache Spark, Apache Storm, Apache Flink) or proprietary services (Google DataFlow or AWS Lambda) that are very well-known solutions for real time scenarios.

However, the objective of this post to present Kafka Streams as an easy alternative to all these frameworks, explaining why we have chosen it as our real time layer in all the projects where the Kappa Architecture was suitable.

This article will be focused in Kafka Streams, which relies on Apache Kafka. (you don’t have be a Kafka hero… but certain knowledge of the basics of Kafka is presumed to follow this post). It will be explained from an architectural point of view, reviewing the characteristics of this technology and the conclusions that lead us to choose it. Anyhow, and in order to satisfy your insane coder needs, some of the more powerful features of Kafka Streams (windowing and stores) will be shown in a following article with an easy example.

Kafka Streams: technology review

According to Jay Kreps, Kafka Streams is a library for building streaming applications, specifically applications that transform input Kafka topics into output Kafka topics (or calls to external services, or updates to databases, or whatever)

Two meaningful ideas in that definition:

  1. The obvious one. The name of this technology should have provided you a lead on this: the source for our streaming pipeline must be a topic (or topics) in a Kafka cluster, and there are no other sources admitted for Kafka Streams. We can admit that not all the scenarios apply, but if you have a Kafka cluster in your architecture, please, keep on reading. It’s worth it.
  2. The excellent one. Kafka Streams is a library we can use in our Java or Scala applications. Let me repeat it for you: It’s only a library and not a distributed framework such as Spark or Flink. It means that there are no cluster manager or daemons to manage besides your Kafka cluster and your application. For this reason, applications developed using Kafka Streams are easier to code, deploy and operate. In addition to this, developing connectors to store information in external services is really simple.

Despite it’s only a library, Kafka Streams is able to solve a lot of the main problems that can be found in stream processing scenarios. These are the main features of this technology:

  • One event at a time processing with milliseconds latency
  • Stateful processing including windowing, joins and aggregations
  • Distributed processing and fault-tolerance with fast failover
  • Reprocessing capabilities so you can recalculate in case your code changes
  • No-downtime rolling deployments
  • Two levels API: a convenient DSL suitable for most of the cases, and the low-level processor API to be used in the more complex scenarios

The concepts

A stream is an unbounded dataset which is continuously updating. We create an stream by reading data from a Kafka topic (or topics). Like a topic, the stream is divided in one or more partitions. These partitions can be defined as an, ordered, replayable, and fault-tolerant sequence of immutable data records, where a data record is defined as a key-value pair.

A stream processor is a node in the business logic that represents a single processing step.

A topology defines the computational logic of the data processing that needs to be performed by a stream processing application. All the input or output topics used inside a topology will be in the same Kafka cluster. Kafka Streams is not designed to exchange data between different clusters.

A streaming application defines one or more topologies, and execute that logic for all the events in the input topics. We must emphasize that these applications don’t run inside a Kafka broker. As they are client applications, the run in separate JVMs wherever we want to deploy them.

An application instance is any running instance or “copy” of your application. Application instances are the primary means to elasticly scale and parallelize your application, and they also contribute to making it fault-tolerant.

Parallelizing and scaling

Using the concept of consumer groups of Kafka, all the application instances will collaborate in the consumption of the events in a topic, so It’s very easy to parallelize and scale Kafka Streams applications. We will configure the consumer group name  of our streams application with the config parameter application.id

The main idea here is related to the maximum parallel tasks that an application can handle. This number can be inferred from the number of partitions of the input topics, as a streaming task will be created for each partition. These tasks will be executed by a thread in an application instance. Finally, each thread will execute at least one streaming task and thus it will consume at least messages from a partition.

 

So, given that, how can we scale and parallelize our application?
   * Horizontally, scaling instances of our application.
   * Vertically, adding new threads in each instance of our application using the config parameter num.stream.threads
   * Combining both solutions

Easy, isn’t it?. We will reach the maximum degree of parallelism in the moment we have the same number of threads consuming events as partitions in the input topic. In this scenario, if the throughput we’ve achieved isn’t enough for our requirements, the first action must be increasing the number of partitions of the topic and just then, scaling the number of consuming threads.

Fault Tolerance

In case of need, all the streaming tasks can be restarted without problem by different threads or even instances of an application.

How is this possible? The answer to this question is again related to Kafka, as it stores the last single processing point for each task.

As we said before, Kafka Streams processes one event at a time, and once an event has completed successfully a topology, its offset is committed to the cluster and a new event is asked to be processed. (the topic __consumer_offsets stores the last offset consumed for each partition). In case of error, Kafka Streams follows a fail-fast policy: processing is immediately stopped whenever an uncaught exception is throwed. A new streaming task will resume the consumption of the affected partition, starting in the same event that killed the original task.

Furthemore, we could have running more threads than partitions in the input topic. In this scenario, some threads would be in a passive status waiting for the death of an active thread in order to resume the processing of its partition.

This feature could represent a huge advantage in distributed and high available environments (like Kubernetes or Mesos + Marathon deployments), because an error during the process would cause an immediate restart. This is a perfect fit if your error is caused by a third party. Let’s imagine we are trying to write events into a database, which is not reachable for a while: the cluster manager will restart your application indefinitely until the database is available again, producing a self healing system.

But, what happens if we are causing the error?  For instance, we might receive an event we are not able to parse, producing an exception. Our application would die and it’d be restarted… but we would be stuck in and endless loop, as we’d be receiving the same message forever. Because of that, we should be careful with these kind of processing, and fortunately, Kafka Streams provide us with tools and patterns (for instance, using flatMap instead of map) that will help us to avoid them.

In conclusion, Kafka Streams imposes us a fail fast policy, but deciding which errors will cause the application to fail is up to us. We must be careful with those decisions in order to build robust systems.

In addition to this, if you are planning to use more advance features of Kafka Streams in your application, like stores or windowing, we have good news for you as well. They are also fault tolerant because they have been designed to use Kafka topics as backend, which means they are also replicated and partitioned.

Processing semantics

Since version 0.11 was released, exactly-one guarantees can be enabled inside the Kafka stack. However, whenever our processing pipeline involves other systems or protocols that can’t provide this quality of service, we will be talking about at least once semantics.

This QoS may be good enough for most applications, but in case your backend or your business logic don’t support idempotent writing, you will be forced to write additional custom code to avoid duplicates. The good news here are that Kafka Streams provide very useful tools for writing this deduplication logic, like key/value stores and a low level API which will ease this development.

Back pressure

Kafka Streams uses a depth-first processing strategy (an event goes through a whole topology before another event is requested). This way, events are stored in Kafka topics and no in-memory buffer is used to store data. That implies that we don’t need to implement any other back pressure mechanism. Besides, Kafka consumers are designed to use a pull-based model, which is an advantage in order to control the reading speed of the clients (for instance, poll records and interval, or bytes fetched by the server may be tuned in the consumer to control reading performance).

The same scheme applies in case of having more complex topologies, based in several sub-topologies, as events will be sent between them through Kafka topics.

stream1.to("topic");

stream2 = builder.stream("topic"); 

Dynamic routing

At the moment of writing this article, Kafka 2.0 has just  been released. This version includes one feature that we missed in older releases: dynamic routing. We find that choosing the destiny topic of an event at runtime according to its content is a very powerful trait.

Before this release, all output topics had to be known beforehand, and thus it was not possible to send output records dynamically. From now on, routing decisions could be made without restarting an application.

The solution adopted to solve this problem has been to overload the to method, including a TopicNameExtractor, which is an simple object

 public interface TopicNameExtractor<K, V> {
    String extract(K key, V value, RecordContext recordContext);
}

...

// KStream.java
void to(final TopicNameExtractor<K, V> topicChooser);

The only concern that cluster administrators must be aware of is that topics are not automatically created if they don’t exist while messages are being sent. In other words, topics can be chosen dynamically, but they must have been created previously.

Conclusions

In spite of being a young technology, Apache Kafka is widely used in big data architectures and has become almost a standard in real time messaging scenarios. Kafka Streams is a piece of the Kafka ecosystem that it’s evolving quickly lately, taking advantage of the traction that Kafka is having worldwide.

 

In addition to this, the fact that Kafka Streams is a library that can be used with any other Java dependencies, is a great advantage that must be considered when you are choosing a stream processing framework. This feature will allow you to integrate easily with any packaging or deployment technology, specially with cloud-native stacks, obtaining resilient applications, which will be easy to develop, operate and scale (at the moment we are using Spring boot, Docker and DC/OS)

Hey, it was great so far. Where is my code?

In a following article we will show some of the more powerful features with a full  but simple example: both APIs (DSL and processor API), windowing and key/value stores will be explained. Stay tuned for more rock ‘n roll…

David Oñoro

I'm a telecommunications engineer with more than 10 years of experience with event oriented architectures (messaging middlewares, complex event processing, etc). Over the last years, I've specialized myself in big data architectures, focusing special interest in real time and streaming technologies. When I'm not working in the Datio's architecture team, you can find me taking care of two little kids. Or maybe swimming in case I'm lucky...

More Posts

I’m a telecommunications engineer with more than 10 years of experience with event oriented architectures (messaging middlewares, complex event processing, etc). Over the last years, I’ve specialized myself in big data architectures, focusing special interest in real time and streaming technologies. When I’m not working in the Datio’s architecture team, you can find me taking care of two little kids. Or maybe swimming in case I’m lucky…

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *