In this first part of the post , we will explain how important is to process incoming data as soon as possible and the technical problems when developing a stream processing system, problems that cloud providers try to solve. In the second part, we are going to show how a -very simple- kappa architecture can be deployed using managed services in Amazon Web Services (AWS) and Google Cloud Platform (GCP).
The rise of stream processing: fast data
Every second a company generates a huge amount of data that can be processed, and this amount of data is increasing with the rise of the Internet of Things, where everything tend to be connected. Big Data came to solve the problem around how to process this huge amount of data with commodity hardware and get insights that can produce some value to the company or its customers. But, what makes Big Data really “big” is the stream of input data generated at a very high rate, and this stream of data needs to be analyzed as soon as possible to get insights and to have the ability to act and alert to an event with a low latency.
Traditional Big Data processing saves the fast generated data into persistent data stores and then performs the processing in batches, with a high latency between when new data arrives and when it is processed. If the data is not processed quick enough the benefits of Big Data will be lost, and in this point is when fast data appears. Now it is not about processing huge amounts of data, is about do it fast. Using fast data gives the company the ability to react faster to events and insights and the value generated is increased.
So, by now it’s clear that processing streaming data is necessary, but, how it can be done? This kind of systems need a complex architecture that can be built in different ways, and there are a lot of decisions to take.
When a new project starts and you work as Big Data Architect, there are a lot of decisions that need to be made. You have to decide which technologies you’re going to use, and that’s not an easy task.
Firstly, you have to choose a processing framework that, in terms of real time analytics, could be Apache Spark, Apache Flink, Apache Storm, Kafka Streams and so on.
Then, you have to define the way to ingest source data into the platform. It is affected by the kind of source that you are dealing with, so you can use Apache Kafka, Apache Sqoop, Apache Flume, Apache NiFi, etc.
Obviously, you have to define where are the data going to be stored. This could be a distributed file system like HDFS, a NoSQL database like Apache HBASE, a query engine like Impala or a relational database (which could be MPP to get a better performance) like PostgreSQL.
This task can be more complicated when you think that you could have different use cases in your platform, and you will need to combine this different technologies to define a correct solution for all the possibilities.
Consider the number of different names that have appeared until now, and this is only a small set of open source tools.
Also, in case that you aren’t overwhelmed enough, there is another important decision to be made. Where will the platform be deployed? You can install all this technologies on your own servers (which is called on premise) or you can deploy the architecture in a cloud service. But, if you deploy your servers in the cloud, you still have to install, configure and manage all the technologies needed by the project. That’s not a trivial thing to do and, without a good System Team, it could be very painful.
Cloud providers like Google, Amazon or Microsoft Azure have realized that Big Data teams often only focus on the data, and are not confident with installing and deploying the platform that is needed to process the data, so they offer in their clouds managed services that allows you to set up the needed resources with a simple command.
There is a need to process data that arrives at high rates with low latency to get insights fast, and that needs an architecture which allows that.
You may be wondering: what is a kappa architecture? Well, it is an architecture for real time processing systems that tries to resolve the disadvantages of the Lambda Architecture. It is better explained here. Briefly, it works in three layers: a message broker which receives the source data while the data is consumed in real time, a stream processing framework that transforms the input data into the required views, and the serving database, which has the transformed and updated data ready to be queried.
I started getting in touch with Big Data, using Spark and the Hadoop ecosystem, working with cloud providers. Now, as a Big Data Developer, I work with IaaS, Terraform, Ansible, Scala, Spark… I am a vocational Computer Scientist and I enjoy learning new technologies and discovering new things.