In the first part of the post, we introduced the need of stream data processing and how difficult is for a Big Data Architect to design a solution to accomplish this. Also, Kappa Architecture was presented as a stream data processing model that it’s going to be used to show how cloud providers try to reduce the complexity behind deploying this kind of systems. Let’s start, clean your mind, that’s going to be dense…
Deploying Kappa Architecture on the cloud
The following pictures show how the Kappa Architecture looks in AWS and GCP.
They look so similar, right? In fact they are very very close each other, as we will see diving into a little more.
The heart: message broker
As we said, the core of the Kappa Architecture is the message broker. The original post refers directly to Apache Kafka, a distributed and fault-tolerant publish-subscribe messaging system. In Kafka, there are N publishers that send messages into a topic (that can be partitioned, to improve performance and reliability). Consumers go to a topic and “pull” the messages, in a way that there could be N consumers reading from the same topic, and all messages are read by all the consumers. Both AWS and GCP have a managed service that works like Kafka:
- Amazon Kinesis: this is the AWS solution for publish-subscribe messaging. The Kafka topic is called stream. A stream is composed of shards, and the number of shards provisioned determines the throughput, and also the price. Each shard can afford an input rate of 1 MB/s, up to 1000 record puts per second, and an output (read) rate of 2 MB/s. Maximum record size is 1 MB. The number of shards can be modified during the life of the stream. So, to have a “topic”, you only have to create a stream and set the number of shards. You pay per hour a shard is deployed and for the PUT API requests made. Kinesis also has native integration with Spark Streaming just like Kafka.
- Cloud Pub/Sub: same service, this time by GCP. It’s even easier to manage than Kinesis. You only have to create a topic and register each subscription (consumers). Then it works like Kafka, and it scales up and down automatically to handle your requests. By default it’s limited to a 100 MB/s input rate and 200 MB/s output rate, but you can ask Google to remove this limits if you want. You pay per operation made (writes and reads), being cheaper when the number of operations grow. It’s not natively supported in Spark, but there is some code on the internet that can help you with that.
Where amazing happens: stream processing framework
In the first part of the post we talk about the complexity behind install, configure and manage the required resources, and this affects especially to the processing framework. Due to its performance, the processing framework selected is Spark Streaming. Imagine that you can start a cluster with Spark and Hadoop on YARN submitting a one line command, and after less than 10 minutes you’ll have a cluster configured and ready to go. Awesome, right?
This is the proposal of Amazon EMR (AWS) and Cloud Dataproc (GCP). Both are quasi identical, so they are going to be explained together. This services allows you to decide how many nodes do you want, which are the type (CPU, memory and disk space) of the master and worker nodes and if you want the cluster to stay running to receive jobs, or you want to execute one job and stop the cluster when the job finishes. The service will then create the required nodes (instances) in Amazon EC2 or Google Compute Engine, and install Spark configured to work with this nodes. In addition, you can resize a running cluster, adding or deleting nodes, which is handled transparently by the service (and that’s not a trivial thing). Both services allow to use low cost instances as nodes to have a bigger performance when needed without increasing costs (spot instances as core nodes in AWS, preemptible instances as preemptible workers in GCP, this kind of instances only offers resources to YARN -NodeManager-, don’t store HDFS data -DataNode-, so if the provider turn them off there is no data loss). The bootstrapping of the cluster takes around 2 minutes in GCP and 7 minutes in AWS (it may change). EMR allows you to choose which products you want to install (Hadoop, Spark, Sqoop, Zeppelin, Impala, etc), while Dataproc installs automatically Spark, Hadoop, Hive and Pig. There is also the possibility to use initialization scripts to install another software that you want (bootstrap actions in AWS, initialization actions in GCP) with an official action repository by Amazon and Google. In this services you pay for the time you have each node deployed, and the kind of node it is, with an additional charge for using the managed service. An important difference is that Google charges in a minute basis, while Amazon do that in an hour basis. This become more important in short jobs that only takes a couple of minutes.
What happens to the data stored in HDFS when the cluster is stopped? By the moment the cluster is stopped, this data disappears. This is because this services have ephemeral storage. Luckily, there is a solution to this. The idea is to use the object storage provided by Amazon S3 and Google Cloud Storage as persistent storage. In this way, you can deploy multiple clusters that access and share data. With this idea, Amazon and Google have integrated this storage services with EMR and Dataproc. They provide a library for Hadoop to have transparent and optimized access to the storage service, using the prefix s3:// or gs://. This means that inside an EMR or Dataproc cluster, Spark can read and write directly to/from the object storage service. Hadoop has native access to Amazon S3 with the correct configuration, but EMR offers an optimized version called EMRFS to improve the performance. Amazon S3 and Google Cloud Storage are also managed services, which means that you can upload your files, and the provider handles the availability, replication, access and capacity problems. You don’t have to take care about that, in fact you don’t have a capacity limit, it scales up automatically. In this services you pay per consumed capacity, requests (read, write, list, etc) and network traffic out of the provider network.
So, we have an elastic cluster that we can run only for the time we need and then destroy it, and we have an elastic storage which can persist the data with high availability, and all its done from the command line. This is really helpful.
Accessing the results: serving database
There is only one remaining layer to deepen, the serving and storage layer. The kind of storage service needed depends on the data and the use that it’s going to be done with it, so this part is less detailed than the previous ones.
In the previous layer we have talked about Amazon S3 and Google Cloud Storage as object storage service to save data as files.
If your use case fits better in a NoSQL database, Amazon DynamoDB and Google Cloud BigTable are the managed services you are looking for.
Also, you can use a managed relational database to save your data, with Amazon RDS and Google Cloud SQL.
Finally, to provide a better analytics performance, you have Amazon Redshift and Google BigQuery. These products are not strictly equivalent. Amazon Redshift is a relational distributed MPP (massive parallel processing) database build from PostgreSQL, with columnar storage to improve performance. In Redshift the performance is determined by how many nodes you use and you pay for this number of nodes. BigQuery is a fully managed columnar data warehouse in which you upload your data to analyze it in SQL. Google uses as many resources as it need to give you the best performance, so you pay for the amount of data processed in each query, in addition to the storage capacity used. Google says that BigQuery can move petabytes of data in less than a minute. Impressive.
Google Bonus Level: Google Cloud Dataflow
Google has another stream processing framework that is scalable and managed, Cloud Dataflow. Cloud Dataflow is composed of two main parts, an open SDK used to develop the transformations that are going to process the data (defining a pipeline) and the managed service which can execute this pipelines.
The SDK brings a new model that combines Batch and Stream processing (especially powerful) in an unified language. The SDK is open source and now part of the Apache Beam project, which aims to have an SDK in which you write a pipeline, and then you can launch the pipeline in a variety of runners like the Dataflow service, Spark or Flink, providing a top level abstraction and unification between different frameworks.
The Dataflow service is based on the Millwheel and FlumeJava papers for stream processing, and executes the pipeline in a managed service. You can set the number and kind of nodes that are going to process the pipeline (just like Dataproc) or you can let Google scale dynamically with the workload. It takes less time to provision (around a minute) than Dataproc, and performs very well. It is also integrated with all the Google services like Pub/Sub, Cloud Storage or BigQuery, making it easy to interoperate.
Google and Amazon offer different services which allow to deploy a Big Data architecture in a fully managed way, without needing to install or configure anything. In the post we have showed the managed alternatives to open source products, which can help to move between different environments (on premise, GCP, AWS) in a more straightforward way.
However, there are some disadvantages in the use of managed services. The first one is that they are more expensive than the open source products (which are free, you only pay for the servers where they are installed). The second one is that products like the publish-subscribe services are similar to Kafka, but not equal, so using Kafka can have some advantages. In opposition to managed services, when you control the installation you have more ability to fine tune your platform and to improve the performance. Finally, some enterprise data is very important and you have to control it in all the stages of the process, it’s not so easy as let the cloud provider do that without knowing what’s happening inside.
I started getting in touch with Big Data, using Spark and the Hadoop ecosystem, working with cloud providers. Now, as a Big Data Developer, I work with IaaS, Terraform, Ansible, Scala, Spark… I am a vocational Computer Scientist and I enjoy learning new technologies and discovering new things.