Java Archives - Kai Waehner

When to choose Redpanda instead of Apache Kafka?

Kai Waehner — Wed, 16 Nov 2022 03:19:39 +0000

Data streaming emerged as a new software category. It complements traditional middleware, data warehouse, and data lakes. Apache Kafka became the de facto standard. New players enter the market because of Kafka’s success. One of those is Redpanda, a lightweight Kafka-compatible C++ implementation. This blog post explores the differences between Apache Kafka and Redpanda, when to choose which framework, and how the Kafka ecosystem, licensing, and community adoption impact a proper evaluation.

Disclaimer: I work for Confluent. However, the post is not about comparing features but explaining the concepts behind the alternatives of using Apache Kafka (and related products, including Confluent) or Redpanda. I talk to enterprises across the globe every week. Below, I summarize common misunderstandings or missing knowledge about both technologies. I hope it helps you to make the right decision. Either choose to run open-source Apache Kafka, one of the various commercial Kafka offerings or cloud services, or Redpanda. All are great options with pros and cons…

Data streaming: A new software category

Data-driven applications are the new black. As part of this, data streaming is a new software category. If you don’t understand yet how it differs from other data management platforms like a data warehouse or data lake, check out the following blog series:

And if you wonder how Apache Kafka differs from other middleware, check out how Kafka fits into comparison with ETL, ESB, and iPaas.

Apache Kafka: The de facto standard for data streaming

Apache Kafka became the de facto standard for data streaming similar to Amazon S3 is the de facto standard for object storage. Kafka is used across industries for many use cases.

The adoption curve of Apache Kafka

The growth of the Apache Kafka community in the last years is impressive:

>100,000 organizations using Apache Kafka
>41,000 Kafka Meetup attendees
>32,000 Stack Overflow Questions
>12,000 Jiras for Apache Kafka
>31,000 Open Job Listings Request Kafka Skills

And look at the increased number of active monthly unique users downloading the Kafka Java client library with Maven:

Source: Sonatype

The numbers grow exponentially. That’s no surprise to me as the adoption pattern and maturity curve for Kafka are similar in most companies:

Start with one or few use cases (that prove the business value quickly)
Deploy the first applications to production and operate them 24/7
Tap into the data streams from many domains, business units, and technologies
Move to a strategic central nervous system with a decentralized data hub

Kafka use cases by business value across industries

The main reason for the incredible growth of Kafka’s adoption curve is the variety of potential use cases for data streaming. The potential is almost endless. Kafka’s characteristics of combing low latency, scalability, reliability, and true decoupling establish benefits across all industries and use cases:

Search my blog for your favorite industry to find plenty of case studies and architectures. Or to get started, read about use cases for Apache Kafka across industries.

The emergence of many Kafka vendors

The market for data streaming is enormous. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products. Most vendors use Kafka or implement its protocol because Kafka has become the de facto standard for data streaming.

Learn more about the various data streaming vendors in the following blog posts:

To be clear: An increasing number of Kafka vendors is a great thing! It proves the creation of a new software category. Competition pushes innovation. The market share is big enough for many vendors. And I am 100% convinced that we are still in a very early stage of the data streaming hype cycle…

After a lengthy introduction to set the context, let’s now review a new entrant into the Kafka market: Redpanda…

Introducing Redpanda: Kafka-compatible data streaming

Redpanda is a data streaming platform. Its website explains its positioning in the market and product strategy as follows (to differentiate it from Apache Kafka):

No Java: A JVM-free and ZooKeeper-free infrastructure.
Designed in C++: Designed for a better performance than Apache Kafka.
A single-binary architecture: No dependencies to other libraries or nodes.
Self-managing and self-healing: A simple but scalable architecture for on-premise and cloud deployments.
Kafka-compatible: Out-of-the-box support for the Kafka protocol with existing applications, tools, and integrations.

This sounds great. You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with “real Apache Kafka”.

How to choose the proper “Kafka” implementation for your project?

A recommendation that some people find surprising: Qualify out first! That’s much easier. Similarly, like I explained when NOT to use Apache Kafka.

As part of the evaluation, the question is if Kafka is the proper protocol for you. And for Kafka, pick different offerings and begin with the comparison.

Start your evaluation with the business case requirements and define your most critical needs like uptime SLAs, disaster recovery strategy, enterprise support, operations tooling, self-managed vs. fully-managed cloud service, capabilities like messaging vs. data ingestion vs. data integration vs. applications, and so on. Based on your use cases and requirements, you can start qualifying out vendors like Confluent, Redpanda, Cloudera, Red Hat / IBM, Amazon MSK, Amazon Kinesis, Google Pub Sub, and others to create a shortlist.

The following sections compare the open-source project Apache Kafka versus the re-implementation of the Kafka protocol of Redpanda. You can use these criteria (and information from other blogs, articles, videos, and so on) to evaluate your options.

Similarities between Redpanda and Apache Kafka

The high-level value propositions are the same in Redpanda and Apache Kafka:

Data streaming to process data in real-time at scale continuously
Decouple applications and domains with a distributed storage layer
Integrate with various data sources and data sinks
Leverage stream processing to correlate data and take action in real-time
Self-managed operations or consuming a fully-managed cloud offering

However, the devil is in the details and facts. Don’t trust marketing, but look deeper into the various products and cloud services.

Deployment options: Self-managed vs. cloud service

Data streaming is required everywhere. While most companies across industries have a cloud-first strategy, some workloads must stay at the edge for different reasons: Cost, latency, or security requirements. My blog about use cases for Apache Kafka at the edge is still one of the most read articles I have written in recent years.

Besides operating Redpanda by yourself, you can buy Redpanda as a product and deploy it in your environment. Instead of just self-hosting Redpanda, you can deploy it as a data plane in your environment using Kubernetes (supported by the vendor’s external control plane) or leverage a cloud service (fully managed by the vendor).

The different deployment options for Redpanda are great. Pick what you need. This is very similar to Confluent’s deployment options for Apache Kafka. Some other Kafka vendors only provide either self-managed (e.g., Cloudera) or fully managed (e.g., Amazon MSK Serverless) deployment options.

What I miss from Redpanda: No official documentation about SLAs of the cloud service and enterprise support. I hope they do better than Amazon MSK (excluding Kafka support from their cloud offerings). I am sure you will get that information if you reach out to the Redpanda team, who will probably soon incorporate some information into their website.

Bring your own Cluster (BYOC)

There is a third option besides self-managing a data streaming cluster and leveraging a fully managed cloud service: Bring your own Cluster (BYOC). This alternative allows end users to deploy a solution partially managed by the vendor in your own infrastructure (like your data center or your cloud VPC).

Here is Redpanda’s marketing slogan: “Redpanda clusters hosted on your cloud, fully managed by Redpanda, so that your data never leaves your environment!”

This sounds very appealing in theory. Unfortunately, it creates more questions and problems than it solves:

How does the vendor access your data center or VPC?
Who decides how and when to scale a cluster?
When to act on issues? How and when do you roll a cluster to incorporate bug fixes or version upgrades?
What about cost management? What is the total cost of ownership? How much value does the vendor solution bring?
How do you guarantee SLAs? Who has to guarantee them, you or the vendor?
For regulated industries, how are security controls and compliance supported? How are you sure about what the vendor does in an environment you ostensibly control? How much harder will a bespoke third-party risk assessment be if you aren’t using pure SaaS?

For these reasons, cloud vendors only host managed services in the cloud vendor’s environment. Look at Amazon MSK, Azure Event Hubs, Google Pub Sub, Confluent Cloud, etc. All fully managed cloud services are only in the VPC of the vendor for the above reasons.

There are only two options: Either you hand over the responsibility to a SaaS offering or control it yourself. Everything in the middle is still your responsibility in the end.

Community vs. commercial offerings

The sales approach of Redpanda looks almost identical to how Confluent sells data streaming. A free community edition is available, even for production usage. The enterprise edition adds enterprise features like tiered storage, automatic data balancing, or 24/7 enterprise support.

No surprise here. And a good strategy, as data streaming is required everywhere for different users and buyers.

Technical differences between Apache Kafka and Redpanda

There are plenty of technical and non-functional differences between Apache Kafka products and Redpanda. Keep in mind that Redpanda is NOT Kafka. Redpanda uses the Kafka protocol. This is a small but critical difference. Let’s explore these details in the following sections.

Apache Kafka vs. Kafka protocol compatibility

Redpanda is NOT an Apache Kafka distribution like Confluent Platform, Cloudera, or Red Hat. Instead, Redpanda re-implements the Kafka protocol to provide API compatibility. Being Kafka-compatible is not the same as using Apache Kafka under the hood, even if it sounds great in theory.

Two other examples of Kafka-compatible offerings:

Azure Event Hubs: A Kafka-compatible SaaS cloud service offering from Microsoft Azure. The service itself works and performs well. However, its Kafka compatibility has many limitations. Microsoft lists a lot of them on its website. Some limitations of the cloud service are the consequence of a different implementation under the hood, like limited retention time and message sizes.
Apache Pulsar: An open-source framework competing with Kafka. The feature set overlaps a lot. Unfortunately, Pulsar often only has good marketing for advanced features to compete with Kafka or to differentiate. And one example is its Kafka mapper to be compatible with the Kafka protocol. Contrary to Azure Event Hubs as a serious implementation (with some limitations), Pulsar’s compatibility wrapper provides a basic implementation that is compatible with only minor parts of the Kafka protocol. So, while alleged “Kafka compatibility” sounds nice on paper, one shouldn’t seriously consider this for migrating your running Kafka infrastructure to Pulsar.

We have seen compatible products for open-source frameworks in the past. Re-implementations are usually far away from being complete and perfect. For instance, MongoDB compared the official open source protocol to its competitor Amazon DocumentDB to pinpoint the fact that DocumentDB only passes ~33% of the MongoDB integration test chain.

In summary, it is totally fine to use these non-Kafka solutions like Azure Event Hubs, Apache Pulsar, or Redpanda for a new project if they fulfill your requirements better than Apache Kafka. But keep in mind that it is not Kafka. There is no guarantee that additional components from the Kafka ecosystem (like Kafka Connect, Kafka Streams, REST Proxy, and Schema Registry) behave the same when integrated with a non-Kafka solution that only uses the Kafka protocol with its own implementation.

How good is Redpanda’s Kafka protocol compatibility?

Frankly, I don’t know. Probably and hopefully, Redpanda has better Kafka compatibility than Pulsar. The whole product is based on this value proposition. Hence, we can assume that the Redpanda team spends plenty of time on compatibility. Redpanda has NOT achieved 100% API compatibility yet.

Time will tell when we see more case studies from enterprises across industries that migrated some Apache Kafka projects to Redpanda and successfully operated the infrastructure for a few years. Why wait a few years to see? Well, I compare it to what I see from people starting with Amazon MSK. It is pretty easy to get started. However, after a few months, the first issues happen. Users find out that Amazon MSK is not a fully-managed product and does not provide serious Kafka SLAs. Hence, I see too many teams starting with Amazon MSK and then migrating to Confluent Cloud after some months.

But let’s be clear: If you run an application against Apache Kafka and migrate to a re-implementation supporting the Kafka protocol, you should NOT expect 100% the same behavior as with Kafka!

Some underlying behavior will differ even if the API is 100% compatible. This is sometimes a benefit. For instance, Redpanda focuses on performance optimization with C++. This is only possible in some workloads because of the re-implementation. C++ is superior compared to Java and the JVM for some performance and memory scenarios.

Redpanda = Apache Kafka – Kafka Connect – Kafka Streams

Apache Kafka includes Kafka Connect for data integration and Kafka Streams for stream processing.

Like most Kafka-compatible projects, Redpanda does exclude these critical pieces from its offering. Hence, even 100 percent protocol compatibility would not mean a product re-implements everything in the Apache Kafka project.

Lower latency vs. benchmarketing

Always think about your performance requirements before starting a project. If necessary, do a proof of concept (POC) with Apache Kafka, Apache Pulsar, and Redpanda. I bet that in 99% of scenarios, all three of them will show a good enough performance for your use case.

Don’t trust opinionated benchmarks from others! Your use case will have different requirements and characteristics. And performance is typically just one of many evaluation dimensions.

I am not a fan of most “benchmarks” of performance and throughput. Benchmarks are almost always opinionated and configured for a specific problem (whether a vendor, independent consultant or researcher conducts them).

My colleague Jack Vanlightly explained this concept of benchmarketing with excellent diagrams:

Source: Jack Vanlightly

Here is one concrete example you will find in one of Redpanda’s benchmarks: Kafka was not built for very high throughput producers, and this is what Redpanda is exploiting when they claim that Kafka’s throughput is inferior to Redpanda. Ask yourself this question: Of 1GB/s use cases, who would create that throughput with just 4 producers? Benchmarketing at its finest.

Hence, once again, start with your business requirements. Then choose the right tool for the job. Benchmarks are always built for winning against others. Nobody will publish a benchmark where the competition wins.

Soft real-time vs. hard real-time

When we speak about real-time in the IT world, we mean end-to-end data processing pipelines that need at least a few milliseconds. This is called soft real-time. And this is where Apache Kafka, Apache Pulsar, Redpanda, Azure Event Hubs, Apache Flink, Amazon Kinesis, and similar platforms fit into. None of these can do hard real time.

Hard real-time requires a deterministic network with zero latency and no spikes. Typical scenarios include embedded systems, field buses, and PLCs in manufacturing, cars, robots, securities trading, etc. Time-Sensitive Networking (TSN) is the right keyword if you want more research.

I wrote a dedicated blog post about why data streaming is NOT hard real-time. Hence, don’t try to use Kafka or Redpanda for these use cases. That’s OT (operational technology), not IT (information technology). OT is plain C or Rust on embedded software.

No ZooKeeper with Redpanda vs. no ZooKeeper with Kafka

Besides being implemented in C++ instead of using the JVM, the second big differentiator of Redpanda is no need for ZooKeeper and two complex distributed systems… Well, with Apache Kafka 3.3, this differentiator is gone. Kafka is now production-ready without ZooKeeper! KIP-500 was a multi-year journey and an operation at Kafka’s heart.

To be fair, it will still take some time until the new ZooKeeper-less architecture goes into production. Also, today, it is only supported by new Kafka clusters. However, migration scenarios with zero downtime and without data loss will be supported in 2023, too. But that’s how a severe release cycle works for a mature software product: Step-by-step implementation and battle-testing instead of starting with marketing and selling of alpha and beta features.

ZooKeeper-less data streaming with Kafka is not just a massive benefit for the scalability and reliability of Kafka but also makes operations much more straightforward, similar to ZooKeeper-less Redpanda.

By the way, this was one of the major arguments why I did not see the value of Apache Pulsar. The latter requires not just two but three distributed systems: Pulsar broker, ZooKeeper, and BookKeeper. That’s nonsense and unnecessary complexity for virtually all projects and use cases.

Lightweight Redpanda + heavyweight ecosystem = middleweight data streaming?

Redpanda is very lightweight and efficient because of its C++ implementation. This can help in limited compute environments like edge hardware. As an additional consequence, Redpanda has fewer latency spikes than Apache Kafka. That are significant arguments for Redpanda for some use cases!

However, you need to look at the complete end-to-end data pipeline. If you use Redpanda as a message queue, you get these benefits compared to the JVM-based Kafka engine. You might then pick a message queue like RabbitMQ or NATs instead. I don’t start this discussion here as I focus on the much more powerful and advanced data streaming use cases.

Even in edge use cases where you deploy a single Kafka broker, the hardware, like an industrial computer (IPC), usually provides at least 4GB or 8GB of memory. That is sufficient for deploying the whole data streaming platform around Kafka and other technologies.

Data streaming is more than messaging or data ingestion

My fundamental question is, what is the benefit of a C++ implementation of the data hub if all the surrounding systems are built with JVM technology or even worse and slow technologies like Python?

Kafka-compatible tools like Redpanda integrate well with the Kafka ecosystem, as they use the same protocol. Hence, tools like Kafka Connect, Kafka Streams, KSQL, Apache Flink, Faust, and all other components from the Kafka ecosystem work with Redpanda. You will find such an example for almost every existing Kafka tool on the Redpanda blog.

However, these combinations kill almost all the benefits of having a C++ layer in the middle. All integration and processing components would also need to be as efficient as Redpanda and use C++ (or Go or Rust) under the hood. These tools do not exist today (likely, as they are not needed by many people). And here is an additional drawback: The debugging, testing, and monitoring infrastructure must combine C++, Python, and JVM platforms if you combine tools like Java-based Kafka Connect and Python-based Faust with C++-based Redpanda. So, I don’t get the value proposition here.

Data replication across clusters

Having more than one Kafka cluster is the norm, not an exception. Use cases like disaster recovery, aggregation, data sovereignty in different countries, or migration from on-premise to the cloud require multiple data streaming clusters.

Replication across clusters is part of open-source Apache Kafka. MirrorMaker 2 (based on Kafka Connect) supports these use cases. More advanced (proprietary) tools from vendors like Confluent Replicator or Cluster Linking make these use cases more effortless and reliable.

Data streaming with the Kafka ecosystem is perfect as the foundation of a decentralized data mesh:

How do you build these use cases with Redpanda?

It is the same story as for data integration and stream processing: How much does it help to have a very lightweight and performant core if all other components rely on “3rd party” code bases and infrastructure? In the case of data replication, Redpanda uses Kafka’s Mirrormaker.

And make sure to compare MirrorMaker to Confluent Cluster Linking – the latter uses the Kafka protocol for replications and does not need additional infrastructure, operations, offset sync, etc.

Non-functional differences between Apache Kafka and Redpanda

Technical evaluations are dominant when talking about Redpanda vs. Apache Kafka. However, the non-functional differences are as crucial before making the strategic decision to choose the data streaming platform for your next project.

Licensing, adoption curve and the total cost of ownership (TCO) are critical for the success of establishing a data streaming platform.

Open source (Kafka) vs. source available (Redpanda)

As the name says, Apache Kafka is under the very permissive Apache license 2.0. Everyone, including cloud providers, can use the framework for building internal applications, commercial products, and cloud services. Committers and contributions are spread across various companies and individuals.

Redpanda is released under the more restrictive Source Available License (BSL). The intention is to deter cloud providers from offering Redpanda’s work as a service. For most companies, this is fine, but it limits broader adoption across different communities and vendors. The likelihood of external contributors, committers, or even other vendors picking the technology is much smaller than in Apache projects like Kafka.

This has a significant impact on the (future) adoption curve…

Maturity, community and ecosystem

The introduction of this article showed the impressive adoption of Kafka. Just keep in mind: Redpanda is NOT Apache Kafka! It just supports the Kafka protocol.

Redpanda is a brand-new product and implementation. Operations are different. The behavior of the engine is different. Experts are not available. Job offerings do not exist. And so on.

Kafka is significantly better documented, has a tremendously larger community of experts, and has a vast array of supporting tooling that makes operations more straightforward.

There are many local and online Kafka training options, including online courses, books, meetups, and conferences. You won’t find much for Redpanda beyond the content of the vendor behind it.

And don’t trust marketing! That’s true for every vendor, of course. If you read a great feature list on the Redpanda website, double-check if the feature truly exists and in what shape it is. Example: RBAC (role-based access control) is available for Redpanda. The devil lies in the details. Quote from the Redpanda RBAC documentation: “This page describes RBAC in Redpanda Console and therefore manages access only for Console users but not clients that interact via the Kafka API. To restrict Kafka API access, you need to use Kafka ACLs.” There are plenty of similar examples today. Just try to use the Redpanda cloud service. You will find many things that are more alpha than beta today. Make sure not to fall into the same myths around the marketing of product features as some users did with Apache Pulsar a few years ago.

The total cost of ownership and business value

When you define your project’s business requirements and SLAs, ask yourself how much downtime or data loss is acceptable. The RTO (recovery time objective) and RPO (recovery point objective) impact a data streaming platform’s architecture and overall process to ensure business continuity, even in the case of a disaster.

The TCO is not just about the cost of a product or cloud service. Full-time engineers need to operate and integrate the data streaming platform. Expensive project leads, architects, and developers build applications.

Project risk includes the maturity of the product and the expertise you can bring in for consulting and 24/7 support.

Similar to benchmarketing regarding latency, vendors use the same strategy for TCO calculations! Here is one concrete example you always hear from Redpanda: “C++ does enable more efficient use of CPU resources.”

This statement is correct. However, the problem with that statement is that Kafka is rarely CPU-bound and much more IO-bound. Redpanda has the same network and disk requirements as Kafka, which means Redpanda has limited differences from Kafka in terms of TCO regarding infrastructure.

When to choose Redpanda instead of Apache Kafka?

You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with the “real Apache Kafka” and related products or cloud offerings. Read articles and blogs, watch videos, search for case studies in your industry, talk to different competitive vendors, and build your proof of concept or pilot project. Qualifying out products is much easier than evaluating plenty of offerings.

When to seriously consider Redpanda?

You need C++ infrastructure because your ops team cannot handle and analyze JVM logs – but be aware that this is only the messaging core, not the data integration, data processing, or other capabilities of the Kafka ecosystem
The slight performance differences matter to you – and you still don’t need hard real-time
Simple, lightweight development on your laptop and in automated test environments – but you should then also run Redpanda in production (using different implementations of an API for TEST and PROD is a risky anti-pattern)

You should evaluate Redpanda against Apache Kafka distributions and cloud services in these cases.

This post explored the trade-offs Redpanda has from a technical and non-functional perspective. If you need an enterprise-grade solution or fully-managed cloud service, a broad ecosystem (connectors, data processing capabilities, etc.), and if 10ms latency is good enough and a few p99 spikes are okay, then I don’t see many reasons why you would take the risk of adopting Redpanda instead of an actual Apache Kafka product or cloud service.

The future will tell us if Redpanda is a severe competitor…

I didn’t even cover the fact that a startup always has challenges finding great case studies, especially with big enterprises like fortune 500 companies. The first great logos are always the hardest to find. Sometimes, startups never get there. In other cases, a truly competitive technology and product are created. Such a journey takes years. Let’s revisit this blog post in one, two, and five years to see the evolution of Redpanda (and Apache Kafka).

What are your thoughts? When do you consider using Redpanda instead of Apache Kafka? Are you using Redpanda already? Why and for what use cases? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to choose Redpanda instead of Apache Kafka? appeared first on Kai Waehner.

Streaming Machine Learning with Kafka-native Model Deployment

Kai Waehner — Tue, 27 Oct 2020 15:07:50 +0000

Apache Kafka became the de facto standard for event streaming across the globe and industries. Machine Learning (ML) includes model training on historical data and model deployment for scoring and predictions. While training is mostly batch, scoring usually requires real-time capabilities at scale and reliability. Apache Kafka plays a key role in modern machine learning infrastructures. The next-generation architecture leverages a Kafka-native streaming model server instead of RPC (HTTP/gRPC) calls:

This blog post explores the architectures and trade-offs between three options for model deployment with Kafka: Embedded model into the Kafka application, model server and RPC, model server, and Kafka-native communication.

Kafka and Machine Learning Architecture

Model deployment is usually completely separated from model training (from the process and the technology perspective). The model training is often executed in elastic cloud infrastructure or data lakes such as Hadoop or Spark. Model scoring (i.e., doing the predictions) is usually a mission-critical workload with different uptime SLAs and latency requirements:

Learn more about this architecture and the relation to modern ML approaches such as Hybrid ML architectures or AutoML in the blog post “Using Apache Kafka to Drive Cutting-Edge Machine Learning“.

Two alternatives for model deployment in Kafka infrastructures: The model can either be embedded into the Kafka application, or it can be deployed into a separate model server. The blog post “Model Server and Model Deployment Options in a Kafka Infrastructure” covers the use cases and architectures in detail and explores some code examples.

The following explores both options’ trade-offs and introduces a third option: A Kafka-native streaming model server.

RPC between Kafka App and Model Server

The analytic model is deployed into a model server. The streaming Kafka application does a request-response call to send the input data to the model and to receive the prediction in return:

Almost every ML product or framework provides a model server. This includes open-source frameworks such as TensorFlow and the related model server TF Serving, but also proprietory tools such as SAS, AutoML vendors such as DataRobot, and cloud ML services from all major cloud providers such as AWS, Azure, GCP.

Pros and Cons of a Model Server with RPC

Trade-offs using Kafka in conjunction with an RPC-based model server and HTTP/gRPC:

Simple integration with existing technologies and organizational processes
Easiest to understand if you come from a non-streaming world
Tight coupling of the availability, scalability, and latency/throughput between application and model server
Separation of concerns (e.g. Python model + Java streaming app)
Limited scalability and robustness
Later migration to real streaming is also possible
Model management built-in for different models, versioning, and A/B testing
Model monitoring built-in

Example: TensorFlow Serving with GRPC and Kafka Streams

An example of this approach is available on Github: “TensorFlow Serving + gRPC + Java + Kafka Streams“. In this case, the TensorFlow model is exported and deployed into a TensorFlow Serving infrastructure.

Embedded Model into Kafka Application

An analytic model is just a binary, i.e., a file stored in memory or persistent storage.

The data type differs and depends on the ML framework. But it does not matter if it is Java (e.g., with H2O), Protobuf (e.g., with TensorFlow), or proprietary (with many commercial ML tools).

Any model can be embedded into a Kafka application (if the ML solution provides programmatic APIs – but almost every tool does):

The Kafka application for embedding the model can either be a Kafka-native stream processing engine such as Kafka Streams or ksqlDB, or a “regular” Kafka application using any Kafka client such as Java, Scala, Python, Go, C, C++, etc.

Pros and Cons of Embedding an Analytic Model into a Kafka Application

Trade-offs of embedding analytic models into a Kafka application:

Best latency as local inference instead of remote call
No coupling of the availability, scalability, and latency/throughput of your Kafka Streams application
Offline inference (devices, edge processing, etc.)
No side-effects (e.g., in case of failure), all covered by Kafka processing (e.g., exactly once)
No built-in model management and monitoring

Example: Kafka Python Application with Embedded TensorFlow Model

A robust and scalable example of the embedded model approach is presented in Github project “Streaming Machine Learning with Kafka, MQTT, and TensorFlow for 100000 Connected Cars“. This demo uses Python for both model training and model deployment in separate, scalable containers in a Kubernetes infrastructure.

Several other (more simple) demos to try out this approach are available here: “Machine Learning + Kafka Streams Examples“. The examples use TensorFlow, H2O.ai, and DeepLearning4J (DL4J) in conjunction with Kafka Streams and ksqlDB.

Kafka-native Streaming Model Server

A Kafka-native streaming model server combines some characteristics from both approaches discussed above. It enables the separation of concerns by providing a model server with all the expected features. But the model server does not use RPC communication via HTTP/gRPC and all the drawbacks this creates for a streaming architecture. Instead, the model server communicates via the native Kafka protocol and Kafka topics with the client application:

Pros and Cons of a Kafka-native Streaming Model Server

Trade-offs of a Kafka-native streaming model server:

Good latency via Kafka-native streaming communication
Some coupling of the availability, scalability, and latency/throughput of your Kafka Streams application
Some side-effects (e.g., in case of failure), but most potential issues covered by Kafka processing (e.g., decoupling and persistence via Kafka topics)
Model management built-in for different models, versioning, and A/B testing
Model monitoring built-in
Separation of concerns (e.g. Python model + Java streaming app)
Scalability and robustness of the model server not necessarily Kafka-like (because the underlying implementation is often not Kafka-native yet)

A Kafka-native streaming model server provides many advantages of a streaming architecture and the features of a model server. Just be aware that a Kafka-native interface does NOT mean that the model server itself is implemented with Kafka under the hood. Hence, test your scalability, robustness, and latency requirements to decide if an embedded model might be a better approach.

Example: Streaming Model Deployment with Kafka and Seldon

Seldon is a robust and scalable open-source model server. It allows us to manage, serve, and scale models in any language or framework on Kubernetes.

In mid of 2020, Seldon added a key differentiator compared to many other model servers on the market: Seldon added support for Kafka. Hence, Seldon combines the advantages of a separate model server with the streaming paradigm of Kafka:

The Jupyter notebook demonstrates this example using scikit-learn, the NLP framework spaCy, Seldon, and Kafka-native integration with producer and consumer applications. Check out the blog post “Real-Time Machine Learning at Scale using SpaCy, Kafka & Seldon Core” for more details.

All Model Deployment Options have Trade-Offs in Streaming Machine Learning Architectures

This blog post covered three alternatives for model deployment in a Kafka infrastructure: A model server with RPC, embedding models into the Kafka application, and a Kafka-native model server. All three have their trade-offs. Know them, and evaluate the right choice for your project. The good news is that it is also pretty straightforward to change from one approach to another one.

UPDATE May 2021: Dataiku also provides a native Kafka interface in the meantime (including support for Schema Registry). Great to see different model servers adopting this architecture.

If you want to learn more details about Kafka-native model deployment, check out the following video recording and slide deck from Kafka Summit:

The talk does not cover the “streaming model server” approach (because no model server provided a Kafka-native interface in 2019). But you can still learn a lot about the different architectures and best practices.

If you want to learn more about “Streaming Machine Learning with Kafka – without another Data Lake“, check out the following video recording and slide deck. It explores a simplified architecture and the advantages of Tiered Storage for Kafka:

What are your experiences with Kafka and model deployment? What are your use cases? Which approach works best for you? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Streaming Machine Learning with Kafka-native Model Deployment appeared first on Kai Waehner.

IoT Architectures for Digital Twin with Apache Kafka

Kai Waehner — Wed, 25 Mar 2020 15:47:30 +0000

A digital twin is a virtual representation of something else. This can be a physical thing, process or service. This post covers the benefits and IoT architectures of a Digital Twin in various industries and its relation to Apache Kafka, IoT frameworks and Machine Learning. Kafka is often used as central event streaming platform to build a scalable and reliable digital twin and digital thread for real time streaming sensor data.

I already blogged about this topic recently in detail: Apache Kafka as Digital Twin for Open, Scalable, Reliable Industrial IoT (IIoT). Hence that post covers the relation to Event Streaming and why people choose Apache Kafka to build an open, scalable and reliable digital twin infrastructure.

This article here extends the discussion about building an open and scalable digital twin infrastructure:

Digital Twin vs. Digital Thread
Relation between Event Streaming, Digital Twin and AI / Machine Learning
IoT Architectures for a Digital Twin with Apache Kafka and other IoT Platforms
Extensive slide deck and video recording

Key Take-Aways for Building a Digital Twin

Key Take-Aways:

A digital twin merges the real world (often physical things) and the digital world
Apache Kafka enables an open, scalable and reliable infrastructure for a Digital Twin
Event Streaming complements IoT platforms and other backend applications / databases.
Machine Learning (ML) and statistical models are used in most digital twin architectures to do simulations, predictions and recommendations.

Digital Thread vs. Digital Twin

The term ‘Digital Twin’ usually means the copy of a single asset. In the real world, many digital twins exist. The term ‘Digital Thread’ spans the entire life cycle of one or more digital twins. Eurostep has a great graphic explaining this:

When we talk about ‘Digital Twin’ use cases, we almost always mean a ‘Digital Thread’.

Honestly, the same is true in my material. Both terms overlap, but ‘Digital Twin’ is the “agreed buzzword”. It is important to understand the relation and definition of both terms, though.

Use Cases for Digital Twin and Digital Thread

Use cases exist in many industries. Think about some examples:

Downtime reduction
Inventory management
Fleet management
What-if simulations
Operational planning
Servitization
Product development
Healthcare
Customer experience

The slides and lecture (Youtube video) go into more detail discussing four use cases from different industries:

Virtual Singapore: A Digital Twin of the Smart City
Smart Infrastructure: Digital Solutions for Entire Building Lifecycle
Connected Car Infrastructure
Twinning the Human Body to Enhance Medical Care

The key message here is that digital twins are not just for automation industry. Instead, many industries and projects can add business value and innovation by building a digital twin.

Relation between Event Streaming, Digital Twin and AI / Machine Learning

Digital Twin respectively Digital Thread and AI / Machine Learning (ML) are complementary concepts. You need to apply ML to do accurate predictions using a digital twin.

Digital Twin and AI

Melda Ulusoy from MathWorks shows in a Youtube video how different Digital Twin implementations leverage statistical methods and analytic models:

Examples include physics-based modeling to simulate what-if scenarios and data-driven modeling to estimate the RUL (Remaining Useful Life).

Digital Twin and Machine Learning both have the following in common:

Continuous learning, monitoring and acting
(Good) data is key for success
The more data the better
Real time, scalability and reliability are key requirements

Digital Twin, Machine Learning and Event Streaming with Apache Kafka

Real time, scalability and reliability are key requirements to build a digital twin infrastructure. This makes clear how Event Streaming and Apache Kafka fit into this discussion. I won’t cover what Kafka is or relation between Kafka and Machine Learning in detail here because there are so many other blog posts and videos about it. To recap, let’s take a look at a common Kafka ML architecture providing openness, real time processing, scalability and reliability for model training, deployment / scoring and monitoring:

To get more details about Kafka + Machine learning, start with the blog post “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka” and google for slides, demos, videos and more blog posts.

Characteristics of Digital Twin Technology

The following five characteristics describe common Digital Twin implementations:

Connectivity
- Physical assets, enterprise software, customers
- Bidirectional communication to ingest, command and control
Homogenization
- Decoupling and standardization
- Virtualization of information
- Shared with multiple agents, unconstrained by physical location or time
- Lower cost and easier testing, development and predictions
Reprogrammable and smart
- Adjust and improve characteristics and develop new version of a product
Digital traces
- Go back in time and analyse historical events to diagnose problems
Modularity
- Design and customization of products and production modules
- Tweak modules of models and machines

There are plenty of options to implement these characteristics. Let’s take a look at some IoT platforms and how Event Streaming and Apache Kafka fit into the discussion.

IoT Platforms, Frameworks, Standards and Cloud Services

Plenty of IoT solutions are available on the market. IoT Analytics Research talks about over 600 IoT Platforms in 2019. All have their “right to exist” In most cases, some of these tools are combined with each other. There is no need or good reason to choose just one single solution.

Let’s take a quick look at some offerings and their trade-offs.

Proprietary IoT Platforms

Sophisticated integration for related IIoT protocols (like Siemens S7, Modbus, etc.) and standards (like OPC-UA)
Not a single product (plenty of acquisitions, OEMs and different code bases are typically the foundation)
Typically very expensive
Proprietary (just open interfaces)
Often limited scalability
Examples: Siemens MindSphere, Cisco Kinetic, GE Digital and Predix

IoT Offerings from Cloud Providers

Sophisticated tools for IoT management (devices, shadowing, …)
Good integration with other cloud services (storage, analytics, …)
Vendor lock-in
No key focus on hybrid and edge (but some on premises products)
Limited scalability
Often high cost (beyond ’hello world’)
Examples: All major cloud providers have IoT services, including AWS, GCP, Azure and Alibaba

Standards-based / Open Source IoT Platforms

Open and standards-based (e.g. MQTT)
Open source / open core business model
Infrastructure-independent
Different vendors contribute and compete behind the core technologies (competition means innovation)
Sometimes less mature or non-existent connectivity (especially to legacy and proprietary protocols)
Examples: Open source frameworks like Eclipse IoT, Apache PLC4X or Node-RED and standards like MQTT and related vendors like HiveMQ
Trade-off: Solid offering for one standard (e.g. HiveMQ for MQTT) or diversity but not for mission-critical scale (e.g. Node-RED)

IoT Architectures for a Digital Twin / Digital Thread with Apache Kafka and other IoT Platforms

So, we learned that there are hundreds of IoT solutions available. Consequently, how does Apache Kafka fit into this discussion?

As discussed in the other blog post and in the below slides / video recording: There is huge demand for an open, scalable and reliable infrastructure for Digital Twins. This is where Kafka comes into play to provide a mission-critical event streaming platform for real time messaging, integration and processing.

Kafka and the 5 Characteristics of a Digital Twin

Let’s take a look at a few architectures in the following. Keep in mind the five characteristics of Digital Twins discussed above and its relation to Kafka:

Connectivity – Kafka Connect provides connectivity as scale in real time to IoT interfaces, big data solutions and cloud services. The Kafka ecosystem is complementary, NOT competitive to other Middleware and IoT Platforms.
Homogenization – Real decoupling between clients (i.e. producers and consumers) is one of the key strengths of Kafka. Schema management and enforcement leveraging different technologies (JSON Schema, Avro, Profobuf, etc.) enables data awareness and standardization.
Reprogrammable and smart – Kafka is the de facto standard for microservice architectures for exactly this reason: Separation of concerns and domain-driven design (DDD). Deploy new decoupled applications and do versioning, A/B testing, canarying.
Digital traces – Kafka is a distributed commit log. Events are appended, stored as long as you want (potentially forever with retention time = -1) and immutable. Seriously, what other technology could be used better to build a digital trace for a digital twin?
Modularity – The Kafka infrastructure itself is modular and scalable. This includes components like Kafka brokers, Connect, Schema Registry, REST Proxy and client applications in different languages like Java, Scala, Python, Go, .NET, C++ and others. With this modularity, you can easily build the right Digital Twin architecture your your edge, hybrid or global scenarios and also combine the Kafka components with any other IoT solutions.

Each of the following IoT architectures for a Digital Twin has its pros and cons. Depending on your overall enterprise architecture, project situation and many other aspects, pick and choose the right one:

Scenario 1: Digital Twin Monolith

An IoT Platform is good enough for integration and building the digital twin. No need to use another database or integrate with the rest of the enterprise.

Scenario 2: Digital Twin as External Database

An IoT Platform is used for integration with the IoT endpoints. The Digital Twin data is stored in an external database. This can be something like MongoDB, Elastic, InfluxDB or a Cloud Storage. The database could be used just for storage and for additional tasks like processing, dashboards and analytics.

A combination with yet another product is also very common. For instance, a Business Intelligence (BI) tool like Tableau, Qlik or Power BI can use the SQL interface of a database for interactive queries and reports.

Scenario 3: Kafka as Backbone for the Digital Twin and the Rest of the Enterprise

The IoT Platform is used for integration with the IoT endpoints. Kafka is the central event streaming platform to provide decoupling between the other components. As a result, the central layer is open, scalable and reliable. The database is used for the digital twin (storage, dashboards, analytics). Other applications also consume parts of the data from Kafka (some real time, some batch, some request-response communication).

Scenario 4: Kafka as IoT Platform

Kafka is the central event streaming platform to provide a mission-critical real time infrastructure and the integration layer to the IoT endpoints and other applications. The digital twin is implemented in its own solution. In this example, it does not use a database like in the examples above, but a Cloud IoT Service like Azure Digital Twins.

Scenario 5: Kafka as IoT Platform

Kafka is used to implement the digital twin. No other components or databases are involved. Other consumers consume the raw data and the digital twin data.

Like all the other architectures, this has pros and cons. The main question in this approach is if Kafka can really replace a database and how you can query the data. First if all, Kafka can be used as database (check out the detailed discussion in the linked blog post), but it will not replace other databases like Oracle, MongoDB or Elasticsearch.

Having said this, I have already seen several deployments of Kafka for Digital Twin infrastructures in automation, aviation, and even banking industry.

Especially with “Tiered Storage” in mind (a Kafka feature currently discussed in a KIP-405 and already implemented by Confluent), Kafka gets more and more powerful for long-term storage.

Slides and Video Recording – IoT Architectures for a Digital Twin with Apache Kafka

This section provides a slide deck and video recording to discuss Digital Twin use cases, technologies and architectures in much more detail.

The agenda for the deck and lecture:

Digital Twin – Merging the Physical and the Digital World
Real World Challenges
IoT Platforms
Apache Kafka as Event Streaming Solution for IoT
Spoilt for Choice for a Digital Twin
Global IoT Architectures
A Digital Twin for 100000 Connected Cars

Slides

Here is the long version of the slides (with more content than the slides used for the video recording):

Video Recording

The video recording covers a “lightweight version” of the above slides:

The post IoT Architectures for Digital Twin with Apache Kafka appeared first on Kai Waehner.

Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing

Kai Waehner — Mon, 02 Sep 2019 09:35:03 +0000

Data integration and processing is a huge challenge in Industrial IoT (IIoT, aka Industry 4.0 or Automation Industry) due to monolithic systems and proprietary protocols. Apache Kafka, its ecosystem (Kafka Connect, KSQL) and Apache PLC4X are a great open source choice to implement this IIoT integration end to end in a scalable, reliable and flexible way.

This blog post covers a high level overview about the challenges and a good, flexible architecture to solve the problems. At the end, I share a video recording and the corresponding slide deck. These provide many more details and insights.

Challenges in IIoT / Industry 4.0

Here are some of the key challenges in IIoT / Industry 4.0:

IoT != IIoT: Automation industry does not use MQTT or other standards, but is slow, insecure, not scalable and proprietary.
Product Lifecycles are very long (tens of years), no simple changes or upgrades
IIoT usually uses incompatible protocols, typically proprietary and just built for one specific vendor.
Automation industry uses proprietary and expensive monoliths which are not scalable and not extendible.
Machines and PLCs are insecure by nature with no authentication, no authorization, no encryption.

This is still state of the art in automation industry. This is no surprise with such long product life cycles, but still very concerning.

Evolution of Convergence between IT and Automation Industry

Today, everybody talks about cloud, big data analytics, machine learning and real time processing at scale. The convergence between IT and Automation Industry is coming, as the analyst report from IoT research company IOT Analytics shows:

There is huge demand to build an open, flexible, scalable platform. Many opportunities from business and technical perspective:

Cost reduction
Flexibility
Standards-based
Scalability
Extendibility
Security
Infrastructure-independent

So, how to get from legacy technologies and proprietary IIoT protocols to cloud, big data, machine learning, real time processing? How to build a reliable, scalable and flexible architecture and infrastructure?

Apache Kafka and Apache PLC4X for End-to-End IIoT Integration

I assume you already know it: Apache Kafka is the De-facto Standard for Real-Time Event Streaming. It provides

Open Source (Apache 2.0 License)
Global-scale
Real-time
Persistent Storage
Stream Processing

If you need more details about Apache Kafka, check out the Kafka website, the extensive Confluent documentation or some free video recordings and slides from any Kafka Summit to learn about the technology and use cases.

The only very important thing I want to point out is that Apache Kafka includes Kafka Connect and Kafka Streams:

Kafka Connect enables reliable and scalable integration of Kafka with other systems. Kafka Streams allows to write standard Java apps and microservices to continuously process your data in real-time with a lightweight stream processing API. And finally, KSQL enables Stream Processing using SQL-like Semantics.

Apache PLC4X for PLC Integration (Siemens S7, Modbus, Allen Bradley, Beckhoff ADS, etc.)

Apache PLC4X is less established on the market than Apache Kafka. It also “just covers a niche” (a big one, of course) compared to Kafka, which is used in any industry for many different use cases. However, PLC4X is a very interesting top level Apache project for automation industry.

The Goal is to open up PLC interfaces from IIoT world to the outside world. PCL4X allows vertical integration and to write software independent of PLCs using JDBC-like adapters for various protocols like Siemens S7, Modbus, Allen Bradley, Beckhoff ADS, OPC-UA, Emerson, Profinet, BACnet, Ethernet.

PLC4X provides a Kafka Connect connector. Therefore, you can leverage the benefits of Apache Kafka (high availability, high throughput, high scalability reliability, real time processing) to deploy PLC4X integration pipelines. With this, you can build one single architecture and infrastructure for

legacy IIoT connectivity using PLC4X and Kafka Connect
data processing using Kafka Streams / KSQL
integration with the rest of the enterprise using Kafka Connect and any other sink (database, big data analytics, machine learning, ERP, CRM, cloud services, custom business applications, etc.)

As Kafka decouples the producers from the consumers, you can consume the IIoT machine sensor data from any application – some might be real time, some might be batch, and some might be request-response communication for human interaction on a web or mobile app.

Apache PLC4X vs. OPC-UA

A little bit off-topic: How to choose between Apache PLC4X (open source framework for IIoT) and OPC-UA (open standard for IIoT). In short, both are different things and can also be complementary. Here is a comparison:

OPC-UA

Open standard
All the pros and cons of an open standard (works with different vendors; slow adoption; inflexible, etc.)
Often poorly implemented by the vendors
Requires app server on top of PLC
Every device has to be retrofitted with the ability to speak a new protocol and use a common client to speak with these devices
Often over-engineering for just reading the data
Activating OPC-UA support on existing PLCs greatly increases the load on the PLCs
With licensing cost for every machine

Apache PLC4X

Open source framework (Apache 2.0 license)
Provides unified API by implementing drivers for communicating with most industrial controllers in the protocols they natively understand
No need to modify existing hardware
No increased load on the PLCs
No need to pay for licenses to activate OPC-UA support
Drivers being implemented from the specs or by reverse engineering protocols in order to be fully Apache 2.0 licensed
PLC4X adapter for OPC-UA available -> Both can be used together!

As you see, both have their pros and cons. To me, and this is clearly my subjective opinion, PLC4X provides a great alternatives with high flexibility and low footprint.

Confluent and IoT Platform Solutions

Many IoT Platform Solutions are available on the market. This includes products like Siemens MindSphere or Cisco Kinetic, and cloud services from the major cloud providers like AWS, GCP or Azure. And you have Kafka + PLC4X as you just learned above. Often, this is not a “neither … nor” decision:

You can either use

just Kafka and PLC4X for lightweight and flexible IIoT integration based on a scalable, reliable and open event streaming platform
just a IoT Platform Solution if the pros of such a specific product (dedicated for a specific vendor protocol, nice GUI, etc.) outperform the cons (like high cost, proprietary and inflexible solution)
both together where you use the IoT Platform Solution to integrate with the PLCs and then send the data to Kafka to integrate with the rest of the enterprise (with all the benefits and added value Kafka brings)
both together where you use Kafka and PLC4X for PLC integration and one of the consumers is the IoT Platform Solution (while other consumers can also get the data from Kafka – fully decoupled from the IoT Platform Solution)

All alternatives have their pros and cons. There is no single solution which fits every use case! Therefore, no surprise that most IoT Solution Platforms provide Kafka source and sink connectors.

Apache Kafka and Apache PLC4X – Slides / Video Recording / Github Code Example

If you got curious about more details and insights, please check out my video recording and slide deck.

Slide Deck – Apache Kafka and PLC4X:

Video Recording – Apache Kafka and PLC4X:

Github Code Example – Apache Kafka and PLC4X:

We are also building a nice and simple demo on Github these days:

Kafka-native end-to-end IIoT Data Integration and Processing with Kafka Connect, KSQL and Apache PLC4X

PLC4X gets most exciting if you try it out by yourself and connect to your machines or tools. So, check out the example and adjust it to connect to your infrastructure.

Feedback and Questions?

Please let me know your feedback and questions about Kafka, its ecosystem and PLC4X for IIoT integration. Let’s also connect on LinkedIn to discuss interesting IIoT use cases and technologies in the future.

The post Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing appeared first on Kai Waehner.

Model Serving: Stream Processing vs. RPC / REST with Java, gRPC, Apache Kafka, TensorFlow

Kai Waehner — Mon, 09 Jul 2018 01:13:45 +0000

Machine Learning / Deep Learning models can be used in different ways to do predictions. My preferred way is to deploy an analytic model directly into a stream processing application (like Kafka Streams or KSQL). You could e.g. use the TensorFlow for Java API. This allows best latency and independence of external services. Several examples can be found in my Github project: Model Inference within Kafka Streams Microservices using TensorFlow, H2O.ai, Deeplearning4j (DL4J).

However, direct deployment of models is not always a feasible approach. Sometimes it makes sense or is needed to deploy a model in another serving infrastructure like TensorFlow Serving for TensorFlow models. Model Inference is then done via RPC / Request Response communication. Organisational or technical reasons might force this approach. Or you might want to leverage the built-in features for managing and versioning different models in the model server.

So you combine stream processing with RPC / Request-Response paradigm. The architecture looks like the following:

Pros of an external model serving infrastructure like TensorFlow Serving:

Simple integration with existing technologies and organizational processes
Easier to understand if you come from non-streaming world
Later migration to real streaming is also possible
Model management built-in for different models and versioning

Cons:

Worse latency as remote call instead of local inference
No offline inference (devices, edge processing, etc.)
Coupling the availability, scalability, and latency / throughput of your Kafka Streams application with the SLAs of the RPC interface
Side-effects (e.g. in case of failure) not covered by Kafka processing (e.g. Exactly Once)

Combination of Stream Processing and Model Server using Apache Kafka, Kafka Streams and TensorFlow Serving

I created the Github Java project “TensorFlow Serving + gRPC + Java + Kafka Streams” to demo how to do model inference with Apache Kafka, Kafka Streams and a TensorFlow model deployed using TensorFlow Serving. The concepts are very similar for other ML frameworks and Cloud Providers, e.g. you could also use Google Cloud ML Engine for TensorFlow (which uses TensorFlow Serving under the hood) or Apache MXNet and AWS model server.

Most ML servers for model serving are also extendible to serve other types of models and data, e.g. you could also deploy non-TensorFlow models to TensorFlow Serving. Many ML servers are available as cloud service and for local deployment.

TensorFlow Serving

Let’s discuss TensorFlow Serving quickly. It can be used to host your trained analytic models. Like with most model servers, you can do inference via request-response paradigm. gRPC and REST / HTTP are the two common technologies and concepts used.

The blog post “How to deploy TensorFlow models to production using TF Serving” is a great explanation of how to export and deploy trained TensorFlow models to a TensorFlow Serving infrastructure. You can either deploy your own infrastructure anywhere or leverage a cloud service like Google Cloud ML Engine. A SavedModel is TensorFlow’s recommended format for saving models, and it is the required format for deploying trained TensorFlow models using TensorFlow Serving or deploying on Goodle Cloud ML Engine.

The core architecture is described in detail in TensorFlow Serving’s architecture overview:

This architecture allows deployement and managend of different models and versions of these models including additional features like A/B testing. In the following demo, we just deploy one single TensorFlow model for Image Recognition (based on the famous Inception neural network).

Demo: Mixing Stream Processing with RPC: TensorFlow Serving + Kafka Streams

Disclaimer: The following is a shortened version of the steps to do. For full example including source code and scripts, please go to my Github project “TensorFlow Serving + gRPC + Java + Kafka Streams“.

Things to do

Install and start a ML Serving Engine
Deploy prebuilt TensorFlow Model
Create Kafka Cluster
Implement Kafka Streams application
Deploy Kafka Streams application (e.g. locally on laptop or to a Kubernetes cluster)
Generate streaming data to test the combination of Kafka Streams and TensorFlow Serving

Step 1: Create a TensorFlow model and export it to ‘SavedModel’ format

I simply added an existing pretrained Image Recognition model built with TensorFlow. You just need to export a model using TensorFlow’s API and then use the exported folder. TensorFlow uses Protobuf to store the model graph and adds variables for the weights of the neural network.

Google ML Engine shows how to create a simple TensorFlow model for predictions of census using the “ML Engine getting started guide“. In a second step, you can build a more advanced example for image recognition using Transfer Learning folling the guide “Image Classification using Flowers dataset“.

You can also combine cloud and local services, e.g. build the analytic model with Google ML Engine and then deploy it locally using TensorFlow Serving as we do.

Step 2: Install and start TensorFlow Serving server + deploy model

Different options are available. Installing TensforFlow Serving on a Mac is still a pain in mid of 2018. apt-get works much easier on Linux operating systems. Unforunately there is nothing like a ‘brew’ command or simple zip file you can use on Mac. Alternatives:

You can build the project and compile everything using Bazel build system – which literaly takes forever (on my laptop), i.e. many hours.
Install and run TensorFlow Serving via a Docker container . This also requires building the project. In addition, documentation is not very good and outdated.
Preferred option for beginners => Use a prebuilt Docker container with TensorFlow Serving. I used an example from Thamme Gowda. Kudos to him for building a project which not just contains the TensorFlow Serving Docker image, but also shows an example of how to do gRPC communication between a Java application and TensorFlow Serving.

If you want to your own model, read the guide “Deploy TensorFlow model to TensorFlow serving“. Or to use a cloud service, e.g. take a look at “Getting Started with Google ML Engine“.

Step 3: Create Kafka Cluster and Kafka topics

Create a local Kafka environment (Apache Kafka broker + Zookeeper). The easiest way is the open source Confluent CLI – which is also part of Confluent Open Source and Confluent Enteprise Platform. Just type “confluent start kafka“.

You can also create a cluster using Kafka as a Service. Best option is Confluent Cloud – Apache Kafka as a Service. You can choose between Confluent Cloud Professional for “playing around” or Confluent Cloud Enterprise on AWS, GCP or Azure for mission-critical deployments including 99.95% SLA and very large scale up to 2 GBbyte/second throughput. The third option is to connect to your existing Kafka cluster on premise or in cloud (note that you need to change the broker URL and port in the Kafka Streams Java code before building the project).

Next create the two Kafka topics for this example (‘ImageInputTopic’ for URLs to the image and ‘ImageOutputTopic’ for the prediction result):

Step 4 Build and deploy Kafka Streams app + send test messages

The Kafka Streams microservice (i.e. Java class) “Kafka Streams TensorFlow Serving gRPC Example” is the Kafka Streams Java client. The microservice uses gRPC and Protobuf for request-response communication with the TensorFlow Serving server to do model inference to predict the contant of the image. Note that the Java client does not need any TensorFlow APIs, but just gRPC interfaces.

This example executes a Java main method, i.e. it starts a local Java process running the Kafka Streams microservice. It waits continuously for new events arriving at ‘ImageInputTopic’ to do a model inference (via gRCP call to TensorFlow Serving) and then sending the prediction to ‘ImageOutputTopic’ – all in real time within milliseconds.

In the same way, you could deploy this Kafka Streams microservice anywhere – including Kubernetes (e.g. on premise OpenShift cluster or Google Kubernetes Engine), Mesosphere, Amazon ECS or even in a Java EE app – and scale it up and down dynamically.

Now send messages, e.g. with kafkacat, and use kafka-console-consumer to consume the predictions.

Once again, if you want to see source code and scripts, then please go to my Github project “TensorFlow Serving + gRPC + Java + Kafka Streams“.

The post Model Serving: Stream Processing vs. RPC / REST with Java, gRPC, Apache Kafka, TensorFlow appeared first on Kai Waehner.

Video Recording – Apache Kafka as Event-Driven Open Source Streaming Platform (Voxxed Zurich 2018)

Kai Waehner — Tue, 13 Mar 2018 07:25:43 +0000

I spoke at Voxxed Zurich 2018 about Apache Kafka as Event-Driven Open Source Streaming Platform. The talk includes an intro to Apache Kafka and its open source ecosystem (Kafka Streams, Connect, KSQL, Schema Registry, etc.). Just want to share the video recording of my talk.

Abstract

This session introduces Apache Kafka, an event-driven open source streaming platform. Apache Kafka goes far beyond scalable, high volume messaging. In addition, you can leverage Kafka Connect for integration and the Kafka Streams API for building lightweight stream processing microservices in autonomous teams. The open source Confluent Platform adds further components such as a KSQL, Schema Registry, REST Proxy, Clients for different programming languages and Connectors for different technologies and databases. Live Demos included.

Video Recording

The post Video Recording – Apache Kafka as Event-Driven Open Source Streaming Platform (Voxxed Zurich 2018) appeared first on Kai Waehner.

Apache Kafka + Machine Learning => Confluent Blog Post and Github Project

Kai Waehner — Fri, 27 Oct 2017 08:24:28 +0000

I am happy that my first official Confluent blog post was published and want to link to it from by blog:

How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka

The post explains in detail how you can leverage Apache Kafka and its Streams API to deploy analytic models to a lightweight, but scalable, mission-critical streaming appilcation.

Github Examples for Apache Kafka + Machine Learning

If you want to take a look directly at the source code, go to my Github project about Kafka + Machine Learning. It contains several examples how to combine Kafka Streams with frameworks like TensorFlow, H2O or DeepLearning4J.

The post Apache Kafka + Machine Learning => Confluent Blog Post and Github Project appeared first on Kai Waehner.

Deep Learning in Real Time with TensorFlow, H2O.ai and Kafka Streams (Slides from JavaOne 2017)

Kai Waehner — Wed, 04 Oct 2017 16:59:04 +0000

Early October… Like every year in October, it is time for JavaOne and Oracle Open World in San Francisco… I am glad to be back at this huge event again. My talk at JavaOne 2017 was all about deployment of analytic models to scalable production systems leveraging Apache Kafka and Kafka Streams. Let’s first look at the abstract. After that I attach the slides and refer to further material around this topic.

Abstract “Deep Learning in Real Time with TensorFlow, H2O.ai and Kafka Streams”

Intelligent real time applications are a game changer in any industry. Deep Learning is one of the hottest buzzwords in this area. New technologies like GPUs combined with elastic cloud infrastructure enable the sophisticated usage of artificial neural networks to add business value in real world scenarios. Tech giants use it e.g. for image recognition and speech translation. This session discusses some real-world scenarios from different industries to explain when and how traditional companies can leverage deep learning in real time applications.

This session shows how to deploy Deep Learning models into real time applications to do predictions on new events. Apache Kafka will be used to inter analytic models in a highly scalable and performant way.

The first part introduces the use cases and concepts behind Deep Learning. It discusses how to build Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Autoencoders leveraging open source frameworks like TensorFlow, DeepLearning4J or H2O.

The second part shows how to deploy the built analytic models to real time applications leveraging Apache Kafka as streaming platform and Apache Kafka’s Streams API to embed the intelligent business logic into any external application or microservice.

Key Takeaways for the Audience: Kafka Streams + Deep Learning

Here are the takeaways of this talk:

Focus of this talk is to discuss and show how to productionize analytic models built by data scientists – the key challenge in most companies.
Deep Learning allows to build different neural networks to solve complex classification and regression scenarios and can add business value in any industry
Deep Learning is used to build analytics models using open source frameworks like TensorFlow, DeepLearning4J or H2O.ai.
Apache Kafka’s Streams API allows to embed the intelligent business logic into any application or microservice
Apache Kafka’s Streams API leverages these Deep Learning Models (without Redeveloping) to act on new events in real time

Slides and Further Material around Apache Kafka and Machine Learning

Here are the slides of my talk:

Some further material around Apache Kafka, Kafka Streams and Machine Learning:

I will post more examples and use cases around Apache Kafka and Machine Learning in the upcoming months… Stay tuned!

The post Deep Learning in Real Time with TensorFlow, H2O.ai and Kafka Streams (Slides from JavaOne 2017) appeared first on Kai Waehner.

Case Study: From a Monolith to Cloud, Containers, Microservices

Kai Waehner — Fri, 24 Feb 2017 15:14:12 +0000

The following shows a case study about successfully moving from a very complex monolith system to a cloud-native architecture. The architecture leverages containers and Microservices. This solve issues such as high efforts for extending the system, and a very slow deployment process. The old system included a few huge Java applications and a complex integration middleware deployment.

The new architecture allows flexible development, deployment and operations of business and integration services. Besides, it is vendor-agnostic so that you can leverage on-premise hardware, different public cloud infrastructures, and cloud-native PaaS platforms.

The session will describe the challenges of the existing monolith system, the step-by-step procedure to move to the new cloud-native Microservices architecture. It also explains why containers such as Docker play a key role in this scenario.

A live demo shows how container solutions such as Docker, PaaS cloud platforms such as CloudFoundry, cluster managers such as Kubernetes or Mesos, and different programming languages are used to implement, deploy and scale cloud-native Microservices in a vendor-agnostic way.

Key Takeaways

Key takeaways for the audience:

– Best practices for moving to a cloud-native architecture

– How to leverage microservices and containers for flexible development, deployment and operations

– How to solve challenges in real world projects

– Understand key technologies, which are recommended

– How to stay vendor-agnostic

– See a live demo of how cloud-native applications respectively services differ from monolith applications regarding development and runtime

Slides and Video from Microservices Meetup Mumbai

Here are the slides and video recording. Presented in February 2017 at Microservices Meetup Mumbai, India.

The post Case Study: From a Monolith to Cloud, Containers, Microservices appeared first on Kai Waehner.

Comparison of Open Source IoT Integration Frameworks

Kai Waehner — Thu, 03 Nov 2016 11:20:09 +0000

In November 2016, I attended Devoxx conference in Casablanca. Around 1500 developers participated. A great event with many awesome speakers and sessions. Hot topics this year besides Java: Open Source Frameworks, Microservices (of course!), Internet of Things (including IoT Integration), Blockchain, Serverless Architectures.

I had three talks:

How to Apply Machine Learning to Real Time Processing
Comparison of Open Source IoT Integration Frameworks
Tools in Action – Live Demo of Open Source Project Flogo

In addition, I was interviewed by the Voxxed team about Big Data, Machine Learning and Internet of Things. The video will be posted on Voxxed website in the next weeks.

You can find several slides and video recordings about my big data / machine learning topic on my blog / website. Therefore, I will focus on the open source IoT frameworks in this post.

Open Source Integration Frameworks for the Internet of Things

Internet of Things (IoT) and edge integration are getting more important than ever before due to the massively growing number of connected devices year by year. In this talk, I showed open source frameworks built to develop very lightweight microservices. These can be deployed on small devices or in cloud native containers / serverless architectures with very low resource consumption to wire together all different kinds of hardware devices, APIs and online services.

The focus of this session was to discuss open source process engines such as Eclipse Kura (in conjunction with Apache Camel), Node-RED or Flogo. These offer a framework plus zero-code environment with Web IDE for building and deploying IoT integration and data processing directly onto IoT gateways and connected devices. For that, they leverage IoT standards such as MQTT, WebSockets or CoaP, but also other interfaces such as Twitter feeds or REST services.

The session also discussed the relation to other components in a IoT architecture including:

Open source dataflow pipelines (like Apache NiFi / MiNiFy, StreamSets or Cask Hydrator),
Open source stream processing frameworks (such as Apache Storm, Flink, Spark Streaming or Samza)
Cloud IoT platforms (such as AWS IoT, Cloud IoT Cloud Platform or IBM’s open source serverless framework OpenWhisk).

Slide Deck: Apache Nifi vs. StreamSets vs. Eclipse Kura vs. Node-RED vs. Flogo vs. Others

Here is the slide deck about different alternatives for IoT integration:

Video Recording: Integration Frameworks for the Internet of Things

And here is the video recording where I walk through the above slide deck and also show live demos of Node-Red and Flogo:

As always, I appreciate any comments or feedback…

The post Comparison of Open Source IoT Integration Frameworks appeared first on Kai Waehner.