IBM Archives - Kai Waehner

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink

Kai Waehner — Sat, 18 Jan 2025 11:33:44 +0000

The cloud revolution has transformed how businesses deploy, scale, and manage data streaming solutions. While Software-as-a-Service (SaaS) and Platform-as-a-Service (PaaS) cloud models are often used interchangeably in marketing, their distinctions have significant implications for operational efficiency, cost, and scalability. In the context of data streaming around Apache Kafka and Flink, understanding these differences and recognizing common misconceptions—such as the overuse of the term “serverless”—can help you make an informed decision. Additionally, the emergence of Bring Your Own Cloud (BYOC) offers yet another option, providing organizations with enhanced control and flexibility in their cloud environments.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Data Streaming Landscape: Kafka, Flink, Cloud, and More

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

If you’re still grappling with the fundamentals of stream processing, this article is a must-read: “Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink“.

What is SaaS in Data Streaming?

SaaS data streaming solutions are fully managed services where the provider handles all aspects of deployment, maintenance, scaling, and updates. SaaS offerings are designed for ease of use, providing a serverless experience where developers focus solely on building applications rather than managing infrastructure.

Characteristics of SaaS in Data Streaming

Serverless Architecture: Resources scale automatically based on demand. True SaaS solutions eliminate the need to provision or manage servers.
Low Operational Overhead: The provider manages hardware, software, and runtime configurations, including upgrades and security patches.
Pay-As-You-Go Pricing: Consumption-based pricing aligns costs directly with usage, reducing waste during low-demand periods.
Rapid Deployment: SaaS enables users to start processing streams within minutes, accelerating time-to-value.

Examples of SaaS in Data Streaming:

Confluent Cloud: A fully managed Kafka platform offering serverless scaling, multi-tenancy, and a broad feature set for both stateless and stateful processing.
Amazon Kinesis Data Analytics: A managed service for real-time analytics with automatic scaling.

What is PaaS in Data Streaming?

PaaS offerings sit between fully managed SaaS and infrastructure-as-a-service (IaaS). These solutions provide a platform to deploy and manage applications but still require significant user involvement for infrastructure management.

Characteristics of PaaS in Data Streaming

Partial Management: The provider offers tools and frameworks, but users must manage servers, clusters, and scaling policies.
Manual Configuration: Deployment involves provisioning VMs or containers, tuning parameters, and monitoring resource usage.
Complex Scaling: Scaling is not always automatic; users may need to adjust resource allocation based on workload changes.
Higher Overhead: PaaS requires more expertise and operational involvement, making it less accessible to teams without dedicated DevOps resources.

Examples of PaaS in Data Streaming (Kafka, Flink)

PaaS offerings in data streaming, while simplifying some infrastructure tasks, still require significant user involvement compared to fully serverless SaaS solutions. Below are three common examples, along with their benefits and pain points compared to serverless SaaS:

Apache Flink (Self-Managed on Kubernetes Cloud Service like EKS, AKS or GKE)
- Benefits: Full control over deployment and infrastructure customization.
- Pain Points: High operational overhead for managing Kubernetes clusters, manual scaling, and complex resource tuning.
Amazon Managed Service for Apache Flink (Amazon MSF)
- Benefits: Simplifies infrastructure management and integrates with some other AWS services.
- Pain Points: Users still handle job configuration, scaling optimization, and monitoring, making it less user-friendly than serverless SaaS solutions.
Amazon MSK (Managed Streaming for Apache Kafka)
- Benefits: Eases Kafka cluster maintenance and integrates with the AWS ecosystem.
- Pain Points: Requires users to design and manage producers/consumers, manually configure scaling, and handle monitoring responsibilities. MSK also excludes support for Kafka if you have any operational issues with the Kafka piece of the infrastructure.

SaaS vs. PaaS: Key Differences

SaaS and PaaS differ in the level of management and user responsibility, with SaaS offering fully managed services for simplicity and PaaS requiring more user involvement for customization and control.

Feature	SaaS	PaaS
Infrastructure	Fully managed by the provider	Partially managed; user controls clusters
Scaling	Automatic and server less	Manual or semi-automatic scaling
Deployment Speed	Immediate, ready to use	Slower; requires configuration
Operational Complexity	Minimal	Moderate to high
Cost Model	Consumption-based, no idle costs	May incur idle resource costs

The big benefit of PaaS over SaaS is greater flexibility and control, allowing users to customize the platform, integrate with specific infrastructure, and optimize configurations to meet unique business or technical requirements. This level of control is often essential for organizations with strict compliance, security, or data sovereignty requirements.

SaaS is NOT Always Better than PaaS!

Be careful: The limitations and pain points of PaaS do NOT mean that SaaS is always better.

A concrete example: Amazon MSK Serverless simplifies Apache Kafka operations with automated scaling and infrastructure management but comes with significant limitations, including the lack of fully-managed connectors, advanced data governance tools, and native multi-language client support.

Amazon MSK also excludes Kafka engine support, leading to potential operational risks and cost unpredictability, especially when integrating with additional AWS services for a complete data streaming pipeline. I explored these challenges in more detail in my article “When NOT to choose Amazon MSK Serverless for Apache Kafka?“.

Bring Your Own Cloud (BYOC) as Alternative to PaaS

BYOC (Bring Your Own Cloud) offers a middle ground between fully managed SaaS and self-managed PaaS solutions, allowing organizations to host applications in their own VPCs.

BYOC provides enhanced control, security, and compliance while reducing operational complexity. This makes BYOC a strong alternative to PaaS for companies with strict regulatory or cost requirements.

As an example, here are the options of Confluent for deploying the data streaming platform: Serverless Confluent Cloud, Self-managed Confluent Platform (some consider this a PaaS if you leverage Confluent’s Kubernetes Operator and other automation / DevOps tooling) and WarpStream as BYOC offering:

Source: Confluent

While BYOC complements SaaS and PaaS, it can be a better choice when fully managed solutions don’t align with specific business needs. I wrote a detailed article about this topic: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

“Serverless” Claims: Don’t Trust the Marketing

Many cloud data streaming solutions are marketed as “serverless,” but this term is often misused. A truly serverless solution should:

Abstract Infrastructure: Users should never need to worry about provisioning, upgrading, or cluster sizing.
Scale Transparently: Resources should scale up or down automatically based on workload.
Eliminate Idle Costs: There should be no cost for unused capacity.

However, many products marketed as serverless still require some degree of infrastructure management or provisioning, making them closer to PaaS. For example:

A so-called “serverless” PaaS solution may still require setting initial cluster sizes or monitoring node health.
Some products charge for pre-provisioned capacity, regardless of actual usage.

Do Your Own Research

When evaluating data streaming solutions, dive into the technical documentation and ask pointed questions:

Does the solution truly abstract infrastructure management?
Are scaling policies automatic, or do they require manual configuration?
Is there a minimum cost even during idle periods?

By scrutinizing these factors, you can avoid falling for misleading “serverless” claims and choose a solution that genuinely meets your needs.

Choosing the Right Model for Your Data Streaming Business: SaaS, PaaS, or BYOC

When adopting a data streaming platform, selecting the right model is crucial for aligning technology with your business strategy:

Use SaaS (Software as a Service) if you prioritize ease of use, rapid deployment, and operational simplicity. SaaS is ideal for teams looking to focus entirely on application development without worrying about infrastructure.
Use PaaS (Platform as a Service) if you require deep customization, control over resource allocation, or have unique workloads that SaaS offerings cannot address.
Use BYOC (Bring Your Own Cloud) if your organization demands full control over its data but sees benefits in fully managed services. BYOC enables you to run the data plane within your cloud VPC, ensuring compliance, security, and architectural flexibility while leveraging SaaS functionality for the control plane .

In the rapidly evolving world of data streaming around Apache Kafka and Flink, SaaS data streaming platforms like Confluent Cloud provide the best of both worlds: the advanced features of tools like Apache Kafka and Flink, combined with the simplicity of a fully managed serverless experience. Whether you’re handling stateless stream processing or complex stateful analytics, SaaS ensures you’re scaling efficiently without operational headaches.

What deployment option do you use today for Kafka and Flink? Any changes planned in the future? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink appeared first on Kai Waehner.

Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL?

Kai Waehner — Tue, 14 Jan 2025 07:48:04 +0000

When discussing stream processing engines, Apache Flink often takes center stage for its advanced capabilities in stateful stream processing and real-time data analytics. However, a common question arises: is Flink too heavyweight for simple, stateless stream processing and ETL tasks? The short answer for open-source Flink is often yes. But the story evolves significantly when looking at SaaS Flink products such as Confluent Cloud’s Flink offering, with its serverless architecture, multi-tenancy, consumption-based pricing, and no-code/low-code capabilities like Flink Actions. This post explores the considerations and trade-offs to help you decide when Flink is the right tool for your data streaming needs, and when Kafka Streams or Single Message Transform (SMT) within Kafka Connect are the better choice.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Nature of Stateless Stream Processing

Stateless stream processing, as the name implies, processes each event independently, with no reliance on prior events or context. This simplicity lends itself to use cases such as filtering, transformations, and simple ETL operations. Stateless tasks are:

Efficient: They don’t require state management, reducing overhead.
Scalable: Easily parallelized since there is no dependency between events.
Minimalistic: Often achievable with simpler, lightweight frameworks like Kafka Streams or Kafka Connect’s Single Message Transforms (SMT).

For example, filtering transactions above a certain amount or transforming event formats for downstream systems are classic stateless tasks that demand minimal computational complexity.

In these scenarios, deploying a robust and feature-rich framework like open-source Apache Flink might seem excessive. Flink’s rich API and state management features are unnecessary for such straightforward use cases. Instead, tools with smaller footprints, and simpler deployment models, such as Kafka Streams, often suffice.

When Flink Feels Like Overkill

Apache Flink is a powerhouse. It’s designed for advanced analytics, stateful processing, and complex event patterns. But this sophistication of the open source framework comes with complexity:

Operational Overhead: Setting up and maintaining Flink in an open-source environment can require significant infrastructure and expertise.
Resource Intensity: Flink’s distributed architecture and stateful processing capabilities are resource-hungry, often overkill for tasks that don’t require stateful processing.
Complexity in Development: The Flink API is robust but comes with a steeper learning curve. The combination with Kafka (or another streaming engine) requires understanding two frameworks. In contrast, Kafka Streams is Kafka-native, offering a single, unified framework for stream processing, which can reduce complexity for basic tasks.

For organizations that need to perform straightforward stateless operations, investing in the full Flink stack can feel like using a sledgehammer to crack a nut. Having said this, FlinkSQL simplifies development for certain personas, providing a more accessible interface beyond just Java and Python.

The Cloud Advantage: Serverless Flink for Everyone

The conversation shifts dramatically with Serverless Flink Cloud offerings, such as Confluent Cloud, which address many of the challenges associated with running open-source Flink. Let’s unpack how Serverless Flink makes a more attractive choice, even for simpler use cases.

1. Serverless Architecture

With a Serverless stream processing service, Flink operates on a fully serverless model, eliminating the need for heavy infrastructure management. This means:

No Setup Hassles: Developers focus purely on application logic, not cluster provisioning or tuning.
Elastic Scaling: Resources automatically scale with the workload, ensuring efficient handling of varying traffic patterns without manual intervention. One of the biggest challenges of self-managing Flink is over-provisioning resources to handle anticipated peak loads. Elastic Scaling mitigates this inefficiency.

2. Multi-Tenancy

Multi-tenant design allows multiple applications, teams or organizations to share the same infrastructure securely. This reduces operational costs and complexity compared to managing isolated clusters for each workload.

3. Consumption-Based Pricing

One of the key barriers to adopting Flink for simple tasks is cost. A truly Serverless Flink offering mitigates this with a pay-as-you-go pricing model:

You only pay for the resources you use, making it cost-effective for both lightweight and high-throughput workloads.
It aligns with the scalability of stateless stream processing, where workloads may spike temporarily and then taper off.

4. Bridging the Gap with No-Code and Low-Code Solutions

The rise of citizen integrators and the demand for low-code/no-code solutions have reshaped how organizations approach data streaming. Less-technical users, such as business analysts or operational teams, often face challenges when trying to engage with technical platforms designed for developers.

Low-code/no-code tools address this by providing intuitive interfaces that allow users to build, deploy, and monitor pipelines without deep programming knowledge. These solutions empower business users to take charge of simple workflows and integrations, significantly reducing time-to-value while minimizing the reliance on technical teams.

For example, capabilities like Flink Actions in Confluent Cloud offer a user-friendly approach to deploying stream processing pipelines without coding. By simplifying the process and making it accessible to non-technical stakeholders, these tools enhance collaboration and ensure faster outcomes without compromising performance or scalability. For instance, you can do ETL functions such as transformation, deduplication or masking field:

Source: Confluent

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Products

When choosing between SaaS and PaaS for data streaming, it’s essential to understand the key differences.

SaaS solutions, like Confluent Cloud, offer a fully managed, serverless experience with automatic scaling, low operational overhead, and pay-as-you-go pricing.

In contrast, PaaS requires users to manage infrastructure, configure scaling policies, and handle more operational complexity.

While many products are marketed as “serverless,” not all truly abstract infrastructure or eliminate idle costs—so scrutinize claims carefully.

SaaS is ideal for teams focused on rapid deployment and simplicity, while PaaS suits those needing deep customization and control. Ultimately, SaaS ensures scalability and ease of use, making it a compelling choice for most modern streaming needs. Always dive into the technical details to ensure the platform aligns with your goals. Don’t trust the marketing slogans of the vendors!

Stateless vs. Stateful Stream Processing: Blurring the Lines

Even if your current use case is stateless, it’s worth considering the potential for future needs. Stateless pipelines often evolve into more complex systems as businesses grow, requiring features like:

State Management: For event correlation and pattern detection.
Windows and Aggregations: To derive insights from time-series data.
Joins: To enrich data streams with contextual information.
Integrating Multiple Data Sources: To seamlessly combine information from various streams for a comprehensive and cohesive analysis.
AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

With a SaaS Flink service such as Confluent Cloud, you can start small with stateless tasks and seamlessly scale into stateful operations as needed, leveraging Flink’s full capabilities without a complete overhaul.

When Does Apache Flink Shine?

While Flink may feel like overkill for simple, stateless tasks in its open-source form, its potential is unmatched in these scenarios:

Enterprise Workloads: Scalable, reliable, and fault-tolerant systems for mission-critical applications.
Data Integration and Preparation (Streaming ETL): Flink enables preprocessing, cleansing, and enriching data at the streaming layer, ensuring high-quality data reaches downstream systems like data lakes and warehouses.
Complex Event Processing (CEP): Detecting patterns across events in real time.
Advanced Analytics: Stateful stream processing for aggregations, joins, and windowed computations.
AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Apache Flink for Stateless Stream Processing instead of Kafka Streams or Kafka Connect’s Single Message Transform (SMT)

Stateless stream processing is often achieved using lightweight tools like Kafka Streams or Single Message Transforms (SMTs) within Kafka Connect. SMTs enable inline transformations, such as normalization, enrichment, or filtering, as events pass through the integration framework. This functionality is available in Kafka Connect (provided by Confluent, IBM/Red Hat, Amazon MSK and others) and tools like Benthos for Redpanda. SMTs are particularly useful for quick adjustments and filtering data before it reaches the Kafka cluster, optimizing resource usage and data flow.

While Kafka Streams and Kafka Connect’s SMTs handle many stateless workloads effectively, Apache Flink offers significant advantages for all types of workloads—whether simple or complex, stateless or stateful.

Stream processing in Flink enables true decoupling within the enterprise architecture (as it is not bound to the Kafka cluster like Kafka Streams and Kafka Connect). The benefits are separation of concerns with a domain-driven design (DDD), and improved data governance. And Flink provides interfaces for Java, Python and SQL. Something for (almost) everyone. This makes ideal Flink for ensuring clean, modular architectures and easier scalability.

By processing events from diverse sources and preparing them for downstream consumption, Flink supports both lightweight and comprehensive workflows while aligning with domain boundaries and governance requirements. This brings us to the shift left architecture.

The Shift Left Architecture

No matter what specific use cases you have in mind: The Shift Left Architecture brings data processing upstream with real-time stream processing, transforming raw data into high-quality data products early in the pipeline.

Apache Flink plays a key role as part of a complete data streaming platform by enabling advanced streaming ETL, data curation, and on-the-fly transformations, ensuring consistent, reliable, and ready-to-use data for both operational and analytical workloads, while reducing costs and accelerating time-to-market.

Making the Right Choice for Apache Flink or Not

The decision to use Flink boils down to your use case, expertise, and growth trajectory:

For basic stateless tasks, consider lightweight options like Kafka Streams or SMTs within Kafka Connect unless you’re already invested in a SaaS such as Confluent Cloud where Flink is also the appropriate choice for simple ETL processes.
For evolving workloads or scenarios requiring scalability and advanced analytics, a Flink SaaS offers unparalleled flexibility and ease of use.
For on-premise or edge deployments, Flink’s flexibility makes it an excellent choice for environments where data processing must occur locally due to latency, security, or compliance requirements.

Understanding the deployment environment—cloud, on-premise, or edge— and the capabilities of the Flink product is crucial to choosing the right streaming technology. Flink’s adaptability ensures it can serve diverse needs across these contexts.

Kafka Streams is another excellent, Kafka-native stream processing alternative. Most importantly for this discussion, Kafka Streams is “just” a lightweight Java library, not a server infrastructure like Flink. Hence, it brings different trade-offs with it. I wrote a dedicated article about the trade-offs between Apache Flink and Kafka Streams for stream processing.

Apache Flink’s Role in Your Data Streaming Strategy

In its open-source form, Flink can seem excessive for simple, stateless tasks. However, a serverless Flink SaaS like Confluent Cloud changes the equation. Multi-tenancy and pay-as-you-go pricing make it suitable for a wider range of use cases, from basic ETL to advanced analytics. Serverless features like Confluent’s Flink Actions further reduce complexity, allowing non-technical users to harness the power of stream processing without coding.

Whether you’re just beginning your journey into stream processing or scaling up for enterprise-grade applications, Flink—as part of a complete data streaming platform such as Confluent Cloud—is a future-proof investment that adapts to your needs.

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

The Data Streaming Landscape 2025

Kai Waehner — Wed, 04 Dec 2024 13:49:37 +0000

Data streaming is a new software category. It has grown from niche adoption to becoming a fundamental part of modern data architecture. With real-time data processing transforming industries, the ecosystem of tools, platforms, and services has evolved significantly. This blog post explores the data streaming landscape of 2025, analyzing key players, trends, and market dynamics shaping this space.

The data streaming landscape of 2025 categorizes solutions by their adoption and completeness as fully-featured data streaming platforms, as well as their deployment models, which range from self-managed setups to BYOC (Bring Your Own Cloud) and fully managed cloud services like PaaS and SaaS. While Apache Kafka remains the backbone of this ecosystem, the landscape also includes stream processing engines like Apache Flink and competitive technologies such as Pulsar and Redpanda that are built on the Kafka protocol.

This blog also explores the latest market trends and provides an outlook for 2025 and beyond, highlighting potential new entrants and evolving use cases. By the end, you’ll gain a clear understanding of the data streaming platform landscape and its trajectory in the years to come.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download by free ebook about data streaming use cases and industry-specific success stories.

Data Streaming in 2025: The Rise of a New Software Category

Real-time data beats slow data. That’s true across almost all use cases in any industry. Event-driven applications powered by data streaming continuously process data from any data source. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience. And the event-driven architecture ensures a future-ready architecture.

Even top researchers and advisory firms such as Forrester, Gartner, and IDG recognize data streaming as a new software category. In December 2023, Forrester released “The Forrester Wave: Streaming Data Platforms, Q4 2023,” highlighting Microsoft, Google, and Confluent as leaders, followed by Oracle, Amazon, and Cloudera.

Data Streaming is NOT just another data integration tool. Plenty of software categories and related data platforms exist to process and analyze data. I explored in a few dedicated series how data streaming. differs:

The Business Value of Data Streaming

A new software category opens use cases and adds business value across all industries:

Source: Lyndon Hedderly (Confluent)

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products.

Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The Data Streaming Landscape of 2025

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. Several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Various SaaS startups have emerged in this category in the last few years.

It all began with the open-source framework Apache Kafka, and today, the Kafka protocol is widely adopted across various implementations, including proprietary ones. What truly matters now is leveraging the capabilities of a complete data streaming platform—one that is fully compatible with the Kafka protocol. This includes built-in features like connectors, stream processing, security, data governance, and the elimination of self-management, reducing risks and operational effort.

The Kafka Protocol is the De Facto Standard of Data Streaming

Most software vendors use Kafka (or its protocol) at the core of their data streaming platforms. Apache Kafka has become the de facto standard for data streaming.

Additionally, “benchmarketing” (i.e., picking a sweet spot or niche scenario where you perform better than your competitor) is the favorite marketing technique to “prove” differentiators to the real Apache Kafka. Some vendors also present misleading cost-efficiency comparisons by excluding critical cloud costs such as data transfer or storage, giving an incomplete picture of the true expenses.

Apache Kafka vs. Data Streaming Platform

Many still use Kafka merely as a dumb ingestion pipeline, overlooking its potential to power sophisticated, real-time data streaming use cases. One reason is that Kafka alone lacks the full capabilities of a comprehensive data streaming platform.

A complete solution requires more than “just” Kafka. Apache Flink is becoming the de facto standard for stream processing. Data integration capabilities (connectors, clients, APIs), data governance, security, and critical 24/7 SLAs and support are important for many data streaming projects.

The Data Streaming Landscape 2025 summarizes the current status of relevant products and cloud services, focusing on deployment models and the adoption/completeness of the data streaming platforms.

Data Streaming Vendors and Categories for the 2025 Landscape

The data streaming landscape changed this year. As most solutions evolve, I do not distinguish anymore between Kafka, non-Kafka, and stream processing as categories. Instead, I look at the adoption and completeness to assess the maturity of a data streaming solution from an open-source framework to a complete platform.

The deployment models also changed in the 2025 landscape. Instead of categorizing it into Self Managed, Partially Managed, and Fully Managed, I sort as follows: Self Managed, Bring Your Own Cloud (BYOC), and Cloud. The Cloud category is separated into PaaS (Platform as a Service) and SaaS (Software as a Service) to indicate that many Kafka cloud offerings are still NOT fully managed!

Please note: Intentionally, this data streaming landscape is not a complete list of frameworks, cloud services, or vendors. It is also not an official research. There is no statistical evidence. You might miss your favorite technology in this diagram. Then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community.

Also, note that I focus on general data streaming infrastructure. Brilliant solutions exist for using and analyzing streaming data for specific scenarios, like time-series databases, machine learning engines, observability platforms, or purpose-built IoT solutions. These are usually complementary, often connected out of the box via a Kafka connector, or even built on top of a data streaming platform (invisible for the end user).

Adoption and Completeness of Data Streaming (X-Axis)

Data streaming is adopted more and more across all industries. The concept is not new. In “The Past, Present and Future of Stream Processing“, I explored how the data streaming journey started decades ago with research and the first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Is anyone still remembering (or even still using) Apache Storm?

Today, most enterprises are realizing the value of data streaming for both analytical and operational use cases across industries. The cloud has brought a transformative shift, enabling businesses to start streaming and processing data with just a click, using fully managed SaaS solutions and intuitive UIs. Complete data streaming platforms now offer many built-in features that users previously had to develop themselves, including connectors, encryption, access control, governance, data sharing, and more.

Capabilities of a Complete Data Streaming Platform

Data streaming vendors are on the way to building a complete Data Streaming Platform (DSP). Capabilities include:

Messaging (“Streaming”): Transfer messages in real-time and persist for durability, decoupling, and slow consumers (near real-time, batch, API, file).
Data Integration: Connect to any legacy and cloud-native sources and sinks.
Stream Processing: Correlate events with stateless and stateful transformation or business logic.
Data Governance: Ensure security, observability, data sovereignty, and compliance.
Developer Tooling: Enable flexibility for different personas such as software engineers, data scientists, and business analysts by providing different APIs (such as Java, Python, SQL, REST/HTTP), graphical user interfaces, and dashboards.
Operations Tooling and SaaS: Ease infrastructure management on premise respectively take over the entire operations burden in the public cloud with serverless offerings.
Uptime SLAs and Support: Provide the required guarantees and expertise for critical use cases.

Evolution from Open Source Adoption to a Data Streaming Organization

Modern data streaming is not just about adopting a product; it’s about transforming the way organizations operate and derive value from their data. Hence, the adoption goes beyond product features:

From open source and self-operations to enterprise-grade products and SaaS.
From human scale to automated, serverless elasticity with consumption-based pricing.
From dumb ingestion pipes to complex data pipelines and business applications.
From analytical workloads to critical transactional (and analytical) use cases.
From a single data streaming cluster to a powerful edge, hybrid, and multi-cloud architecture, including integration, migration, aggregation, and disaster recovery scenarios.
From wild adoption across business units with uncontrolled growth using various frameworks, cloud services, and integration tools to a center of excellence (CoE) with a strategic approach with standards, best practices, and knowledge sharing in an internal community.
From effortful and complex human management to enterprise-wide data governance, automation, and self-service APIs.

Data Streaming Deployment Models: Self-Managed vs. BYOC vs. Cloud (Y-Axis)

Different data streaming categories exist regarding the deployment model:

Self-Managed: Operate nodes like Kafka Broker, Kafka Connect, and Schema Registry by yourself with your favorite scripts and tools. This can be on-premise or in the public cloud in your VPC. Reduce the operations burden via a cloud-native platform (usually Kubernetes) and related operator tools that automate operations tasks like rolling upgrades or rebalancing Kafka Partitions.
Bring Your Own Cloud (BYOC): Allow organizations to host Kafka within their own cloud VPC. BYOC combines some of the benefits of cloud flexibility with enhanced security and control, while it outsources most of Kafka’s management to specialized vendors. The data plane is still customer-managed, but in contrast to self-managed Kafka, the customer does not need to worry about the complexity under the hood (like rebalancing, rolling upgrades, backups) – that is what cloud-native object storage and other magic code of the BYOC control plane service take over.
Cloud (PaaS or SaaS): Both PaaS and SaaS solutions operate within the cloud provider’s VPC. Fully managed SaaS for data streaming takes overall operational responsibilities, including scaling, failover handling, upgrades, and performance tuning, allowing users to focus solely on integration and business logic. In contrast, partially managed PaaS reduces the operational burden by automating certain tasks like rolling upgrades and rebalancing, but still requires some level of user involvement in managing the infrastructure. Fully Managed SaaS typically provides critical SLAs for support and uptime while partially managed PaaS cannot provide such guarantees.

Most organizations prefer SaaS for data streaming when business and technical requirements allow, as it minimizes operational complexity and maximizes scalability. Other deployment models are chosen when specific constraints or needs require them.

The Evolution of BYOC Kafka Cloud Services

Cloud and On-Premise deployment options are typically well understood, but BYOC (Bring Your Own Cloud) often requires more explanation due to its unique operating model and varying implementations across vendors.

In last year’s data streaming landscape 2024, I wrote the following about BYOC for Kafka:

“I do NOT believe in this approach as too many questions and challenges exist with BYOC regarding security, support, and SLAs in the case of P1 and P2 tickets and outages. Hence, I put this in the category of self-managed. That is what it is, even though the vendor touches your infrastructure. In the end, it is your risk because you have to and want to control your environment.”

This statement made sense because BYOC vendors at that time required access to the customer VPC and offered a shared support model. While this is still true for some BYOC solutions, my mind changed with the innovation of BYOC by one emerging vendor: WarpStream.

WarpStream’s BYOC Operating Model with Stateless Agents in the Customer VPC

WarpStream published a new operating model for BYOC: The customer only deploys stateless agents in its VPC and provides an object storage bucket to store the data. The control plane and metadata store are fully managed by the vendor as SaaS and the vendor takes over all the complexity.

Source: Confluent

With this innovation, BYOC is now a worthwhile third deployment option besides a self-managed and fully managed data streaming platform. It brings several benefits:

No access is needed by the BYOC cloud vendor to the customer VPC. The data plane (i.e., the “brokers” in the customer VPC) is stateless. The metadata/consensus is in the control plane (i.e., the cloud service in the WarpStream VPC).
The architecture solves sovereignty challenges and is a great fit for security and compliance requirements.
The cost of the BYOC offering is cheaper than self-managed Apache Kafka because it is built with cloud-native concepts and technologies in mind (e.g., zero disks and zero interzone networking fees, leveraging cloud object storage such as Amazon S3, Azure Blog Storage, or Google Cloud Storage).
The stateless architecture in the customer VPC makes autoscaling and elasticity very easy to implement/configure.

When to use BYOC?

WarpStream introduced an innovative share-nothing operating model that makes BYOC practical, secure, and cost-efficient. With that being said, I still recommend only looking at BYOC options for Apache Kafka in the public cloud if a fully managed and serverless data streaming platform does NOT work for you because of cost, security, or compliance reasons! When it comes to simplicity and ease of operation, nothing beats a fully managed cloud service.

And please keep in mind that NOT every BYOC cloud service provides these characteristics and benefits. Make sure to make a proper evaluation of your favorite solutions. For more details, look at my blog post: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

Changes in the Data Streaming Landscape from 2024 to 2025

My goal is NOT a growing landscape with tens or even hundreds of vendors and cloud services. Plenty of these pictures exist. Instead, I focus on a few technologies, vendors, and cloud offerings that I see in the field, with adoption by the broader open-source and cloud community.

I already discussed the important conceptual changes in the data streaming landscape:

Deployment Model: From self-managed, partially managed, and fully managed to self-managed, BYOC and cloud.
Streaming Categories: From different streaming categories to a single category for all data streaming platforms sorted by adoption and completeness.

Additionally, every year I modified the list of solutions compared to the data streaming landscape 2024 published one year ago.

Additions to the Data Streaming Landscape 2025

The following data streaming services were added:

Alibaba (Cloud): Confluent Data Streaming Service on Alibaba Cloud is an OEM partnership between Alibaba Cloud and Confluent to offer a fully managed SaaS in Mainland China. The service was announced end of 2021 and sees more and more traction in Asia. Alibaba is the contractor and first-level support for the end user.
Google Managed Service for Kafka (Cloud): Google announced this Kafka PaaS recently. The strategy looks very similar to Amazon’s MSK. Even the shortcut is the same: MSK. I explored when (not) to choose Google’s Kafka cloud service after the announcement. The service is still in preview, but available to a huge customer base already.
Oracle Streaming with Apache Kafka (Cloud): A partially managed Apache Kafka PaaS on Oracle Cloud Infrastructure (OCI). The service is in early access, but available to a huge customer base already.
WarpStream (BYOC): WarpStream was acquired by Confluent. It is still included with its logo as Confluent continues to keep the brand and solution separated (at least for now).

Removals from the Data Streaming Landscape 2025

There are NO REMOVALS this year, BUT I was close to removing two technologies:

Apache Pulsar and StreamNative: I was close to removing Apache Pulsar as I see zero traction in the market. Around 2020, Pulsar had some traction but focused too much on Kafka FUD instead of building a vibrant community. While Kafka simplified its architecture (ZooKeeper removal), Pulsar still includes three distributed systems (ZooKeeper or alternatives like etcd, BookKeeper, and Pulsar Broker). It also pivots to the Kafka protocol trying to get some more traction again. But it seems to be too late.
ksqlDB (formerly called KSQL): The product is feature complete. While it is still supported by Confluent, it will not get any new features. ksqlDB is still a great piece of software for some (!) Kafka-native stream processing projects but might be removed in the future. Confluent co-founder and Kafka co-creator Jay Kreps commented on X (former Twitter): “Confluent went through a set of experiments in this area. What we learned is that for *platform* layers you want a clean separation. We learned this the hard way: our source available stream processing layer KSQL, lost to open-source Apache Flink. We pivoted to Flink.“

Vendor Overview for Data Streaming Platforms

All vendors of the landscape do some kind of data streaming. However, the offerings differ a lot in adoption, completeness, and vision. And many solutions are not available everywhere but only in one public cloud or only as self-managed. For detailed product information and experiences, the vendor websites and other blogs/conference talks are the best and most up-to-date resources. The following is just a summary to get an overview.

Before we do the deep dive, here again, the entire data streaming landscape for 2025:

Self-Managed Data Streaming with Open Source and Proprietary Products

The following list describes the open-source frameworks and proprietary products for self-managed data streaming deployments (in order of adoption and completeness):

Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture. Kafka is a single distributed cluster – after removing the ZooKeeper dependency in 2022. Pulsar is three (!) distributed clusters: Pulsar brokers, ZooKeeper, and BookKeeper. Pulsar vs. Kafka explored the differences. And Kafka catches up to some missing features like Queues for Kafka.
StreamNative: The primary vendor behind Apache Pulsar. Not much market traction.
ksqlDB (usually called KSQL, even after Confluent’s rebranding): An abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. Hence, also Kafka-native. It comes with a Confluent Community License and is free to use. Sweet spot: Streaming ETL.
Redpanda: Implements the Kafka protocol with C++. Trying out different market strategies to define Redpanda as an alternative to a Kafka-native offering. Still in the early stage in the maturity curve. Adding tons of (immature) product features in parallel to find the right market fit in a growing Kafka market. Recently acquired Benthos to provide connectivity to data sources and sinks (similar to Kafka Connect).
Ververica: Well-known Flink company. Acquired by Chinese tech giant Alibaba in 2019. Not much traction in Europe and the US. Sweet spot: Flink in Mainland China.
Apache Flink: Becoming the de facto standard for stream processing. Open-source implementation. Provides advanced features including a powerful scalable compute engine, freedom of choice for developers between SQL, Java, and Python, APIs for Complex Event Processing (CEP), and unified APIs for stream and batch workloads.
Spark Streaming: The streaming part of the open-source big data processing framework Apache Spark. The enormous installed base of Spark clusters in enterprises broadens adoption thanks to solutions from Cloudera, Databricks, and the cloud service providers. Sweet spot: Analytics in (micro)batches with data stored at rest in the data lake/lakehouse.
Apache Kafka: The de facto standard for data streaming. Open-source implementation with a vast community. Almost all vendors rely on (parts of) this project. Often underestimated: Kafka includes data integration and stream processing capabilities with Kafka Connect and Kafka Streams, making even the open-source Apache download already more powerful than many other data streaming frameworks and products.
IBM / Red Hat AMQ Streams: Provides Kafka as self-managed Kafka on Kubernetes via OpenShift. Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel. Sweet spot: Existing IBM customers.
Cloudera: Provides Kafka, Flink, and other open-source data and analytics frameworks as a self-managed offering. The main strategy is offering one product with a vast combination of many open-source frameworks that can be deployed on any infrastructure. Sweet spot: Analytics.
Confluent Platform: Focuses on a complete data streaming platform including Kafka and Flink, and various advanced data streaming capabilities for operations, integration, governance, and security. Sweet spot: Unifying operational and analytical workloads, and combination with the fully managed cloud service.

Data Streaming with Bring Your Own Cloud (BYOC)

BYOC is an emerging category and is mainly used for specific challenges such as strict data security and compliance requirements. The following vendors provide dedicated BYOC offerings for data streaming (in order of adoption and completeness)

WarpStream (Confluent): A new entrant into the data streaming market. The cloud service is a Kafka-compatible data streaming platform built directly on top of S3. Innovated the BYOC model to enable secure and cost-effective data streaming for workloads that don’t have strong latency requirements.
Redpanda: The first BYOC offering on the market for data streaming. The biggest concern is the shared responsibility model of this solution because the vendor requires access to the customer VPC for operations and support. This is against the key principles of BYOC regarding security and compliance and why organizations (have to) look for BYOC instead of SaaS solutions.
Databricks: Cloud-based data platform that provides a collaborative environment for data engineering, data science, and machine learning, built on top of Apache Spark. Data Streaming is enabled by Spark Streaming and focuses mainly on analytical workloads that are optimized from batch to near real-time.

Partially Managed Data Streaming Cloud Platforms (PaaS)

Here is an overview of relevant PaaS data streaming cloud services (in order of adoption and completeness):

Google Cloud Managed Service for Apache Kafka (MSK): Initially branded as Google Managed Kafka for BigQuery (likely for a better marketing push), the service enables data ingestion into lakehouses on GCP such as Google BigQuery.
Amazon Managed Service for Apache Flink (MSF): A partially managed service by AWS that allows customers to transform and analyze streaming data in real-time with Apache Flink. It still provides some (costly) gaps for auto-scaling and is not truly serverless. Supports all Flink interfaces, i.e., SQL, Java, and Python. And non-Kafka connectors, too. Only available on AWS.
Oracle OCI Streaming with Apache Kafka: The service is still in early access, but available to a huge customer base already on Oracle’s cloud infrastructure.
Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases beyond data ingestion for batch analytics.
Instaclustr: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies.
IBM / Red Hat AMQ Streams: Provides Kafka as a partially managed cloud offering on OpenShift (through Red Hat). Sweet spot: Existing IBM customers.
Amazon Managed Service for Apache Kafka (MSK): AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. MSK is only available in public AWS cloud regions; not on Outposts, Local Zones, Wavelength, etc. MSK is likely the largest partially managed Kafka service across all clouds. It evolved with new features like support for Kafka Connect and Tiered Storage. But lacks connectivity outside the AWS ecosystem and a data governance narrative.

Fully Managed Data Streaming Cloud Services (SaaS)

Here is an overview of relevant SaaS data streaming cloud services (in order of adoption and completeness):

Decodable: A relatively new cloud service for Apache Flink in the early stage. Huge potential if it is combined with existing Kafka infrastructures in enterprises. But also provides pre-built connectors for non-Kafka systems. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
StreamNative Cloud: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is still in a very early stage of maturity, not sure if it will ever strengthen.
Ververica: Stream processing as a service powered by Apache Flink on all major cloud providers. Huge potential if it is combined with existing Kafka infrastructures in enterprises. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
Redpanda Cloud: Redpanda provides its data streaming as a serverless offering. Not much information is available on the website about this part of the vendor’s product portfolio.
Amazon MSK Serverless: Different functionalities and limitations than Amazon MSK. MSK Serverless still does not get much traction because of its limitations. Both MSK offerings exclude Kafka support in their SLAs (please read the terms and conditions).
Google Cloud DataFlow: Fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Mature product for a specific problem. Only available on GCP.
Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol. Hence, it is not a complete streaming platform but is more comparable to Amazon Kinesis or Google Cloud PubSub. The limited compatibility with the Kafka protocol and the high cost of the service for higher throughput are the two major blockers that come up regularly in conversations.
Confluent Cloud: A complete data streaming platform including Kafka and Flink as a fully managed offering. In addition to deep integration, the platform provides connectivity, security, data governance, self-service portal, disaster recovery tooling, and much more to be the most complete DSP on all major public clouds.

Potential for the Data Streaming Landscape 2026

Data streaming is a journey. So is the development of event streaming platforms and cloud services. Several established software and cloud vendors might get more traction with their data streaming offerings. And some startups might grow significantly. The following shows a few technologies that might evolve and see growing adoption in 2025:

New startups around the Kafka protocol emerge. The list of new frameworks and cloud services is growing every quarter. A few names I saw in some social network posts (but not much beyond in the real world): AutoMQ, S2, Astradot, Bufstream, Responsive, tansu, Tektite, Upstash. While some focus on the messaging/streaming part, others focus on a particular piece such as building database capabilities.
Streaming databases like Materialize or RisingWave might become a new software category. My feeling: Very early stage of the hype cycle. We will see in 2025 if and where this technology gets more broadly adopted and what the use cases are. It is hard to answer how these will compete with emerging real-time analytics databases like Apache Druid, Apache Pinot, ClickHouse, Timeplus, Tinybird, et al. I know there are differences, but the broader community and companies need to a) understand these differences and b) find business problems for it.
Stream Processing SaaS startups emerge: Quix and Bytewax provide stream processing with Python. Quix now also offers a hosted offering based on Kafka Streams; as does Responsive. DeltaStream provides Apache Flink as SaaS. And many more startups emerge these days. Let’s see which of these gets traction in the market with an innovative product and business model.
Traditional data management vendors like MongoDB or Snowflake try to get deeper into the data streaming business. I am still a fan of separation of concerns; so I think these should keep their sweet spot and (only) provide streaming ingestion and CDC as use cases, but not (try to) compete with data streaming vendors.

Fun fact: The business model of almost all emerging startups is fully managed cloud services, not selling licenses for on-premise deployments. Many are based on open-source or open-core, and others only provide a proprietary implementation.

Although they are not aiming to be full data streaming platforms (and thus are not part of the platform landscape), other complementary tools are gaining momentum in the data streaming ecosystem. For instance, Conduktor is developing a proxy for Kafka clusters, and Lenses, though relatively quiet since its acquisition by Celonis, has recently introduced updates to its user-friendly management and developer tools. These tools address gaps that some data streaming platforms leave unfilled.

Data Streaming: A Journey, Not a Sprint

Data streaming isn’t a sprint—it’s a journey! Adopting event-driven architectures with technologies like Apache Kafka or Apache Flink requires rethinking how applications are designed, developed, deployed, and monitored. Modern data strategies involve legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud environments.

The data streaming landscape in 2025 highlights the emergence of a new software category, though it’s still in its early stages. Building such a category takes time. In discussions with customers, partners, and the community, a common sentiment emerges: “We understand the value but are just starting with the first data streaming pipelines and have a long-term plan to implement advanced stream processing and build a strategic data streaming organization.”

The Forrester Wave: Streaming Data Platforms, Q4 2023, and other industry reports from Gartner and IDG signal that this category is progressing through the hype cycle.

Last but not least, check out my Top Data Streaming Trends for 2025 to understand how the data streaming landscape fits into emerging trends:

Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a foundational platform in modern data infrastructure.
Kafka Protocol as the Standard: Vendors adopt the Kafka protocol over the framework, enabling flexibility with compatibility and performance trade-offs.
BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
Data Streaming Organizations: Companies unify real-time data strategies to standardize tools, governance, and collaboration.

Which are your favorite open-source frameworks or cloud services for data streaming? What are your most relevant and exciting trends around Apache Kafka and Flink in 2024 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. Make sure to download by free ebook about data streaming use cases and industry examples.

The post The Data Streaming Landscape 2025 appeared first on Kai Waehner.

Open Standards for Data Lineage: OpenLineage for Batch AND Streaming

Kai Waehner — Mon, 13 May 2024 05:20:11 +0000

One of the greatest wishes of companies is end-to-end visibility in their operational and analytical workflows. Where does data come from? Where does it go? To whom am I giving access to? How can I track data quality issues? The capability to follow the data flow to answer these questions is called data lineage. This blog post explores market trends, efforts to provide an open standard with OpenLineage, and how data governance solutions from vendors such as IBM, Google, Confluent and Collibra help fulfil the enterprise-wide data governance needs of most companies, including data streaming technologies such as Apache Kafka and Flink.

What is Data Governance?

Data governance refers to the overall management of the availability, usability, integrity, and security of data used in an organization. It involves establishing processes, roles, policies, standards, and metrics to ensure that data is properly managed throughout its lifecycle. Data governance aims to ensure that data is accurate, consistent, secure, and compliant with regulatory requirements and organizational policies. It encompasses activities such as data quality management, data security, metadata management, and compliance with data-related regulations and standards.

What is the Business Value of Data Governance?

The business value of data governance is significant and multifaceted:

Improved Data Quality: Data governance ensures that data is accurate, consistent, and reliable, leading to better decision-making, reduced errors, and improved operational efficiency.
Enhanced Regulatory Compliance: By establishing policies and procedures for data management and ensuring compliance with regulations such as GDPR, HIPAA, and CCPA, data governance helps mitigate risks associated with non-compliance, including penalties and reputational damage.
Increased Trust and Confidence: Effective data governance instills trust and confidence in data among stakeholders. It leads to greater adoption of data-driven decision-making and improved collaboration across departments.
Cost Reduction: By reducing data redundancy, eliminating data inconsistencies, and optimizing data storage and maintenance processes, data governance helps organizations minimize costs associated with data management and compliance.
Better Risk Management: Data governance enables organizations to identify, assess, and mitigate risks associated with data management, security, privacy, and compliance, reducing the likelihood and impact of data-related incidents.
Support for Business Initiatives: Data governance provides a foundation for strategic initiatives such as digital transformation, data analytics, and AI/ML projects by ensuring that data is available, accessible, and reliable for analysis and decision-making.
Competitive Advantage: Organizations with robust data governance practices can leverage data more effectively to gain insights, innovate, and respond to market changes quickly, giving them a competitive edge in their industry.

Overall, data governance contributes to improved data quality, compliance, trust, cost efficiency, risk management, and competitiveness, ultimately driving better business outcomes and value creation.

What is Data Lineage?

Data lineage refers to the ability to trace the complete lifecycle of data, from its origin through every transformation and movement across different systems and processes. It provides a detailed understanding of how data is created, modified, and consumed within an organization’s data ecosystem, including information about its source, transformations applied, and destinations.

Data Lineage is an essential component of Data Governance: Understanding data lineage helps organizations ensure data quality, compliance with regulations, and adherence to internal policies by providing visibility into data flows and transformations.

Data Lineage is NOT Event Tracing!

Event tracing and data lineage a are different concepts that serve distinct purposes in the realm of data management:

Data Lineage:

Data lineage refers to the ability to track and visualize the complete lifecycle of data, from its origin through every transformation and movement across different systems and processes.
It provides a detailed understanding of how data is created, modified, and consumed within an organization’s data ecosystem, including information about its source, transformations applied, and destinations.
Data lineage focuses on the flow of data and metadata, helping organizations ensure data quality, compliance, and trustworthiness by providing visibility into data flows and transformations.

Event Tracing:

Event tracing, also known as distributed tracing, is a technique used in distributed systems to monitor and debug the flow of individual requests or events as they traverse through various components and services.
It involves instrumenting applications to generate trace data, which contains information about the path and timing of events as they propagate across different nodes and services.
Event tracing is primarily used for performance monitoring, troubleshooting, and root cause analysis in complex distributed systems, helping organizations identify bottlenecks, latency issues, and errors in request processing.

In summary, data lineage focuses on the lifecycle of data within an organization’s data ecosystem, while event tracing is more concerned with monitoring the flow of individual events or requests through distributed systems for troubleshooting and performance analysis.

Here is an example in payments processing: Data lineage would track the path of payment data from initiation to settlement, detailing each step and transformation it undergoes. Meanwhile, event tracing would monitor individual events within the payment system in real-time, capturing the sequence and outcome of actions, such as authentication checks and transaction approvals.

What is the Standard ‘OpenLineage’?

Open Lineage is an open-source project that aims to standardize metadata management for data lineage. It provides a framework for capturing, storing, and sharing metadata related to the lineage of data as it moves through various stages of processing within an organization’s data infrastructure. By providing a common format and APIs for expressing and accessing lineage information, Open Lineage enables interoperability between different data processing systems and tools, facilitating data governance, compliance, and data quality efforts.

Source: OpenLineage (presented at Kafka Summit London 2024)

OpenLineage is an open platform for the collection and analysis of data lineage. It includes an open standard for lineage data collection, integration libraries for the most common tools, and a metadata repository/reference implementation (Marquez). Many frameworks and tools already support producers/consumers:

Source: OpenLineage (presented at Kafka Summit London 2024)

Data Governance for Data Streaming (like Apache Kafka and Flink)

Data streaming involves the real-time processing and movement of data through its distributed messaging platform. This enables organizations to efficiently ingest, process, and analyze large volumes of data from various sources. By decoupling data producers and consumers, a data streaming platform provides a scalable and fault-tolerant solution for building real-time data pipelines to support use cases such as real-time analytics, event-driven architectures, and data integration.

The de facto standard for data streaming is Apache Kafka, used by over 100,000 organizations. Kafka is not just used for big data, it also provides support for transactional workloads.

Data Governance Differences with Data Streaming Compared to Data Lake and Data Warehouse?

Implementing data governance and lineage with data streaming presents several differences and challenges compared to data lakes and data warehouses:

Real-Time Nature: Data streaming involves the processing of data in real-time when it is generated, whereas data lakes and data warehouses typically deal with batch processing of historical data. This real-time nature of data streaming requires governance processes and controls that can operate at the speed of streaming data ingestion, processing, and analysis.
Dynamic Data Flow: Data streaming environments are characterized by dynamic and continuous data flows, with data being ingested, processed, and analyzed in near-real-time. This dynamic nature requires data governance mechanisms that can adapt to changing data sources, schemas, and processing pipelines in real-time, ensuring that governance policies are applied consistently across the entire streaming data ecosystem.
Granular Data Lineage: In data streaming, data lineage needs to be tracked at a more granular level compared to data lakes and data warehouses. This is because streaming data often undergoes multiple transformations and enrichments as it moves through streaming pipelines. In some cases, the lineage of each individual data record must be traced to ensure data quality, compliance, and accountability.
Immediate Actionability: Data streaming environments often require immediate actionability of data governance policies and controls to address issues such as data quality issues, security breaches, or compliance violations in real-time. This necessitates the automation of governance processes and the integration of governance controls directly into streaming data processing pipelines, enabling timely detection, notification, and remediation of governance issues.
Scalability and Resilience: Data streaming platforms like Apache Kafka and Apache Flink are designed for scalability and resilience to handle both, high volumes of data and transactional workloads with critical SLAs. The platform must ensure continuous stream processing even in the face of failures or disruptions. Data governance mechanisms in streaming environments need to be similarly scalable and resilient to keep pace with the scale and speed of streaming data processing, ensuring consistent governance enforcement across distributed and resilient streaming infrastructure.
Metadata Management Challenges: Data streaming introduces unique challenges for metadata management, as metadata needs to be captured and managed in real-time to provide visibility into streaming data pipelines, schema evolution, and data lineage. This requires specialized tools and techniques for capturing, storing, and querying metadata in streaming environments, enabling stakeholders to understand and analyze the streaming data ecosystem effectively.

In summary, implementing data governance with data streaming requires addressing the unique challenges posed by the real-time nature, dynamic data flow, granular data lineage, immediate actionability, scalability, resilience, and metadata management requirements of streaming data environments. This involves adopting specialized governance processes, controls, tools, and techniques tailored to the characteristics and requirements of data streaming platforms like Apache Kafka and Apache Flink.

Schemas and Data Contracts for Streaming Data

The foundation of data governance for streaming data are schemas and data contracts. Confluent Schema Registry is available on GitHub. It became the de facto standard for ensuring data quality and governance in Kafka projects across all industries. Not just for Confluent projects, but also in the broader community leveraging open source technologies. Schema Registry is available under the Confluent Community License that allows deployment in production scenarios with no licensing costs.

Source: Confluent

For more details, check out my article “Policy Enforcement and Data Quality for Apache Kafka with Schema Registry“. And here are two great case studies for financial services companies leveraging schemas and group-wide API contracts across the organization for data governance:

Raiffeisenbank International (RBI): Enterprise-wide data mesh across countries with data streaming.
ING Bank: Schemas in the data streaming enterprise architecture to evolve data contracts

Source: ING Bank

Example: Confluent’s Stream Data Governance Suite for Kafka and Flink

Confluent Cloud is an excellent example of a data governance solution for data streaming. The fully-managed data streaming platform provides capabilities such as

Data Catalog
Data Lineage
Stream Sharing
Data Portal

The Data Portal combines the capabilities in an intuitive user interface to discover, explore, access and use streaming data products:

Source: Confluent

All information and functionality are available in the UI for humans and as an API for integration scenarios.

If you want to learn more about data streaming and data governance in a fun way, check out the free comic ebook “The Data Streaming Revolution: The Force of Kafka + Flink Awakens“:

Source: Confluent

Data Lineage for Streaming Data

Being a core fundament of data governance, data streaming projects require good data lineage for visibility and governance. Today’s market mainly provides two options: Custom projects or buying a commercial product/cloud service. But the market develops. Open standards emerge for data lineage, and integrate data streaming into its implementations.

Let’s explore an example of a commercial solution and an open standard for streaming data lineage:

Cloud service: Data Lineage as part of Confluent Cloud
Open standard: OpenLineage’s integration with Apache Flink and Marquez

Data Lineage in Confluent Cloud for Kafka and Flink

To move forward with updates to critical applications or answer questions on important subjects like data regulation and compliance, teams need an easy means of comprehending the big picture journey of data in motion. Confluent Cloud provides a solution deeply integrated with Kafka and Flink as part of the fully managed SaaS offering.

Source: Confluent

Stream lineage provides a graphical UI of event streams and data relationships with both a bird’s eye view and drill-down magnification for answering questions like:

Where did data come from?
Where is it going?
Where, when, and how was it transformed?

Answers to questions like these allow developers to trust the data they’ve found, and gain the visibility needed to make sure their changes won’t cause any negative or unexpected downstream impact. Developers can learn and decide quickly with live metrics and metadata inspection embedded directly within lineage graphs.

The Confluent documentation goes into much more detail, including examples, tutorials, free cloud credits, etc. Most of the above description is also copied from there.

OpenLineage for Stream Processing with Apache Flink

In recent months, stream processing has gained the particular focus of the OpenLineage community, as described in a dedicated talk at Kafka Summit 2024 in London.

Many useful features for stream processing completed or begun in OpenLineage’s implementation, including:

A seamless OpenLineage and Apache Flink integration
Support for streaming jobs in data catalogs like Marquez, manta, atlan
Progress on a built-in lineage API within the Flink codebase

Here is a screenshot from the live demo of the Kafka Summit talk that shows data lineage across Kafka Topics, Flink applications, and other databases with the reference implementation of OpenLineage (Marquez):

Source: OpenLineage (presented at Kafka Summit London 2024)

The OpenLineage Flink integration is in the early stage with limitations, like no support for Flink SQL or Table API yet. But this is an important initiative. Cross-platform lineage enables a holistic overview of data flow and its dependencies within organizations. This must include stream processing (which often runs the most critical workloads in an enterprise).

The Need for Enterprise-Wide Data Governance and Data Lineage

Data Governance, including Data Lineage, is an enterprise-wide challenge. OpenLineage is an excellent approach for an open standard to integrate with various data platforms like data streaming platform, data lake, data warehouse, lake house, and any other business application.

However, we are still early on this journey. Most companies (have to) build custom solutions today for enterprise-wide governance and lineage of data across various platforms. Short term, most companies leverage purpose-built data governance and lineage features from cloud products like Confluent, Databricks and Snowflake. This makes sense as it creates visibility in the data flows and improves data quality.

Enterprise-wide data governance needs to integrate with all the different data platforms. Today, most companies have built their own solutions – if they have anything at all today (most don’t yet)… Dedicated enterprise governance suites like Collibra or Microsoft Purview get adopted more and more to solve these challenges. And software/cloud vendors like Confluent integrate their purpose-built data lineage and governance into these platforms. Either just via open APIs or via direct and certified integrations.

Balancing Standardization and Innovation with Open Standards and Cloud Services

OpenLineage is a great community initiative to standardize the integration between data platforms and data governance. Hopefully, vendors will adopt such open standards in the future. Today, it is an early stage and you will probably integrate via open APIs or certified (proprietary) connectors.

Balancing standardization and innovation is always a trade-off: Finding the right balance between standardization and innovation entails simplicity, flexibility, and diligent review processes, with a focus on addressing real-world pain points and fostering community-driven extensions.

If you want to learn more about open standards for data governance, please watch this expert panel for data lineage where Accenture and Confluent welcomed experts from Open Lineage, Collibra, Google, IBM / Mantra, IBM / Egeria, Atlan, and Confluent (actually me).

How do you implement data governance and lineage? Do you already leverage OpenLineage or other standards? Or are you investing in commercial products? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Open Standards for Data Lineage: OpenLineage for Batch AND Streaming appeared first on Kai Waehner.

Pulsar vs Kafka – Comparison and Myths Explored

Kai Waehner — Tue, 09 Jun 2020 15:32:29 +0000

Pulsar vs Kafka – which one is better? This blog post explores pros and cons, popular myths, and non-technical criteria to find the best tool for your business problem.

My discussions are usually around Apache Kafka and its ecosystem as I work for Confluent. The only questions I got about Pulsar in the last years came from Pulsar committers and contributors. They asked me deep technical questions so as to be able to explain where Kafka sucks and why Pulsar is the much better option. Discussions about this topic on platforms like Reddit are typically very opinionated, often inaccurate, and brutal. The following is my point of view based on years of experience with open source streaming platforms.

Tech comparisons are the new black: Kafka vs. Middleware, Event Streaming and API Platforms

Tech comparisons are meant to guide people to choose the right solution and architecture for their business problem. There is no all-rounder, and there should be no bias. Choose the right tool for the problem.

However, technical comparisons are almost always biased. Even if the author does not work for a vendor and is an “independent” consultant, he or she is still likely to have a biased opinion from past experiences and knowledge, whether purposely or unknowingly. Still, comparisons from different perspectives are useful, and we’ve seen Apache Pulsar discussed in a few places on the internet, so I wanted to share my personal views of how Kafka and Pulsar compare. I work for Confluent, the leading experts behind Apache Kafka and its ecosystem, so keep that in mind, but the aim of this post is not to provide opinion, it’s to weigh up facts rather than myths.

Technical comparisons of open source frameworks and commercial software products happen all the time. I did several comparisons in the past on my blog or other platforms like InfoQ, including a Comparison of integration frameworks, Choosing the right ESB for your integration needs, Kafka vs. ETL / ESB / MQ, Kafka vs. Mainframe and Apache Kafka and API Management / API Gateway. All these comparisons were done because customers wanted to understand when to use which tool.

For Pulsar vs. Kafka, the situation is a little bit different.

Why compare Pulsar and Kafka?

Talking to prospects or customers, I rarely get asked about Pulsar. To be fair, this increased slightly in the last months. I guess the question comes up in every ~15th or ~20th meeting due to the overlapping feature set and use cases. However, this seems to be mostly due to a few posts on the internet that claim Pulsar is in some ways better than Kafka. There is no fact-checking and very little material, if any, for the opposing view.

I have not talked to a single organization that seriously considered deploying Pulsar in production, although I know there are a large number of users out there in the world who need a distributed messaging technology like Kafka or Pulsar. But I also think that Pulsar’s alleged reference users are not particularly accurate.

For example, their flagship user is Tencent, a large Chinese tech company, but Tencent is a huge Kafka user, whereas Pulsar’s use is limited to just one project. Tencent processes trillion messages per day (in digits: 10,000,000,000,000) with Kafka. As it turns out, Tencent uses Kafka 1000x more than Pulsar (ten trillion msg/day vs. tens of billion msg/day). The Tencent team discussed their Kafka deployment in more detail: How Tencent PCG Uses Apache Kafka to Handle 10 Trillion+ Messages Per Day.

Comparison of two competitive open source frameworks

Apache Kafka and Apache Pulsar are two exciting and competing technologies. Therefore, it makes a lot of sense to compare them. Period.

Both Apache Kafka and Apache Pulsar have very similar feature sets. I recommend that you evaluate both frameworks for available features, maturity, market adoption, open source tools and projects, training material, availability of local meetups, videos, blog posts, etc. Reference use cases from your industry or business problems help making the right decision.

Confluent published such a comparison of “Kafka vs. Pulsar vs. RabbitMQ: Performance, Architecture, and Features Compared“. I was involved in creating this comparison. So we have that comparison already…

What is this blog post here about then?

I want to explore the myths from some ‘Kafka vs. Pulsar’ arguments which I see regularly in blog posts and forum discussions. Afterwards, I will give a more comprehensive comparison beyond just technical aspects because most Pulsar discussions focus purely on tech features.

Kafka vs Pulsar – Technology myths explored

The following discusses some myths I have come across. I agree with some of them, but also counter some others with hard facts. Of course, different opinions can exist for some of these statements. Again, this is totally fine. The following is my point of view.

Myth 1: “Pulsar has differentiating built-in features compared to Kafka”?

True.

If you compare Apache Kafka to Apache Pulsar, features like its tiered architecture, queuing, and multi-tenancy are mentioned as differentiators.

But:

Kafka has many differentiating features, too:

Half as many servers to run
Data saved to disk only once
Data cached in memory only once
Battle-tested replication protocol
Zero copy performance
Transactions
Built-in stream processing
Long term storage
In the works: ZooKeeper removal (KIP-500), which makes Kafka even more simple to operate and deploy than Pulsar (which has a four-component architecture of Pulsar, ZooKeeper, BookKeeper, and RocksDB), apart from making Kafka more scalable, more resilient, etc. etc..)
In the works: Tiered Storage (KIP-405), which makes Kafka more elastic and cost-efficient.

Also ask yourself: Should you really compare just the open source frameworks or products and vendors with their complete offering?

It is easy to add new features if you don’t have to provide mission-critical support for it. Don’t just evaluate features in a checklist, but also evaluate how they are battle-tested in production scenarios. How many “differentiating features” are low-quality and implemented quickly vs. high-quality implementations?

For instance: It took a few years to implement and battle-test Kafka Streams as Kafka-native stream processing engine. Do you really want to compare this to Pulsar Functions? The latter is a feature to add user-defined functions (UDF); without any relation to “real stream processing”. Or is this more like Single Message Transformations (SMT), a core feature of Kafka Connect? Just be sure to a) compare apples to apples (instead of apples to oranges) and b) don’t forget to think about the maturity of a feature. The more powerful and critical, the more mature it should be…

The Kafka community spends a large amount of efforts to improve the core project and its ecosystem. Confluent alone has over 200 full time engineers working on the Kafka project, additional community components, commercial products and the SaaS offering on major cloud providers.

Myth 2: “Pulsar has a few very big users like Tencent in China”?

True.

But: Tencent actually uses Kafka more than Pulsar. The billing department, which uses Pulsar, is only a small fraction at Tencent, whereas a large portion of the core business is using Kafka, and they have a Global-Kafka like architecture that combines 1000+ brokers into a single logical cluster.

Always be cautious with open source projects. Check out the success at “normal companies”. Just because a tech giant uses it, does not mean it will work for your company well. How many Fortune 2000 companies shared their success stories around Pulsar in the past?

Look for proof points beyond tech giants!

Proof points beyond the tech giants are helpful to get insights and lessons learned from other people. Not from the software vendors. The Kafka website gives many examples about mission-critical deployments. Even more impressive: At the past Kafka Summit conferences in San Francisco, New York and London, every year various enterprises from different industries present their use cases and success stories. Including fortune 2000 companies, mid-size enterprises and startups.

Just to give you one specific example in the Kafka world: Various different implementations exist for replication of data in real time between separate Kafka clusters, including MirrorMaker 1 (part of the Apache Kafka project), MirrorMaker 2 (part of the Apache Kafka project), Confluent Replicator (built by Confluent and only available as part of Confluent Platform or Confluent Cloud), uReplicator (open sourced by Uber), Mirus (open sourced by Salesforce), Brooklin (open sourced by LinkedIn).

In practice, only two options are reasonable if you don’t want to maintain and improve the code by yourself: MirrorMaker 2 (very new, not mature yet, but a great option mid and long term) and Confluent Replicator (battle-tested in many mission-critical deployments, but not open source). All the other options work, too. But who maintains the projects? Who solves bugs and security issues? Who do you call when you have a problem in production? Deployment in production for mission-critical deployments is different from evaluating and trying out an open source project.

Myth 3: “Pulsar provides message queuing and event streaming in a single solution”?

Partly.

Message queues are used for point-to-point communication. They provide an asynchronous communications protocol, meaning that the sender and receiver of the message do not need to interact with the message queue at the same time.d

Pulsar has only limited support for message queuing, and limited support for event streaming. If it wants to compete in either area, it still has a long way to go for two reasons:

1) Pulsar has only limited support for message queuing because it misses popular messaging features like message XA transactions, routing, message filtering, etc. that are commonly used with messaging systems like IBM MQ, RabbitMQ, and ActiveMQ. Pulsar’s “adapters” for messaging systems are similarly limited. While they may look nice on paper, they are less useful in practice.

2) Pulsar has only limited support for event streaming. For example, it does not support exactly-once delivery and processing semantics, which disqualifies it for most use cases in practice – you would never implement, say, a payment processing system with Pulsar as it may cause duplicate payments, or lose payments. It also lacks functionality to perform stream processing with features like joins, aggregations, windowing, fault-tolerant state management, and event-time based processing. Pulsar’s “topics” functionality is also different to Kafka’s, and suffers from BookKeeper’s origins, as it was conceived and designed in 2008 as a write ahead log for Hadoop’s HDFS namenode, with only short-lived data storage in mind.

Side note: Pulsar’s “Kafka adapter”, like its messaging siblings, is similarly limited. While it may look nice on paper, it is less useful in practice because it supports only a small subset of Kafka functionality.

Like Pulsar, Kafka has only limited support for message queuing.

In Kafka, different workarounds can be used to realize “real queuing” behavior. If you want to use separate message queues instead of shared Kafka topics for:

Security? => Use Kafka’s ACLs (and optional tools like Confluent’s role-based access control aka RBAC).
Semantics (i.e. separate applications)? => Use Kafka’s consumer groups.
Load balancing? => Use Kafka’s partitions.

I typically ask customers what exactly they want to do with queuing. Often, Kafka provides out-of-the-box solutions for use cases which simply require thinking of the solution in new terms. Also, the number of high throughput use cases that need queuing is relatively small.

Having explained all these workarounds and limitations of Pulsar and Kafka for messaging, let’s be clear: Neither Kafka nor Pulsar provide a “real messaging solution”.

If you really need a messaging solution, shouldn’t you better choose a “real messaging framework” like RabbitMQ or NATS for a messaging problem anyway?

There is no ‘yes or no’ answer to this. I see many customers replacing existing messaging systems like IBM MQ with Kafka (for scalability and cost reasons). Know the options, their trade-offs, and do an evaluation to solve your problem the best way…

Myth 4: “Pulsar provides stream processing”?

False.

Or to be fair: It depends on your definition of stream processing. Is it only rudimentary features, or full-fledged stream processing?

In one sentence, I typically explain stream processing as continuous consumption, processing, and aggregation of events from different data sources. In real time. At scale. And, of course, in a fault-tolerant manner, including (and especially) for any stateful processing operations.

Pulsar provides only rudimentary functionality for stream processing, using its Pulsar Functions interface. This is suited for simple callbacks, but it isn’t a true stream processing offering like you get it with Kafka Streams or ksqlDB for building streaming applications that include stateful information, sliding windows, and other stream processing concepts. Use cases exist in every industry. For instance, check out the Kafka Streams website for examples from the New York Times, Pinterest, Trivago, Zalando, and others.

Streaming analytics examples with Pulsar typically use Pulsar in conjunction with another “proper” stream processing framework like Apache Spark or Apache Flink, which of course means you now need to operate even more additional pieces of distributed infrastructure and to understand their complex interactions.

Myth 5: “Pulsar provides exactly-once semantics like Kafka”?

False.

Pulsar provides a deduplication feature that ensures that a message will not be stored in the Pulsar broker twice, but nothing prevents a consumer from reading this message multiple times. This is insufficient for any form of stream processing use case where both input and output are from Pulsar.

Also, unlike Kafka’s Transactions feature, it is not possible to accurately tie messages committed to state recorded inside a stream processor.

Exactly-Once Semantics (EOS) are available since Kafka 0.11 (released three years ago) and used in many production deployments. Kafka’s EOS supports the whole Kafka ecosystem, including Kafka Connect, Kafka Streams, ksqlDB and clients like Java, C, C++, Go or Python. Kafka Summit had several talks about Kafka’s EOS functionality, including this great intro for everybody, with slides and video recording.

Myth 6: “Pulsar’s performance is much better than Kafka’s”?

False.

I am not a fan of most “benchmarks” of performance and throughput. Benchmarks are almost always opinionated and configured for a specific problem (no matter if a vendor, independent consultant, or researcher conducts them).

For example, there is one benchmark published by GIGAOM, which compares the latency and performance of Kafka versus Pulsar. But this benchmark deliberately slowed Kafka down by forcing it to synchronize-to-disk on every single message by setting the Kafka config ‘flush.messages = 1’ (this makes every request cause an fsync). The benchmark also forces the Kafka Consumer to acknowledge synchronously while the Pulsar consumer acknowledges asynchronously. Unsurprisingly, this benchmark setup makes Pulsar the seemingly clear “winner”. But this benchmark does not mention or explain this significant configuration difference in the setup and measurements. This is what some people call apples-to-oranges comparison.

Pulsar’s architecture actually requires higher network utilization (due to the Pulsar broker tier which acts as a proxy in front of BookKeeper bookies) as well as twice the I/O (as BookKeeper writes data to a write ahead log as well as to the main segment).

Confluent did some benchmarks, too. More an apple-to-apple comparison. Not surprisingly, the results were different. But should you really care about these benchmark fights from software vendors?

Think about your performance requirements. Do a proof of concept (POC) with Kafka and Pulsar, if you must. I bet that in 99% of scenarios, both will show acceptable performance for your use case. Don’t trust opinionated benchmarks from others! Your use case will have different requirements and characteristics anyway, and typically performance is just one of many evaluation dimensions.

Myth 7: “Pulsar is easier to operate than Kafka”?

False.

Both Kafka and Pulsar are hard to operate if you don’t use additional tooling.

Kafka includes two distributed systems: Kafka itself and Apache ZooKeeper.

But: Pulsar includes three distributed systems and an additional storage technology: Pulsar, ZooKeeper, and Apache BookKeeper. Like Pulsar, BookKeeper uses ZooKeeper, too. And lastly, RocksDB is used for certain storage tasks. This means that Pulsar has a significantly higher complexity to understand, tweak, and tune than Kafka. Additionally, Pulsar also has more configuration parameters than Kafka.

Kafka is firmly going into the opposite direction and is removing ZooKeeper (see KIP-500) so that you have just one distributed system to deploy, operate, scale and monitor:

ZooKeeper is Kafka’s biggest scalability bottleneck and comes with operational challenges — This is true for Kafka but even more so for Pulsar!

One of the key issues of my customers is how to run ZooKeeper in mission-critical deployments at scale. Therefore I am really looking forward to Kafka’s simplified architecture, where you will deploy Kafka brokers only. This also establishes a unified security model, as ZooKeeper’s security no longer needs to be separately configured. This is a huge benefit, especially for larger organizations and regulated industries. Compliance and information security departments will thank you for this simplified architecture.

Operations is NOT just about Architecture!

Kafka is significantly better documented, has a tremendously larger community of experts, and a vast array of supporting tooling that make operations easier.

Additionally, there are many options for local and online Kafka training, including online courses, books, meetups, and conferences. You won’t find much for Pulsar, unfortunately.

Myth 8: “An architecture with three tiers is better than two tiers”?

It depends.

Personally, I am skeptical that Pulsar’s three tier architecture (using Pulsar brokers, ZooKeeper and BookKeeper) is an advantage for most projects. It is a trade-off!

Twitter described their move away from BookKeeper + DistributedLog (the latter a system very similar to Pulsar, with comparable architecture and design) just over a year ago, citing the advantages of Kafka’s single-tier architecture, such as cost efficiency and better performance, over a two-tier architecture that decouples storage and serving.

Like Pulsar, DistributedLog is built on top of BookKeeper and adds streaming-like functionality with an architecture and concepts similar to Pulsar (e.g., using decoupled storage and serving tiers). DistributedLog was originally a standalone project but eventually became a sub-project of BookKeeper, though nowadays it appears to be no longer actively developed (only a few commits in the past 12 months). The main reasons Twitter cited for switching to Kafka were (1) significant cost savings and performance gains and (2) Kafka’s huge community and adoption. For example, they concluded: “For single consumer use cases, we saw a 68% resource savings, and for fanout cases with multiple consumers, we saw a 75% resource savings.”

There are benefits from a three-tier architecture to build a scalable infrastructure. But the extra layer also increases network utilization by (at least) 33%, and data held in Pulsar’s brokers must additionally be cached in both layers for equivalent performance, and also written to disk twice because the storage format of Bookkeeper is not based on a log.

On the cloud, where most Kafka deployments are being run, the best backing storage tier is in fact not a niche technology like BookKeeper, but a widely used and battle-tested object store like AWS S3 or GCP GCS.

Tiered Storage in Confluent Platform, which is backed by the likes of AWS S3 and GCP GCS, provides the same benefits without Pulsar’s extra layer of BookKeeper and the resulting extra network transfer cost and latency that this architecture incurs. It took Confluent two years to build and make Tiered Storage for Kafka generally available, including global 24/7 support for your most mission-critical data. Tiered Storage is not available yet for open source Apache Kafka, but Confluent is working with the rest of the Kafka community (including some major tech companies like Uber) on KIP-405 to add Tiered Storage to Kafka with different storage options.

There are always pros and cons for both architectures. Personally, I think that 95% of projects do not need a complex three-tier architecture. And where they make sense it is to add the advantages of external, price-efficient storage. You should care about 24/7 service level agreements (SLA), scalability, and throughout. Plus integration into your ecosystem as well as security, management tooling, and support. If your requirements require a three-tier architecture, then of course give it a go!

Sub-Myth: “Pulsar is better for lagging consumers because of its caching layer and storage layer”?

False.

The main problem with lagging consumers is that they exhaust the page cache i.e. recent messages are already cached. Reads from older segments replace these reducing the performance of consumers reading from the head of the log.

Pulsar’s architecture is actually worse in this regard. It retains the same issue around cache-flushing, but now the reads must do an extra network hop + and IO rather than just reading from the local media.

Myth 9: “Kafka does not scale as well as Pulsar”?

False.

This is one of the key arguments by the Pulsar community. As I said before, this always depends on the chosen benchmark. For example, I have seen tests with equivalent computing resources where Kafka did significantly better at high throughputs than Pulsar. Here is a “Pulsar vs. Kafka benchmark” where Kafka is much faster than Pulsar:

Scalability is not a problem for most use cases. You can easily scale up Kafka to process several gigabytes per second, as you can see in a demo to “Scale Apache Kafka to 10+ GB Per Second in Confluent Cloud“:

Honestly speaking, less than 1% of users should be worried about this discussion at all. If you have requirements like Netflix (processing Petabytes per day) or LinkedIn (processing trillions of messages), let’s talk about and discuss the best architecture, hardware, and configuration for such a deployment. For anybody else, don’t be worried.

Sub-Myth: “Kafka’s current approach means it can only store ~ 500K partitions per cluster”?

True.

Kafka today has not yet the best architecture for large scale deployments with hundreds of thousands of Kafka Topics and Partitions.

But: Pulsar, too, does not allow for unlimited scale. It just has different limits.

Kafka’s partition limit is imposed by Zookeeper. Removing Zookeeper from Kafka through the work in KIP-500 removes this upper bound.

As a side note:

The right design of your architecture is critical for success!

Most of the customers I have seen in trouble with Kafka partition counts and scalability are because they designed their architecture and applications in the wrong way (they’d run into the same issues if they were using Pulsar)!

Kafka is an event streaming platform, and not the next IBM MQ. If you try to recreate your favorite MQ solution and architecture with Kafka, you will likely fail. I have seen several customers failing here and then succeeding by re-architecting their setup with our help.

Chances are very high that you will not have any issues with partition numbers and scalability, even today with Kafka’s usage of ZooKeeper, if you design your use case right and understand Kafka’s basic concepts. This experience of customers is a common theme for any technology, like Kafka, that introduces a new technology level and paradigm well beyond what was done before (a prime example is the adoption hurdles faced by companies when they first began to move their use cases to the cloud).

Sub-Myth: “Pulsar supports a practically infinite number of partitions”?

False.

BookKeeper has the same 1-file-per-ledger limitation Kafka has, but there are multiple ledgers in one partition. Pulsar’s broker layer groups partitions into bundles, but it’s storage layer, Bookkeeper, stores data in segments with many segments for each partition.

Like for Kafka, the metadata for these segments is stored in Zookeeper, which imposes a limit on the total number that can be stored. Kafka is removing this dependency, thus allowing it to scale significantly further. I am really looking forward to seeing KIP-500 being implemented until ~ the end of 2020. “Apache Kafka Needs No Keeper: Removing the Apache ZooKeeper Dependency” walks you through the implementation details and planned timelines.

Sub-Myth: “Kafka scaling needs to be defined when creating a Kafka Topic”?

Partly true.

If more scalability is needed, Kafka topics can either be over-partitioned (i.e., you configure a topic with more partitions than you initially need for a use case; see Streams and Tables in Apache Kafka: Topics, Partitions, and Storage Fundamentals), or they can be re-configured to use more partitions if there are requirements to scale in the future. This is not perfect, but a consequence of how distributed event streaming works (and why it scales much better than traditional messaging systems like IBM MQ).

Best practices for creating topics and procedures for changing topic configurations during production are available. So no worries!

But: Pulsar topics have this restriction, too!

Write throughput is based on the number of partitions allocated in a Pulsar topic in the exact same way it is in a Kafka topic, so Pulsar topics must be over-provisioned for exactly the same reasons. That’s because, for each partition, only a single ledger (of the partition’s potentially many ledgers) is writable at the same time. Also, increasing the number of partitions dynamically impacts message ordering just like it does in Kafka (i.e. the message order is lost).

Both Kafka and Pulsar scale like crazy. This is sufficient for almost all use cases!

If you need even more extreme scale, I think a ZooKeeper-free implementation is the best choice. KIP-500 is thus the most anticipated Kafka change I see in the community and in Confluent’s customer base.

Myth 10: “Pulsar recovers from machine failure instantly but Kafka has to reload data”?

True and false.

Killing a Pulsar broker is indeed seamless, but (in contrast to a Kafka broker) the Pulsar broker doesn’t store any data but is only a proxy fronting the actual storage layer, which is BookKeeper. So highlighting that a Pulsar broker failure can easily be resolved is a marketing distraction, because actually one must talk about what happens when a BookKeeper node (a “bookie”) fails.

Killing and restarting a BookKeeper bookie requires the same redistribution of data seen in Kafka’s case. This is the nature of distributed systems, with concepts like replication and partitions.

Elastic Kafka is here already!

Elasticity is important. Confluent’s founder Jay Kreps has recently blogged about this topic: Elastic Apache Kafka Clusters in Confluent Cloud. In a SaaS cloud service like Confluent Cloud, the end user shouldn’t have to care at all about machine failure. 24/7 uptime is expected and should be guaranteed with 99.xx SLAs. Consumption-based pricing (i.e., pay as you go) means you do not have to worry about issues like broker management, sizing broker nodes, expanding or shrinking clusters, etc. under the hood at all.

Self-managed Kafka clusters also need similar capabilities. Tiered Storage for Kafka is huge because most of the data is not stored on the broker anymore to allow almost instant recovery from failures. In conjunction with tools like Self-Balancing Kafka (a Confluent feature coming in Q3 and discussed in the above link blog post), users don’t have to worry about elasticity in their self-managed clusters at all.

Unfortunately, if you are looking for such a modern offering for Pulsar, there is none available.

Myth 11: “Pulsar has better Inter-Cluster (Geo) Replication than Kafka”?

False.

Every distributed system has to solve problems like the CAP theorem and quorum in distributed computing. The quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system. A quorum-based technique is implemented to enforce consistent operation in a distributed system.

Kafka requires ZooKeeper to solve the quorum problem. Even after KIP-500 and ZooKeeper removal, the universal laws of real-world physics are still the same: There are latency issues deploying a distributed system over regions like the US East, Central and West or even globally. That’s because the speed of light, though very high, does have a limit.

Various deployment options exist to work around this problem, including real time replication tools like Apache Kafka’s MirrorMaker 2, Confluent’s Replicator or Confluent’s Multi-Region-Clusters. Check out “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” for various different deployment options and best practices:

There is no single pattern or implementation to provide global replication AND zero downtime + zero data loss! For the most critical applications, Confluent’s Multi-Region-Clusters allows RTO=0 and RPO=0 (i.e. zero downtime and zero data loss) with automatic disaster recovery and client fail-over even if a complete data center or cloud region goes down.

Here, Pulsar’s architecture requires even more complexity than a “basic” Pulsar deployment. That’s because, for geo-replication, Pulsar requires an additional “global” Zookeeper cluster, which makes Pulsar inappropriate for geo-distribution over large distances. There is a workaround, but the problem around CAP theorem and physics do not go away.

No matter if you use Kafka or Pulsar, you need a battle-tested design to fight the laws of physics in your global deployments!

Myth 12: “Pulsar is compatible with Kafka’s interface and API”?

Partially True.

Pulsar provides a very basic implementation that is compatible with only minor parts of the Kafka v2.0 protocol.

Pulsar has a converter for basic parts of the Kafka protocol.

So, while alleged “Kafka compatibility” sounds nice on paper, one shouldn’t seriously consider this for migrating your running Kafka infrastructure to Pulsar? I doubt someone will take the risk…

We have seen “Kafka compatibility” claims in other examples such as the much more mature Azure Event Hubs service. Check out the limiting factors of their Kafka API, and be surprised! No support for core Kafka features like transactions (and thus exactly-once semantics), compression, or log compaction.

As it is not Kafka under the hood, also expect further diverging and unexpected behaviors when you connect your existing Kafka applications against such a “compatible” setup. No matter if Azure Event Hubs, Pulsar, or any other wrapper.

Kafka vs. Pulsar – Comprehensive Comparison

The last sections explored various technology myths we find in many other blog posts. I think I brought some clarity into these discussions.

Now, let’s not forget to take a look beyond the technical details of Kafka and Pulsar. Non-functional aspects are as important when choosing a technology.

I will cover three critical aspects in the following: Market traction, enterprise support and cloud offerings.

Market Traction of Apache Kafka and Apache Pulsar

Taking a look at Google Trends from the last five years confirms my personal experience, I see the interest in Apache Pulsar is very limited compared to Apache Kafka:

The picture looks very similar when you take a look at Stack Overflow and similar platforms, number and size of supporting vendors, the open ecosystem (tool integrations, wrapper frameworks like Spring Kafka), and similar characteristics for technology trends.

Job openings is another very good indicator of adoption of technology. Not many job openings for Pulsar means not many companies are using it. Search in your favorite job search engine. If you search globally, you will find <100 job openings for Pulsar, but thousands of jobs for Kafka. Additionally, most of the ones showing Pulsar say something like “looking for experience with Kafka, Pulsar, Kinesis or similar technologies”.

In most cases, these characteristics are much more relevant for the success of your next project than the subtle technical differences. The key goal is to solve your business problem, isn’t it?

So with the lack of adoption, why is Pulsar coming up in conversations at all? One reason is that independent consulting companies, research analysts, and bloggers (including me) need to talk about new cutting-edge technologies to keep their audience interested… And to be honest, it makes a good story.

Enterprise Support for Kafka and Pulsar

There is enterprise support for Kafka and Pulsar!

Though, the situation is not what you might expect. Here are the vendors you can call and ask for a meeting to discuss the potential next steps for working together on your Pulsar journey:

Streamlio (now acquired by Splunk), the former company behind Apache Pulsar. Splunk did not yet announce a future Pulsar strategy to support people working on their own Pulsar-based projects. Splunk is well-known for their widely-adopted analytics platform. That’s their core business (~ $1.8B in 2019). The only thing people complain about Splunk is the pricing. Splunk is a heavy Kafka user under the hood and now incorporates Pulsar into their Splunk Data Stream Processor (DSP). It is very doubtful that Splunk will jump on the open source bandwagon to support your next standalone Pulsar project (but a broader-scope DSP might be coming, of course). The future will show us…
StreamNative, founded by one of the original developers of Apache Pulsar, provides an event streaming platform based on Pulsar. At the time of writing this in June 2020, StreamNative has 13 (!) employees on LinkedIn. I am not sure if this is the right scale to support your next mission-critical deployment in 2020 but they do offer it.
TIBCO announced support for Pulsar in December 2019. Their core strategy moved from integration to analytics in the last years. TIBCO’s middleware customers are migrating away in high numbers. Their middleware team had to do some desperate strategy decisions: Support other platforms even though having zero contribution and experience with the projects. You are right, this might be a myth. But hey, a fact is that TIBCO also does the same for Kafka. And here is a nice trivia: TIBCO provides Kafka and ZooKeeper to you on Windows! Something nobody else does – because others know that this is not stable and creates inconsistencies all the time. But hey, TIBCO can support you now with Kafka and Pulsar. Why evaluate these two frameworks if one single vendor allows you to use both? Even on Windows; with .exe download and .bat scripts for starting the server components:

The number of vendors supporting Kafka grows every quarter!

Kafka has incredible huge market adoption in the meantime. The best proof for this is when the biggest software vendors provide support and tools around it. IBM, Oracle, Amazon, Microsoft and many other software companies support Kafka and build integration capabilities and own products around it.

The latest “wake-up call” for me was at Oracle OpenWorld 2019 in San Francisco where I attended a roadmap session from the Oracle product manager for GoldenGate (Oracle’s well-known great but also very expensive CDC tool). Most of the talk focused on opening GoldenGate to make it the data integration platform for everything. Half the talk was about event streaming, Kafka and how GoldenGate will provide integration with different databases / data lakes and Kafka in both directions.

Fully-Managed Cloud Offerings for Kafka and Pulsar

Let’s take a look at the cloud offerings available for Kafka and Pulsar.

There is a cloud service available for Apache Pulsar. It has a very innovative name:

Kafkaesque.

No kidding. Check the link… [Update: On ~June 17th, they rebranded the service: KAFKAESQUE is now KESQUE – probably they realized how embarrassing the name was.]

Maybe you also check out the various cloud offerings for Apache Kafka to find out which offering fits you better:

Confluent Cloud (SaaS) is a fully-managed service providing consumption-based pricing, 24/7 SLAs and elastic, serverless characteristics for Apache Kafka and its ecosystem (e.g. Schema Registry, Kafka Connect connectors and ksqlDB for stream processing).
Amazon MSK (PaaS) provisions ZooKeeper and Kafka Brokers so that the end user can operate it, fix bugs, do rolling upgrades, etc. One important fact everybody should be aware of: AWS excludes Kafka issues from its 99.95 SLAs and support!
Azure Event Hubs (SaaS) provides a Kafka endpoint (with a proprietary implementation under the hood) to interact with Kafka applications. It is very scalable and performant. As it is not really Kafka, but just an emulation, it misses several core features of Kafka like exactly-once semantics, log compaction, and compression. Not to mention the surrounding capabilities like Kafka Connect and Kafka Streams
Big Blue (IBM) and Big Red (Oracle) have cloud offerings around Kafka and its APIs. I have no idea if anyone is using them and how good they are. Never seen them in the wild by myself.
Plenty of smaller players like Aiven, CloudKarafka, Instaclustr, and others.

As you can see, the current cloud offerings show relatively clear how the market adoption of Kafka and Pulsar look like.

Conclusion – Apache Kafka or Apache Pulsar?

TL;DR: Pulsar is still a long way from Kafka’s level of maturity in terms of being proven for high scale use cases and building a community.

You should also question whether Pulsar is actually better.

Evaluate Kafka and Pulsar if you are going the purely open source way. Find out which fits you best. In your evaluation, include the technical feature set, maturity, vendors, developer community, and other relevant factors. Which one fits your situation best?

If you need an enterprise solution that covers much more than what both of these two open source systems offer, Kafka is the only option: Choose a Kafka-based offering from one of the various vendors or a suitable cloud offering. Pulsar, unfortunately, is not ready for this today and the foreseeable future.

How do you think about Apache Kafka vs. Apache Pulsar? What is your strategy? Let’s connect on LinkedIn and discuss! Stay informed about new blog posts by subscribing to my newsletter.

The post Pulsar vs Kafka – Comparison and Myths Explored appeared first on Kai Waehner.

Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?

Kai Waehner — Mon, 25 May 2020 10:42:10 +0000

Event Streaming with Apache Kafka and API Management / API Gateway solutions (Apigee, Mulesoft Anypoint, Kong, TIBCO Mashery, etc.) are complementary, not competitive! Read this blog post to understand the relation between these two components in your enterprise architecture.

API Management is relevant for many years already. I talked about “A New Front for SOA: Open API and API Management as Game Changer” in 2014 when SOAP Web Services and Service-Oriented Architectures (SOA) were cutting-edge technologies and concepts. Exposing APIs and monetization were still in their infancy at that time. EDI / EDIFACT and similar complex technologies were used for B2B communication. B2C communication was just starting with smartphones and mobile apps. Internally billing was done with estimations and Excel sheets instead of automated and accurate information systems.

Let’s start this blog post with an overview of the current market situation. Use cases and the relation between event streaming with Apache Kafka and API Management with tools like Mulesoft Anypoint Platform are discussed afterwards. The last part of the post explores the future of API Management for streaming technologies (and how you can even solve this use case today already).

Market Situation – One Middleware Tool to Solve All your Problems?

Microservices became the new black in enterprise architectures. APIs provide functions to other applications or end users. Even if your architecture uses another pattern than microservices, like SOA (Service-Oriented Architecture) or Client-Server communication, APIs are used between the different applications and end users.

Apache Kafka plays a key role in modern architectures to build open, scalable, flexible and decoupled real time applications. API Management complements Kafka by providing a way to implement and govern the full life cycle of the APIs. This blog post explores how event streaming with Apache Kafka and API Management (including API Gateway and Service Mesh technologies) complement each other, and why they are still not always a perfect match.

In the middleware market, every software vendor is the best one and puts itself into the middle of the enterprise architecture; at least if you trust marketing graphics. No matter which vendor’s website you visit, you will see something similar to this:

Middleware, Event Streaming and API Management Vendors

Here are some examples of global middleware vendors providing software to glue together applications and to provide APIs:

Universal Players offer various products. Vendors like Red Hat / IBM, Oracle, Software AG, TIBCO even offer different overlapping and competing solutions. For instance, IBM has 10+ products for integration middleware (not included are the rebranded product names).
Cloud Providers like AWS, GCP, Azure and Alibaba provide a vast number of services for gluing together applications and services.
Some companies focus just on Messaging, for instance Solace or Synadia (the company behind nats.io).
Event Streaming Platforms like Confluent or Streamlio (the company behind Pulsar; acquired by Splunk recently) are relative new on the market (compared to the above categories), but get more and more traction these days.
API Management solutions like Mulesoft, Apigee or Kong focus on the creation, life cycle management on monetization of APIs.
New startups focus on specific niches or cutting edge technologies, like solo.io providing an API Gateway on top of Envoy Proxy Service Mesh.

MQ, ETL, ESB, Kafka, API Management – When to use which Tool(s)?

Obviously this market situation creates an important question: When to use which tool(s)? How do they overlap with each other? When are they complementary?

I covered the discussion about traditional middleware and Kafka already in detail. Check out “Event streaming with Apache Kafka vs. traditional middleware using MQ, ETL, ESB“.

It is also relative easy to explain the relation between traditional middleware and API Management: Build a SOAP or REST based application (aka web service) and put an API Gateway or API Management tool in front of it to manage its lifecycle and monetize it.

Important pointer here: Some platforms like Mulesoft provide an ESB and API Management. You can use just one of them, or both together. Just make sure to compare the right things (to Kafka). For Mule ESB (vs. Kafka), check out the above link. For Mulesoft’s Anypoint Platform for API Management (vs. Kafka), read the below content… Read both if you need integration middleware and API Management.

How do Apache Kafka and API Management relate to each other? This question is harder to answer because both solve very different problems based on different technologies. Let’s discuss this topic in more detail in the following.

Use Cases for Event Streaming and Apache Kafka

First of all, it is very important to understand what ‘Event Streaming’ is and why this is different from the “traditional API approach” providing REST or SOAP web services.

Apache Kafka is used in all Industries and Verticals

Some use cases can also be done with other technologies, but it is easier and a simpler architecture with Kafka. That is true for integration layers and microservice architectures – and all the use cases around this like real time monitoring or customer 360.

Some other use cases cannot be done easily with other technologies because others don’t provide the combination of messaging + storage + processing in one single platform in a scalable, reliable and fault tolerant way – which is e.g. required to build a connected car infrastructure or sensor processing and analytics at scale in real time.

In the early era of Apache Kafka, many companies just used it for data ingestion into Hadoop or another data lake. The significant difference today – and this is what i would define as innovative – is that companies today use Apache Kafka as Event Streaming Platform to build mission-critical infrastructures and core operations platforms.

To be fair, Kafka is not the best solution for every problem. If you need point-to-point messaging, use something like RabbitMQ or IBM MQ. If you need to transfer large files, evaluate the market for MFT (Managed File Transfer) products. And… If you need to manage and monetize APIs, then evaluate API Management solutions.

Kafka’s Ecosystem to build mission-critical and scalable Platforms and real time Applications

Apache Kafka is more than just data ingestion or messaging. Apache Kafka (which includes Kafka Connect and Kafka Streams) and its open ecosystem (Schema Registry, ksqlDB, etc.) established a complete event streaming platform for many innovative use cases.

Here are some examples:

An interesting trend can be seen here: More and more Kafka deployments are mission-critical focusing on business transactions. These deployments cannot be down for an hour because the company behind it would be in huge trouble then.

Many more use cases from companies in almost all existing verticals and industries can be found at the Kafka Summit website. Videos and slides from all past talks are available for free. This includes success stories from tech giants, traditional companies and cutting edge startups.

Why Event Streaming with Apache Kafka?

Kafka has a few unique characteristics:

Combination of messaging, storage, integration and processing of data
Event-based architecture for real time processing, supporting modern design patterns like Event Sourcing and CQRS
Built for high availability, high throughput and cloud-native DevOps and CI/CD integration
Open source with a huge community and ecosystem

For these and other reasons, Kafka became the de facto standard for Microservice architectures and many other application infrastructures. Many of these use cases cannot be built with traditional middleware due to various limitations of scalability, non-flexible architectures or simply too high cost for building a highly available deployment.

So, what is the relation between event streaming with Kafka and API Management? Let’s explore this in the next section.

What is an API and its Relation to Event Streaming?

Event Streaming is changing from ground up how applications are built. More scalable, more reliable, decoupled, real time. In many new innovative use cases, there is no way around using event streaming instead of web services and traditional APIs.

This brings up several questions. Why do we still need to create and manage APIs? Does it make sense to put an API on top of streaming data? What technology and interface should this API use?

Let’s cover the basics first…

API (Application Programming Interface)

An API (application programming interface) is a computing interface which defines interactions between multiple software intermediaries. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used, the conventions to follow, etc.

API Technologies

From a technical perspective, most people and products mean REST (HTTP) or SOAP (XML) web services when talking about APIs. Most API Gateway and API Management tools just support these technologies.

These two technologies are established in most companies for many years and are very mature. Some people prefer the one, some the other. Some people don’t like either one but have to use them because REST and SOAP web services are the de facto standard in enterprises today.

In fact, many other API technologies are available. Many of these other APIs do not use synchronous request-response patterns, but asynchronous communication.

Examples: WebSocket, MQTT, Server-side Events (SSE), or the Kafka protocol (the underlying wire protocol implemented in Kafka). So why are more and more technologies emerging?

Synchronous Request-Response vs. Asynchronous Event Streaming

Two very different communication paradigms exist: Request-response and event streaming.

Request-Response communication has the following characteristics:

Low latency
Typically synchronous
Point-to-point
“Bespoke API”
e.g. HTTP, SOAP, gRPC

Event streams are based on these concepts:

Messaging / Pub Sub (sending data from A to B and C)
Continuous data processing (filtering, transformations, aggregations, business logic)
Often asynchrounous
Event-driven, supporting patterns like Event Sourcing and CQRS
General-purpose events
e.g. Apache Kafka

Both approaches have their trade-offs. Most architectures need request-response and event streams! Read the great article from Gregor Hophe (author of the famous Enterprise Integration Patterns) from 2004: “Starbucks Does Not Use Two-Phase Commit“. This article explains very well why both synchronous and asynchronous communication make sense (together).

REST and SOAP Web Services typically use synchronous communication. This is not the full story, you could e.g. also use JMS-based SOAP communication, but the reality in most cases is synchronous request-response. Event streaming is asynchronous, but you can implement request-reply patterns, too.

Event Streaming instead of REST / SOAP Web Services

So what are the most important reasons why event streaming with technologies like Apache Kafka is often used for new projects instead of REST / SOAP web services?

REST / SOAP web services do not provide characteristics to build a scalable, reliable real time infrastructure for a high throughput of events. Period!

The other big advantage of Kafka is that it decouples microservices from each other. The storage of Kafka and the asynchronous (i.e. decoupled) communication keeps every microservice independent from each other. Microservice A does not need to know Microservice B, but they can still communicate with each other. Even if one of them is down while the other one is producing data. There can still be a contract (a term used in API Management a lot) between the producers and consumers, for instance using the Confluent Schema Registry.

One thing to point out here is that most API Management solutions and API Gateway today don’t support Event Streams but only Web Service APIs, unfortunately.

But let’s go one step back first and understand what API Management actually is.

What is API Management?

API management is the process of creating and publishing web application programming interfaces (APIs), enforcing their usage policies, controlling access, nurturing the subscriber community, collecting and analyzing usage statistics, and reporting on performance. API Management components provide mechanisms and tools to support the developer and subscriber community.

Gartner’s Magic Quadrant 2019 for Full Life Cycle API Management shows the various vendors in this market:

Use Cases for API Management

API Management can be used for different scenarios:

Open API: Developer portal and API Gateway
Partner Gateway: Access control for well-known external parties
Mobile App Gateway: Access control for apps deployed externally
Cloud Integration Gateway: Governance and mediation control for SaaS
Internal Governance: Manage, monetize and bill internal services and applications

Various different API business models are possible as John Musser explained very well in 2013 already:

What changed since 2013? Not that much! The main idea is the same: APIs are provided for the public, external partners or internal teams. However, technically speaking, more and more of these interfaces need to use a technology for real time streaming data at scale. REST APIs are not ideal or sometimes not even possible at all with its limitations regarding scalability.

No matter if the API Management solution supports just REST / SOAP web services or modern streaming technologies, the API development workflow looks like this:

While API Management solutions vary, components that provide the following functionalities are typically found in products:

API Gateway

A server that acts as an API front-end, receives API requests, enforces throttling and security policies, passes requests to the back-end service and then passes the response back to the requester. A gateway often includes a transformation engine to orchestrate and modify the requests and responses on the fly. A gateway can also provide functionality such as collecting analytics data and providing caching. The gateway can provide functionality to support authentication, authorization, security, audit and regulatory compliance.

API Life Cycle Management and Publishing Tools

A collection of tools that API providers use to define APIs, for instance using the OpenAPI or RAML specifications, generate API documentation, manage access and usage policies for APIs, test and debug the execution of API, including security testing and automated generation of tests and test suites, deploy APIs into production, staging, and quality assurance environments, and coordinate the overall API lifecycle.

Developer Portal / API Store

Community site, typically branded by an API provider, that can encapsulate for API users in a single convenient source information and functionality including documentation, tutorials, sample code, software development kits, an interactive API console and sandbox to trial APIs, the ability to subscribe to the APIs and manage subscription keys such as OAuth2 Client ID and Client Secret, and obtain support from the API provider and user and community.

Reporting and Analytics

Functionality to monitor API usage and load (overall hits, completed transactions, number of data objects returned, amount of compute time and other internal resources consumed, volume of data transferred). This can include real-time monitoring of the API with alerts being raised directly or via a higher-level network management system, for instance, if the load on an API has become too great, as well as functionality to analyze historical data, such as transaction logs, to detect usage trends. Functionality can also be provided to create synthetic transactions that can be used to test the performance and behavior of API endpoints. The information gathered by the reporting and analytics functionality can be used by the API provider to optimize the API offering within an organization’s overall continuous improvement process and for defining software Service-Level Agreements for APIs.

Monetization and Billing

Functionality to support charging for access to commercial APIs. This functionality can include support for setting up pricing rules, based on usage, load and functionality, issuing invoices and collecting payments including multiple types of credit card payments.

As you can see: An API Management solution has some exciting features to build and operate APIs! So what is the relation to Kafka? As discussed earlier, many innovative use cases require a scalable, reliable event streaming platform. That’s what Kafka is.

Kafka and API Management – Friends, Enemies or Frenemies?

To be very clear

Apache Kafka does not provide out-of-the-box capabilities of an API Management solution.
API Management solutions do not provide event streaming capabilities to continuously send, process, store and handle millions of events in real time (aka stream processing / streaming analytics).

Therefore, the combination of Kafka and API Management solution makes a lot of sense in many scenarios. It is NOT a competitive situation (like many people think – or are “taught” by some vendors).

Unique API Management Features

Some of the unique features of API Management products are:

API Developer Portal and Publishing Tools
API Life Cycle Management
Billing and Monetization

These components can be provided as standalone services respectively products (e.g. from a cloud provider) or within a complete platform (like Mulesoft Anypoint Platform).

Domain-Driven Design (DDD), Decoupling and Anti-Patterns

Some features from API Management tools overlap with other solutions. You should question if API Management is the right spot for doing this. This is not a ‘yes or no’ discussion. But I think in many cases, the API Management solution should not be used for tasks where other platforms provide the better capabilities regarding scalability, tooling, reliability, performance, and other characteristics.

A clear separation of concerns is important to simple and flexible enterprise architecture. Don’t couple things too tightly. This was a key issue of ESB deployments in the past. Don’t do the same fault with API Management. It is not a surprise and should be a warning that several vendors even built their API Management product on top of their ESB to couple things together.

Martin Fowler taught us several years ago “not to recreate ESB Anti Patterns with Kafka“. Keep this in mind for your API strategy, too! My article “Microservices, Apache Kafka, and Domain-Driven Design” should also help you understanding how important the separation of concerns and decoupling is for your enterprise architecture. This is true for Kafka, APIs and other business applications.

Overlapping Features between Kafka and API Management

Kafka provides a messaging and storage solution for event-based processing as its core. In addition, Kafka Connect (for integration) and Kafka Streams (for stream processing) are part of the open source project.

API Management exists for completely different use cases as discussed in detail in the above section: To create, publish, manage and monetize APIs.

Nevertheless, some overlapping features exist between Kafka and API Gateways and API Management solutions. Here are some examples:

Protocol conversion: One consumer or client requires JSON while the other one can only process Avro, Protobuf or XML.
ETL (Extract Transform Load): Transformations, filtering, sorting and similar tasks.
Connectivity: Integration with back-end systems like databases, data warehouses, data lakes, messaging systems, business applications.
API Gateway: Routing, public endpoints, single entry point, access control, encryption, throttling, etc. are common features. This can either be configured / implemented by a dedicated API Gateway (like Amazon API Gateway) or with a Kafka-based platform (like Confluent Platform providing features such as RBAC, Rest Proxy, etc).

Who should solve these overlapping tasks? The Event Streaming Platform or the API Management solution? Well, each vendor will tell you that they can do it the best way. Think about your architecture and requirements. What makes most sense? As so often: It depends!

If you want to build a scalable, reliable integration pipeline, Kafka is probably the better choice. If you need to provide a flexible Gateway interface for REST web services with routing configurations, a dedicated API Gateway is probably the best choice. Try to keep the architecture as simple as possible.

Let’s now take a look at an architecture to understand how Kafka and API Management solutions play together very well.

Microservices, API Management (Mulesoft Anypoint) and Event Streaming (Kafka)

The following examples shows a microservices architecture leveraging Event Streaming and API Management. It uses a combination of Confluent Platform for the event-based nervous system and Mulesoft Anypoint Platform for API Management and integration with some legacy applications:

There are different options to combine Kafka and Event Streaming with API Management solutions:

Event Streaming is used to process data continuously at scale in real time
Event Streaming is used to directly integrate with various data sources and data sinks (databases, messaging systems, business applications, etc.)
The heart of many companies is Event Streaming, gluing together streaming applications with batch, request-response and other platforms.
API Management is used to provide an API interface (including lifecycle management, monetization, etc) on top of Kafka applications, e.g. using services via Confluent REST Proxy, the REST API of Confluent Cloud to provision a new Kafka cluster, or the REST API running on top of a custom Kafka Streams / ksqlDB application or microservice using Interactive Queries.
Kafka is used as backend infrastructure. A proxy or business application is used in between Kafka and business applications. API Management is not directly used with Kafka interfaces, but one layer higher on top of the applications which use Kafka under the hood.

Most enterprise architectures require event streaming, request-response and API management. I hope if you read this far in this blog post, you agree and now understand why Apache Kafka and API Management platforms are complementary, not competitive.

But it is also clear that event streaming and today’s API Management tools don’t fit together perfectly because in many cases it does not make sense to put a REST or SOAP API on top of event streaming data.

The Missing Killer Feature: Native Kafka Integration in API Management and API Gateway

The last section explored options how Kafka and API Management work together very well.

In an ideal world, an API could be put directly on top of the Kafka protocol. In the real world, almost all API Management products today only support REST / SOAP web services. This means you (have to) build a web service on top of event streaming to provide the API Management capabilities.

Envoy proxy, one of the established proxies for building a Service Mesh, actually supports the Kafka protocol natively. On TCP level, no need to use HTTP REST APIs. This is huge from scalability and performance perspective. HTTP / synchronous request-response is an anti-pattern for streaming data and will not work if large scale is required for the streaming application. Check out “Service Mesh and Cloud-Native Microservices with Apache Kafka, Kubernetes and Envoy, Istio, Linkerd” for more details on this topic.

Unfortunately, examples like Envoy’s support for the Kafka protocol are very rare today. What if you get native Kafka support in your API Management solution?

Streaming-based API Management for Cross Companies Communication

API Management using REST or SOAP web services is not appropriate for streaming data and large scale use cases. Therefore, more and more enterprises build streaming applications. How strange is it that almost all of these enterprises use the anti-pattern of providing a request-response based REST API on top of the streaming services for API Management?

Support for the Kafka protocol would be very helpful to make API Management even more complementary than it is today. Think about the huge opportunities if you could build life cycle management and monetization / billing on top of a streaming Kafka service.

A great example of a Kafka-native API is HERE Technologies, a company majority-owned by a consortium of German automotive companies (namely Audi, BMW, and Daimler) providing mapping and location services. Their real-time APIs recommend using the native Kafka interface (as all their backend services run on Kafka for the reasons discussed in this post) instead of an optional HTTP wrapper endpoint.

Cross-Company Streaming Replication

Even without proper support for event streaming in most API Management tools, I have seen many customers doing Kafka-native real time communication at scale between different business units or projects. Check out “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” to understand various different options.

Here is the most exciting use case: Streaming replication between different enterprises:

Different tools enable streaming replication between business units, regions or companies:

MirrorMaker 1
MirrorMaker 2
Confluent Replicator
uReplicator (Uber)
Mirus (Salesforce)
Brooklin (LinkedIn)
Custom Replication

If you want to rely on a mature and battle-tested product, then Confluent Replicator is the way to go today in 2020 for real time streaming replication. MirrorMaker 1 should never be an option. MirrorMaker 2 will be a great option in some quarters, but today it is very new and probably not the best option for a mission-critical project yet. All other options are only recommended if you want to dive deep into the project.

Tools like Confluent Schema Registry provide governance for the “streaming API interface”. Technologies like Avro, Protobuf or JSON Schema are used to define the “API contract” and process large data volumes efficiently and in real time.

Event Streaming Internally and REST API to the Outside World?

A cross-company streaming architecture has one key drawback: Information security and politics are your biggest enemy! But I have seen customers running this setup in production with a partner company. So it is doable, and even without API Management in the middle, you can leverage event streaming at scale with your partner. Think about use cases like airline ticketing, retail transactions or financial services.

Why would you build everything in real time at scale internally, but only provide a non-scalable synchronous HTTP interface to the outside world? And your external partners are asking themselves exactly the same question…

API Management for event streaming would make this easier from security and monetization / billing perspective. I hope this feature will be implemented soon by various API Management software vendors.

The Future – Streaming-based API Management for Apache Kafka?

Most architectures require request-response based communication (typically REST) and event streaming (typically Kafka). API Management helps making applications accessible; no matter if the heart of the infrastructure is event-based or a point-to-point communication.

I think (and hope) the future will provide streaming-based API Management solutions for Apache Kafka. Envoy’s support for the Kafka protocol is a first example. A few other frameworks also provide some “first hacks” already.

I hope this blog post helped you understanding the relation between Event Streaming with Apache Kafka and API Management solutions such as Kong or Mulesoft Anypoint Platform. They are complementary, not competitive.

How do you think about API Management in conjunction with event streaming and Apache Kafka? What is your strategy? Let’s connect on LinkedIn and discuss! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies? appeared first on Kai Waehner.

Apache Kafka vs. Middleware (MQ, ETL, ESB) – Slides + Video

Kai Waehner — Thu, 07 Mar 2019 15:45:15 +0000

Learn the differences between an event-driven streaming platform like Apache Kafka and middleware like Message Queues (MQ), Extract-Transform-Load (ETL) and Enterprise Service Bus (ESB). Including best practices and anti-patterns, but also how these concepts and tools complement each other in an enterprise architecture.

This blog post shares my slide deck and video recording. I discuss the differences between Apache Kafka as Event Streaming Platform and integration middleware. Learn if they are friends, enemies or frenemies.

Problems of Legacy Middleware

Extract-Transform-Load (ETL) is still a widely-used pattern to move data between different systems via batch processing. Due to its challenges in today’s world where real time is the new standard, an Enterprise Service Bus (ESB) is used in many enterprises as integration backbone between any kind of microservice, legacy application or cloud service to move data via SOAP / REST Web Services or other technologies. Stream Processing is often added as its own component in the enterprise architecture for correlation of different events to implement contextual rules and stateful analytics. Using all these components introduces challenges and complexities in development and operations.

Apache Kafka – An Open Source Event Streaming Platform

This session discusses how teams in different industries solve these challenges by building a native event streaming platform from the ground up instead of using ETL and ESB tools in their architecture. This allows to build and deploy independent, mission-critical streaming real time application and microservices. The architecture leverages distributed processing and fault-tolerance with fast failover, no-downtime, rolling deployments and the ability to reprocess events, so you can recalculate output when your code changes. Integration and Stream Processing are still key functionality but can be realized in real time natively instead of using additional ETL, ESB or Stream Processing tools.

A concrete example architecture shows how to build a complete streaming platform leveraging the widely-adopted open source framework Apache Kafka to build a mission-critical, scalable, highly performant streaming platform. Messaging, integration and stream processing are all build on top of the same strong foundation of Kafka; deployed on premise, in the cloud or in hybrid environments. In addition, the open source Confluent projects, based on top of Apache Kafka, adds additional features like a Schema Registry, additional clients for programming languages like Go or C, or many pre-built connectors for various technologies.

Slides: Apache Kafka vs. Integration Middleware

Here is the slide deck:

Video Recording: Kafka vs. MQ / ETL / ESB – Friends, Enemies or Frenemies?

Here is the video recording where I walk you through the above slide deck:

Article: Apache Kafka vs. Enterprise Service Bus (ESB)

I also published a detailed blog post on Confluent blog about this topic in 2018:

Apache Kafka vs. Enterprise Service Bus (ESB)

Talk and Slides from Kafka Summit London 2019

The slides and video recording from Kafka Summit London 2019 (which are similar to above) are also available for free.

Why Apache Kafka instead of Traditional Middleware?

If you don’t want to spend a lot of time on the slides and recording, here is a short summary of the differences of Apache Kafka compared to traditional middleware:

Questions and Discussion…

Please share your thoughts, too!

Does your infrastructure see similar architectures? Do you face similar challenges? Do you like the concepts behind an Event Streaming Platform (aka Apache Kafka)? How do you combine legacy middleware with Kafka? What’s your strategy to integrate the modern and the old (technology) world? Is Kafka part of that architecture?

Please let me know either via a comment or via LinkedIn, Twitter, email, etc. I am curious about other opinions and experiences (and people who disagree with my presentation).

The post Apache Kafka vs. Middleware (MQ, ETL, ESB) – Slides + Video appeared first on Kai Waehner.

Big Data Spain: Talk about KSQL – The Streaming SQL Engine for Apache Kafka

Kai Waehner — Thu, 15 Nov 2018 05:35:24 +0000

In November 2018, I was back in Madrid to speak at Big Data Spain. A great event all about big data, analytics and machine learning. One of the largest tech companies in Spain. A perfect event to talk about KSQL – The Streaming SQL Engine for Apache Kafka.

Big Data Spain is held in Kinepolis, a big cinema. One of my favorite locations for a tech conference – for speakers and audience.

All talks at Big Data Spain are recorded. Video recording and slides below.

KSQL – The Open Source SQL Streaming Engine for Apache Kafka

My talk was an update about KSQL. The slide deck describes various different use cases for KSQL. I also included some advanced topics such as User Defined Functions (UDF). Here is the abstract:

The rapidly expanding world of stream processing can be daunting, with new concepts such as various types of time semantics, windowed aggregates, changelogs, and programming frameworks to master.
KSQL is an open-source, Apache 2.0 licensed streaming SQL engine on top of Apache Kafka which aims to simplify all this and make stream processing available to everyone. Even though it is simple to use, KSQL is built for mission-critical and scalable production deployments (using Kafka Streams under the hood).
Benefits of using KSQL include: No coding required; no additional analytics cluster needed; streams and tables as first-class constructs; access to the rich Kafka ecosystem. This session introduces the concepts and architecture of KSQL. Use cases such as Streaming ETL, Real Time Stream Monitoring or Anomaly Detection are discussed. A live demo shows how to setup and use KSQL quickly and easily on top of your Kafka ecosystem.

Key takeaways:

– KSQL includes access to the rich Apache Kafka ecosystem and is suitable for various use cases, including Streaming ETL, Real Time Stream Monitoring and Anomaly Detection

– KSQL allows to realize stream processing without coding and without additional analytics cluster

Slide Deck: KSQL Introduction

Here is the slide deck:

Video Recording: Intro to KSQL

Here is the video recording from my talk:

The post Big Data Spain: Talk about KSQL – The Streaming SQL Engine for Apache Kafka appeared first on Kai Waehner.

Apache Kafka vs. ESB / ETL / MQ

Kai Waehner — Wed, 18 Jul 2018 18:33:56 +0000

Apache Kafka and Enterprise Service Bus (ESB) are complementary, not competitive!

Apache Kafka is much more than messaging in the meantime. It evolved to a streaming platform including Kafka Connect, Kafka Streams, KSQL and many other open source components. Kafka leverages events as a core principle. You think in data flows of events and process the data while it is in motion. Many concepts, such as event sourcing, or design patterns such as Enterprise Integration Patterns (EIPs), are based on event-driven architecture.

Kafka is unique because it combines messaging, storage, and processing of events all in one platform. It does this in a distributed architecture using a distributed commit log and topics divided into multiple partitions.

ETL and ESB have excellent tooling, including graphical mappings for doing complex integration with legacy systems and technologies such as SOAP, EDIFACT, SAP BAPI, COBOL, etc. (Trust me, you don’t want to write the code for this.)

Therefore, existing MQ and ESB solutions, which already integrate with your legacy world, are not competitive to Apache Kafka. Rather, they are complementary!

Read more details about this question in my Confluent blog post:

Apache Kafka® vs. Enterprise Service Bus (EBS) | Confluent

As always, I appreciate any feedback, comments or criticism.

The post Apache Kafka vs. ESB / ETL / MQ appeared first on Kai Waehner.

Apache Kafka Streams + Machine Learning (Spark, TensorFlow, H2O.ai)

Kai Waehner — Tue, 23 May 2017 17:11:40 +0000

I started at Confluent in May 2017 to work as Technology Evangelist focusing on topics around the open source framework Apache Kafka. I think Machine Learning is one of the hottest buzzwords these days as it can add huge business value in any industry. Therefore, you will see various other posts from me around Apache Kafka (messaging), Kafka Connect (integration), Kafka Streams (stream processing), Confluent’s additional open source add-ons on top of Kafka (Schema Registry, Replicator, Auto Balancer, etc.). I will explain how to leverage all this for machine learning and other big data technologies in real world production scenarios.

Read this, if you wonder why am so excited about moving (back) to open source for messaging, integration and stream processing in the big data world.

In the following blog post, I want to share my first slide deck from a conference talk representing Confluent: A software architecture user group in Leipzig, Germany organized a 2-day event to discuss big data in practice.

Apache Kafka Streams + Machine Learning / Deep Learning

This is the abstract of the slide deck:

Big Data and Machine Learning are key for innovation in many industries today. Large amounts of historical data are stored and analyzed in Hadoop, Spark or other clusters to find patterns and insights, e.g. for predictive maintenance, fraud detection or cross-selling.

This first part of the session explains how to build analytic models with R, Python and Scala leveraging open source machine learning / deep learning frameworks like Apache Spark, TensorFlow or H2O.ai.

The second part discusses how to leverage these built analytic models in your own real time streaming applications or microservices. It explains how to leverage the Apache Kafka cluster and Kafka Streams instead of building an own stream processing cluster. The session focuses on live demos and teaches lessons learned for executing analytic models in a highly scalable and performant way.

The last part explains how Apache Kafka can help to move from a manual build and deployment of analytic models to continuous online model improvement in real time.

Slide Deck: How to Build Analytic Models and Deployment to Real Time Processing

Here is the slide deck:

More blog posts with more details and specific code examples will follow in the next weeks. I will also do a web recording for this slide deck and post it on Youtube.

The post Apache Kafka Streams + Machine Learning (Spark, TensorFlow, H2O.ai) appeared first on Kai Waehner.