event streaming Archives - Kai Waehner

How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time

Kai Waehner — Mon, 14 Apr 2025 09:09:10 +0000

Artificial Intelligence is evolving beyond passive analytics and reactive automation. Agentic AI represents a new wave of autonomous, goal-driven AI systems that can think, plan, and execute complex workflows without human intervention. However, for these AI agents to be effective, they must operate on real-time, consistent, and trustworthy data—a challenge that traditional batch processing architectures simply cannot meet. This is where Data Streaming with Apache Kafka and Apache Flink, coupled with an event-driven architecture (EDA), form the backbone of Agentic AI. By enabling real-time and continuous decision-making, EDA ensures that AI systems can act instantly and reliably in dynamic, high-speed environments. Emerging standards like the Model Context Protocol (MCP) and Google’s Agent-to-Agent (A2A) protocol are now complementing this foundation, providing structured, interoperable layers for managing context and coordination across intelligent agents—making AI not just event-driven, but also context-aware and collaborative.

In this post, I will explore:

How Agentic AI works and why it needs real-time data
Why event-driven architectures are the best choice for AI automation
Key use cases across industries
How Kafka and Flink provide the necessary data consistency and real-time intelligence for AI-driven decision-making
The role of MCP, A2A, and frameworks like LangChain and LlamaIndex in enabling scalable, context-aware, and collaborative AI systems

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

What is Agentic AI?

Agentic AI refers to AI systems that exhibit autonomous, goal-driven decision-making and execution. Unlike traditional automation tools that follow rigid workflows, Agentic AI can:

Understand and interpret natural language instructions
Set objectives, create strategies, and prioritize actions
Adapt to changing conditions and make real-time decisions
Execute multi-step tasks with minimal human supervision
Integrate with multiple operational and analytical systems and data sources to complete workflows

Here is an example AI Agent dependency graph from Sean Falconer’s article “Event-Driven AI: Building a Research Assistant with Kafka and Flink“:

Source: Sean Falconer

Instead of merely analyzing data, Agentic AI acts on data, making it invaluable for operational and transactional use cases—far beyond traditional analytics.

However, without real-time, high-integrity data, these systems cannot function effectively. If AI is working with stale, incomplete, or inconsistent information, its decisions become unreliable and even counterproductive. This is where Kafka, Flink, and event-driven architectures become indispensable.

Why Batch Processing Fails for Agentic AI

Traditional AI and analytics systems have relied heavily on batch processing, where data is collected, stored, and processed in predefined intervals. This approach may work for generating historical reports or training machine learning models offline, but it completely breaks down when applied to operational and transactional AI use cases—which are at the core of Agentic AI.

I recently explored the Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming). And here’s why batch processing is fundamentally incompatible with Agentic AI and the real-world challenges it creates:

1. Delayed Decision-Making Slows AI Reactions

Agentic AI systems are designed to autonomously respond to real-time changes in the environment, whether it’s optimizing a telecommunications network, detecting fraud in banking, or dynamically adjusting supply chains.

In a batch-driven system, data is processed hours or even days later, making AI responses obsolete before they even reach the decision-making phase. For example:

Fraud detection: If a bank processes transactions in nightly batches, fraudulent activities may go unnoticed for hours, leading to financial losses.
E-commerce recommendations: If a retailer updates product recommendations only once per day, it fails to capture real-time shifts in customer behavior.
Network optimization: If a telecom company analyzes network traffic in batch mode, it cannot prevent congestion or outages before it affects users.

Agentic AI requires instantaneous decision-making based on streaming data, not delayed insights from batch reports.

2. Data Staleness Creates Inaccurate AI Decisions

AI agents must act on fresh, real-world data, but batch processing inherently means working with outdated information. If an AI agent is making decisions based on yesterday’s or last hour’s data, those decisions are no longer reliable.

Consider a self-healing IT infrastructure that uses AI to detect and mitigate outages. If logs and system metrics are processed in batch mode, the AI agent will be acting on old incident reports, missing live system failures that need immediate attention.

In contrast, an event-driven system powered by Kafka and Flink ensures that AI agents receive live system logs as they occur, allowing for proactive self-healing before customers are impacted.

3. High Latency Kills Operational AI

In industries like finance, healthcare, and manufacturing, even a few seconds of delay can lead to severe consequences. Batch processing introduces significant latency, making real-time automation impossible.

For example:

Healthcare monitoring: A real-time AI system should detect abnormal heart rates from a patient’s wearable device and alert doctors immediately. If health data is only processed in hourly batches, a critical deterioration could be missed, leading to life-threatening situations.
Automated trading in finance: AI-driven trading systems must respond to market fluctuations within milliseconds. Batch-based analysis would mean losing high-value trading opportunities to faster competitors.

Agentic AI must operate on a live data stream, where every event is processed instantly, allowing decisions to be made in real-time, not retrospectively.

4. Rigid Workflows Increase Complexity and Costs

Batch processing forces businesses to predefine rigid workflows that do not adapt well to changing conditions. In a batch-driven world:

Data must be manually scheduled for ingestion.
Systems must wait for the entire dataset to be processed before making decisions.
Business logic is hard-coded, requiring expensive engineering effort to update workflows.

Agentic AI, on the other hand, is designed for continuous, adaptive decision-making. By leveraging an event-driven architecture, AI agents listen to streams of real-time data, dynamically adjusting workflows on the fly instead of relying on predefined batch jobs.

This flexibility is especially critical in industries with rapidly changing conditions, such as supply chain logistics, cybersecurity, and IoT-based smart cities.

5. Batch Processing Cannot Support Continuous Learning

A key advantage of Agentic AI is its ability to learn from past experiences and self-improve over time. However, this is only possible if AI models are continuously updated with real-time feedback loops.

Batch-driven architectures limit AI’s ability to learn because:

Models are retrained infrequently, leading to outdated insights.
Feedback loops are slow, preventing AI from adjusting strategies in real time.
Drift in data patterns is not immediately detected, causing AI performance degradation.

For instance, in customer service chatbots, an AI-powered agent should adapt to customer sentiment in real time. If a chatbot is trained on stale customer interactions from last month, it won’t understand emerging trends or newly common issues.

By contrast, a real-time data streaming architecture ensures that AI agents continuously receive live customer interactions, retrain in real time, and evolve dynamically.

Agentic AI Requires an Event-Driven Architecture

Agentic AI must act in real time and integrate operational and analytical information. Whether it’s an AI-driven fraud detection system, an autonomous network optimization agent, or a customer service chatbot, acting on outdated information is not an option.

The Event-Driven Approach

An Event-Driven Architecture (EDA) enables continuous processing of real-time data streams, ensuring that AI agents always have the latest information available. By decoupling applications and processing events asynchronously, EDA allows AI to respond dynamically to changes in the environment without being constrained by rigid workflows.

AI can also be seamlessly integrated into existing business processes leveraging an EDA, bridging modern and legacy technologies without requiring a complete system overhaul. Not every data source may be real-time, but EDA ensures data consistency across all consumers—if an application processes data, it sees exactly what every other application sees. This guarantees synchronized decision-making, even in hybrid environments combining historical data with real-time event streams.

Why Apache Kafka is Essential for Agentic AI

For AI to be truly autonomous and effective, it must operate in real time, adapt to changing conditions, and ensure consistency across all applications. An Event-Driven Architecture (EDA) built with Apache Kafka provides the foundation for this by enabling:

Immediate Responsiveness → AI agents receive and act on events as they occur.
High Scalability → Components are decoupled and can scale independently.
Fault Tolerance → AI processes continue running even if some services fail.
Improved Data Consistency → Ensures AI agents are working with accurate, real-time data.

Building Agentic AI with Apache Kafka and Flink

To build truly autonomous AI systems, organizations need a real-time data infrastructure that can process, analyze, and act on events as they happen.

Source: Sean Falconer

Apache Kafka: The Real-Time Data Streaming Backbone

Apache Kafka provides a scalable, event-driven messaging infrastructure that ensures AI agents receive a constant, real-time stream of events. By acting as a central nervous system, Kafka enables:

Decoupled AI components that communicate through event streams.
Efficient data ingestion from multiple sources (IoT devices, applications, databases).
Guaranteed event delivery with fault tolerance and durability.
High-throughput processing to support real-time AI workloads.

Apache Flink: The Real-Time Stream Processing Engine

Apache Flink complements Kafka by providing stateful stream processing for AI-driven workflows. With Flink, AI agents can:

Analyze real-time data streams for anomaly detection, predictions, and decision-making.
Perform complex event processing to detect patterns and trigger automated responses.
Continuously learn and adapt based on evolving real-time data.
Orchestrate multi-agent workflows dynamically.

Use Cases of Agentic AI with Kafka and Flink Across Industries

Across industries, Agentic AI is redefining how businesses and governments operate. By leveraging event-driven architectures and real-time data streaming, organizations can unlock the full potential of AI-driven automation, improving efficiency, reducing costs, and delivering better experiences.

Here are key use cases across different industries:

Financial Services: Real-Time Fraud Detection and Risk Management

Traditional fraud detection systems rely on batch processing, leading to delayed responses and financial losses.

Agentic AI enables real-time transaction monitoring, detecting anomalies as they occur and blocking fraudulent activities instantly.

AI agents continuously learn from evolving fraud patterns, reducing false positives and improving security. In risk management, AI analyzes market trends, adjusts investment strategies, and automates compliance processes to ensure financial institutions stay ahead of threats and regulatory requirements.

Telecommunications: Autonomous Network Optimization

Telecom networks require constant tuning to maintain service quality, but traditional network management is reactive and expensive.

Agentic AI can proactively monitor network traffic, predict congestion, and automatically reconfigure network resources in real time. AI-powered agents optimize bandwidth allocation, detect outages before they impact customers, and enable self-healing networks, reducing operational costs and improving service reliability.

Retail: AI-Powered Personalization and Dynamic Pricing

Retailers struggle with static recommendation engines that fail to capture real-time customer intent.

Agentic AI analyzes customer interactions, adjusts recommendations dynamically, and personalizes promotions based on live purchasing behavior. AI-driven pricing strategies adapt to supply chain fluctuations, competitor pricing, and demand changes in real time, maximizing revenue while maintaining customer satisfaction.

AI agents also enhance logistics by optimizing inventory management and reducing stock shortages.

Healthcare: Real-Time Patient Monitoring and Predictive Care

Hospitals and healthcare providers require real-time insights to deliver proactive care, but batch processing delays critical decisions.

Agentic AI continuously streams patient vitals from medical devices to detect early signs of deterioration and triggering instant alerts to medical staff. AI-driven predictive analytics optimize hospital resource allocation, improve diagnosis accuracy, and enable remote patient monitoring, reducing emergency incidents and improving patient outcomes.

Gaming: Dynamic Content Generation and Adaptive AI Opponents

Modern games need to provide immersive, evolving experiences, but static game mechanics limit engagement.

Agentic AI enables real-time adaptation of gameplay to generate dynamic environments and personalizing challenges based on a player’s behavior. AI-driven opponents can learn and adapt to individual playstyles, keeping games engaging over time. AI agents also manage server performance, detect cheating, and optimize in-game economies for a better gaming experience.

Manufacturing & Automotive: Smart Factories and Autonomous Systems

Manufacturing relies on precision and efficiency, yet traditional production lines struggle with downtime and defects.

Agentic AI monitors production processes in real time to detect quality issues early and adjusting machine parameters autonomously. This directly improves Overall Equipment Effectiveness (OEE) by reducing downtime, minimizing defects, and optimizing machine performance to ensure higher productivity and operational efficiency to ensure higher productivity and operational efficiency.

In automotive, AI-driven agents analyze real-time sensor data from self-driving cars to make instant navigation decisions, predict maintenance needs, and optimize fleet operations for logistics companies.

Public Sector: AI-Powered Smart Cities and Citizen Services

Governments face challenges in managing infrastructure, public safety, and citizen services efficiently.

Agentic AI can optimize traffic flow by analyzing real-time data from sensors and adjusting signals dynamically. AI-powered public safety systems detect threats from surveillance data and dispatch emergency services instantly. AI-driven chatbots handle citizen inquiries, automate document processing, and improve response times for government services.

The Business Value of Real-Time AI using Autonomous Agents

By leveraging Kafka and Flink in an event-driven AI architecture, organizations can achieve:

Better Decision-Making → AI operates on fresh, accurate data.
Faster Time-to-Action → AI agents respond to events immediately.
Reduced Costs → Less reliance on expensive batch processing and manual intervention by humans.
Greater Scalability → AI systems can handle massive workloads in real time.
Vendor Independence → Kafka and Flink support open standards and hybrid/multi-cloud deployments, preventing vendor lock-in.

Why LangChain, LlamaIndex, and Similar Frameworks Are Not Enough for Agentic AI in Production

Frameworks like LangChain, LlamaIndex, and others have gained popularity for making it easy to prototype AI agents by chaining prompts, tools, and external APIs. They provide useful abstractions for reasoning steps, retrieval-augmented generation (RAG), and basic tool use—ideal for experimentation and lightweight applications.

However, when building agentic AI for operational, business-critical environments, these frameworks fall short on several fronts:

Many frameworks like LangChain are inherently synchronous and follows a request-response model, which limits its ability to handle real-time, event-driven inputs at scale. In contrast, LlamaIndex takes an event-driven approach, using a message broker—including support for Apache Kafka—for inter-agent communication.
Debugging, observability, and reproducibility are weak—there’s often no persistent, structured record of agent decisions or tool interactions.
State is ephemeral and in-memory, making long-running tasks, retries, or rollback logic difficult to implement reliably.
Most Agentic AI frameworks lack support for distributed, fault-tolerant execution and scalable orchestration, which are essential for production systems.

That said, these frameworks like LangChain and Llamaindex can still play a valuable, complementary role when integrated into an event-driven architecture. For example, an agent might use LangChain for planning or decision logic within a single task, while Apache Kafka and Apache Flink handle the real-time flow of events, coordination between agents, persistence, and system-level guarantees.

LangChain and similar toolkits help define how an agent thinks. But to run that thinking at scale, in real time, and with full traceability, you need a robust data streaming foundation. That’s where Kafka and Flink come in.

Model Context Protocol (MCP) and Agent-to-Agent (A2A) for Scalable, Composable Agentic AI Architectures

Model Context Protocol (MCP) is one of the hottest topics in AI right now. Coined by Anthropic, with early support emerging from OpenAI, Google, and other leading AI infrastructure providers, MCP is rapidly becoming a foundational layer for managing context in agentic systems. MCP enables systems to define, manage, and exchange structured context windows—making AI interactions consistent, portable, and state-aware across tools, sessions, and environments.

Google’s recently announced Agent-to-Agent (A2A) protocol adds further momentum to this movement, setting the groundwork for standardized interaction across autonomous agents. These advancements signal a new era of AI interoperability and composability.

Together with Kafka and Flink, MCP and protocols like A2A help bridge the gap between stateless LLM calls and stateful, event-driven agent architectures. Naturally, event-driven architecture is the perfect foundation for all this. The key now is to build enough product functionality and keep pushing the boundaries of innovation.

A dedicated blog post is coming soon to explore how MCP and A2A connect data streaming and request-response APIs in modern AI systems.

The Future of Agentic AI is Event-Driven with Data Streaming using Kafka and Flink

Agentic AI is poised to revolutionize industries by enabling fully autonomous, goal-driven AI systems that perceive, decide, and act continuously. But to function reliably in dynamic, production-grade environments, these agents require real-time, event-driven architectures—not outdated, batch-oriented pipelines.

Apache Kafka and Apache Flink form the foundation of this shift. Kafka ensures agents receive reliable, ordered event streams, while Flink provides stateful, low-latency stream processing for real-time reactions and long-lived context management. This architecture enables AI agents to process structured events as they happen, react to changes in the environment, and coordinate with other services or agents through durable, replayable data flows.

If your organization is serious about AI, the path forward is clear:

Move from batch to real-time, from passive analytics to autonomous action, and from isolated prompts to event-driven, context-aware agents—enabled by Kafka and Flink.

As a next step, learn more about “Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink“.

Let’s connect on LinkedIn and discuss how to implement these ideas in your organization. Stay informed about new developments by subscribing to my newsletter. And make sure to download my free book about data streaming use cases.

The post How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time appeared first on Kai Waehner.

The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming)

Kai Waehner — Mon, 07 Apr 2025 03:31:55 +0000

As technology landscapes evolve, software vendors must decide whether to specialize in a core area or offer a broad suite of services. Some companies take a highly focused approach, investing deeply in a specific technology, while others attempt to cover multiple use cases by integrating various tools and frameworks. Both strategies have trade-offs, but history has shown that specialization leads to deeper innovation, better performance, and stronger customer trust. This blog explores why focus matters in the context of data streaming software, the challenges of trying to do everything, and how companies that prioritize one thing—data streaming—can build best-in-class solutions that work everywhere.

Specialization vs. Generalization: Why Data Streaming Requires a Focused Approach

Data streaming enables real-time processing of continuous data flows, allowing businesses to act instantly rather than relying on batch updates. This shift from traditional databases and APIs to event-driven architectures has become essential for modern IT landscapes.

Data streaming is no longer just a technique—it is a new software category. The 2023 Forrester Wave for Streaming Data Platforms confirms its role as a core component of scalable, real-time architectures. Technologies like Apache Kafka and Apache Flink have become industry standards. They power cloud, hybrid, and on-premise environments for real-time data movement and analytics.

Businesses increasingly adopt streaming-first architectures, focusing on:

Hybrid and multi-cloud streaming for real-time edge-to-cloud integration
AI-driven analytics powered by continuous optimization and inference using machine learning models
Streaming data contracts to ensure governance and reliability across the entire data pipeline
Converging operational and analytical workloads to replace inefficient batch processing and Lambda architecture with multiple data pipelines

The Data Streaming Landscape

As data streaming becomes a core part of modern IT, businesses must choose the right approach: adopt a purpose-built data streaming platform or piece together multiple tools with limitations. Event-driven architectures demand scalability, low latency, cost efficiency, and strict SLAs to ensure real-time data processing meets business needs.

Some solutions may be “good enough” for specific use cases, but they often lack the performance, reliability, and flexibility required for large-scale, mission-critical applications.

The Data Streaming Landscape highlights the differences—while some vendors provide basic capabilities, others offer a complete Data Streaming Platform (DSP)designed to handle complex, high-throughput workloads with enterprise-grade security, governance, and real-time analytics. Choosing the right platform is essential for staying competitive in an increasingly data-driven world.

The Challenge of Doing Everything

Many software vendors and cloud providers attempt to build a comprehensive technology stack, covering everything from data lakes and AI to real-time data streaming. While this offers customers flexibility, it often leads to overlapping services, inconsistent long-term investment, and complexity in adoption.

A few examples (from the perspective of data streaming solutions).

Amazon AWS: Multiple Data Streaming Services, Multiple Choices

AWS has built the most extensive cloud ecosystem, offering services for nearly every aspect of modern IT, including data lakes, AI, analytics, and real-time data streaming. While this breadth provides flexibility, it also leads to overlapping services, evolving strategies, and complexity in decision-making for customers, resulting in frequent solution ambiguity.

Amazon provides several options for real-time data streaming and event processing, each with different capabilities:

Amazon SQS (Simple Queue Service): One of AWS’s oldest and most widely adopted messaging services. It’s reliable for basic decoupling and asynchronous workloads, but it lacks native support for real-time stream processing, ordering, replayability, and event-time semantics.
Amazon Kinesis Data Streams: A managed service for real-time data ingestion and simple event processing, but lacks the full event streaming capabilities of a complete data streaming platform.
Amazon MSK (Managed Streaming for Apache Kafka): A partially managed Kafka service that mainly focuses on Kafka infrastructure management. It leaves customers to handle critical operational support (MSK does NOT provide SLAs or support for Kafka itself) and misses capabilities such as stream processing, schema management, and governance.
AWS Glue Streaming ETL: A stream processing service built for data transformations but not designed for high-throughput, real-time event streaming.
Amazon Flink (formerly Kinesis Data Analytics): AWS’s attempt to offer a fully managed Apache Flink service for real-time event processing, competing directly with open-source Flink offerings.

Each of these services targets different real-time use cases, but they lack a unified, end-to-end data streaming platform. Customers must decide which combination of AWS services to use, increasing integration complexity, operational overhead, and costs.

Strategy Shift and Rebranding with Multiple Product Portfolios

AWS has introduced, rebranded, and developed its real-time streaming services over time:

Kinesis Data Analytics was originally AWS’s solution for stream processing but was later rebranded as Amazon Flink, acknowledging Flink’s dominance in modern stream processing.
MSK Serverless was introduced to simplify Kafka adoption but also introduces various additional product limitations and cost challenges.
AWS Glue Streaming ETL overlaps with Flink’s capabilities, adding confusion about the best choice for real-time data transformations.

As AWS expands its cloud-native services, customers must navigate a complex mix of technologies—often requiring third-party solutions to fill gaps—while assessing whether AWS’s flexible but fragmented approach meets their real-time data streaming needs or if a specialized, fully integrated platform is a better fit.

Google Cloud: Multiple Approaches to Streaming Analytics

Google Cloud is known for its powerful analytics and AI/ML tools, but its strategy in real-time stream processing has been inconsistent:

Customers looking for stream processing in Google Cloud now have three competing services:

Google Managed Service for Apache Kafka (Google MSK) (a managed Kafka offering). Google MSK is very early stage in the maturity curve and has many limitations.
Google Dataflow (built on Apache Beam)
Google Pub/Sub (event messaging)
Apache Flink on Dataproc (a managed service)

While each of these services has its use cases, they introduce complexity for customers who must decide which option is best for their workloads.

BigQuery Flink was introduced to extend Google’s analytics capabilities into real-time processing but was later discontinued before exiting the preview.

Microsoft Azure: Shifting Strategies in Data Streaming

Microsoft Azure has taken multiple approaches to real-time data streaming and analytics, with an evolving strategy that integrates various tools and services.

Azure Event Hubs has been a core event streaming service within Azure, designed for high-throughput data ingestion. It supports the Apache Kafka protocol (through Kafka version 3.0, so its feature set lags considerably), making it a flexible choice for (some) real-time workloads. However, it primarily focuses on event ingestion rather than event storage, data processing and integration–additional capabilities of a complete data streaming platform.
Azure Stream Analytics was introduced as a serverless stream processing solution, allowing customers to analyze data in motion. Despite its capabilities, its adoption has remained limited, particularly as enterprises seek more scalable, open-source alternatives like Apache Flink.
Microsoft Fabric is now positioned as an all-in-one data platform, integrating business intelligence, data engineering, real-time streaming, and AI. While this brings together multiple analytics tools, it also shifts the focus away from dedicated, specialized solutions like Stream Analytics.

While Microsoft Fabric aims to simplify enterprise data infrastructure, its broad scope means that customers must adapt to yet another new platform rather than continuing to rely on long-standing, specialized services. The combination of Azure Event Hubs, Stream Analytics, and Fabric presents multiple options for stream processing, but also introduces complexity, limitations and increased cost for a combined solution.

Microsoft’s approach highlights the challenge of balancing broad platform integration with long-term stability in real-time streaming technologies. Organizations using Azure must evaluate whether their streaming workloads require deep, specialized solutions or can fit within a broader, integrated analytics ecosystem.

I wrote an entire blog series to demystify what Microsoft Fabric really is.

Instaclustr: Too Many Technologies, Not Enough Depth

Instaclustr has positioned itself as a managed platform provider for a wide array of open-source technologies, including Apache Cassandra, Apache Kafka, Apache Spark, Apache ZooKeeper, OpenSearch, PostgreSQL, Redis, and more. While this broad portfolio offers customers choices, it reflects a horizontal expansion strategy that lacks deep specialization in any one domain.

For organizations seeking help with real-time data streaming, Instaclustr’s Kafka offering may appear to be a viable managed service. However, unlike purpose-built data streaming platforms, Instaclustr’s Kafka solution is just one of many services, with limited investment in stream processing, schema governance, or advanced event-driven architectures.

Because Instaclustr splits its engineering and support resources across so many technologies, customers often face challenges in:

Getting deep technical expertise for Kafka-specific issues
Relying on long-term roadmaps and support for evolving Kafka features
Building integrated event streaming pipelines that require more than basic Kafka infrastructure

This generalist model may be appealing for companies looking for low-cost, basic managed services—but it falls short when mission-critical workloads demand real-time reliability, zero data loss, SLAs, and advanced stream processing capabilities. Without a singular focus, platforms like Instaclustr risk becoming jacks-of-all-trades but masters of none—especially in the demanding world of real-time data streaming.

Cloudera: A Broad Portfolio Without a Clear Focus

Cloudera has adopted a distinct strategy by incorporating various open-source frameworks into its platform, including:

Apache Kafka (event streaming)
Apache Flink (stream processing)
Apache Iceberg (data lake table format)
Apache Hadoop (big data storage and batch processing)
Apache Hive (SQL querying)
Apache Spark (batch and near real-time processing and analytics)
Apache NiFi (data flow management)
Apache HBase (NoSQL database)
Apache Impala (real-time SQL engine)
Apache Pulsar (event streaming, via a partnership with StreamNative)

While this provides flexibility, it also introduces significant complexity:

Customers must determine which tools to use for specific workloads.
Integration between different components is not always seamless.
The broad scope makes it difficult to maintain deep expertise in each area.

Rather than focusing on one core area, Cloudera’s strategy appears to be adding whatever is trending in open source, which can create challenges in long-term support and roadmap clarity.

Splunk: Repeated Attempts at Data Streaming

Splunk, known for log analytics, has tried multiple times to enter the data streaming market:

Initially, Splunk built a proprietary streaming solution that never gained widespread adoption.

Later, Splunk acquired Streamlio to leverage Apache Pulsar as its streaming backbone.This Pulsar-based strategy ultimately failed, leading to a lack of a clear real-time streaming offering.

Splunk’s challenges highlight a key lesson: successful data streaming requires long-term investment and specialization, not just acquisitions or technology integrations.

Why a Focused Approach Works Better for Data Streaming

Some vendors take a more specialized approach, focusing on one core capability and doing it better than anyone else. For data streaming, Confluent became the leader in this space by focusing on improving the vision of a complete data streaming platform.

Confluent: Focused on Data Streaming, Built for Everywhere

At Confluent, the focus is clear: real-time data streaming. Unlike many other vendors and the cloud providers that offer fragmented or overlapping services, Confluent specializes in one thing and ensures it works everywhere:

Cloud: Deploy across AWS, Azure, and Google Cloud with deep native integrations.
On-Premise: Enterprise-grade deployments with full control over infrastructure.
Edge Computing: Real-time streaming at the edge for IoT, manufacturing, and remote environments.
Hybrid Cloud: Seamless data streaming across edge, on-prem, and cloud environments.
Multi-Region: Built-in disaster recovery and globally distributed architectures.

More Than Just “The Kafka Company”

While Confluent is often recognized as “the Kafka company,” it has grown far beyond that. Today, Confluent is a complete data streaming platform, combining Apache Kafka for event streaming, Apache Flink for stream processing, and many additional components for data integration, governance and security to power critical workloads.

However, Confluent remains laser-focused on data streaming—it does NOT compete with BI, AI model training, search platforms, or databases. Instead, it integrates and partners with best-in-class solutions in these domains to ensure businesses can seamlessly connect, process, and analyze real-time data within their broader IT ecosystem.

The Right Data Streaming Platform for Every Use Case

Confluent is not just one product—it matches the specific needs, SLAs, and cost considerations of different streaming workloads:

Fully Managed Cloud (SaaS)
- Dedicated and multi-tenant Enterprise Clusters: Low latency, strict SLAs for mission-critical workloads.
- Freight Clusters: Optimized for high-volume, relaxed latency requirements.
Bring Your Own Cloud (BYOC)
- WarpStream: Bring Your Own Cloud for flexibility and cost efficiency.
Self-Managed
- Confluent Platform: Deploy anywhere—customer cloud VPC, on-premise, at the edge, or across multi-region environments.

Confluent is built for organizations that require more than just “some” data streaming—it is for businesses that need a scalable, reliable, and deeply integrated event-driven architecture. Whether operating in a cloud, hybrid, or on-premise environment, Confluent ensures real-time data can be moved, processed, and analyzed seamlessly across the enterprise.

By focusing only on data streaming, Confluent ensures seamless integration with best-in-class solutions across both operational and analytical workloads. Instead of competing across multiple domains, Confluent partners with industry leaders to provide a best-of-breed architecture that avoids the trade-offs of an all-in-one compromise.

Deep Integrations Across Key Ecosystems

A purpose-built data streaming platform plays well with cloud providers and other data platforms. A few examples:

Cloud Providers (AWS, Azure, Google Cloud): While all major cloud providers offer some data streaming capabilities, Confluent takes a different approach by deeply integrating into their ecosystems. Confluent’s managed services can be:
- Consumed via cloud credits through the cloud provider marketplace
- Integrated natively into cloud provider’s security and networking services
- Fully-managed out-of-the-box connectivity to cloud provider services like object storage, lakehouses, and databases
MongoDB: A leader in NoSQL and operational workloads, MongoDB integrates with Confluent via Kafka-based change data capture (CDC), enabling real-time event streaming between transactional databases and event-driven applications.
Databricks: A powerhouse in AI and analytics, Databricks integrates bi-directionally with Confluent via Kafka and Apache Spark, or object storage and the open table format from Iceberg / Delta Lake via Tableflow. This enables businesses to stream data for AI model training in Databricks and perform real-time model inference directly within the streaming platform.

Rather than attempting to own the entire data stack, Confluent specializes in data streaming and integrates seamlessly with the best cloud, AI, and database solutions.

Beyond the Leader: Specialized Vendors Shaping Data Streaming

Confluent is not alone in recognizing the power of focus. A handful of other vendors have also chosen to specialize in data streaming—each with their own vision, strengths, and approaches.

WarpStream, recently acquired by Confluent, is a Kafka-compatible infrastructure solution designed for Bring Your Own Cloud (BYOC) environments. It re-architects Kafka by running the protocol directly on cloud object storage like Amazon S3, removing the need for traditional brokers or persistent compute. This model dramatically reduces operational complexity and cost—especially for high-ingest, elastic workloads. While WarpStream is now part of the Confluent portfolio, it remains a distinct offering focused on lightweight, cost-efficient Kafka infrastructure.

StreamNative is the commercial steward of Apache Pulsar, aiming to provide a unified messaging and streaming platform. Built for multi-tenancy and geo-replication, it offers some architectural differentiators, particularly in use cases where separation of compute and storage is a must. However, adoption remains niche, and the surrounding ecosystem still lacks maturity and standardization.

Redpanda positions itself as a Kafka-compatible alternative with a focus on performance, especially in low-latency and resource-constrained environments. Its C++ foundation and single-binary architecture make it appealing for edge and latency-sensitive workloads. Yet, Redpanda still needs to mature in areas like stream processing, integrations, and ecosystem support to serve as a true platform.

AutoMQ re-architects Apache Kafka for the cloud by separating compute and storage using object storage like S3. It aims to simplify operations and reduce costs for high-throughput workloads. Though fully Kafka-compatible, AutoMQ concentrates on infrastructure optimization and currently lacks broader platform capabilities like governance, processing, or hybrid deployment support.

Bufstream is experimenting with lightweight approaches to real-time data movement using modern developer tooling and APIs. While promising in niche developer-first scenarios, it has yet to demonstrate scalability, production maturity, or a robust ecosystem around complex stream processing and governance.

Ververica focuses on stream processing with Apache Flink. It offers Ververica Platform to manage Flink deployments at scale, especially on Kubernetes. While it brings deep expertise in Flink operations, it does not provide a full data streaming platform and must be paired with other components, like Kafka for ingestion and delivery.

Great Ideas Are Born From Market Pressure

Each of these companies brings interesting ideas to the space. But building and scaling a complete, enterprise-grade data streaming platform is no small feat. It requires not just infrastructure, but capabilities for processing, governance, security, global scale, and integrations across complex environments.

That’s where Confluent continues to lead—by combining deep technical expertise, a relentless focus on one problem space, and the ability to deliver a full platform experience across cloud, on-prem, and hybrid deployments.

In the long run, the data streaming market will reward not just technical innovation, but consistency, trust, and end-to-end excellence. For now, the message is clear: specialization matters—but execution matters even more. Let’s see where the others go.

How Customers Benefit from Specialization

A well-defined focus provides several advantages for customers, ensuring they get the right tool for each job without the complexity of navigating overlapping services.

Clarity in technology selection: No need to evaluate multiple competing services; purpose-built solutions ensure the right tool for each use case.
Deep technical investment: Continuous innovation focused on solving specific challenges rather than spreading resources thin.
Predictable long-term roadmap: Stability and reliability with no sudden service retirements or shifting priorities.
Better performance and reliability: Architectures optimized for the right workloads through the deep experience in the software category.
Seamless ecosystem integration: Works natively with leading cloud providers and other data platforms for a best-of-breed approach.
Deployment flexibility: Not bound to a single environment like one cloud provider; businesses can run workloads on-premise, in any cloud, at the edge, or across hybrid environments.

Rather than adopting a broad but shallow set of solutions, businesses can achieve stronger outcomes by choosing vendors that specialize in one core competency and deliver it everywhere.

Why Deep Expertise Matters: Supporting 24/7, Mission-Critical Data Streaming

For mission-critical workloads—where downtime, data loss, and compliance failures are not an option—deep expertise is not just an advantage, it is a necessity.

Data streaming is a high-performance, real-time infrastructure that requires continuous reliability, strict SLAs, and rapid response to critical issues. When something goes wrong at the core of an event-driven architecture—whether in Apache Kafka, Apache Flink, or the surrounding ecosystem—only specialized vendors with proven expertise can ensure immediate, effective solutions.

The Challenge with Generalist Cloud Services

Many cloud providers offer some level of data streaming, but their approach is different from a dedicated data streaming platform. Take Amazon MSK as an example:

Amazon MSK provides managed Kafka clusters, but does NOT offer Kafka support itself. If an issue arises deep within Kafka, customers are responsible for troubleshooting it—or must find external experts to resolve the problem.
The terms and conditions of Amazon MSK explicitly exclude Kafka support, meaning that, for mission-critical applications requiring uptime guarantees, compliance, and regulatory alignment, MSK is not a viable choice.
This lack of core Kafka support poses a serious risk for enterprises relying on event streaming for financial transactions, real-time analytics, AI inference, fraud detection, and other high-stakes applications.

For companies that cannot afford failure, a data streaming vendor with direct expertise in the underlying technology is essential.

Why Specialized Vendors Are Essential for Mission-Critical Workloads

A complete data streaming platform is much more than a hosted Kafka cluster or a managed Flink service. Specialized vendors like Confluent offer end-to-end operational expertise, covering:

24/7 Critical Support: Direct access to Kafka and Flink experts, ensuring immediate troubleshooting for core-level issues.
Guaranteed SLAs: Strict uptime commitments, ensuring that mission-critical applications are always running.
No Data Loss Architecture: Built-in replication, failover, and durability to prevent business-critical data loss.
Security & Compliance: Encryption, access control, and governance features designed for regulated industries.
Professional Services & Advisory: Best practices, architecture reviews, and operational guidance tailored for real-time streaming at scale.

This level of deep, continuous investment in operational excellence separates a general-purpose cloud service from a true data streaming platform.

The Power of Specialization: Deep Expertise Beats Broad Offerings

Software vendors will continue expanding their offerings, integrating new technologies, and launching new services. However, focus remains a key differentiator in delivering best-in-class solutions, especially for operational systems with critical SLAs—where low latency, 24/7 uptime, no data loss, and real-time reliability are non-negotiable.

For companies investing in strategic data architectures, choosing a vendor with deep expertise in one core technology—rather than one that spreads across multiple domains—ensures stability, predictable performance, and long-term success.

In a rapidly evolving technology landscape, clarity, specialization, and seamless integration are the foundations of lasting innovation. Businesses that prioritize proven, mission-critical solutions will be better equipped to handle the demands of real-time, event-driven architectures at scale.

How do you see the world of software? Better to specialize or become an allrounder? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. And download my free book about data streaming use cases.

The post The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming) appeared first on Kai Waehner.

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink

Kai Waehner — Sat, 18 Jan 2025 11:33:44 +0000

The cloud revolution has transformed how businesses deploy, scale, and manage data streaming solutions. While Software-as-a-Service (SaaS) and Platform-as-a-Service (PaaS) cloud models are often used interchangeably in marketing, their distinctions have significant implications for operational efficiency, cost, and scalability. In the context of data streaming around Apache Kafka and Flink, understanding these differences and recognizing common misconceptions—such as the overuse of the term “serverless”—can help you make an informed decision. Additionally, the emergence of Bring Your Own Cloud (BYOC) offers yet another option, providing organizations with enhanced control and flexibility in their cloud environments.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Data Streaming Landscape: Kafka, Flink, Cloud, and More

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

If you’re still grappling with the fundamentals of stream processing, this article is a must-read: “Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink“.

What is SaaS in Data Streaming?

SaaS data streaming solutions are fully managed services where the provider handles all aspects of deployment, maintenance, scaling, and updates. SaaS offerings are designed for ease of use, providing a serverless experience where developers focus solely on building applications rather than managing infrastructure.

Characteristics of SaaS in Data Streaming

Serverless Architecture: Resources scale automatically based on demand. True SaaS solutions eliminate the need to provision or manage servers.
Low Operational Overhead: The provider manages hardware, software, and runtime configurations, including upgrades and security patches.
Pay-As-You-Go Pricing: Consumption-based pricing aligns costs directly with usage, reducing waste during low-demand periods.
Rapid Deployment: SaaS enables users to start processing streams within minutes, accelerating time-to-value.

Examples of SaaS in Data Streaming:

Confluent Cloud: A fully managed Kafka platform offering serverless scaling, multi-tenancy, and a broad feature set for both stateless and stateful processing.
Amazon Kinesis Data Analytics: A managed service for real-time analytics with automatic scaling.

What is PaaS in Data Streaming?

PaaS offerings sit between fully managed SaaS and infrastructure-as-a-service (IaaS). These solutions provide a platform to deploy and manage applications but still require significant user involvement for infrastructure management.

Characteristics of PaaS in Data Streaming

Partial Management: The provider offers tools and frameworks, but users must manage servers, clusters, and scaling policies.
Manual Configuration: Deployment involves provisioning VMs or containers, tuning parameters, and monitoring resource usage.
Complex Scaling: Scaling is not always automatic; users may need to adjust resource allocation based on workload changes.
Higher Overhead: PaaS requires more expertise and operational involvement, making it less accessible to teams without dedicated DevOps resources.

Examples of PaaS in Data Streaming (Kafka, Flink)

PaaS offerings in data streaming, while simplifying some infrastructure tasks, still require significant user involvement compared to fully serverless SaaS solutions. Below are three common examples, along with their benefits and pain points compared to serverless SaaS:

Apache Flink (Self-Managed on Kubernetes Cloud Service like EKS, AKS or GKE)
- Benefits: Full control over deployment and infrastructure customization.
- Pain Points: High operational overhead for managing Kubernetes clusters, manual scaling, and complex resource tuning.
Amazon Managed Service for Apache Flink (Amazon MSF)
- Benefits: Simplifies infrastructure management and integrates with some other AWS services.
- Pain Points: Users still handle job configuration, scaling optimization, and monitoring, making it less user-friendly than serverless SaaS solutions.
Amazon MSK (Managed Streaming for Apache Kafka)
- Benefits: Eases Kafka cluster maintenance and integrates with the AWS ecosystem.
- Pain Points: Requires users to design and manage producers/consumers, manually configure scaling, and handle monitoring responsibilities. MSK also excludes support for Kafka if you have any operational issues with the Kafka piece of the infrastructure.

SaaS vs. PaaS: Key Differences

SaaS and PaaS differ in the level of management and user responsibility, with SaaS offering fully managed services for simplicity and PaaS requiring more user involvement for customization and control.

Feature	SaaS	PaaS
Infrastructure	Fully managed by the provider	Partially managed; user controls clusters
Scaling	Automatic and server less	Manual or semi-automatic scaling
Deployment Speed	Immediate, ready to use	Slower; requires configuration
Operational Complexity	Minimal	Moderate to high
Cost Model	Consumption-based, no idle costs	May incur idle resource costs

The big benefit of PaaS over SaaS is greater flexibility and control, allowing users to customize the platform, integrate with specific infrastructure, and optimize configurations to meet unique business or technical requirements. This level of control is often essential for organizations with strict compliance, security, or data sovereignty requirements.

SaaS is NOT Always Better than PaaS!

Be careful: The limitations and pain points of PaaS do NOT mean that SaaS is always better.

A concrete example: Amazon MSK Serverless simplifies Apache Kafka operations with automated scaling and infrastructure management but comes with significant limitations, including the lack of fully-managed connectors, advanced data governance tools, and native multi-language client support.

Amazon MSK also excludes Kafka engine support, leading to potential operational risks and cost unpredictability, especially when integrating with additional AWS services for a complete data streaming pipeline. I explored these challenges in more detail in my article “When NOT to choose Amazon MSK Serverless for Apache Kafka?“.

Bring Your Own Cloud (BYOC) as Alternative to PaaS

BYOC (Bring Your Own Cloud) offers a middle ground between fully managed SaaS and self-managed PaaS solutions, allowing organizations to host applications in their own VPCs.

BYOC provides enhanced control, security, and compliance while reducing operational complexity. This makes BYOC a strong alternative to PaaS for companies with strict regulatory or cost requirements.

As an example, here are the options of Confluent for deploying the data streaming platform: Serverless Confluent Cloud, Self-managed Confluent Platform (some consider this a PaaS if you leverage Confluent’s Kubernetes Operator and other automation / DevOps tooling) and WarpStream as BYOC offering:

Source: Confluent

While BYOC complements SaaS and PaaS, it can be a better choice when fully managed solutions don’t align with specific business needs. I wrote a detailed article about this topic: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

“Serverless” Claims: Don’t Trust the Marketing

Many cloud data streaming solutions are marketed as “serverless,” but this term is often misused. A truly serverless solution should:

Abstract Infrastructure: Users should never need to worry about provisioning, upgrading, or cluster sizing.
Scale Transparently: Resources should scale up or down automatically based on workload.
Eliminate Idle Costs: There should be no cost for unused capacity.

However, many products marketed as serverless still require some degree of infrastructure management or provisioning, making them closer to PaaS. For example:

A so-called “serverless” PaaS solution may still require setting initial cluster sizes or monitoring node health.
Some products charge for pre-provisioned capacity, regardless of actual usage.

Do Your Own Research

When evaluating data streaming solutions, dive into the technical documentation and ask pointed questions:

Does the solution truly abstract infrastructure management?
Are scaling policies automatic, or do they require manual configuration?
Is there a minimum cost even during idle periods?

By scrutinizing these factors, you can avoid falling for misleading “serverless” claims and choose a solution that genuinely meets your needs.

Choosing the Right Model for Your Data Streaming Business: SaaS, PaaS, or BYOC

When adopting a data streaming platform, selecting the right model is crucial for aligning technology with your business strategy:

Use SaaS (Software as a Service) if you prioritize ease of use, rapid deployment, and operational simplicity. SaaS is ideal for teams looking to focus entirely on application development without worrying about infrastructure.
Use PaaS (Platform as a Service) if you require deep customization, control over resource allocation, or have unique workloads that SaaS offerings cannot address.
Use BYOC (Bring Your Own Cloud) if your organization demands full control over its data but sees benefits in fully managed services. BYOC enables you to run the data plane within your cloud VPC, ensuring compliance, security, and architectural flexibility while leveraging SaaS functionality for the control plane .

In the rapidly evolving world of data streaming around Apache Kafka and Flink, SaaS data streaming platforms like Confluent Cloud provide the best of both worlds: the advanced features of tools like Apache Kafka and Flink, combined with the simplicity of a fully managed serverless experience. Whether you’re handling stateless stream processing or complex stateful analytics, SaaS ensures you’re scaling efficiently without operational headaches.

What deployment option do you use today for Kafka and Flink? Any changes planned in the future? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink appeared first on Kai Waehner.

How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)

Kai Waehner — Sat, 12 Oct 2024 06:58:00 +0000

In today’s data-driven world, understanding data at rest versus data in motion is crucial for businesses. Data streaming frameworks like Apache Kafka and Apache Flink enable real-time data processing, offering quick insights and seamless system integration. They are ideal for applications that require immediate responses and handle transactional workloads. Meanwhile, lakehouses like Snowflake, Databricks, and Microsoft Fabric excel in long-term data storage and detailed analysis, perfect for reports and AI training. By leveraging both data streaming and lakehouse systems, businesses can effectively meet both short-term and long-term data needs. This blog post explores how these technologies complement each other in enterprise architecture.

This is part two of a blog series about Microsoft Fabric and its relation to other data platforms on the Azure cloud:

What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks
How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)
When to Choose Apache Kafka vs. Azure Event Hubs vs. Confluent Cloud for a Microsoft Fabric Lakehouse

Subscribe to my newsletter to get an email about a new blog post every few weeks.

Data at Rest (Lakehouse) vs. Data in Motion (Data Streaming)

Data streaming technologies like Apache Kafka and Apache Flink enable continuous data processing while the data is in motion in an event-driven architecture. Data streaming enables immediate insights and seamless integration of data across systems. Kafka provides a robust real-time messaging and persistence platform, while Flink excels in low-latency stream processing, making them ideal for dynamic, stateful applications. A data streaming platform supports operational/transactional and analytical use cases.

Data lakes and data warehouses store data at rest before processing the data. The platforms are optimized for batch processing and long-term analytics, including AI/ML use cases such as model training. Some components provide near real-time capabilities, e.g. data ingestion or dashboards. Data lakes offer scalable, flexible storage for raw data, and data warehouses provide structured, high-performance environments for business intelligence and reporting, complementing the real-time capabilities of streaming technologies. Most leading data platforms provide a unified combination of data lake and data warehouse called lakehouse. Lakehouses are almost exclusively used for analytical workloads as they typically lack the SLAs and tight latency required for operational/transactional use cases.

Data streaming and lakehouses are complementary, with some overlaps but different sweet spots. If you want to learn more, check out these articles:

I also created a short ten minute video explaining the above concepts:

Let’s explore why data streaming and a lakehouse like Microsoft Fabric are complementary (with a few overlaps). I explained in the first blog of this series what Microsoft Fabric is. To understand the differences, it is important to understand what a data streaming platform really is.

Data Streaming in an Event-driven Architecture with Apache Kafka and Flink

There is a lot of confusion in the market. For instance, some folks still compare Apache Kafka to a message broker like RabbitMQ or IBM MQ. I mainly focus on Apache Kafka and Apache Flink as these are the de facto standards for data streaming across industries. Before talking about technologies and solutions, we need to start with the concept of an event-driven architecture as the foundation of data streaming.

Event-driven Architecture for Operational and Analytical Workloads

In today’s digital world, getting real-time data quickly is more important than ever. Traditional methods that process data in batches or via request-response APIs often cannot keep up when you need immediate insights.

Event-driven architecture offers a different approach by focusing on handling events – like transactions or user actions – as they happen. One of the key benefits of an event-driven architecture is its ability to decouple systems, meaning that different parts of a system can work independently. This makes it easier to scale and adapt to changes. An event-driven architecture excels in handling both operational and analytical workloads.

Source: Confluent

For operational tasks, the event-driven architecture enables real-time data processing, automating processes, enhancing customer experiences, and boosting efficiency. In e-commerce, for example, an event-driven system can instantly update inventory, trigger marketing campaigns, and detect fraud.

On the analytical side, the event-driven architecture allows organizations to derive insights from data as it flows, enabling real-time analytics and trend identification without the delays of batch processing. This is invaluable in sectors like finance and healthcare, where timely insights are crucial.

Building an event-driven architecture with data streaming technologies like Apache Kafka and Apache Flink enhances its potential. These platforms provide the infrastructure for high-throughput, low-latency data streams, enabling scalable and resilient event-driven systems.

Apache Kafka – The De Facto Standard for Event-driven Messaging and Integration

Apache Kafka has become the go-to platform for event-driven messaging and integration, transforming how organizations manage data in motion. Developed by LinkedIn and open-sourced, Kafka is a distributed streaming platform adept at handling real-time data feeds. Over 150,000 organizations use Kafka in the meantime.

Kafka’s architecture is based on a distributed commit log, ensuring data durability and consistency. It decouples data producers and consumers, allowing for flexible and scalable data architectures. Producers publish data to topics, and consumers subscribe independently, facilitating system evolution.

Beyond messaging, Kafka serves as a robust integration platform, connecting diverse systems and enabling seamless data flow. Its ecosystem of connectors allows integration with databases, cloud services, and legacy systems. This helps organizations in modernizing their data infrastructure step-by-step.

Kafka’s stream processing capabilities, through Kafka Streams and integration with Apache Flink, further enhance the value of the streaming data pipelines. Kafka Streams allows real-time data processing within Kafka to enable complex transformations and enrichments, driving data-driven innovation.

Apache Flink – The De Facto Standard for Stream Processing

Apache Flink stands out as the leading framework for stream processing. It offers a versatile platform for streaming ETL and stateful business applications. Flink processes data streams with low latency and high throughput, suitable for diverse use cases.

Source: Apache

Flink provides a unified programming model for batch and stream processing that allows developers to use the same API for real-time transactional or analytical batch tasks. This flexibility is a significant advantage as it enables varied data processing without separate tools.

A key feature of Flink is its stateful stream processing. This is crucial for maintaining state across events in real-time applications. Flink’s state management ensures accurate processing in complex scenarios. In contrast to many other stream processing solutions, Flink can do stateful processing even at an extreme scale (i.e., with a throughput of gigabytes per second).

Flink’s event time processing capabilities handle out-of-order or delayed events and ensure consistent results. Developers can define windows and triggers based on event timestamps, accommodating late-arriving data.

Apache Flink supports multiple programming languages, including Java, Python, and SQL, offering developers the flexibility to use their preferred language for building stream processing applications. This is a key differentiator to other stream processing engines, such as Kafka Streams or KSQL.

Kafka and Flink are a “Match Made in Heaven” for Data Streaming

The integration of Flink with Apache Kafka enhances its capabilities. Kafka serves as a reliable data source for Flink to enable seamless real-time data ingestion and processing. With Kafka’s persistent commit log, you can travel back in time and replay historical data in guaranteed ordering for analytical use cases. This combination supports high-volume, low-latency data pipelines, unlocking transactional real-time scenarios and batch analytics.

In summary, Apache Flink’s robust stream processing, combined with Apache Kafka, offers a powerful solution for organizations seeking to leverage real-time data. Whether for operational tasks, real-time analytics, or complex event processing (CEP), Flink provides the necessary flexibility and performance for data-driven innovation.

Microsoft Fabric and Data Streaming (Kafka, Flink): Complementary Forces, NOT Competitors

In the growing landscape of data management, it’s crucial to understand the complementary roles of Microsoft Fabric and data streaming technologies. While some may perceive these technologies as competitors, they actually serve distinct yet interconnected purposes that enhance an organization’s data strategy. And keep in mind that Microsoft Fabric is not just an offering for Azure cloud. Hybrid edge scenarios in the IoT space are perfect for Microsoft Fabric and data streaming together.

Microsoft Fabric’s Streaming Ingestion: A Common Feature Among Lakehouses

Microsoft Fabric, like other modern lakehouse platforms such as Snowflake and Databricks, offers streaming ingestion capabilities. This feature is essential for handling near real-time data flows. It allows organizations to capture and process data as it arrives in the lakehouse. However, it’s important to distinguish between operational and analytical workloads when considering the role of streaming ingestion.

Operational workloads benefit from the immediacy of streaming data, enabling real-time decision-making and process automation. In contrast, analytical workloads often require data to be stored at rest for in-depth analysis and reporting. Microsoft Fabric’s architecture focuses on streaming ingestion into robust storage solutions for analytical purposes.

The Fabric Real Time Intelligence Hub: Understanding Its Capabilities and Limitations

The integration of streaming ingestion into Microsoft Fabric is part of its Real Time Intelligence Hub, which aims to provide a comprehensive platform for managing real-time data. However, beyond the marketing and buzz around Fabric Real Time Intelligence Hub, it’s important to note that it doesn’t operate in true real-time.

Instead, Fabric’s “Real Time Intelligence Hub” uses Spark Streaming jobs to manipulate data, which can introduce some latency. And the infrastructure is not meant for critical SLAs that are required by operational / transactional systems. Additionally, the ingestion process is throttled when using Power BI and other batch analytics tools via an API gateway with a Kafka client.

Microsoft is strong in introducing new names for products or feature for Fabric that are actually just a new brand of existing services. If you find new terms such as “eventhouse” or “event streams feature in the Microsoft Fabric Real-Time Intelligence”, make sure to evaluate if this is really a new component or just some Fabric marketing.

Therefore, despite some overlapping with a data streaming platform, the collaboration between Microsoft Fabric and data streaming vendors like Confluent (Kafka, Flink) underscores the complementary nature of these platforms. By leveraging the strengths of both, organizations can build a robust data infrastructure that supports real-time operations and comprehensive analytics.

In conclusion, Microsoft Fabric and data streaming technologies such as Kafka and Flink are not competitors but complementary tools that, when used together, can significantly enhance an organization’s ability to manage and analyze data. By understanding the distinct roles each plays, businesses can create a more agile and responsive data strategy that meets both operational and analytical needs.

Enterprise Architecture with Data Streaming like Kafka / Flink and Lakehouse(s) like Microsoft Fabric

In the modern enterprise architecture landscape, data streaming and lakehouse platforms are pivotal in creating a robust and flexible data ecosystem. Data streaming technologies enable continuous data ingestion and processing for operational and analytical use cases.

Lakehouse platforms, like Microsoft Fabric, Snowflake and Databricks, provide a unified architecture that combines the best of data lakes and data warehouses, offering scalable storage and advanced analytics capabilities.

Together, these technologies empower businesses to handle both operational and analytical workloads efficiently, breaking down data silos and fostering a data-driven culture. By integrating data streaming with lakehouse architectures, enterprises can achieve seamless data flow and comprehensive insights across their operations.

Reverse ETL is an Anti-Pattern

Reverse ETL is the process of moving data from a data store at rest back into operational systems. It is often considered an anti-pattern in modern data architecture. This approach can lead to data inconsistencies, increased complexity, and higher maintenance costs, as it essentially reverses the natural flow of data in motion. Do NOT store data in MIcrosoft Fabric Lakehouse just to reverse it later into other operational systems!

Instead of relying on reverse ETL, organizations should focus on building real-time data pipelines that enable direct integration between data sources and operational systems. By leveraging an event-driven architecture and data streaming technologies, businesses can ensure that data is consistently updated and available where it’s needed most. This approach not only simplifies data management, but also enhances the accuracy and timeliness of insights.

Apache Iceberg as the De Facto Standard for an Open Table Format – Store Once, Analyze with any Tool

Apache Iceberg has emerged as the de facto standard for an open table format. It offers a opportunity for storing data once in an object store like Amazon S3 and analyzing data across various tools. With its ability to handle large-scale datasets and support ACID transactions, Iceberg provides a reliable and efficient way to manage data in a lakehouse environment.

Organizations can use their preferred analytics and processing tools without being locked into a specific vendor. This flexibility is crucial for businesses looking to maximize their data investments and adapt to changing technological landscapes. By adopting Apache Iceberg together with data streaming, enterprises can ensure data consistency and accessibility across all business units to drive better data-quality, insights and decision-making.

Shift Left Architecture to Support Operational and Analytical Workloads with an Event-driven Architecture

Traditionally, many organizations use data streaming with Kafka as a dumb pipeline to ingest all raw data into a data lake. The consequences are high compute cost for multiple (re-)processing of the raw data, inconsistencies across business units, and slow time to market for new applications.

The Shift Left Architecture is a forward-thinking approach that integrates operational and analytical workloads within an event-driven architecture. By shifting data processing closer to the source, this architecture enables real-time data ingestion and analysis, improving data quality for lakehouse ingestion, reducing latency and improving responsiveness.

Event-driven architectures, powered by technologies like Apache Kafka and Flink, facilitate the seamless flow and processing of data across systems. Shift Left ensures that both operational and analytical needs are met. This approach not only enhances the agility of data-driven applications, but also supports continuous improvement and innovation. By adopting a Shift Left Architecture, organizations can streamline their data processes, improve efficiency, and gain a competitive edge in the market.

Example: Confluent (Data Streaming) + Microsoft Fabric (Lakehouse) + Snowflake (Another Lakehouse)

An example of integrating data streaming and lakehouse technologies is the combination of Confluent, Microsoft Fabric, and Snowflake.

Confluent, built on Apache Kafka and Flink, provides a robust platform for real-time data streaming, enabling organizations to integrate with operational and analytical workloads.

Microsoft Fabric and Snowflake, both lakehouse platforms, offer scalable storage and advanced analytics capabilities to allow businesses performing in-depth analysis and reporting on historical data, near real-time analytics and AI model training.

Apache Iceberg enables storing data once and connects any analytical engine to the data, including lakehouses such as Microsoft Fabric or Snowflake, and unified batch and streaming frameworks such as Apache Flink. Iceberg improves the overall data quality for data sharing, reduces storage cost and enables a much faster rollout of new analytical applications.

By leveraging Confluent for data streaming and integrating it with Microsoft Fabric and Snowflake, organizations can create a comprehensive data architecture that supports both real-time operations and long-term analytics. This synergy not only enhances data accessibility and consistency but also empowers businesses to make data-driven decisions with confidence.

Microsoft Fabric Lakehouse + Data Streaming (Kafka, Flink) = Match Made in Heaven

In conclusion, the synergy between Microsoft Fabric and data streaming technologies like Apache Kafka and Apache Flink creates a powerful combination for modern data management. While Microsoft Fabric excels in providing robust analytics and storage capabilities, data streaming platforms offer real-time data processing and integration, ensuring that businesses can respond swiftly to operational demands.

By leveraging both technologies together, organizations can build a comprehensive data architecture that supports both immediate and long-term needs, enhancing their ability to make informed, data-driven decisions. This complementary relationship not only breaks down data silos, but also fosters a more agile and responsive data strategy. As businesses continue to navigate the complexities of data management, understanding and using the strengths of both Microsoft Fabric and data streaming with data streaming vendors like Confluent will be key to achieving a competitive edge.

The Shift Left Architecture, when paired with Apache Iceberg’s open table format, simplifies the integration of data streaming with one or more lakehouses. This combination enhances data quality for all data consumers and significantly reduces overall storage costs.

In part three of this blog series, I will dig deeper into the data streaming alternatives. When to choose open source frameworks such as Apache Kafka and Flink, a leading data streaming platform such as Confluent, or a native Azure service like Event Hubs. Primer: The trade-offs are huge. Do a proper evaluation BEFORE choosing your data streaming solution.

How do you see the combination of a lakehouse like Microsoft Fabric with data streaming? Do you already use both together? And what is your strategy for other data lakes and data warehouses you already have in your enterprise architecture, such as Databricks or Snowflake? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.) appeared first on Kai Waehner.

Open Standards for Data Lineage: OpenLineage for Batch AND Streaming

Kai Waehner — Mon, 13 May 2024 05:20:11 +0000

One of the greatest wishes of companies is end-to-end visibility in their operational and analytical workflows. Where does data come from? Where does it go? To whom am I giving access to? How can I track data quality issues? The capability to follow the data flow to answer these questions is called data lineage. This blog post explores market trends, efforts to provide an open standard with OpenLineage, and how data governance solutions from vendors such as IBM, Google, Confluent and Collibra help fulfil the enterprise-wide data governance needs of most companies, including data streaming technologies such as Apache Kafka and Flink.

What is Data Governance?

Data governance refers to the overall management of the availability, usability, integrity, and security of data used in an organization. It involves establishing processes, roles, policies, standards, and metrics to ensure that data is properly managed throughout its lifecycle. Data governance aims to ensure that data is accurate, consistent, secure, and compliant with regulatory requirements and organizational policies. It encompasses activities such as data quality management, data security, metadata management, and compliance with data-related regulations and standards.

What is the Business Value of Data Governance?

The business value of data governance is significant and multifaceted:

Improved Data Quality: Data governance ensures that data is accurate, consistent, and reliable, leading to better decision-making, reduced errors, and improved operational efficiency.
Enhanced Regulatory Compliance: By establishing policies and procedures for data management and ensuring compliance with regulations such as GDPR, HIPAA, and CCPA, data governance helps mitigate risks associated with non-compliance, including penalties and reputational damage.
Increased Trust and Confidence: Effective data governance instills trust and confidence in data among stakeholders. It leads to greater adoption of data-driven decision-making and improved collaboration across departments.
Cost Reduction: By reducing data redundancy, eliminating data inconsistencies, and optimizing data storage and maintenance processes, data governance helps organizations minimize costs associated with data management and compliance.
Better Risk Management: Data governance enables organizations to identify, assess, and mitigate risks associated with data management, security, privacy, and compliance, reducing the likelihood and impact of data-related incidents.
Support for Business Initiatives: Data governance provides a foundation for strategic initiatives such as digital transformation, data analytics, and AI/ML projects by ensuring that data is available, accessible, and reliable for analysis and decision-making.
Competitive Advantage: Organizations with robust data governance practices can leverage data more effectively to gain insights, innovate, and respond to market changes quickly, giving them a competitive edge in their industry.

Overall, data governance contributes to improved data quality, compliance, trust, cost efficiency, risk management, and competitiveness, ultimately driving better business outcomes and value creation.

What is Data Lineage?

Data lineage refers to the ability to trace the complete lifecycle of data, from its origin through every transformation and movement across different systems and processes. It provides a detailed understanding of how data is created, modified, and consumed within an organization’s data ecosystem, including information about its source, transformations applied, and destinations.

Data Lineage is an essential component of Data Governance: Understanding data lineage helps organizations ensure data quality, compliance with regulations, and adherence to internal policies by providing visibility into data flows and transformations.

Data Lineage is NOT Event Tracing!

Event tracing and data lineage a are different concepts that serve distinct purposes in the realm of data management:

Data Lineage:

Data lineage refers to the ability to track and visualize the complete lifecycle of data, from its origin through every transformation and movement across different systems and processes.
It provides a detailed understanding of how data is created, modified, and consumed within an organization’s data ecosystem, including information about its source, transformations applied, and destinations.
Data lineage focuses on the flow of data and metadata, helping organizations ensure data quality, compliance, and trustworthiness by providing visibility into data flows and transformations.

Event Tracing:

Event tracing, also known as distributed tracing, is a technique used in distributed systems to monitor and debug the flow of individual requests or events as they traverse through various components and services.
It involves instrumenting applications to generate trace data, which contains information about the path and timing of events as they propagate across different nodes and services.
Event tracing is primarily used for performance monitoring, troubleshooting, and root cause analysis in complex distributed systems, helping organizations identify bottlenecks, latency issues, and errors in request processing.

In summary, data lineage focuses on the lifecycle of data within an organization’s data ecosystem, while event tracing is more concerned with monitoring the flow of individual events or requests through distributed systems for troubleshooting and performance analysis.

Here is an example in payments processing: Data lineage would track the path of payment data from initiation to settlement, detailing each step and transformation it undergoes. Meanwhile, event tracing would monitor individual events within the payment system in real-time, capturing the sequence and outcome of actions, such as authentication checks and transaction approvals.

What is the Standard ‘OpenLineage’?

Open Lineage is an open-source project that aims to standardize metadata management for data lineage. It provides a framework for capturing, storing, and sharing metadata related to the lineage of data as it moves through various stages of processing within an organization’s data infrastructure. By providing a common format and APIs for expressing and accessing lineage information, Open Lineage enables interoperability between different data processing systems and tools, facilitating data governance, compliance, and data quality efforts.

Source: OpenLineage (presented at Kafka Summit London 2024)

OpenLineage is an open platform for the collection and analysis of data lineage. It includes an open standard for lineage data collection, integration libraries for the most common tools, and a metadata repository/reference implementation (Marquez). Many frameworks and tools already support producers/consumers:

Source: OpenLineage (presented at Kafka Summit London 2024)

Data Governance for Data Streaming (like Apache Kafka and Flink)

Data streaming involves the real-time processing and movement of data through its distributed messaging platform. This enables organizations to efficiently ingest, process, and analyze large volumes of data from various sources. By decoupling data producers and consumers, a data streaming platform provides a scalable and fault-tolerant solution for building real-time data pipelines to support use cases such as real-time analytics, event-driven architectures, and data integration.

The de facto standard for data streaming is Apache Kafka, used by over 100,000 organizations. Kafka is not just used for big data, it also provides support for transactional workloads.

Data Governance Differences with Data Streaming Compared to Data Lake and Data Warehouse?

Implementing data governance and lineage with data streaming presents several differences and challenges compared to data lakes and data warehouses:

Real-Time Nature: Data streaming involves the processing of data in real-time when it is generated, whereas data lakes and data warehouses typically deal with batch processing of historical data. This real-time nature of data streaming requires governance processes and controls that can operate at the speed of streaming data ingestion, processing, and analysis.
Dynamic Data Flow: Data streaming environments are characterized by dynamic and continuous data flows, with data being ingested, processed, and analyzed in near-real-time. This dynamic nature requires data governance mechanisms that can adapt to changing data sources, schemas, and processing pipelines in real-time, ensuring that governance policies are applied consistently across the entire streaming data ecosystem.
Granular Data Lineage: In data streaming, data lineage needs to be tracked at a more granular level compared to data lakes and data warehouses. This is because streaming data often undergoes multiple transformations and enrichments as it moves through streaming pipelines. In some cases, the lineage of each individual data record must be traced to ensure data quality, compliance, and accountability.
Immediate Actionability: Data streaming environments often require immediate actionability of data governance policies and controls to address issues such as data quality issues, security breaches, or compliance violations in real-time. This necessitates the automation of governance processes and the integration of governance controls directly into streaming data processing pipelines, enabling timely detection, notification, and remediation of governance issues.
Scalability and Resilience: Data streaming platforms like Apache Kafka and Apache Flink are designed for scalability and resilience to handle both, high volumes of data and transactional workloads with critical SLAs. The platform must ensure continuous stream processing even in the face of failures or disruptions. Data governance mechanisms in streaming environments need to be similarly scalable and resilient to keep pace with the scale and speed of streaming data processing, ensuring consistent governance enforcement across distributed and resilient streaming infrastructure.
Metadata Management Challenges: Data streaming introduces unique challenges for metadata management, as metadata needs to be captured and managed in real-time to provide visibility into streaming data pipelines, schema evolution, and data lineage. This requires specialized tools and techniques for capturing, storing, and querying metadata in streaming environments, enabling stakeholders to understand and analyze the streaming data ecosystem effectively.

In summary, implementing data governance with data streaming requires addressing the unique challenges posed by the real-time nature, dynamic data flow, granular data lineage, immediate actionability, scalability, resilience, and metadata management requirements of streaming data environments. This involves adopting specialized governance processes, controls, tools, and techniques tailored to the characteristics and requirements of data streaming platforms like Apache Kafka and Apache Flink.

Schemas and Data Contracts for Streaming Data

The foundation of data governance for streaming data are schemas and data contracts. Confluent Schema Registry is available on GitHub. It became the de facto standard for ensuring data quality and governance in Kafka projects across all industries. Not just for Confluent projects, but also in the broader community leveraging open source technologies. Schema Registry is available under the Confluent Community License that allows deployment in production scenarios with no licensing costs.

Source: Confluent

For more details, check out my article “Policy Enforcement and Data Quality for Apache Kafka with Schema Registry“. And here are two great case studies for financial services companies leveraging schemas and group-wide API contracts across the organization for data governance:

Raiffeisenbank International (RBI): Enterprise-wide data mesh across countries with data streaming.
ING Bank: Schemas in the data streaming enterprise architecture to evolve data contracts

Source: ING Bank

Example: Confluent’s Stream Data Governance Suite for Kafka and Flink

Confluent Cloud is an excellent example of a data governance solution for data streaming. The fully-managed data streaming platform provides capabilities such as

Data Catalog
Data Lineage
Stream Sharing
Data Portal

The Data Portal combines the capabilities in an intuitive user interface to discover, explore, access and use streaming data products:

Source: Confluent

All information and functionality are available in the UI for humans and as an API for integration scenarios.

If you want to learn more about data streaming and data governance in a fun way, check out the free comic ebook “The Data Streaming Revolution: The Force of Kafka + Flink Awakens“:

Source: Confluent

Data Lineage for Streaming Data

Being a core fundament of data governance, data streaming projects require good data lineage for visibility and governance. Today’s market mainly provides two options: Custom projects or buying a commercial product/cloud service. But the market develops. Open standards emerge for data lineage, and integrate data streaming into its implementations.

Let’s explore an example of a commercial solution and an open standard for streaming data lineage:

Cloud service: Data Lineage as part of Confluent Cloud
Open standard: OpenLineage’s integration with Apache Flink and Marquez

Data Lineage in Confluent Cloud for Kafka and Flink

To move forward with updates to critical applications or answer questions on important subjects like data regulation and compliance, teams need an easy means of comprehending the big picture journey of data in motion. Confluent Cloud provides a solution deeply integrated with Kafka and Flink as part of the fully managed SaaS offering.

Source: Confluent

Stream lineage provides a graphical UI of event streams and data relationships with both a bird’s eye view and drill-down magnification for answering questions like:

Where did data come from?
Where is it going?
Where, when, and how was it transformed?

Answers to questions like these allow developers to trust the data they’ve found, and gain the visibility needed to make sure their changes won’t cause any negative or unexpected downstream impact. Developers can learn and decide quickly with live metrics and metadata inspection embedded directly within lineage graphs.

The Confluent documentation goes into much more detail, including examples, tutorials, free cloud credits, etc. Most of the above description is also copied from there.

OpenLineage for Stream Processing with Apache Flink

In recent months, stream processing has gained the particular focus of the OpenLineage community, as described in a dedicated talk at Kafka Summit 2024 in London.

Many useful features for stream processing completed or begun in OpenLineage’s implementation, including:

A seamless OpenLineage and Apache Flink integration
Support for streaming jobs in data catalogs like Marquez, manta, atlan
Progress on a built-in lineage API within the Flink codebase

Here is a screenshot from the live demo of the Kafka Summit talk that shows data lineage across Kafka Topics, Flink applications, and other databases with the reference implementation of OpenLineage (Marquez):

Source: OpenLineage (presented at Kafka Summit London 2024)

The OpenLineage Flink integration is in the early stage with limitations, like no support for Flink SQL or Table API yet. But this is an important initiative. Cross-platform lineage enables a holistic overview of data flow and its dependencies within organizations. This must include stream processing (which often runs the most critical workloads in an enterprise).

The Need for Enterprise-Wide Data Governance and Data Lineage

Data Governance, including Data Lineage, is an enterprise-wide challenge. OpenLineage is an excellent approach for an open standard to integrate with various data platforms like data streaming platform, data lake, data warehouse, lake house, and any other business application.

However, we are still early on this journey. Most companies (have to) build custom solutions today for enterprise-wide governance and lineage of data across various platforms. Short term, most companies leverage purpose-built data governance and lineage features from cloud products like Confluent, Databricks and Snowflake. This makes sense as it creates visibility in the data flows and improves data quality.

Enterprise-wide data governance needs to integrate with all the different data platforms. Today, most companies have built their own solutions – if they have anything at all today (most don’t yet)… Dedicated enterprise governance suites like Collibra or Microsoft Purview get adopted more and more to solve these challenges. And software/cloud vendors like Confluent integrate their purpose-built data lineage and governance into these platforms. Either just via open APIs or via direct and certified integrations.

Balancing Standardization and Innovation with Open Standards and Cloud Services

OpenLineage is a great community initiative to standardize the integration between data platforms and data governance. Hopefully, vendors will adopt such open standards in the future. Today, it is an early stage and you will probably integrate via open APIs or certified (proprietary) connectors.

Balancing standardization and innovation is always a trade-off: Finding the right balance between standardization and innovation entails simplicity, flexibility, and diligent review processes, with a focus on addressing real-world pain points and fostering community-driven extensions.

If you want to learn more about open standards for data governance, please watch this expert panel for data lineage where Accenture and Confluent welcomed experts from Open Lineage, Collibra, Google, IBM / Mantra, IBM / Egeria, Atlan, and Confluent (actually me).

How do you implement data governance and lineage? Do you already leverage OpenLineage or other standards? Or are you investing in commercial products? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Open Standards for Data Lineage: OpenLineage for Batch AND Streaming appeared first on Kai Waehner.

When NOT to Use Apache Kafka? (Lightboard Video)

Kai Waehner — Tue, 26 Mar 2024 06:45:11 +0000

Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

DisclaimeR: This blog post shares a lightboard video to watch an explanation about when NOT to use Apache Kafka. For a much more detailed and technical blog post with various use cases and case studies, check out this blog post from 2022 (which is still valid today – whenever you read it).

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

a scalable real-time messaging platform to process millions of messages per second.
a data streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
a data integration framework (Kafka Connect) for streaming ETL.
a data processing framework (Kafka Streams) for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
an IoT platform with features such as device management – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Lightboard Video: When NOT to use Apache Kafka

The following video explores the key concepts of Apache Kafka. Afterwards, the DOs and DONTs of Kafka show how to complement data streaming with other technologies for analytics, APIs, IoT, and other scenarios.

Data Streaming Vendors and Cloud Services

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations.

Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. Check out the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Apache Kafka is a Data Streaming Platform: Combine it with other Platforms when needed!

Over 150,000 organizations use Apache Kafka in the meantime. The Kafka protocol is the de facto standard for many open source frameworks, commercial products and serverless cloud SaaS offerings.

However, Kafka is not an allrounder for every use case. Many projects combine Kafka with other technologies, such as databases, data lakes, data warehouses, IoT platforms, and so on. Additionally, Apache Flink is becoming the de facto standard for stream processing (but Kafka Streams is not going away and is the better choice for specific use cases).

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

Data Streaming is not a Race, it is a Journey!

Kai Waehner — Thu, 13 Apr 2023 07:15:31 +0000

Data Streaming is not a race, it is a Journey! Event-driven architectures and technologies like Apache Kafka or Apache Flink require a mind shift in architecting, developing, deploying, and monitoring applications. Legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud setups are the norm, not an exception. This blog post explores success stories from data streaming journeys across industries, including banking, retail, insurance, manufacturing, healthcare, energy & utilities, and software companies.

Data Streaming is a Journey, not a Race!

Confluent’s maturity model is used across thousands of customers to analyze the status quo, deploy real-time infrastructure and applications, and plan for a strategic event-driven architecture to ensure success and flexibility in the future of the multi-year data streaming journey:

Source: Confluent

The following sections show success stories from various companies across industries that moved through the data streaming journey. Each journey looks different. Each company has different technologies, vendors, strategies, and legal requirements.

Before we begin, I must stress that these journeys are NOT just successful because of the newest technologies like Apache Kafka, Kubernetes, and public cloud providers like AWS, Azure, and GCP. Success comes with a combination of technologies and expertise.

Consulting – Expertise from Software Vendors and System Integrators

All the below success stories combined open source technologies, software vendors, cloud providers, internal business and technical experts, system integrators, and consultants of software vendors.

Long story short: Technology and expertise are required to make your data streaming journey successful.

We not only sell data streaming products and cloud services but also offer advice and essential support. Note that the bundle numbers (1 to 5) in the following diagram are related to the above data streaming maturity curve:

Source: Confluent

Other vendors have similar strategies to support you. The same is true for the system integrators. Learn together. Bring people from different companies into the room to solve your business problems.

With this background, let’s look at the fantastic data streaming journeys we heard about at past Kafka Summits, Data in Motion events, Confluent blog posts, or other similar public knowledge-sharing alternatives.

Customer Journeys across Industries powered by Apache Kafka

Apache Kafka is the DE FACTO standard for data streaming. However, each data streaming journey looks different. Don’t underestimate how much you can learn from other industries. These companies might have different legal or compliance requirements, but the overall IT challenges are often very similar.

We cover stories from various industries, including banks, retailers, insurers, manufacturers, healthcare organizations, energy providers, and software companies.

The following data streaming journeys are explored in the below sections:

AO.com: Real-Time Clickstream Analytics
Nordstrom: Event-driven Analytics Platform
Migros: End-to-End Supply Chain with IoT
NordLB: Bank-wide Data Streaming for Core Banking
Raiffeisen Bank International: Strategic Real-Time Data Integration Platform Across 13 Countries
Allianz: Legacy Modernization and Cloud-Native Innovation
Optum (UnitedHealth Group): Self-Service Data Streaming
Intel: Real-Time Cyber Intelligence at Scale with Kafka and Splunk
Bayer: Transition from On-Premise to Multi-Cloud
Tesla: Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)
Siemens: Integration between On-Premise SAP and Salesforce CRM in the Cloud
50Hertz: Transition from Proprietary SCADA to Cloud-Native Industrial IoT

There is no specific order besides industries. If you read the stories, you see that all are different, but you can still learn a lot from them, no matter what industry your business is in.

For more information about data streaming in a specific industry, just search my blog for case studies and architectures. Recent blog posts focus on the state of data streaming in manufacturing, in financial services, and so on.

AO.com – Real-Time Clickstream Analytics

AO.com is an electrical retailer. The hyper-personalized online retail experience turns each customer visit into a one-on-one marketing opportunity. The technical implementations correlation of historical customer data with real-time digital signals.

Years ago, Apache Hadoop and Apache Spark referred to this kind of clickstream analytics as the “Hello World” example. AO.com does real-time clickstream analytics powered by data streaming.

AO.com’s journey started with relatively simple data streaming pipelines leveraging the Schema Registry for decoupling and API contracts. Over time, the platform thinking of data streaming added more business value, and operations shifted to the fully-managed Confluent Cloud to focus on implementing business applications, not operations infrastructure.

Source: AO.com

Nordstrom – Event-driven Analytics Platform

Nordstrom is an American luxury department store chain. In the meantime, 50+% of revenue comes from online sales. They built the Nordstrom Analytical Platform (NAP) as the heart of its event-driven architecture. Nordstrom can use a singular event for analytical, functional, operational, and model-building purposes.

As Nordstrom said at Kafka Summit: “If it’s not in NAP, it didn’t happen.”

Nordstrom’s journey started in 1901 with the first stores. The last 25 years brought the retailer into the digital world and online retail. NAP has been a core component since 2017 for real-time analytics. The future is all automated decision-making in real-time.

Source: Nordstrom

Migros – End-to-End Supply Chain with IoT

Migros is Switzerland’s largest retail company, the largest supermarket chain, and the largest employer. They leverage data streaming with Confluent to distribute master data across many systems.

Migros’ supply chain is optimized with a single data streaming pipeline (including replaying entire days of events). For instance, real-time transportation information is visualized with MQTT and Kafka. Afterward, more advanced business logic was implemented, like forecasting the truck arrival time and planning, respectively, rescheduling truck tours.

The data streaming journey started with a single Kafka cluster. And grew to various independent Kafka clusters and a strategic Kafka-powered enterprise integration platform.

Source: Migros

NordLB – Bank-wide Data Streaming for Core Banking

Norddeutsche Landesbank (NordLB) is one of the largest commercial banks in Germany. They implemented an enterprise-wide transformation. The new Confluent-powered core banking platform enables event-based and truly decoupled stream processing. Use cases include improved real-time analytics, fraud detection, and customer retention.

Unfortunately, NordLB’s slide is only available in German. But I guess you can still follow their data streaming journey moving over the years from on-premise big data batch processing with Hadoop to real-time data streaming and analytics in the cloud with Kafka and Confluent:

Source: Norddeutsche Landesbank

Raiffeisen Bank International – Strategic Real-Time Data Integration Platform across 13 Countries

Raiffeisen Bank International (RBI) operates as a corporate and investment bank in Austria and as a universal bank in Central and Eastern Europe (CEE).

RBI’s bank transformation across 13 countries includes various components:

Bank-wide transformation program
Realtime Integration Center of Excellence (“RICE“)
Central platform and reference architecture for self-service re-use
Event-driven integration platform (fully-managed cloud and on-premise)
Group-wide API contracts (=schemas) and data governance

While I don’t have a nice diagram of RBI’s data streaming journey over the past few years, I can show you one of the most impressive migration stories. The context is horrible had to move from Ukraine data centers to the public cloud after the Russian invasion started in early 2022. The journey was super impressive from the IT perspective, as the migration happened without downtime or data loss.

RBI’s CTO recapped:

“We did this migration in three months, because we didn’t have a choice.”

Source: McKinsey

Allianz – Legacy Modernization and Cloud-Native Innovation

Allianz is a European multinational financial services company headquartered in Munich, Germany. Its core businesses are insurance and asset management. The company is one of the largest insurers and financial services groups.

A large organization like Allianz does not have just one enterprise architecture. This section has two independent stories of two separate Allianz business units.

One team of Allianz has built a so-called Core Insurance Service Layer (CISL). The Kafka-powered enterprise architecture is flexible and evolutionary. Why? Because it needs to integrate with hundreds of old and new applications. Some are running on the mainframe or use a batch file transfer. Others are cloud-native, running in containers or as a SaaS application in the public cloud.

Data streaming ensures the decoupling of applications via events and the reliable integration of different backend applications, APIs and communication paradigms. Legacy and new applications connect but also allow running in parallel (old V1 and new V2) before migrating away and shutting down from the legacy component.

Source: Allianz

Allianz Direct is a separate Kafka deployment. This business unit started with a greenfield approach to build insurance as a service in the public cloud. The cloud-native platform is elastic, scalable, and open. Hence, one platform can be used across countries with different legal, compliance, and data governance requirements.

This data streaming journey is best described by looking at a quote from Allianz Direct’s CTO:

Source: Linkedin (November 2022)

Optum – Self-Service Data Streaming

Optum is an American pharmacy benefit manager and healthcare provider
(UnitedHealth Group subsidiary). Optum started with a single data source connecting to Kafka. Today, they provide Kafka as a Service within the UnitedHealth Group. The service is centrally managed and used by over 200 internal application teams.

The benefits of this data streaming approach are the following characteristics: repeatable, scalable, and a cost-efficient way to standardize data. Optum leverages data streaming for many use cases, from mainframe via change data capture (CDC) to modern data processing and analytics tools.

Source: Optum

Intel – Real-Time Cyber Intelligence at Scale with Kafka and Splunk

Intel Corporation (commonly known as Intel) is an American multinational corporation and technology company headquartered in Santa Clara, California. It is one of the world’s largest semiconductor chip manufacturers by revenue.

Intel’s Cyber Intelligence Platform leverages the entire Kafka-native ecosystem, including Kafka Connect, Kafka Streams, Multi-Region Clusters (MRC), and more…

Their Kafka maturity curve shows how Intel started with a few data pipelines, for instance, getting data from various sources into Splunk for situational awareness and threat intelligence use cases. Later, Intel added Kafka-native stream processing for streaming ETL (to pre-process, filter, and aggregate data) instead of ingesting raw data (with high $$$ bills) and for advanced analytics by combining streaming analytics with Machine Learning.

Source: Intel

Brent Conran (Vice President and Chief Information Security Officer, Intel) described the benefits of data streaming:

“Kafka enables us to deploy more advanced techniques in-stream, such as machine learning models that analyze data and produce new insights. This helps us reduce the meantime to detect and respond.”

Bayer AG – Transition from On-Premise to Multi-Cloud

Bayer AG is a German multinational pharmaceutical and biotechnology company and one of the largest pharmaceutical companies in the world. They adopted a cloud-first strategy and started a multi-year transition to the cloud.

While Bayer AG has various data streaming projects powered by Kafka, my favorite is the success story of their Monsanto acquisition. The Kafka-based cross-data center DataHub was created to facilitate migration and to drive a shift to real-time stream processing.

Instead of using many words, let’s look at Bayer’s impressive data streaming journey from on-premise to multi-cloud, connecting various cloud-native Kafka and non-Kafka technologies over the years.

Source: Bayer AG

Tesla – Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)

Tesla is a carmaker, utility company, insurance provider, and more. The company processes trillions of messages per day for IoT use cases.

Tesla’s data streaming journey is fascinating because it focuses on migrating from a message broker to stream processing. The initial reasons for Kafka were the scalability and reliability of processing high volumes of IoT sensor data.

But as you can see, more use cases were added quickly, such as Kafka Connect for data integration.

Source: Tesla

Tesla’s blog post details its advanced stream processing use cases. I often call streaming analytics the “secret sauce” (as data is only valuable if you correlate it in real-time instead of just ingesting it into a batch data warehouse or data lake).

Siemens – Integration between On-Premise SAP and Salesforce CRM in the Cloud

Siemens is a German multinational conglomerate corporation and the largest industrial manufacturing company in Europe.

One strategic data-streaming use case allowed Siemens to move “from batch to faster“ data processing. For instance, Siemens connected its SAP PMD to Kafka on-premise. The SAP infrastructure is very complex, with 80% ”customized ERP”. They improved the business processes and integration workflow from daily or weekly batches to real-time communication by integrating SAP’s proprietary IDoc messages within the event streaming platform.

Siemens later migrated from self-managed on-premise Kafka to Confluent Cloud via Confluent Replicator. Integrating Salesforce CRM via Kafka Connect was the first step of Siemens’ cloud strategy. As usual, more and more projects and applications join the data streaming journey as it is super easy to tap into the event stream and connect it to your favorite tools, APIs, and SaaS products.

Source: Siemens

A few important notes on this migration:

Confluent Cloud was chosen because it enables focusing on business logic and handing over complex operations to the expert, including mission-critical SLAs and support.
Migration from one Kafka cluster to another (in this case, from open source to Confluent Cloud) is possible with zero downtime and no data loss. Confluent Replicator and Cluster Linking, in conjunction with the expertise and skills of consultants, make such a migration simple and safe.

50Hertz – Transition from Proprietary SCADA to Cloud-Native Industrial IoT

50hertz is a transmission system operator for electricity in Germany. They built a cloud-native 24/7 SCADA system with Confluent. The solution is developed and tested in the public cloud but deployed in safety-critical air-gapped edge environments. A unidirectional hardware gateway helps replicate data from the edge into the cloud safely and securely.

This data streaming journey is one of my favorites as it combines so many challenges well known in any OT/IT environment across industries like manufacturing, energy and utilities, automotive, healthcare, etc.:

From monolithic and proprietary systems to open architectures and APIs
From batch with mainframe, files, and old Windows servers to real-time data processing and analytics
From on-premise data centers (and air-gapped production facilities) to a hybrid edge and cloud strategy, often with a cloud-first focus for new use cases.

The following graphic is fantastic. It shows the data streaming journey from left to right, including the bridge in the middle (as such a project is not a big bang but usually takes years, at a minimum):

Source: 50Hertz

Technical Evolution during the Data Streaming Journey

The different data streaming journeys showed you a few things: It is not a big bang. Old and new technology and processes need to work together. Only technology AND expertise drive successful projects.

Therefore, I want to share a few more resources to think about this from a technical perspective:

The data streaming journey never ends. Apache Kafka and Apache Flink increasingly join forces to build innovative real-time stream processing applications. We are just at the beginning.

What does your data streaming journey look like? Where are you right now, and what are your plans for the next few years? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Data Streaming is not a Race, it is a Journey! appeared first on Kai Waehner.

Apache Kafka is NOT real real-time data streaming!

Kai Waehner — Tue, 29 Nov 2022 15:01:04 +0000

Real-time data beats slow data. It is that easy! But what is real-time? The term always needs to be defined when discussing a use case. Apache Kafka is the de facto standard for real-time data streaming. Kafka is good enough for almost all real-time scenarios. But dedicated proprietary software is required for niche use cases. Kafka is NOT the right choice if you need microsecond latency! This blog post explores the architecture of NASDAQ that combines critical stock exchange trading with low-latency streaming analytics.

What is real-time data streaming?

Apache Kafka is the de facto standard for data streaming. However, every business has a different understanding of real-time data. And Kafka cannot solve every real-time problem.

Hard real-time is a deterministic network with zero spikes and zero latency. That’s a requirement for embedded systems using programming languages like C, C++, or Rust to implement safety-critical software like flight control systems or collaborative robots (cobots). Apache Kafka is not the right technology for safety-critical latency requirements.

Soft real-time is data processing in a non-deterministic network with potential latency spikes. Data is processed in near real-time. That can be microseconds, milliseconds, seconds, or slower.

Real-time categories for data streaming with Apache Kafka

I typically see three kinds of real-time use cases. But even here, Apache Kafka does not fit into each category:

Critical real-time: Limited set of use cases that require data processing in microseconds. Famous use case: Trading markets in financial services. This is NOT Kafka.
Low-latency real-time: Fast data processing is required in tens or hundreds of milliseconds to enable specific use cases. Examples: Sensor analytics in manufacturing, end-to-end data correlation in ride-hailing between mobile apps and backends, and fraud detection in instant payments. This is Kafka.
Near real-time: Fast data processing improves the business process but is not mandatory. For instance, data ingestion (streaming ETL) into a data warehouse is better in seconds than a batch process that runs every night. This is Kafka.

Please note that this article focuses on Apache Kafka as it is the de facto standard for data streaming. However, the same is true for many complementary or competitive technologies, like Spark Streaming, Apache Flink, Apache Pulsar, or Redpanda.

Let’s look at a concrete example of the financial services industry…

NASDAQ market feeds vs. real-time data

Ruchir Vani, the Director of Software Engineering at Nasdaq, presented at “Current 2022 – the Next Generation of Kafka Summit” in Austin, Texas: Improving the Reliability of Market Data Subscription Feeds.

The Nasdaq Stock Market (National Association of Securities Dealers Automated Quotations Stock Market) is an American stock exchange based in New York City. It is ranked second on the list of stock exchanges by market capitalization of shares traded, behind the New York Stock Exchange. The exchange platform is owned by Nasdaq, Inc. While most people only know the stock exchange, it is just the tip of the iceberg.

Source: Nasdaq, Inc.

Nasdaq Cloud Data Service powered by Apache Kafka

The Nasdaq Cloud Data Service has democratized access to financial data for companies, researchers, and educators. These downstream consumers have different requirements and SLAs.

The core engine for processing the market data feeds requires sub-15 microsecond latency. This is NOT Kafka but dedicated (expensive) proprietary software. Consumers need to be co-located and use optimized applications to leverage data at that speed.

Source: Nasdaq, Inc.

NASDAQ, Inc. wanted to capture more market share by providing additional services on top of the critical market feed. They built a service on top called Nasdaq Data Link Streaming. Kafka powers this:

Source: Nasdaq, Inc.

The following architecture shows the combination. Critical real-time workloads run in the Nasdaq Data Centers. The data feeds are replicated to the public cloud for further processing and analytics. Market traders need to co-locate with the critical real-time engine. Other internal and external subscribers (like research and education consumers) consume from the cloud with low latency, in near real-time, or even in batches from the historical data store:

Source: Nasdaq, Inc.

Real real-time is proprietary (expensive) technology and is rarely needed

Real-time data beats slow data. Apache Kafka and similar technologies like Apache Flink, Spark Streaming, Apache Pulsar, Redpanda, Amazon Kinesis, Google Pub Sub, RabbitMQ, and so on enable low latency real-time messaging or data streaming.

Apache Kafka became the de facto standard for data streaming because Kafka is good enough for almost all use cases. Most use cases do not even care if end-to-end processing takes 10ms, 100ms, or even 500ms (as downstream applications are not built for that speed anyway). Niche scenarios require dedicated technology. Kafka is NOT the right choice if you need microsecond latency! The NASDAQ example showed how critical proprietary technology and low-latency data streaming work very well together.

If you want to see more use cases, read my post about low-latency data streaming with Apache Kafka and cloud-native 5G infrastructure.

What kind of real-time do you need in your projects? When do you need critical real-time? If you “just” need low latency, what use case is Kafka not good enough for? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka is NOT real real-time data streaming! appeared first on Kai Waehner.

Real-Time Supply Chain with Apache Kafka in the Food and Retail Industry

Kai Waehner — Fri, 25 Feb 2022 08:13:14 +0000

The supply chain in the food and retail industry is complex, error-prone, and slow. This blog post explores real-world deployments across the end-to-end supply chain powered by data streaming with Apache Kafka to improve business processes with real-time services. The examples include manufacturing, logistics, stores, delivery, restaurants, and other parts of the business. Case studies include Walmart, Albertsons, Instacart, Domino’s Pizza, Migros, and more.

The supply chain in the food and retail industry

The food industry is a complex, global network of diverse businesses that supplies most of the food consumed by the world’s population. It is far beyond the following simplified visualization

The term food industry covers a series of industrial activities directed at the production, distribution, processing, conversion, preparation, preservation, transport, certification, and packaging of foodstuffs.

Today’s food industry has become highly diversified, with manufacturing ranging from small, traditional, family-run activities that are highly labor-intensive to large, capital-intensive, and highly mechanized industrial processes.

This post explores several case studies with use cases and architectures to improve supply chain and business processes with real-time capabilities.

Most of the following companies leverage Kafka for various use cases. They sometimes overlap and compete. I structured the next sections to choose specific examples from each company. This way, you see a complete real-time supply chain flow using data streaming examples from different industries related to the food and retail business.

Why do the food and retail supply chains need real-time data streaming using Apache Kafka?

Before I start with the real-world examples, let’s discuss why I thought this was the right time for this post to collect a few case studies.

In February 2022, I was in Florida for a week for customer meetings. Let me report on my inconvenient experience as a frequent traveler and customer. Coincidentally, all this happened within one weekend.

A horrible weekend travel experience powered by manual processes and batch systems

Issue #1 (hotel): While I canceled a hotel night a week ago, I still got emails for buying upgrades from the batch system of the hotel’s booking engine. Contrarily, I did not get promotions for upgrading my new booking at the same hotel chain.
Issue #2 (clothing store): The Point-of-sale (POS) was down because of a power outage and no connectivity to the internet. The POS could not process payments without the internet. People left without buying items. I needed the new clothes (as my hotel also did not provide laundry service because of the limited workforce).
Issue #3 (restaurant): I went to a steak house to get dinner. Their POS was down, too, because of the same power outage. The waiter could not take orders. The separate software in the kitchen could not receive manual orders from the waiter either.
Issue #4 (restaurant): After 30 minutes of waiting, I ordered and had dinner. However, the dessert I chose was not available. The waiter explained that supply chain issues don’t provide this food, but the restaurant could never update the paper menu or online PDF.
Issue #5 (clothing store): After dinner, I went back to the other store to buy my clothes. It worked. The salesperson asked me for my email address to give a 15 percent discount. I wonder why they don’t see my loyalty information already. Maybe via a location based-service. Or at least, I would expect to pay and log in via my mobile app.
Issue #6 (hotel): Back in the hotel, I check out. The hotel did not add my loyalty bonus (discount on food in the restaurant) as the system only supports batch integration of other applications. Also, the receipt could not display loyalty information and the correct receipt value. I had to accept this, even though this is not even legal by German tax law.

This story shows why the digital transformation is crucial across all parts of the food, retail, and travel industry. I have already covered use cases and architecture for data streaming with Apache Kafka in the Airline, Aviation, and Travel Industry. Hence, this post focuses more on food and retail. But the concepts do not differ across these industries.

As you will see in the following sections, optimizing business processes, automating tasks, and improving customer experience is not complicated with the right technologies.

Digital transformation with automated business processes and real-time services across the supply chain

With this story in mind, I thought it was a good time to share various real-world examples across the food and retail supply chain. Some companies have already started with innovation. They digitalized their business processes and built innovative new real-time services.

Food processing machinery for IoT analytics at Baader

BAADER is a worldwide manufacturer of innovative machinery for the food processing industry. They run an IoT-based and data-driven food value chain on Confluent Cloud.

The Kafka-based infrastructure provides a single source of truth across the factories and regions across the food value chain. Business-critical operations are available 24/7 for tracking, calculations, alerts, etc.

The event streaming platform runs on Confluent Cloud. Hence, Baader can focus on building innovative business applications. The serverless Kafka infrastructure provides mission-critical SLAs and consumption-based pricing for all required capabilities: messaging, storage, data integration, and data processing.

MQTT provides connectivity to machines and GPS data from vehicles at the edge. Kafka Connect connectors integrate MQTT and other IT systems, such as Elasticsearch, MongoDB, and AWS S3. ksqlDB processes the data in motion continuously.

Check out my blog series about Kafka and MQTT for other related IoT use cases and examples.

Logistics and track & trace across the supply chain at Migros

Migros is Switzerland’s largest retail company, largest supermarket chain, and largest employer. They leverage MQTT and Kafka for real-time visualization and processing of logistics and transportation information.

As Migros explored in a Swiss Kafka meetup, they optimized their supply chain with a single data streaming pipeline, not just for real-time purposes but also for use cases that require replaying all historical events of the whole day.

The goal was to provide “one global optimum instead of four local optima”. Specific logistics use cases include forecasting the truck arrival time and rescheduling truck tours.

Optimized inventory management and replenishment at Walmart

Walmart operates Kafka across its end-to-end supply chain at a massive scale. The VP of Walmart Cloud says, “Walmart is a $500 billion in revenue company, so every second is worth millions of dollars. Having Confluent as our partner has been invaluable. Kafka and Confluent are the backbone of our digital omnichannel transformation and success at Walmart.”

The real-time supply chain includes Walmart’s inventory system. As part of that infrastructure, Walmart built a real-time replenishment system:

Here are a few notes from Walmart’s Kafka Summit talk covering this topic.

Caters millions of its online and walk-in customers
Ensures optimal availability of needed assortment and timely delivery on online fulfillment
4+ billion messages in 3 hours generate an order plan for the entire network of Walmart stores with great accuracy for ordering decisions made daily
Apache Kafka as the data hub and for real-time processing
Apache Spark for micro-batches

The business value is enormous from a cost and operations perspective: cycle time reduction, accuracy, speed reduced complexity, elasticity, scalability, improved resiliency, reduced cost.

And to be clear: This is just one of many Kafka-powered use cases at Walmart! Check out other Kafka Summit presentations for different use cases and architectures.

Omnichannel order management in restaurants at Domino’s Pizza

Domino’s Pizza is a multinational pizza restaurant chain with ~17,000 stores. They successfully transformed from a traditional pizza company to a data-driven e-commerce organization. They focus on a data-first approach and a relentless customer experience.

Domino’s Pizza provides real-time operation views to franchise owners. This includes KPIs, such as order volume by channel and store efficiency metrics across different ordering channels.

Domino’s Pizza mentions the following capabilities and benefits of their real-time data hub to optimize their supply chain:

Improve store operational real-time analytics
Support global expansion goals via legacy IT modernization
Implement more personalized marketing campaigns
Real-time single pane of glass

Location-based services and upselling at the point-of-sale at Royal Caribbean

Royal Caribbean is a cruise line. It operates the four largest passenger ships in the world. As of January 2021, the line operates twenty-four ships and has six additional ships on order.

Royal Caribbean implemented one of Kafka’s most famous use cases at the edge. Each cruise ship has a Kafka cluster running locally for use cases, such as payment processing, loyalty information, customer recommendations, etc.:

Edge computing on the ship is crucial for an efficient supply chain and excellent customer experience at Royal Caribbean:

Bad and costly connectivity to the internet
The requirement to do edge computing in real-time for a seamless customer experience and increased revenue
Aggregation of all the cruise trips in the cloud for analytics and reporting to improve the customer experience, upsell opportunities, and many other business processes

Hence, a Kafka cluster on each ship enables local processing and reliable mission-critical workloads. The Kafka storage guarantees durability, no data loss, and guaranteed ordering of events – even though they are processed later. Only very critical data is sent directly to the cloud (if there is connectivity at all). All other information is replicated to the central Kafka cluster in the cloud when the ship arrives in a harbor for a few hours. A stable internet connection and high bandwidth are available before leaving for the next trip again.

Learn more about use cases and architecture for Kafka at the edge in its dedicated blog post.

Recommendations and discounts using the loyalty platform at Albertsons

Albertsons is the second-largest American grocery company with 2200+ stores and 290,000+ employees. They also operate pharmacies in many stores.

Albertsons operates Kafka powered by Confluent as the real-time data hub for various use cases, including:

Inventory updates from 2200 stores to the cloud services in near real-time to maintain visibility in the health of the stores and to ensure well managed out-of-stock and
substitution recommendations for digital customers
Distributing offers and customer clips to each store in near real-time and improving store checkout experience.
Feed the data in real-time to analytics engines for supply chain order forecasts, warehouse order management, delivery, labor forecasting and management, demand planning, and inventory management.
Ingesting transactions near real-time to the data lake for batch workloads to generate dashboards for associates and business, training data models, and hyper-personalization.

As an example, here is how Albertsons data streaming architecture empowers their real-time loyalty platform:

Learn more about Albertsons’ innovative real-time data hub by watching the on-demand webinar I did with two of their leading architects recently.

Payment processing and fraud detection at Grab

Grab is a mobility service in Asia, similar to Uber and Lyft in the United States or FREE NOW (former mytaxi) in Europe. The Kafka ecosystem heavily powers all of them. Like FREE NOW, Grab leverages serverless Confluent Cloud to focus on business logic.

As part of this, Grab has built GrabDefence. This internal service provides a mission-critical payment and fraud detection platform. The platform leverages Kafka Streams and Machine Learning for stateful stream processing and real-time fraud detection at scale.

Grab struggles with billions of fraud and safety detections performed daily for millions of transactions. Companies lose 1.6% in fraud in Southeast Asia. Therefore, Grab’s data science and engineering team built a platform to search for anomalous and suspicious transactions and identify high-risk individuals.

Here is one fraud example: An individual who pretends to be both the driver and passenger and makes cashless payments to get promotions.

Delivery and pickup at Instacart

Instacart is a grocery delivery and pickup service in the United States and Canada. The service orders groceries from participating retailers, with the shopping being done by a personal shopper.

Instacart requires a platform that enables elastic scale and fast, agile internal adoption of real-time data processing. Hence, Confluent Cloud was chosen to focus on business logic and roll out new features with fast time-to-market. As Instacart said, during the Covid pandemic, they had to “handle ten years’ worth of growth in six weeks”. Very impressive and only possible with cloud-native and serverless data streaming.

Learn more in the recent Kafka Summit interview between Confluent found Jun Rao and Instacart.

Real-time data streaming with Kafka beats slow data everywhere in the supply chain

This post showed various real-world deployments where real-time information increases operational efficiency, reduces the cost, or improves the customer experience.

Whether your business domain cares about manufacturing, logistics, stores, delivery, or customer experience and loyalty, data streaming with the Apache Kafka ecosystem provides the capability to be innovative at any scale. And in the cloud, serverless Kafka makes it even easier to focus on time to market and business value.

How do you optimize your supply chain? Do you leverage data streaming? Is Apache Kafka the tool of choice for real-time data processing? Or what data streaming technologies do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Real-Time Supply Chain with Apache Kafka in the Food and Retail Industry appeared first on Kai Waehner.

When NOT to use Apache Kafka?

Kai Waehner — Tue, 04 Jan 2022 07:24:59 +0000

Apache Kafka is the de facto standard for event streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.

Market Trends – A Connected World

Let’s begin with understanding why Kafka comes up everywhere in the meantime. This clarifies the huge market demand for event streaming but also shows that there is no silver bullet solving all problems. Kafka is NOT the silver bullet for a connected world, but a crucial component!

The world gets more and more connected. Vast volumes of data are generated and need to be correlated in real-time to increase revenue, reduce costs, and reduce risks. I could pick almost any industry. Some are faster. Others are slower. But the connected world is coming everywhere. Think about manufacturing, smart cities, gaming, retail, banking, insurance, and so on. If you look at my past blogs, you can find relevant Kafka use cases for any industry.

I picked two market trends that show this insane growth of data and the creation of innovation and new cutting-edge use cases (and why Kafka’s adoption is insane across industries, too).

Connected Cars – Insane volume of telemetry data and aftersales

Here is the “Global Opportunity Analysis and Industry Forecast, 2020–2027” by Allied Market Research:

The Connected Car market includes a much wider variety of use cases and industries than most people think. A few examples: Network infrastructure and connectivity, safety, entertainment, retail, aftermarket, vehicle insurance, 3rd party data usage (e.g., smart city), and so much more.

Gaming – Billions of players and massive revenues

The gaming industry is already bigger than all other media categories combined, and this is still just the beginning of a new era – as Bitkraft depicts:

Millions of new players join the gaming community every month across the globe. Connectivity and cheap smartphones are sold in less wealthy countries. New business models like “play to earn” change how the next generation of gamers plays a game. More scalable and low latency technologies like 5G enable new use cases. Blockchain and NFT (Non-Fungible Token) are changing the monetization and collection market forever.

These market trends across industries clarify why the need for real-time data processing increases significantly quarter by quarter. Apache Kafka established itself as the de facto standard for processing analytical and transactional data streams at scale. However, it is crucial to understand when (not) to use Apache Kafka and its ecosystem in your projects.

What is Apache Kafka, and what is it NOT?

Kafka is…

a scalable real-time messaging platform to process millions of messages per second.
an event streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
a data integration framework for streaming ETL.
a data processing framework for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
an IoT platform with features such as device management – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Case studies for Apache Kafka in a connected world

This section shows a few examples of fantastic success stories where Kafka is combined with other technologies because it makes sense and solves the business problem. The focus here is case studies that need more than just Kafka for the end-to-end data flow.

No matter if you follow my blog, Kafka Summit conferences, online platforms like Medium or Dzone, or any other tech-related news. You find plenty of success stories around real-time data streaming with Apache Kafka for high volumes of analytics and transactional data from connected cars, IoT edge devices, or gaming apps on smartphones.

A few examples across industries and use cases:

Audi: Connected car platform rolled out across regions and cloud providers
BMW: Smart factories for the optimization of the supply chain and logistics
SolarPower: Complete solar energy solutions and services across the globe
Royal Caribbean: Entertainment on cruise ships with disconnected edge services and hybrid cloud aggregation
Disney+ Hotstar: Interactive media content and gaming/betting for millions of fans on their smartphone
The list goes on and on and on.

So what is the problem with all these great IoT success stories? Well, there is no problem. But some clarification is needed to explain when to use event streaming with the Apache Kafka ecosystem and where other complementary solutions usually complement it.

When to use Apache Kafka?

Before we discuss when NOT to use Kafka, let’s understand where to use it to get more clear how and when to complement it with other technologies if needed.

I will add real-world examples to each section. In my experience, this makes it much easier to understand the added value.

Kafka consumes and processes high volumes of IoT and mobile data in real-time

Processing massive volumes of data in real-time is one of the critical capabilities of Kafka.

Tesla is not just a car maker. Tesla is a tech company writing a lot of innovative and cutting-edge software. They provide an energy infrastructure for cars with their Tesla Superchargers, solar energy production at their Gigafactories, and much more. Processing and analyzing the data from their vehicles, smart grids, and factories and integrating with the rest of the IT backend services in real-time is a crucial piece of their success.

Tesla has built a Kafka-based data platform infrastructure “to support millions of devices and trillions of data points per day”. Tesla showed an exciting history and evolution of their Kafka usage at a Kafka Summit in 2019:

Keep in mind that Kafka is much more than just messaging. I repeat this in almost every blog post as too many people still don’t get it. Kafka is a distributed storage layer that truly decouples producers and consumers. Additionally, Kafka-native processing tools like Kafka Streams and ksqlDB enable real-time processing.

Kafka correlates IoT data with transactional data from the MES and ERP systems

Data integration in real-time at scale is relevant for analytics and the usage of transactional systems like an ERP or MES system. Kafka Connect and non-Kafka middleware complement the core of event streaming for this task.

BMW operates mission-critical Kafka workloads across the edge (i.e., in the smart factories) and public cloud. Kafka enables decoupling, transparency, and innovation. The products and expertise from Confluent add stability. The latter is vital for success in manufacturing. Each minute of downtime costs a fortune. Read my related article “Apache Kafka as Data Historian – an IIoT / Industry 4.0 Real-Time Data Lake” to understand how Kafka improves the Overall Equipment Effectiveness (OEE) in manufacturing.

BMW optimizes its supply chain management in real-time. The solution provides information about the right stock in place, both physically and in transactional systems like BMW’s ERP powered by SAP. “Just in time, just in sequence” is crucial for many critical applications. The integration between Kafka and SAP is required for almost 50% of customers I talk to in this space. Beyond the integration, many next-generation transactional ERP and MES platforms are powered by Kafka, too.

Kafka integrates with all the non-IoT IT in the enterprise at the edge and hybrid or multi-cloud

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Learn about several scenarios that may require multi-cluster solutions and see real-world examples with their specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments, and global Kafka.

The true decoupling between different interfaces is a unique advantage of Kafka vs. other messaging platforms such as IBM MQ, RabbitMQ, or MQTT brokers. I also explored this in detail in my article about Domain-driven Design (DDD) with Kafka.

Infrastructure modernization and hybrid cloud architectures with Apache Kafka are typical across industries.

One of my favorite examples is the success story from Unity. The company provides a real-time 3D development platform focusing on gaming and getting into other industries like manufacturing with their Augmented Reality (AR) / Virtual Reality (VR) features.

The data-driven company already had content installed 33 billion times in 2019, reaching 3 billion devices worldwide. Unity operates one of the largest monetization networks in the world. They migrated this platform from self-managed Kafka to fully-managed Confluent Cloud. The cutover was executed by the project team without downtime or data loss. Read Unity’s post on the Confluent Blog: “How Unity uses Confluent for real-time event streaming at scale “.

Kafka is the scalable real-time backend for mobility services and gaming/betting platforms

Many gaming and mobility services leverage event streaming as the backbone of their infrastructure. Use cases include the processing of telemetry data, location-based services, payments, fraud detection, user/player retention, loyalty platform, and so much more. Almost all innovative applications in this sector require real-time data streaming at scale.

A few examples:

Mobility services: Uber, Lyft, FREE NOW, Grab, Otonomo, Here Technologies, …
Gaming services: Disney+ Hotstar, Sony Playstation, Tencent, Big Fish Games, …
Betting services: William Hill, Sky Betting, …

Just look at the job portals of any mobility or gaming service. Not everybody is talking about their Kafka usage in public. But almost everyone is looking for Kafka experts to develop and operate their platform.

These use cases are just as critical as a payment process in a core banking platform. Regulatory compliance and zero data loss are crucial. Multi-Region Clusters (i.e., a Kafka cluster stretched across regions like US East, Central, and West) enable high availability with zero downtime and no data loss even in the case of a disaster.

Vehicles, machines, or IoT devices embed a single Kafka broker

The edge is here to stay and grow. Some use cases require the deployment of a Kafka cluster or single broker outside a data center. Reasons for operating a Kafka infrastructure at the edge include low latency, cost efficiency, cybersecurity, or no internet connectivity.

Examples for Kafka at the edge:

Disconnected edge in logistics to store logs, sensor data, and images while offline (e.g., a truck on the street or a drone flying around a ship) until a good internet connection is available in the distribution center
Vehicle-to-Everything (V2X) communication in a local small data center like AWS Outposts (via a gateway like MQTT if large area, a considerable number of vehicles, or lousy network), or via direct Kafka client connection for a few hundreds of machines, e.g., in a smart factory )
Offline mobility services like integrating a car infrastructure with gaming, maps, or a recommendation engine with locally processed partner services (e.g., the next Mc Donalds comes in 10 miles, here is a coupon).

The cruise line Royal Caribbean is a great success story for this scenario. It operates the four largest passenger ships in the world. As of January 2021, the line operates twenty-four ships and has six additional ships on order.

Royal Caribbean implemented one of Kafka’s most famous use cases at the edge. Each cruise ship has a Kafka cluster running locally for use cases such as payment processing, loyalty information, customer recommendations, etc.:

I covered this example and other Kafka edge deployments in various blogs. I talked about use cases for Kafka at the edge, showed architectures for Kafka at the edge, and explored low latency 5G deployments powered by Kafka.

When NOT to use Apache Kafka?

Finally, we are coming to the section everybody was looking for, right? However, it is crucial first to understand when to use Kafka. Now, it is easy to explain when NOT to use Kafka.

For this section, let’s assume that we talk about production scenarios, not some ugly (?) workarounds to connect Kafka to something for a proof of concept directly; there is always a quick and dirty option to test something – and that’s fine for that goal. But things change when you need to scale and roll out your infrastructure globally, be compliant to law, and guarantee no data loss for transactional workloads.

With this in mind, it is relatively easy to qualify out Kafka as an option for some use cases and problems:

Kafka is NOT hard real-time

The definition of the term “real-time” is difficult. It is often a marketing term. Real-time programs must guarantee a response within specified time constraints.

Kafka – and all other frameworks, products, and cloud services used in this context – is only soft real-time and built for the IT world. Many OT and IoT applications require hard real-time with zero latency spikes.

Soft real-time is used for applications such as

Point-to-point messaging between IT applications
Data ingestion from various data sources into one or more data sinks
Data processing and data correlation (often called event streaming or event stream processing)

If your application requires sub-millisecond latency, Kafka is not the right technology. For instance, high-frequency trading is usually implemented with purpose-built proprietary commercial solutions.

Always keep in mind: The lowest latency would be to not use a messaging system at all and just use shared memory. In a race to the lowest latency, Kafka will lose every time. However, for the audit log and transaction log or persistence engine parts of the exchange, it is no data loss that becomes more important than latency and Kafka wins.

Most real-time use cases “only” require data processing in the millisecond to the second range. In that case, Kafka is a perfect solution. Many FinTechs, such as Robinhood, rely on Kafka for mission-critical transactional workloads, even financial trading. Multi-access edge computing (MEC) is another excellent example of low latency data streaming with Apache Kafka and cloud-native 5G infrastructure.

Kafka is NOT deterministic for embedded and safety-critical systems

This one is pretty straightforward and related to the above section. Kafka is not a deterministic system. Safety-critical applications cannot use it for a car engine control system, a medical system such as a heart pacemaker, or an industrial process controller.

A few examples where Kafka CANNOT be used for:

Safety-critical data processing in the car or vehicle. That’s Autosar / MINRA C / Assembler and similar technologies.
CAN Bus communication between ECUs.
Robotics. That’s C / C++ or similar low-level languages combined with frameworks such as Industrial ROS (Robot Operating System).
Safety-critical machine learning / deep learning (e.g., for autonomous driving)
Vehicle-to-Vehicle (V2V) communication. That’s 5G sidelink without an intermediary like Kafka.

My post “Apache Kafka is NOT Hard Real-Time BUT Used Everywhere in Automotive and Industrial IoT” explores this discussion in more detail.

TL;DR: Safety-related data processing must be implemented with dedicated low-level programming languages and solutions. That’s not Kafka! The same is true for any other IT software, too. Hence, don’t replace Kafka with IBM MQ, Flink, Spark, Snowflake, or any other similar IT software.

Kafka is NOT built for bad networks

Kafka requires good stable network connectivity between the Kafka clients and the Kafka brokers. Hence, if the network is unstable and clients need to reconnect to the brokers all the time, then operations are challenging, and SLAs are hard to reach.

There are some exceptions, but the basic rule of thumb is that other technologies are built specifically to solve the problem of bad networks. MQTT is the most prominent example. Hence, Kafka and MQTT are friends, not enemies. The combination is super powerful and used a lot across industries. For that reason, I wrote a whole blog series about Kafka and MQTT.

We built a connected car infrastructure that processes 100,000 data streams for real-time predictions using MQTT, Kafka, and TensorFlow in a Kappa architecture.

Kafka does NOT provide connectivity to tens of thousands of client applications

Another specific point to qualify Kafka out as an integration solution is that Kafka cannot connect to tens of thousands of clients. If you need to build a connected car infrastructure or gaming platform for mobile players, the clients (i.e., cars or smartphones) will not directly connect to Kafka.

A dedicated proxy such as an HTTP gateway or MQTT broker is the right intermediary between thousands of clients and Kafka for real-time backend processing and the integration with further data sinks such as a data lake, data warehouse, or custom real-time applications.

Where are the limits of Kafka client connections? As so often, this is hard to say. I have seen customers connect directly from their shop floor in the plant via .NET and Java Kafka clients via a direct connection to the cloud where the Kafka cluster is running. Direct hybrid connections usually work well if the number of machines, PLCs, IoT gateways, and IoT devices is in the hundreds. For higher numbers of client applications, you need to evaluate if you a) need a proxy in the middle or b) deploy “edge computing” with or without Kafka at the edge for lower latency and cost-efficient workloads.

When to MAYBE use Apache Kafka?

The last section covered scenarios where it is relatively easy to quality Kafka out as it simply cannot provide the required capabilities. I want to explore a few less apparent topics, and it depends on several things if Kafka is a good choice or not.

Kafka does (usually) NOT replace another database

Apache Kafka is a database. It provides ACID guarantees and is used in hundreds of companies for mission-critical deployments. However, most times, Kafka is not competitive with other databases. Kafka is an event streaming platform for messaging, storage, processing, and integration at scale in real-time with zero downtime or data loss.

Kafka is often used as a central streaming integration layer with these characteristics. Other databases can build materialized views for their specific use cases like real-time time-series analytics, near real-time ingestion into a text search infrastructure, or long-term storage in a data lake.

In summary, when you get asked if Kafka can replace a database, then there are several answers to consider:

Kafka can store data forever in a durable and high available manner providing ACID guarantees
Further options to query historical data are available in Kafka
Kafka-native add-ons like ksqlDB or Tiered Storage make Kafka more potent than ever before for data processing and event-based long-term storage
Stateful applications can be built leveraging Kafka clients (microservices, business applications) with no other external database
Not a replacement for existing databases, data warehouses, or data lakes like MySQL, MongoDB, Elasticsearch, Hadoop, Snowflake, Google BigQuery, etc.
Other databases and Kafka complement each other; the right solution has to be selected for a problem; often, purpose-built materialized views are created and updated in real-time from the central event-based infrastructure
Different options are available for bi-directional pull and push-based integration between Kafka and databases to complement each other

My blog post “Can Apache Kafka replace a database, data warehouse, or data lake?” discusses the usage of Kafka as a database in much more detail.

Kafka does (usually) NOT process large messages

Kafka was not built for large messages. Period.

Nevertheless, more and more projects send and process 1Mb, 10Mb, and even much bigger files and other large payloads via Kafka. One reason is that Kafka was designed for large volume/throughput – which is required for large messages. A very common example that comes up regularly is the ingestion and processing of large files from legacy systems with Kafka before ingesting the processed data into a Data Warehouse.

However, not all large messages should be processed with Kafka. Often you should use the right storage system and just leverage Kafka for the orchestration. Reference-based messaging (i.e. storing the file in another storage system and sending the link and metadata) is often the better design pattern:

Know the different design patterns and choose the right technology for your problem.

For more details and use cases about handling large files with Kafka, check out this blog post: “Handling Large Messages with Apache Kafka (CSV, XML, Image, Video, Audio, Files)“.

Kafka is (usually) NOT the IoT gateway for the last-mile integration of industrial protocols…

The last-mile integration with IoT interfaces and mobile apps is a tricky space. As discussed above, Kafka cannot connect to thousands of Kafka clients. However, many IoT and mobile applications only require tens or hundreds of connections. In that case, a Kafka-native connection is straightforward using one of the various Kafka clients available for almost any programming language on the planet.

Suppose a connection on TCP level with a Kafka client makes little sense or is not possible. In that case, a very prevalent workaround is the REST Proxy as the intermediary between the clients and the Kafka cluster. The clients communicate via synchronous HTTP(S) with the streaming platform.

Use cases for HTTP and REST APIs with Apache Kafka include the control plane (= management), the data plane (= produce and consume messages), and automation, respectively DevOps tasks.

Unfortunately, many IoT projects require much more complex integrations. I am not just talking about a relatively straightforward integration via an MQTT or OPC-UA connector. Challenges in Industrial IoT projects include:

The automation industry does often not use open standards but is slow, insecure, not scalable, and proprietary.
Product Lifecycles are very long (tens of years), with no simple changes or upgrades.
IIoT usually uses incompatible protocols, typically proprietary and built for one specific vendor.
Proprietary and expensive monoliths that are not scalable and not extendible.

Therefore, many IoT projects complement Kafka with a purpose-built IoT platform. Most IoT products and cloud services are proprietary but provide open interfaces and architectures. The open-source space is small in this industry. A great alternative (for some use cases) is Apache PLC4X. The framework integrates with many proprietary legacy protocols, such as Siemens S7, Modbus, Allen Bradley, Beckhoff ADS, etc. PLC4X also provides a Kafka Connect connector for native and scalable Kafka integration.

A modern data historian is open and flexible. The foundation of many strategic IoT modernization projects across the shop floor and hybrid cloud is powered by event streaming:

Kafka is NOT a blockchain (but relevant for web3, crypto trading, NFT, off-chain, sidechain, oracles)

Kafka is a distributed commit log. The concepts and foundations are very similar to a blockchain. I explored this in more detail in my post “Apache Kafka and Blockchain – Comparison and a Kafka-native Implementation“.

A blockchain should be used ONLY if different untrusted parties need to collaborate. For most enterprise projects, a blockchain is unnecessary added complexity. A distributed commit log (= Kafka) or a tamper-proof distributed ledger (= enhanced Kafka) is sufficient.

Having said this, more interestingly, I see more and more companies using Kafka within their crypto trading platforms, market exchanges, and NFT token trading marketplaces.

To be clear: Kafka is NOT the blockchain on these platforms. The blockchain is a cryptocurrency like Bitcoin or a platform providing smart contracts like Ethereum where people build new distributed applications (dApps) like NFTs for the gaming or art industry. Kafka is the streaming platform to connect these blockchains with other Oracles (= the non-blockchain apps) like the CRM, data lake, data warehouse, and so on:

TokenAnalyst is an excellent example that leverages Kafka to integrate blockchain data from Bitcoin and Ethereum with their analytics tools. Kafka Streams provides a stateful streaming application to prevent using invalid blocks in downstream aggregate calculations. For example, TokenAnalyst developed a block confirmer component that resolves reorganization scenarios by temporarily keeping blocks, and only propagates them when a threshold of a number of confirmations (children to that block are mined) is reached.

In some advanced use cases, Kafka is used to implementing a sidechain or off-chain platform as the original blockchain does not scale well enough (blockchain is known as on-chain data). Not just Bitcoin has the problem of only processing single-digit (!) transactions per second. Most modern blockchain solutions cannot scale even close to the workloads Kafka processes in real-time.

From DAOs to blue chip companies, measuring the health of blockchain infrastructure and IOT components is still necessary even in a distributed network to avoid downtime, secure the infrastructure, and make the blockchain data accessible. Kafka provides an agentless and scalable way to present that data to the parties involved and make sure that the relevant data is exposed to the right teams before a node is lost. This is relevant for cutting-edge Web3 IoT projects like Helium, or simpler closed distributed ledgers (DLT) like R3 Corda.

My recent post about live commerce powered by event streaming and Kafka transforming the retail metaverse shows how the retail and gaming industry connects virtual and physical things. The retail business process and customer communication happen in real-time; no matter if you want to sell clothes, a smartphone, or a blockchain-based NFT token for your collectible or video game.

TL;DR: Kafka is NOT…

… a replacement for your favorite database or data warehouse.

… hard real-time for safety-critical embedded workloads.

… a proxy for thousands of clients in bad networks.

… an API Management solution.

… an IoT gateway.

… a blockchain.

It is easy to qualify Kafka out for some use cases and requirements.

However, analytical and transactional workloads across all industries use Kafka. It is the de-facto standard for event streaming everywhere. Hence, Kafka is often combined with other technologies and platforms.

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to use Apache Kafka? appeared first on Kai Waehner.