Artificial Intelligence Archives - Kai Waehner

How Penske Logistics Transforms Fleet Intelligence with Data Streaming and AI

Kai Waehner — Mon, 02 Jun 2025 04:44:37 +0000

Real-time visibility is no longer a competitive advantage in logistics—it’s a business necessity. As global supply chains become more complex and customer expectations rise, logistics providers must respond with agility and precision. That means shifting away from static, delayed data pipelines toward event-driven architectures built around real-time data.

Technologies like Apache Kafka and Apache Flink are at the heart of this transformation. They allow logistics companies to capture, process, and act on streaming data as it’s generated—from vehicle sensors and telematics systems to inventory platforms and customer applications. This enables new use cases in predictive maintenance, live fleet tracking, customer service automation, and much more.

A growing number of companies across the supply chain are embracing this model. Whether it’s real-time shipment tracking, automated compliance reporting, or AI-driven optimization, the ability to stream, process, and route data instantly is proving vital.

One standout example is Penske Logistics—a transportation leader using Confluent’s data streaming platform (DSP) to transform how it operates and delivers value to customers.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

Why Real-Time Data Matters in Logistics and Transportation

Transportation and logistics operate on tight margins and stricter timelines than almost any other sector. Delays ripple through supply chains, disrupting manufacturing schedules, customer deliveries, and retail inventories. Traditional data integration methods—batch ETL, manual syncing, and siloed systems—simply can’t meet the demands of today’s global logistics networks.

Data streaming enables organizations in the logistics and transportation industry to ingest and process information in real-time while the data is valuable and critical. Vehicle diagnostics, route updates, inventory changes, and customer interactions can all be captured and acted upon in real time. This leads to faster decisions, more responsive services, and smarter operations.

Real-time data also lays the foundation for advanced use cases in automation and AI, where outcomes depend on immediate context and up-to-date information. And for logistics providers, it unlocks a powerful competitive edge.

How Data Streaming with Apache Kafka and Flink Transforms the Supply Chain

Apache Kafka serves as the backbone for real-time messaging—connecting thousands of data producers and consumers across enterprise systems. Apache Flink adds stateful stream processing to the mix, enabling continuous pattern recognition, enrichment, and complex business logic in real time.

In the logistics industry, this event-driven architecture supports use cases such as:

Continuous monitoring of vehicle health and sensor data
Proactive maintenance scheduling
Real-time fleet tracking and route optimization
Integration of telematics, ERP, WMS, and customer systems
Instant alerts for service delays or disruptions
Predictive analytics for capacity and demand forecasting

This isn’t just theory. Leading logistics organizations are deploying these capabilities at scale.

Data Streaming Success Stories Across the Logistics and Transportation Industry

Many transportation and logistics firms are already using Kafka-based architectures to modernize their operations. A few examples:

LKW Walter relies on data streaming to optimize its full truck load (FTL) freight exchanges and enable digital freight matching.
Uber Freight leverages real-time telematics, pricing models, and dynamic load assignment across its digital logistics platform.
Instacart uses event-driven systems to coordinate live order delivery, matching customer demand with available delivery slots.
Maersk incorporates streaming data from containers and ports to enhance shipping visibility and supply chain planning.

These examples show the diversity of value that real-time data brings—across first mile, middle mile, and last mile operations.

An increasing number of companies are using data streaming as the event-driven control tower for their supply chains. It’s not only about real-time insights—it’s also about ensuring consistent data across real-time messaging, HTTP APIs, and batch systems. Learn more in this article: A Real-Time Supply Chain Control Tower powered by Kafka.

Penske Logistics: A Leader in Transportation, Fleet Services, and Supply Chain Innovation

Penske Transportation Solutions is one of North America’s most recognizable logistics brands. It provides commercial truck leasing, rental, and fleet maintenance services, operating a fleet of over 400,000 vehicles. Its logistics arm offers freight management, supply chain optimization, and warehousing for enterprise customers.

Source: Penske Logistics

But Penske is more than a fleet and logistics company. It’s a data-driven operation where technology plays a central role in service delivery. From vehicle telematics to customer support, Penske is leveraging data streaming and AI to meet growing demands for reliability, transparency, and speed.

Penske’s Data Streaming Success Story

Penske explored its data streaming journey at the Confluent Data in Motion Tour. Sarvant Singh, Vice President of Data and Emerging Solutions at Penske, explains the company’s motivation clearly: “We’re an information-intense business. A lot of information is getting exchanged between our customers, associates, and partners. In our business, vehicle uptime and supply chain visibility are critical.”

This focus on uptime is what drove Penske to adopt a real-time data streaming platform, powered by Confluent. Today, Penske ingests and processes around 190 million IoT messages every day from its vehicles.

Each truck contains hundreds of sensors (and thousands of sub-sensors) that monitor everything from engine performance to braking systems. With this volume of data, traditional architectures fell short. Penske turned to Confluent Cloud to leverage Apache Kafka at scale as a fully-managed, elastic SaaS to eliminate the operational burden and unlocking true real-time capabilities.

By streaming sensor data through Confluent and into a proactive diagnostics engine, Penske can now predict when a vehicle may fail—before the problem arises. Maintenance can be scheduled in advance, roadside breakdowns avoided, and customer deliveries kept on track.

This approach has already prevented over 90,000 potential roadside incidents. The business impact is enormous, saving time, money, and reputation.

Other real-time use cases include:

Diagnosing issues instantly to dispatch roadside assistance faster
Triggering preventive maintenance alerts to avoid unscheduled downtime
Automating compliance for IFTA reporting using telematics data
Streamlining repair workflows through integration with electronic DVIRs (Driver Vehicle Inspection Reports)

Why Confluent for Apache Kafka?

Managing Kafka in-house was never the goal for Penske. After initially working with a different provider, they transitioned to Confluent Cloud to avoid the complexity and cost of maintaining open-source Kafka themselves.

“We’re not going to put mission-critical applications on an open source tech,” Singh noted. “Enterprise-grade applications require enterprise level support—and Confluent’s business value has been clear.”

Key reasons for choosing Confluent include:

The ability to scale rapidly without manual rebalancing
Enterprise tooling, including stream governance and connectors
Seamless integration with AI and analytics engines
Reduced time to market and improved uptime

Data Streaming and AI in Action at Penske

Penske’s investment in AI began in 2015, long before it became a mainstream trend. Early use cases included Erica, a virtual assistant that helps customers manage vehicle reservations. Today, AI is being used to reduce repair times, predict failures, and improve customer service experiences.

By combining real-time data with machine learning, Penske can offer more reliable services and automate decisions that previously required human intervention. AI-enabled diagnostics, proactive maintenance, and conversational assistants are already delivering measurable benefits.

The company is also exploring the role of generative AI. Singh highlighted the potential of technologies like ChatGPT for enterprise applications—but also stressed the importance of controls: “Configuration for risk tolerance is going to be the key. Traceability, explainability, and anomaly detection must be built in.”

Fleet Intelligence in Action: Measurable Business Value Through Data Streaming

For a company operating hundreds of thousands of vehicles, the stakes are high. Penske’s real-time architecture has improved uptime, accelerated response times, and empowered technicians and drivers with better tools.

The business outcomes are clear:

Fewer breakdowns and delays
Faster resolution of vehicle issues
Streamlined operations and reporting
Better customer and driver experience
Scalable infrastructure for new services, including electric vehicle fleets

With 165,000 vehicles already connected to Confluent and more being added as EV adoption grows, Penske is just getting started.

The Road Ahead: Agentic AI and the Next Evolution of Event-Driven Architecture Powered By Apache Kafka

The future of logistics will be defined by intelligent, real-time systems that coordinate not just vehicles, but entire networks. As Penske scales its edge computing and expands its use of remote sensing and autonomous technologies, the role of data streaming will only increase.

Agentic AI—systems that act autonomously based on real-time context—will require seamless integration of telematics, edge analytics, and cloud intelligence. This demands a resilient, flexible event-driven foundation. I explored the general idea in a dedicated article: How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time.

Penske’s journey shows that real-time data streaming is not only possible—it’s practical, scalable, and deeply transformative. The combination of a data streaming platform, sensor analytics, and AI allows the company to turn every vehicle into a smart, connected node in a global supply chain.

For logistics providers seeking to modernize, the path is clear. It starts with streaming data—and the possibilities grow from there. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The post How Penske Logistics Transforms Fleet Intelligence with Data Streaming and AI appeared first on Kai Waehner.

Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker

Kai Waehner — Mon, 26 May 2025 05:32:01 +0000

Agentic AI is gaining traction as a design pattern for building more intelligent, autonomous, and collaborative systems. Unlike traditional task-based automation, agentic AI involves intelligent agents that operate independently, make contextual decisions, and collaborate with other agents or systems—across domains, departments, and even enterprises.

In the enterprise world, agentic AI is more than just a technical concept. It represents a shift in how systems interact, learn, and evolve. But unlocking its full potential requires more than AI models and point-to-point APIs—it demands the right integration backbone.

That’s where Apache Kafka as event broker for true decoupling comes into play together with two emerging AI standards: Google’s Application-to-Application (A2A) Protocol and Antrophic’s Model Context Protocol (MCP) in an enterprise architecture for Agentic AI.

Inspired by my colleague Sean Falconer’s blog post, “Why Google’s Agent2Agent Protocol Needs Apache Kafka”, this blog post explores the Agentic AI adoption in enterprises and how an event-driven architecture with Apache Kafka fits into the AI architecture.

Business Value of Agentic AI in the Enterprise

For enterprises, the promise of agentic AI is compelling:

Smarter automation through self-directed, context-aware agents
Improved customer experience with faster and more personalized responses
Operational efficiency by connecting internal and external systems more intelligently
Scalable B2B interactions that span suppliers, partners, and digital ecosystems

But none of this works if systems are coupled by brittle point-to-point APIs, slow batch jobs, or disconnected data pipelines. Autonomous agents need continuous, real-time access to events, shared state, and a common communication fabric that scales across use cases.

Model Context Protocol (MCP) + Agent2Agent (A2A): New Standards for Agentic AI

The Model Context Protocol (MCP) coined by Anthropic offers a standardized, model-agnostic interface for context exchange between AI agents and external systems. Whether the interaction is streaming, batch, or API-based, MCP abstracts how agents retrieve inputs, send outputs, and trigger actions across services. This enables real-time coordination between models and tools—improving autonomy, reusability, and interoperability in distributed AI systems.

Source: Anthropic

Google’s Agent2Agent (A2A) protocol complements this by defining how autonomous software agents can interact with one another in a standard way. A2A enables scalable agent-to-agent collaboration—where agents discover each other, share state, and delegate tasks without predefined integrations. It’s foundational for building open, multi-agent ecosystems that work across departments, companies, and platforms.

Source: Google

Why Apache Kafka Is a Better Fit Than an API (HTTP/REST) for A2A and MCP

Most enterprises today use HTTP-based APIs to connect services—ideal for simple, synchronous request-response interactions.

In contrast, Apache Kafka is a distributed event streaming platform designed for asynchronous, high-throughput, and loosely coupled communication—making it a much better fit for multi-agent (A2A) and agentic AI architectures.

API-Based Integration	Kafka-Based Integration
Synchronous, blocking	Asynchronous, event-driven
Point-to-point coupling	Loose coupling with pub/sub topics
Hard to scale to many agents	Supports multiple consumers natively
No shared memory	Kafka retains and replays event history
Limited observability	Full traceability with schema registry & DLQs

Kafka serves as the decoupling layer. It becomes the place where agents publish their state, subscribe to updates, and communicate changes—independently and asynchronously. This enables multi-agent coordination, resilience, and extensibility.

MCP + Kafka = Open, Flexible Communication

As the adoption of Agentic AI accelerates, there’s a growing need for scalable communication between AI agents, services, and operational systems. The Model-Context Protocol (MCP) is emerging as a standard to structure these interactions—defining how agents access tools, send inputs, and receive results. But a protocol alone doesn’t solve the challenges of integration, scaling, or observability.

This is where Apache Kafka comes in.

By combining MCP with Kafka, agents can interact through a Kafka topic—fully decoupled, asynchronous, and in real time. Instead of direct, synchronous calls between agents and services, all communication happens through Kafka topics, using structured events based on the MCP format.

This model supports a wide range of implementations and tech stacks. For instance:

A Python-based AI agent deployed in a SaaS environment
A Spring Boot Java microservice running inside a transactional core system
A Flink application deployed at the edge performing low-latency stream processing
An API gateway translating HTTP requests into MCP-compliant Kafka events

Regardless of where or how an agent is implemented, it can participate in the same event-driven system. Kafka ensures durability, replayability, and scalability. MCP provides the semantic structure for requests and responses.

The result is a highly flexible, loosely coupled architecture for Agentic AI—one that supports real-time processing, cross-system coordination, and long-term observability. This combination is already being explored in early enterprise projects and will be a key building block for agent-based systems moving into production.

Stream Processing as the Agent’s Companion

Stream processing technologies like Apache Flink or Kafka Streams allow agents to:

Filter, join, and enrich events in motion
Maintain stateful context for decisions (e.g., real-time credit risk)
Trigger new downstream actions based on complex event patterns
Apply AI directly within the stream processing logic, enabling real-time inference and contextual decision-making with embedded models or external calls to a model server, vector database, or any other AI platform

Agents don’t need to manage all logic themselves. The data streaming platform can pre-process information, enforce policies, and even trigger fallback or compensating workflows—making agents simpler and more focused.

Technology Flexibility for Agentic AI Design with Data Contracts

One of the biggest advantages of Kafka-based event-driven and decoupled backend for agentic systems is that agents can be implemented in any stack:

Languages: Python, Java, Go, etc.
Environments: Containers, serverless, JVM apps, SaaS tools
Communication styles: Event streaming, REST APIs, scheduled jobs

The Kafka topic is the stable data contract for quality and policy enforcement. Agents can evolve independently, be deployed incrementally, and interoperate without tight dependencies.

Microservices, Data Products, and Reusability – Agentic AI Is Just One Piece of the Puzzle

To be effective, Agentic AI needs to connect seamlessly with existing operational systems and business workflows.

Kafka topics enable the creation of reusable data products that serve multiple consumers—AI agents, dashboards, services, or external partners. This aligns perfectly with data mesh and microservice principles, where ownership, scalability, and interoperability are key.

A single stream of enriched order events might be consumed via a single data product by:

A fraud detection agent
A real-time alerting system
An agent triggering SAP workflow updates
A lakehouse for reporting and batch analytics

This one-to-many model is the opposite of traditional REST designs and crucial for enabling agentic orchestration at scale.

Agentic Al Needs Integration with Core Enterprise Systems

Agentic AI is not a standalone trend—it’s becoming an integral part of broader enterprise AI strategies. While this post focuses on architectural foundations like Kafka, MCP, and A2A, it’s important to recognize how this infrastructure complements the evolution of major AI platforms.

Leading vendors such as Databricks, Snowflake, and others are building scalable foundations for machine learning, analytics, and generative AI. These platforms often handle model training and serving. But to bring agentic capabilities into production—especially for real-time, autonomous workflows—they must connect with operational, transactional systems and other agents at runtime. (See also: Confluent + Databricks blog series | Apache Kafka + Snowflake blog series)

This is where Kafka as the event broker becomes essential: it links these analytical backends with AI agents, transactional systems, and streaming pipelines across the enterprise.

At the same time, enterprise application vendors are embedding AI assistants and agents directly into their platforms:

SAP Joule / Business AI – Embedded AI for finance, supply chain, and operations
Salesforce Einstein / Copilot Studio – Generative AI for CRM and sales automation
ServiceNow Now Assist – Predictive automation across IT and employee services
Oracle Fusion AI / OCI – ML for ERP, HCM, and procurement
Microsoft Copilot – Integrated AI across Dynamics and Power Platform
IBM watsonx, Adobe Sensei, Infor Coleman AI – Governed, domain-specific AI agents

Each of these solutions benefits from the same architectural foundation: real-time data access, decoupled integration, and standardized agent communication.

Whether deployed internally or sourced from vendors, agents need reliable event-driven infrastructure to coordinate with each other and with backend systems. Apache Kafka provides this core integration layer—supporting a consistent, scalable, and open foundation for agentic AI across the enterprise.

Agentic AI Requires Decoupling – Apache Kafka Supports A2A and MCP as an Event Broker

To deliver on the promise of agentic AI, enterprises must move beyond point-to-point APIs and batch integrations. They need a shared, event-driven foundation that enables agents (and other enterprise software) to work independently and together—with shared context, consistent data, and scalable interactions.

Apache Kafka provides exactly that. Combined with MCP and A2A for standardized Agentic AI communication, Kafka unlocks the flexibility, resilience, and openness needed for next-generation enterprise AI.

It’s not about picking one agent platform—it’s about giving every agent the same, reliable interface to the rest of the world. Kafka is that interface.

The post Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker appeared first on Kai Waehner.

Databricks and Confluent in the World of Enterprise Software (with SAP as Example)

Kai Waehner — Mon, 12 May 2025 11:26:54 +0000

Modern enterprises rely heavily on operational systems like SAP ERP, Oracle, Salesforce, ServiceNow and mainframes to power critical business processes. But unlocking real-time insights and enabling AI at scale requires bridging these systems with modern analytics platforms like Databricks. This blog explores how Confluent’s data streaming platform enables seamless integration between SAP, Databricks, and other systems to support real-time decision-making, AI-driven automation, and agentic AI use cases. It explores how Confluent delivers the real-time backbone needed to build event-driven, future-proof enterprise architectures—supporting everything from inventory optimization and supply chain intelligence to embedded copilots and autonomous agents.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3: Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4 (THIS ARTICLE): Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5: Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to other operational and analytical platforms like SAP and Databricks.

Most Enterprise Data Is Operational

Enterprise software systems generate a constant stream of operational data across a wide range of domains. This includes orders and inventory from SAP ERP systems, often extended with real-time production data from SAP MES. Oracle databases capture transactional data critical to core business operations, while MongoDB contributes operational data—frequently used as a CDC source or, in some cases, as a sink for analytical queries. Customer interactions are tracked in platforms like Salesforce CRM, and financial or account-related events often originate from IBM mainframes.

Together, these systems form the backbone of enterprise data, requiring seamless integration for real-time intelligence and business agility. This data is often not immediately available for analytics or AI unless it’s integrated into downstream systems.

Confluent is built to ingest and process this kind of operational data in real time. Databricks can then consume it for AI and machine learning, dashboards, or reports. Together, SAP, Confluent and Databricks create a real-time architecture for enterprise decision-making.

SAP Product Landscape for Operational and Analytical Workloads

SAP plays a foundational role in the enterprise data landscape—not just as a source of business data, but as the system of record for core operational processes across finance, supply chain, HR, and manufacturing.

On a high level, the SAP product portfolio has three categories (these days): SAP Business AI, SAP Business Data Cloud (BDC), and SAP Business Applications powered by SAP Business Technology Platform (BTP).

Source: SAP

To support both operational and analytical needs, SAP offers a portfolio of platforms and tools, while also partnering with best-in-class technologies like Databricks and Confluent.

Operational Workloads (Transactional Systems):

SAP S/4HANA – Modern ERP for core business operations
SAP ECC – Legacy ERP platform still widely deployed
SAP CRM / SCM / SRM – Domain-specific business systems
SAP Business One / Business ByDesign – ERP solutions for mid-market and subsidiaries

Analytical Workloads (Data & Analytics Platforms):

SAP Datasphere – Unified data fabric to integrate, catalog, and govern SAP and non-SAP data
SAP Analytics Cloud (SAC) – Visualization, reporting, and predictive analytics
SAP BW/4HANA – Data warehousing and modeling for SAP-centric analytics

SAP Business Data Cloud (BDC)

SAP Business Data Cloud (BDC) is a strategic initiative within SAP Business Technology Platform (BTP) that brings together SAP’s data and analytics capabilities into a unified cloud-native experience. It includes:

SAP Datasphere as the data fabric layer, enabling seamless integration of SAP and third-party data
SAP Analytics Cloud (SAC) for consuming governed data via dashboards and reports
SAP’s partnership with Databricks to allow SAP data to be analyzed alongside non-SAP sources in a lakehouse architecture
Real-time integration scenarios enabled through Confluent and Apache Kafka, bringing operational data in motion directly into SAP and Databricks environments

Together, this ecosystem supports real-time, AI-powered, and governed analytics across operational and analytical workloads—making SAP data more accessible, trustworthy, and actionable within modern cloud data architectures.

SAP Databricks OEM: Limited Scope, Full Control by SAP

SAP recently announced an OEM partnership with Databricks, embedding parts of Databricks’ serverless infrastructure into the SAP ecosystem. While this move enables tighter integration and simplified access to AI workloads within SAP, it comes with significant trade-offs. The OEM model is narrowly scoped, optimized primarily for ML and GenAI scenarios on SAP data, and lacks the openness and flexibility of native Databricks.

This integration is not intended for full-scale data engineering. Core capabilities such as workflows, streaming, Delta Live Tables, and external data connections (e.g., Snowflake, S3, MS SQL) are missing. The architecture is based on data at rest and does not embrace event-driven patterns. Compute options are limited to serverless only, with no infrastructure control. Pricing is complex and opaque, with customers often needing to license Databricks separately to unlock full capabilities.

Critically, SAP controls the entire data integration layer through its BDC Data Products, reinforcing a vendor lock-in model. While this may benefit SAP-centric organizations focused on embedded AI, it restricts broader interoperability and long-term architectural flexibility. In contrast, native Databricks, i.e., outside of SAP, offers a fully open, scalable platform with rich data engineering features across diverse environments.

Whichever Databricks option you prefer, this is where Confluent adds value—offering a truly event-driven, decoupled architecture that complements both SAP Datasphere and Databricks, whether used within or outside the SAP OEM framework.

Confluent and SAP Integration

Confluent provides native and third-party connectors to integrate with SAP systems to enable continuous, low-latency data flow across business applications.

Source: Confluent

This powers modern, event-driven use cases that go beyond traditional batch-based integrations:

Low-latency access to SAP transactional data
Integration with other operational source systems like Salesforce, Oracle, IBM Mainframe, MongoDB, or IoT platforms
Synchronization between SAP DataSphere and other data warehouse and analytics platforms such as Snowflake, Google BigQuery or Databricks
Decoupling of applications for modular architecture
Data consistency across real-time, batch and request-response APIs
Hybrid integration across any edge, on-premise or multi-cloud environments

SAP Datasphere and Confluent

To expand its role in the modern data stack, SAP introduced SAP Datasphere—a cloud-native data management solution designed to extend SAP’s reach into analytics and data integration. Datasphere aims to simplify access to SAP and non-SAP data across hybrid environments.

SAP Datasphere simplifies data access within the SAP ecosystem, but it has key drawbacks when compared to open platforms like Databricks, Snowflake, or Google BigQuery:

Closed Ecosystem: Optimized for SAP, but lacks flexibility for non-SAP integrations.
No Event Streaming: Focused on data at rest, with limited support for real-time processing or streaming architectures.
No Native Stream Processing: Relies on batch methods, adding latency and complexity for hybrid or real-time use cases.

Confluent alleviates these drawbacks and supports this strategy through bi-directional integration with SAP Datasphere. This enables real-time streaming of SAP data into Datasphere and back out to operational or analytical consumers via Apache Kafka. It allows organizations to enrich SAP data, apply real-time processing, and ensure it reaches the right systems in the right format—without waiting for overnight batch jobs or rigid ETL pipelines.

Confluent for Agentic AI with SAP Joule and Databricks

SAP is laying the foundation for agentic AI architectures with a vision centered around Joule—its generative AI copilot—and a tightly integrated data stack that includes SAP Databricks (via OEM), SAP Business Data Cloud (BDC), and a unified knowledge graph. On top of this foundation, SAP is building specialized AI agents for use cases such as customer 360, creditworthiness analysis, supply chain intelligence, and more.

Source: SAP

The architecture combines:

SAP Joule as the interface layer for generative insights and decision support
SAP’s foundational models and domain-specific knowledge graph
SAP BDC and SAP Databricks as the data and ML/AI backbone
Data from both SAP systems (ERP, CRM, HR, logistics) and non-SAP systems (e.g. clickstream, IoT, partner data, social media) from its partnership with Confluent

But here’s the catch: What happens when agents need to communicate with one another to deliver a workflow? Such Agentic systems require continuous, contextual, and event-driven data exchange—not just point-to-point API calls and nightly batch jobs.

This is where Confluent’s data streaming platform comes in as critical infrastructure.

Agentic AI with Apache Kafka as Event Broker

Confluent provides the real-time data streaming platform that connects the operational world of SAP with the analytical and AI-driven world of Databricks, enabling the continuous movement, enrichment, and sharing of data across all layers of the stack.

The above is a conceptual view on the architecture. The AI agents on the left side could be built with SAP Joule, Databricks, or any “outside” GenAI framework.

The data streaming platform helps connecting the AI agents with the reset of the enterprise architecture, both within SAP and Databricks but also beyond:

Real-time data integration from non-SAP systems (e.g., mobile apps, IoT devices, mainframes, web logs) into SAP and Databricks
True decoupling of services and agents via an event-driven architecture (EDA), replacing brittle RPC or point-to-point API calls
Event replay and auditability—critical for traceable AI systems operating in regulated environments
Streaming pipelines for feature engineering and inference: stream-based model triggering with low-latency SLAs
Support for bi-directional flows: e.g., operational triggers in SAP can be enriched by AI agents running in Databricks and pushed back into SAP via Kafka events

Without Confluent, SAP’s agentic architecture risks becoming a patchwork of stateless services bound by fragile REST endpoints—lacking the real-time responsiveness, observability, and scalability required to truly support next-generation AI orchestration.

Confluent turns the SAP + Databricks vision into a living, breathing ecosystem—where context flows continuously, agents act autonomously, and enterprises can build future-proof AI systems that scale.

Data Streaming Use Cases Across SAP Product Suites

With Confluent, organizations can support a wide range of use cases across SAP product suites, including:

Real-Time Inventory Visibility: Live updates of stock levels across warehouses and stores by streaming material movements from SAP ERP and SAP EWM, enabling faster order fulfillment and reduced stockouts.
Dynamic Pricing and Promotions: Stream sales orders and product availability in real time to trigger pricing adjustments or dynamic discounting via integration with SAP ERP and external commerce platforms.
AI-Powered Supply Chain Optimization: Combine data from SAP ERP, SAP Ariba, and external logistics platforms to power ML models that predict delays, optimize routes, and automate replenishment.
Shop Floor Event Processing: Stream sensor and machine data alongside order data from SAP MES, enabling real-time production monitoring, alerting, and throughput optimization.
Employee Lifecycle Automation: Stream employee events (e.g., onboarding, role changes) from SAP SuccessFactors to downstream IT systems (e.g., Active Directory, badge systems), improving HR operations and compliance.
Order-to-Cash Acceleration: Connect order intake (via web portals or Salesforce) to SAP ERP in real time, enabling faster order validation, invoicing, and cash flow.
Procure-to-Pay Automation: Integrate procurement events from SAP Ariba and supplier portals with ERP and financial systems to streamline approvals and monitor supplier performance continuously.
Customer 360 and CRM Synchronization: Synchronize customer master data and transactions between SAP ERP, SAP CX, and third-party CRMs like Salesforce to enable unified customer views.
Real-Time Financial Reporting: Stream financial transactions from SAP S/4HANA into cloud-based lakehouses or BI tools for near-instant reporting and compliance dashboards.
Cross-System Data Consistency: Ensure consistent master data and business events across SAP and non-SAP environments by treating SAP as a real-time event source—not just a system of record.

Example Use Case and Architecture with SAP, Databricks and Confluent

Consider a manufacturing company using SAP ERP for inventory management and Databricks for predictive maintenance. The combination of SAP Datasphere and Confluent enables seamless data integration from SAP systems, while the addition of Databricks supports advanced AI/ML applications—turning operational data into real-time, predictive insights.

With Confluent as the real-time backbone:

Machine telemetry (via MQTT or OPC-UA) and ERP events (e.g., stock levels, work orders) are streamed in real time.
Apache Flink enriches and filters the event streams—adding context like equipment metadata or location.
Tableflow publishes clean, structured data to Databricks as Delta tables for analytics and ML processing.
A predictive model hosted in a Databricks model detects potential equipment failure before it happens in a Flink application calling the remote model with low latency.
The resulting prediction is streamed back to Kafka, triggering an automated work order in SAP via event integration.

This bi-directional, event-driven pattern illustrates how Confluent enables seamless, real-time collaboration across SAP, Databricks, and IoT systems—supporting both operational and analytical use cases with a shared architecture.

Going Beyond SAP with Data Streaming

This pattern applies to other enterprise systems:

Salesforce: Stream customer interactions for real-time personalization through Salesforce Data Cloud
Oracle: Capture transactions via CDC (Change Data Capture)
ServiceNow: Monitor incidents and automate operational responses
Mainframe: Offload events from legacy applications without rewriting code
MongoDB: Sync operational data in real time to support responsive apps
Snowflake: Stream enriched operational data into Snowflake for near real-time analytics, dashboards, and data sharing across teams and partners
OpenAI (or other GenAI platforms): Feed real-time context into LLMs for AI-assisted recommendations or automation
“You name it”: Confluent’s prebuilt connectors and open APIs enable event-driven integration with virtually any enterprise system

Confluent provides the backbone for streaming data across all of these platforms—securely, reliably, and in real time.

Strategic Value for the Enterprise of Event-based Real-Time Integration with Data Streaming

Enterprise software platforms are essential. But they are often closed, slow to change, and not designed for analytics or AI.

Confluent provides real-time access to operational data from platforms like SAP. SAP Datasphere and Databricks enable analytics and AI on that data. Together, they support modern, event-driven architectures.

Use Confluent for real-time data streaming from SAP and other core systems
Use SAP Datasphere and Databricks to build analytics, reports, and AI on that data
Use Tableflow to connect the two platforms seamlessly

This modern approach to data integration delivers tangible business value, especially in complex enterprise environments. It enables real-time decision-making by allowing business logic to operate on live data instead of outdated reports. Data products become reusable assets, as a single stream can serve multiple teams and tools simultaneously. By reducing the need for batch layers and redundant processing, the total cost of ownership (TCO) is significantly lowered. The architecture is also future-proof, making it easy to integrate new systems, onboard additional consumers, and scale workflows as business needs evolve.

Beyond SAP: Enabling Agentic AI Across the Enterprise

The same architectural discussion applies across the enterprise software landscape. As vendors embed AI more deeply into their platforms, the effectiveness of these systems increasingly depends on real-time data access, continuous context propagation, and seamless interoperability.

Without an event-driven foundation, AI agents remain limited—trapped in siloed workflows and brittle API chains. Confluent provides the scalable, reliable backbone needed to enable true agentic AI in complex enterprise environments.

Examples of AI solutions driving this evolution include:

SAP Joule / Business AI – Context-aware agents and embedded AI across ERP, finance, and supply chain
Salesforce Einstein / Copilot Studio – Generative AI for CRM, service, and marketing automation built on top of Salesforce Data Cloud
ServiceNow Now Assist – Intelligent workflows and predictive automation in ITSM and Ops
Oracle Fusion AI / OCI AI Services – Embedded machine learning in ERP, HCM, and SCM
Microsoft Copilot (Dynamics / Power Platform) – AI copilots across business and low-code apps
Workday AI – Smart recommendations for finance, workforce, and HR planning
Adobe Sensei GenAI – GenAI for content creation and digital experience optimization
IBM watsonx – Governed AI foundation for enterprise use cases and data products
Infor Coleman AI – Industry-specific AI for supply chain and manufacturing systems
All the “traditional” cloud providers and data platforms such as Snowflake with Cortex, Microsoft Azure Fabric, AWS SageMaker, AWS Bedrock, and GCP Vertex AI

Each of these platforms benefits from a streaming-first architecture that enables real-time decisions, reusable data, and smarter automation across the business.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to other operational and analytical platforms like SAP and Databricks.

The post Databricks and Confluent in the World of Enterprise Software (with SAP as Example) appeared first on Kai Waehner.

Shift Left Architecture for AI and Analytics with Confluent and Databricks

Kai Waehner — Fri, 09 May 2025 06:03:07 +0000

Modern enterprise architectures are evolving. Traditional batch data pipelines and centralized processing models are being replaced by more flexible, real-time systems. One of the driving concepts behind this change is the Shift Left approach. This blog compares Databricks’ Medallion Architecture with a Shift Left Architecture popularized by Confluent. It explains where each concept fits best—and how they can work together to create a more complete, flexible, and scalable architecture.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3 (THIS ARTICLE): Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5: Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

Medallion Architecture: Structured, Proven, but Not Always Optimal

The Medallion Architecture, popularized by Databricks, is a well-known design pattern for organizing and processing data within a lakehouse. It provides structure, modularity, and clarity across the data lifecycle by breaking pipelines into three logical layers:

Bronze: Ingest raw data in its original format (often semi-structured or unstructured)
Silver: Clean, normalize, and enrich the data for usability
Gold: Aggregate and transform the data for reporting, dashboards, and machine learning

Source: Databricks

This layered approach is valuable for teams looking to establish governed and scalable data pipelines. It supports incremental refinement of data and enables multiple consumers to work from well-defined stages.

Challenges of the Medallion Architecture

The Medallion Architecture also introduces challenges:

Pipeline delays: Moving data from Bronze to Gold can take minutes or longer—too slow for operational needs
Infrastructure overhead: Each stage typically requires its own compute and storage footprint
Redundant processing: Data transformations are often repeated across layers
Limited operational use: Data is primarily at rest in object storage; using it for real-time operational systems often requires inefficient reverse ETL pipelines.

For use cases that demand real-time responsiveness and/or critical SLAs—such as fraud detection, personalized recommendations, or IoT alerting—this traditional batch-first model may fall short. In such cases, an event-driven streaming-first architecture, powered by a data streaming platform like Confluent, enables faster, more cost-efficient pipelines by performing validation, enrichment, and even model inference before data reaches the lakehouse.

Importantly, this data streaming approach doesn’t replace the Medallion pattern—it complements it. It allows you to “shift left” critical logic, reducing duplication and latency while still feeding trusted, structured data into Delta Lake or other downstream systems for broader analytics and governance.

In other words, shifting data processing left (i.e., before it hits a data lake or Lakehouse) is especially valuable when the data needs to serve multiple downstream systems—operational and analytical alike—because it avoids duplication, reduces latency, and ensures consistent, high-quality data is available wherever it’s needed.

In a Shift Left Architecture, data processing happens earlier—closer to the source, both physically and logically. This often means:

Transforming and validating data as it streams in
Enriching and filtering in real time
Sharing clean, usable data quickly across teams AND different technologies/applications

This is especially useful for:

Reducing time to insight
Improving data quality at the source
Creating reusable, consistent data products
Operational workloads with critical SLAs

How Confluent Enables Shift Left with Databricks

In a Shift Left setup, Apache Kafka provides scalable, low-latency, and truly decoupled ingestion of data across operational and analytical systems, forming the backbone for unified data pipelines.

Schema Registry and data governance policies enforce consistent, validated data across all streams, ensuring high-quality, secure, and compliant data delivery from the very beginning.

Apache Flink enables early data processing — closer to where data is produced. This reduces complexity downstream, improves data quality, and allows real-time decisions and analytics.

Data Quality Governance via Data Contracts and Schema Validation

Flink can enforce data contracts by validating incoming records against predefined schemas (e.g., using JSON Schema, Apache Avro or Protobuf with Schema Registry). This ensures structurally valid data continues through the pipeline. In cases where schema violations occur, records can be automatically routed to a Dead Letter Queue (DLQ) for inspection.

Additionally, data contracts can enforce policy-based rules at the schema level—such as field-level encryption, masking of sensitive data (PII), type coercion, or enrichment defaults. These controls help maintain compliance and reduce risk before data reaches regulated or shared environments.

Apache Flink for Continuous Stream Processing

Flink can perform the following tasks before data ever lands in a data lake or warehouse:

Filtering and Routing

Events can be filtered based on business rules and routed to the appropriate downstream system or Kafka topic. This allows different consumers to subscribe only to relevant data, optimizing both performance and cost.

Metric Calculation

Use Flink to compute rolling aggregates (e.g., counts, sums, averages, percentiles) over windows of data in motion. This is useful for business metrics, anomaly detection, or feeding real-time dashboards—without waiting for batch jobs.

Real-Time Joins and Enrichment

Flink supports both stream-stream and stream-table joins. This enables real-time enrichment of incoming events with contextual information from reference data (e.g., user profiles, product catalogs, pricing tables), often sourced from Kafka topics, databases, or external APIs.

By shifting this logic to the beginning of the pipeline, teams can reduce duplication, avoid unnecessary storage and compute costs in downstream systems, and ensure that data products are clean, policy-compliant, and ready for both operational and analytical use—as soon as they are created.

Example: A financial application might use Flink to calculate running balances, detect anomalies, and enrich records with reference data before pushing to Databricks for reporting and training analytic models.

In addition to enhancing data quality and reducing time-to-insight in the lakehouse, this approach also makes data products immediately usable for operational workloads and downstream applications—without building separate pipelines.

Learn more about stateless and stateful stream processing in real-time architectures using Apache Flink in this in-depth blog post.

Combining Shift Left with Medallion Architecture

These architectures are not mutually exclusive. Shift Left is about processing data earlier. Medallion is about organizing data once it arrives.

You can use Shift Left principles to:

Pre-process operational data before it enters the Bronze layer
Ensure clean, validated data enters Silver with minimal transformation needed
Reduce the need for redundant processing steps between layers

Confluent’s Tableflow bridges the two worlds. It converts Kafka streams into Delta tables, integrating cleanly with the Medallion model while supporting real-time flows.

Shift Left with Delta Lake, Iceberg, and Tableflow

Confluent Tableflow makes it easy to publish Kafka streams into Delta Lake or Apache Iceberg formats. These can be discovered and queried inside Databricks via Unity Catalog.

This integration:

Simplifies integration, governance and discovery
Enables live updates to AI features and dashboards
Removes the need to manage Spark streaming jobs

This is a natural bridge between a data streaming platform and the lakehouse.

Source: Confluent

AI Use Cases for Shift Left with Confluent and Databricks

The Shift Left model benefits both predictive and generative AI:

Model training: Real-time data pipelines can stream features to Delta Lake
Model inference: In some cases, predictions can happen in Confluent (via Flink) and be pushed back to operational systems instantly
Agentic AI: Real-time event-driven architectures are well suited for next-gen, stateful agents

Databricks supports model training and hosting via MosaicML. Confluent can integrate with these models, or run lightweight inference directly from the stream processing application.

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

Batch reporting: Continue using Databricks for traditional BI
Real-time analytics: Flink or real-time OLAP engines (e.g., Apache Pinot, Apache Druid) may be a better fit for sub-second insights
Hybrid: Push raw events into Databricks for historical analysis and use Flink for immediate feedback

Where you do the data processing depends on the use case.

Architecture Benefits Beyond Technology

Shift Left also brings architectural benefits:

Cost Reduction: Processing early can lower storage and compute usage
Faster Time to Market: Data becomes usable earlier in the pipeline
Reusability: Processed streams can be reused and consumed by multiple technologies/applications (not just Databricks teams)
Compliance and Governance: Validated data with lineage can be shared with confidence

These are important for strategic enterprise data architectures.

Bringing in New Types of Data

Shift Left with a data streaming platform supports a wider range of data sources:

Operational databases (like Oracle, DB2, SQL Server, Postgres, MongoDB)
ERP systems (SAP et al)
Mainframes and other legacy technologies
IoT interfaces (MQTT, OPC-UA, proprietary IIoT gateway, etc.)
SaaS platforms (Salesforce, ServiceNow, and so on)
Any other system that does not directly fit into the “table-driven analytics perspective” of a Lakehouse

With Confluent, these interfaces can be connected in real time, enriched at the edge or in transit, and delivered to analytics platforms like Databricks.

This expands the scope of what’s possible with AI and analytics.

Shift Left Using ONLY Databricks

A shift left architecture only with Databricks is possible, too. A Databricks consultant took my Shift Left slide and adjusted it that way:

Relying solely on Databricks for a “Shift Left Architecture” can work if all workloads (should) stay within the platform — but it’s a poor fit for many real-world scenarios.

Databricks focuses on ELT, not true ETL, and lacks native support for operational workloads like APIs, low-latency apps, or transactional systems. This forces teams to rely on reverse ETL tools – a clear anti-pattern in the enterprise architecture – just to get data where it’s actually needed. The result: added complexity, latency, and tight coupling.

The Shift Left Architecture is valuable, but in most cases it requires a modular approach, where streaming, operational, and analytical components work together — not a monolithic platform.

That said, shift left principles still apply within Databricks. Processing data as early as possible improves data quality, reduces overall compute cost, and minimizes downstream data engineering effort. For teams that operate fully inside the Databricks ecosystem, shifting left remains a powerful strategy to simplify pipelines and accelerate insight.

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Many high-growth digital platforms adopt a shift-left approach out of necessity—not as a buzzword, but to reduce latency, improve data quality, and scale efficiently by processing data closer to the source.

Meesho, one of India’s largest online marketplaces, relies on Confluent and Databricks to power its hyper-growth business model focused on real-time e-commerce. As the company scaled rapidly, supporting millions of small businesses and entrepreneurs, the need for a resilient, scalable, and low-latency data architecture became critical.

To handle massive volumes of operational events — from inventory updates to order management and customer interactions — Meesho turned to Confluent Cloud. By adopting a fully managed data streaming platform using Apache Kafka, Meesho ensures real-time event delivery, improved reliability, and faster application development. Kafka serves as the central nervous system for their event-driven architecture, connecting multiple services and enabling instant, context-driven customer experiences across mobile and web platforms.

Alongside their data streaming architecture, Meesho migrated from Amazon Redshift to Databricks to build a next-generation analytics platform. Databricks’ lakehouse architecture empowers Meesho to unify operational data from Kafka with batch data from other sources, enabling near real-time analytics at scale. This migration not only improved performance and scalability but also significantly reduced costs and operational overhead.

With Confluent managing real-time event processing and ingestion, and Databricks providing powerful, scalable analytics, Meesho is able to:

Deliver real-time personalized experiences to customers
Optimize operational workflows based on live data
Enable faster, data-driven decision-making across business teams

By combining real-time data streaming with advanced lakehouse analytics, Meesho has built a flexible, future-ready data infrastructure to support its mission of democratizing online commerce for millions across India.

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

Shift Left is not about replacing Databricks. It’s about preparing better data earlier in the pipeline—closer to the source—and reducing end-to-end complexity.

Use Confluent for real-time ingestion, enrichment, and transformation
Use Databricks for advanced analytics, reporting, and machine learning
Use Tableflow and Delta Lake to govern and route high-quality data to the right consumers

This architecture not only improves data quality for the lakehouse, but also enables the same real-time data products to be reused across multiple downstream systems—including operational, transactional, and AI-powered applications.

The result: increased agility, lower costs, and scalable innovation across the business.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

The Past, Present, and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)

Kai Waehner — Fri, 02 May 2025 07:10:42 +0000

Confluent and Databricks are two of the most influential platforms in modern data architectures. Both have roots in open source. Both focus on enabling organizations to work with data at scale. And both have expanded their mission well beyond their original scope.

Confluent and Databricks are often described as serving different parts of the data architecture—real-time vs. batch, operational vs. analytical, data streaming vs. artificial intelligence (AI). But the lines are not always clear. Confluent can run batch workloads and embed AI. Databricks can handle (near) real-time pipelines. With Flink, Confluent supports both operational and analytical processing. Databricks can run operational workloads, too—if latency, availability, and delivery guarantees meet the project’s requirements.

This blog explores where these platforms came from, where they are now, how they complement each other in modern enterprise architectures—and why their roles are future-proof in a data- and AI-driven world.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1 (THIS ARTICLE): The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3: Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5: Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Stay tuned for deep dives into how these platforms are shaping the future of data-driven enterprises. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

Operational vs. Analytical Workloads

Confluent and Databricks were designed for different workloads, but the boundaries are not always strict.

Confluent was built for operational workloads—moving and processing data in real time as it flows through systems. This includes use cases like real-time payments, fraud detection, system monitoring, and streaming pipelines.

Databricks focuses on analytical workloads—enabling large-scale data processing, machine learning, and business intelligence.

That said, there is no clear black and white separation. Confluent, especially with the addition of Apache Flink, can support analytical processing on streaming data. Databricks can handle operational workloads too, provided the SLAs—such as latency, uptime, and delivery guarantees—are sufficient for the use case.

With Tableflow and Delta Lake, both platforms can now be natively connected, allowing real-time operational data to flow into analytical environments, and AI insights to flow back into real-time systems—effectively bridging operational and analytical workloads in a unified architecture.

From Apache Kafka and Spark to (Hybrid) Cloud Platforms: Both Confluent and Databricks both have strong open source roots—Kafka and Spark, respectively—but have taken different branding paths.

Confluent: From Apache Kafka to a Data Streaming Platform (DSP)

Confluent is well known as “The Kafka Company.” It was founded by the original creators of Apache Kafka over ten years ago. Kafka is now widely adopted for event streaming in over 150,000 organizations worldwide. Confluent operates tens of thousands of clusters with Confluent Cloud across all major cloud providers, and also in customer’s data centers and edge locations.

But Confluent has become much more than just Kafka. It offers a complete data streaming platform (DSP).

Source: Confluent

This includes:

Apache Kafka as the core messaging and persistence layer
Data integration via Kafka Connect for databases and business applications, a REST/HTTP proxy for request-response APIs and clients for all relevant programming languages
Stream processing via Apache Flink and Kafka Streams (read more about the past, present and future of stream processing)
Tableflow for native integration with lakehouses that support the open table format standard via Delta Lake and Apache Iceberg
24/7 SLAs, security, data governance, disaster recovery – for the most critical workloads companies run
Deployment options: Everywhere (not just cloud) – SaaS, on-prem, edge, hybrid, stretched across data centers, multi-cloud, BYOC (bring your own cloud)

Databricks: From Apache Spark to a Data Intelligence Platform

Databricks has followed a similar evolution. Known initially as “The Spark Company,” it is the original force behind Apache Spark. But Databricks no longer emphasizes Spark in its branding. Spark is still there under the hood, but it’s no longer the dominant story.

Today, it positions itself as the Data Intelligence Platform, focused on AI and analytics.

Source: Databricks

Key components include:

Fully cloud-native deployment model—Databricks is now a cloud-only platform providing BYOC and Serverless products
Delta Lake and Unity Catalog for table format standardization and governance
Model development and AI/ML tools
Data warehouse workloads
Tools for data scientists and data engineers

Together, Confluent and Databricks meet a wide range of enterprise needs and often complement each other in shared customer environments from the edge to multi-cloud data replication and analytics.

Real-Time vs. Batch Processing

A major point of comparison between Confluent and Databricks lies in how they handle data processing—real-time versus batch—and how they increasingly converge through shared formats and integrations.

A key difference between the platforms lies in how they process and share data.

Confluent focuses on data in motion—real-time streams that can be filtered, transformed, and shared across systems as they happen.

Databricks focuses on data at rest—data that has landed in a lakehouse, where it can be queried, aggregated, and used for analysis and modeling.

Both platforms offer native capabilities for data sharing. Confluent provides Stream Sharing, which enables secure, real-time sharing of Kafka topics across organizations and environments. Databricks offers Delta Sharing, an open protocol for sharing data from Delta Lake tables with internal and external consumers.

In many enterprise architectures, the two vendors work together. Kafka and Flink handle continuous real-time processing for operational workloads and data ingestion into the lakehouse. Databricks handles AI workloads (model training and some of the model inference), business intelligence (BI), and reporting. Both do data integration; ETL (Confluent) respectively ELT (Databricks).

Stream Processing with Spark Structured Streaming vs. Apache Flink or Kafka Streams

Many organizations still use Databricks’ Apache Spark Structured Streaming to connect Kafka and Databricks. That’s a valid pattern, especially for teams with Spark expertise.

Flink is available as a serverless offering in Confluent Cloud that can scale down to zero when idle, yet remains highly scalable—even for complex stateful workloads. It supports multiple languages, including Python, Java, and SQL.

For self-managed environments, Kafka Streams offers a lightweight alternative to running Flink in a self-managed Confluent Platform. But be aware that Kafka Streams is limited to Java and operates as a client library embedded directly within the application. Read my dedicated article to learn about the trade-offs between Apache Flink and Kafka Streams.

In short: use what works. If Spark Structured Streaming is already in place and meets your needs, keep it. But for new use cases, Apache Flink or Kafka Streams might be the better choice for stream processing workloads. But make sure to understand the concepts and value of stateless and stateful stream processing before building batch pipelines.

Confluent Tableflow: Unify Operational and Analytic Workloads with Open Table Formats (such as Apache Iceberg and Delta Lake)

Databricks is actively investing in Delta Lake and Unity Catalog to structure, govern, and secure data for analytical applications. The acquisition of Tabular—founded by the original creators of Apache Iceberg—demonstrates Databricks’ commitment to supporting open standards.

Confluent’s Tableflow materializes Kafka streams into Apache Iceberg or Delta Lake tables—automatically, reliably, and efficiently. This native integration between Confluent and Databricks is faster, simpler, and more cost-effective than using a Spark connector or other ETL tools.

Tableflow reads the Kafka segments, checks schema against schema registry, and creates parquet and table metadata.

Source: Confluent

Native stream processing with Apache Flink also plays a growing role. It enables unified real-time and batch stream processing in a single engine. Flink’s ability to “shift left” data processing (closer to the source) supports early validation, enrichment, and transformation. This simplifies the architecture and reduces the need for always-on Spark clusters, which can drive up cost.

These developments highlight how Databricks and Confluent address different but complementary layers of the data ecosystem.

Confluent + Databricks = A Strategic Partnership for Future-Proof AI Architectures

Confluent and Databricks are not competing platforms—they’re complementary. While they serve different core purposes, there are areas where their capabilities overlap. In those cases, it’s less about which is better and more about which fits best for your architecture, team expertise, SLA or latency requirements. The real value comes from understanding how they work together and where you can confidently choose the platform that serves your use case most effectively.

Confluent and Databricks recently deepened their partnership with Tableflow integration with Delta Lake and Unity Catalog. This integration makes real-time Kafka data available inside Databricks as Delta tables. It reduces the need for custom pipelines and enables fast access to trusted operational data.

The architecture supports AI end to end—from ingesting real-time operational data to training and deploying models—all with built-in governance and flexibility. Importantly, data can originate from anywhere: mainframes, on-premise databases, ERP systems, IoT and edge environments or SaaS cloud applications.

With this setup, you can:

Feed data from 100+ Confluent sources (Mainframe, Oracle, SAP, Salesforce, IoT, HTTP/REST applications, and so on) into Delta Lake
Use Databricks for AI model development and business intelligence
Push models back into Kafka and Flink for real-time model inference with critical, operational SLAs and latency

Both directions will be supported. Governance and security metadata flows alongside the data.

Source: Confluent

Michelin: Real-Time Data Streaming and AI Innovation with Confluent and Databricks

A great example of how Confluent and Databricks complement each other in practice is Michelin’s digital transformation journey. As one of the world’s largest tire manufacturers, Michelin set out to become a data-first and digital enterprise. To achieve this, the company needed a foundation for real-time operational data movement and a scalable analytical platform to unlock business insights and drive AI initiatives.

Confluent @ Michelin: Real-Time Data Streaming Pipelines

Confluent Cloud plays a critical role at Michelin by powering real-time data pipelines across their global operations. Migrating from self-managed Kafka to Confluent Cloud on Microsoft Azure enabled Michelin to reduce operational complexity by 35%, meet strict 99.99% SLAs, and speed up time to market by up to nine months. Real-time inventory management, order orchestration, and event-driven supply chain processes are now possible thanks to a fully managed data streaming platform.

Databricks @ Michelin: Centralized Lakehouse

Meanwhile, Databricks empowers Michelin to democratize data access across the organization. By building a centralized lakehouse architecture, Michelin enabled business users and IT teams to independently access, analyze, and develop their own analytical use cases—from predicting stock outages to reducing carbon emissions in logistics. With Databricks’ lakehouse capabilities, they scaled to support hundreds of use cases without central bottlenecks, fostering a vibrant community of innovators across the enterprise.

The synergy between Confluent and Databricks at Michelin is clear:

Confluent moves operational data in real time, ensuring fresh, trusted information flows across systems (including Databricks).
Databricks transforms data into actionable insights, using powerful AI, machine learning, and analytics capabilities.

Confluent + Databricks @ Michelin = Cloud-Native Data-Driven Enterprise

Together, Confluent and Databricks allow Michelin to shift from batch-driven, siloed legacy systems to a cloud-native, real-time, data-driven enterprise—paving the road toward higher agility, efficiency, and customer satisfaction.

As Yves Caseau, Group Chief Digital & Information Officer at Michelin, summarized: “Confluent plays an integral role in accelerating our journey to becoming a data-first and digital business.”

And as Joris Nurit, Head of Data Transformation, added: “Databricks enables our business users to better serve themselves and empowers IT teams to be autonomous.”

The Michelin success story perfectly illustrates how Confluent and Databricks, when used together, bridge operational and analytical workloads to unlock the full value of real-time, AI-powered enterprise architectures.

Confluent and Databricks: Better Together!

Confluent and Databricks are both leaders in different – but connected – layers of the modern data stack.

If you want real-time, event-driven data pipelines, Confluent is the right platform. If you want powerful analytics, AI, and ML, Databricks is a great fit.

Together, they allow enterprises to bridge operational and analytical workloads—and to power AI systems with live, trusted data.

In the next post, I will explore how Confluent’s Data Streaming Platform compares to the Databricks Data Intelligence Platform for data integration and processing.

The post The Past, Present, and Future of Confluent (The Kafka Company) and Databricks (The Spark Company) appeared first on Kai Waehner.

Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink

Kai Waehner — Mon, 28 Apr 2025 06:29:25 +0000

Mobility services like Uber, Grab, FREE NOW (Lyft), and DoorDash are built on real-time data. Every trip, delivery, and payment relies on accurate, instant decision-making. But as these services scale, they become prime targets for sophisticated fraud—GPS spoofing, fake accounts, payment abuse, and more. Traditional, batch-based fraud detection can’t keep up. It reacts too late, misses complex patterns, and creates blind spots that fraudsters exploit. To stop fraud before it happens, mobility platforms need data streaming technologies like Apache Kafka and Apache Flink for fraud detection. This blog explores how leading platforms are using real-time event processing to detect and block fraud as it happens—protecting revenue, user trust, and platform integrity at scale.

The Business of Mobility Services (Ride-Hailing, Food Delivery, Taxi Aggregators, etc.)

Mobility services have become an essential part of modern urban life. They offer convenience and efficiency through ride-hailing, food delivery, car-sharing, e-scooters, taxi aggregators, and micro-mobility options. Companies such as Uber, Lyft, FREE NOW (former MyTaxi; acquired by Lyft recently), Grab, Careem, and DoorDash connect millions of passengers, drivers, restaurants, retailers, and logistics partners to enable seamless transactions through digital platforms.

These platforms operate in highly dynamic environments where real-time data is crucial for pricing, route optimization, customer experience, and fraud detection. However, this very nature of mobility services also makes them prime targets for fraudulent activities. Fraud in this sector can lead to financial losses, reputational damage, and deteriorating customer trust.

To effectively combat fraud, mobility services must rely on real-time data streaming with technologies such as Apache Kafka and Apache Flink. These technologies enable continuous event processing and allow platforms to detect and prevent fraud before transactions are finalized.

Why Fraud is a Major Challenge in Mobility Services

Fraudsters continually exploit weaknesses in digital mobility platforms. Some of the most common fraud types include:

Fake Rides and GPS Spoofing: Drivers manipulate GPS data to simulate trips that never occurred. Passengers use location spoofing to receive cheaper fares or exploit promotions.

Payment Fraud and Stolen Credit Cards: Fraudsters use stolen payment methods to book rides or order food.

Fake Drivers and Passengers: Fraudsters create multiple accounts and pretend to be both the driver and passenger to collect incentives. Some drivers manipulate fares by manually adjusting distances in their favor.

Promo Abuse: Users create multiple fake accounts to exploit referral bonuses and promo discounts.

Account Takeovers and Identity Fraud: Hackers gain access to legitimate accounts, misusing stored payment information. Fraudsters use fake identities to bypass security measures.

Fraud not only impacts revenue but also creates risks for legitimate users and drivers. Without proper fraud prevention measures, ride-hailing and delivery companies could face serious losses, both financially and operationally.

The Unseen Enemy: Core Challenges in Mobility Fraud
Detection

Traditional fraud detection relies on batch processing and manual rule-based systems. However, these approaches are no longer effective due to the speed and complexity of modern mobile apps with real-time experiences combined with modern fraud schemes.

Payment Fraud – The Hidden Enemy in a Digital World

Key challenges in mobility fraud detection include:

Fraud occurs in real-time, requiring instant detection and prevention before transactions are completed.
Millions of events per second must be processed, requiring scalable and efficient systems.
Fraud patterns constantly evolve, making static rule-based approaches ineffective.
Platforms operate across hybrid and multi-cloud environments, requiring seamless integration of fraud detection systems.

How Data Streaming with Apache Kafka and Flink Enables Real-Time Fraud Detection

To overcome these challenges, real-time streaming analytics powered by Apache Kafka and Apache Flink provide an effective solution.

Apache Kafka: The Backbone of Event-Driven Fraud Detection

Kafka serves as the core event streaming platform. It captures and processes real-time data from multiple sources such as:

GPS location data
Payment transactions
User and driver behavior analytics
Device fingerprints and network metadata

Kafka provides:

High-throughput data streaming, capable of processing millions of events per second to support real-time decision-making.
An event-driven architecture that enables decoupled, flexible systems—ideal for scalable and maintainable mobility platforms.
Seamless scalability across hybrid and multi-cloud environments to meet growing demand and regional expansion.
Always-on reliability, ensuring 24/7 data availability and consistency for mission-critical services such as fraud detection, pricing, and trip orchestration.

An excellent success story about the transition to data streaming comes from DoorDash: Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink.

Apache Flink: Continuous Stream Processing for Fraud Detection in Real-Time

Apache Flink enables real-time fraud detection through advanced event correlation and applied AI:

Detects anomalies in GPS data, such as sudden jumps, route manipulation, or unrealistic movement patterns.
Analyzes historical user behavior to surface signs of account takeovers or other forms of identity misuse.
Joins multiple real-time streams—including payment events, location updates, and account interactions—to generate accurate, low-latency fraud scores.
Applies machine learning models in-stream, enabling the system to flag and stop suspicious transactions before they are processed.
Continuously adapts to new fraud patterns, updating models with fresh data in near real-time to reflect evolving user behavior and emerging threats.

With Kafka and Flink, fraud detection can shift from reactive to proactive to stop fraudulent transactions before they are completed.

I already covered various data streaming success stories from financial services companies such as Paypal, Capital One and ING Bank in a dedicated blog post. And a separate case study from about “Fraud Prevention in Under 60 Seconds with Apache Kafka: How A Bank in Thailand is Leading the Charge“.

Real-World Fraud Prevention Stories from Mobility Leaders

Fraud is not just a technical issue—it’s a business-critical challenge that impacts trust, revenue, and operational stability in mobility services. The following real-world examples from industry leaders like FREE NOW (Lyft), Grab, and Uber show how data streaming with advanced stream processing and AI are used around the world to detect and stop fraud in real time, at massive scale.

FREE NOW (Lyft): Detecting Fraudulent Trips in Real Time by Analyzing GPS Data of Cars

FREE NOW operates in more than 150 cities across Europe with 48 million users. It integrates multiple mobility services, including taxis, private vehicles, car-sharing, e-scooters, and bikes.

The company was recently acquired by Lyft, the U.S.-based ride-hailing giant known for its focus on multimodal urban transport and strong presence in North America. This acquisition marks Lyft’s strategic entry into the European mobility ecosystem, expanding its footprint beyond the U.S. and Canada.

Source: FREE NOW

Fraud Prevention Approach leveraging Data Streaming (presented at Kafka Summit)

Uses Kafka Streams and Kafka Connect to analyze GPS trip data in real-time.
Deploys fraud detection models that identify anomalies in trip routes and fare calculations.
Operates data streaming on fully managed Confluent Cloud and applications on Kubernetes for scalable fraud detection.

Source: FREE NOW

Example: Detecting Fake Rides

A driver inputs trip details into the app.
Kafka Streams predicts expected trip fare based on distance and duration.
GPS anomalies and unexpected route changes are flagged.
Fraud alerts are triggered for suspicious transactions.

By implementing real-time fraud detection with Kafka and Flink, FREE NOW (Lyft) has significantly reduced fraudulent trips and improved platform security.

Grab: AI-Powered Fraud Detection for Ride-Hailing and Delivery with Data Streaming and AI/ML

Grab is a leading mobility platform in Southeast Asia, handling millions of transactions daily. Fraud accounts for 1.6 percent of total revenue loss in the region.

To address these significant fraud numbers, Grab developed GrabDefence—an AI-powered fraud detection engine that leverages real-time data and machine learning to detect and block suspicious activity across its platform.

Source: Grab

Fraud Detection Approach

Uses Kafka Streams and machine learning for fraud risk scoring.
Leverages Flink for feature aggregation and anomaly detection.
Detects fraudulent transactions before they are completed.

Source: Grab

Example: Fake Driver and Passenger Fraud

Fraudsters create accounts as both driver and passenger to claim rewards.
Kafka ingests device fingerprints, payment transactions, and ride data.
Flink aggregates historical fraud behavior and assigns risk scores.
High-risk transactions are blocked instantly.

With GrabDefence built with data streaming, Grab reduced fraud rates to 0.2 percent, well below the industry average. Learn more about GrabDefence in the Kafka Summit talk.

Uber: Project RADAR – AI-Powered Fraud Detection with Human Oversight

Uber processes millions of payments per second globally. Fraud detection is complex due to chargebacks and uncollected payments.

To combat this, Uber launched Project RADAR—a hybrid system that combines machine learning with human reviewers to continuously detect, investigate, and adapt to evolving fraud patterns in near real time. Low latency is not required in this scenario. And humans are in the loop of the business process. Hence, Apache Spark is sufficient for Uber.

Source: Uber

Fraud Prevention Approach

Uses Kafka and Spark for multi-layered fraud detection.
Implements machine learning models to detect chargeback fraud.
Incorporates human analysts for rule validation.

Source: Uber

Example: Chargeback Fraud Detection

Kafka collects all ride transactions in real time.
Stream processing detects anomalies in payment patterns and disputes.
AI-based fraud scoring identifies high-risk transactions.
Uber’s RADAR system allows human analysts to validate fraud alerts.

Uber’s combination of AI-driven detection and human oversight has significantly reduced chargeback-related fraud.

Data Streaming with Kafka and Flink Provides Real-Time Fraud Detection in Mobility Services

Fraud in mobility services is a real-time challenge that requires real-time solutions that work 24/7, even at extreme scale for millions of events. Traditional batch processing systems are too slow, and static rule-based approaches cannot keep up with evolving fraud tactics.

By leveraging data streaming with Apache Kafka in conjunction with Kafka Streams or Apache Flink, mobility platforms can:

Process millions of events per second to detect fraud in real time.
Prevent fraudulent transactions before they occur.
Use AI-driven real-time fraud scoring for accurate risk assessment.
Adapt dynamically through continuous learning to evolving fraud patterns.

Mobility platforms such as Uber, Grab, and FREE NOW (Lyft) are leading the way in using real-time streaming analytics to protect their platforms from fraud. By implementing similar approaches, other mobility businesses can enhance security, reduce financial losses, and maintain customer trust.

Real-time fraud prevention in mobility services is not an option; it is a necessity. The ability to detect and stop fraud in real time will define the future success of ride-hailing, food delivery, and urban mobility platforms.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And download my free book about data streaming use cases.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time

Kai Waehner — Mon, 14 Apr 2025 09:09:10 +0000

Artificial Intelligence is evolving beyond passive analytics and reactive automation. Agentic AI represents a new wave of autonomous, goal-driven AI systems that can think, plan, and execute complex workflows without human intervention. However, for these AI agents to be effective, they must operate on real-time, consistent, and trustworthy data—a challenge that traditional batch processing architectures simply cannot meet. This is where Data Streaming with Apache Kafka and Apache Flink, coupled with an event-driven architecture (EDA), form the backbone of Agentic AI. By enabling real-time and continuous decision-making, EDA ensures that AI systems can act instantly and reliably in dynamic, high-speed environments. Emerging standards like the Model Context Protocol (MCP) and Google’s Agent-to-Agent (A2A) protocol are now complementing this foundation, providing structured, interoperable layers for managing context and coordination across intelligent agents—making AI not just event-driven, but also context-aware and collaborative.

In this post, I will explore:

How Agentic AI works and why it needs real-time data
Why event-driven architectures are the best choice for AI automation
Key use cases across industries
How Kafka and Flink provide the necessary data consistency and real-time intelligence for AI-driven decision-making
The role of MCP, A2A, and frameworks like LangChain and LlamaIndex in enabling scalable, context-aware, and collaborative AI systems

What is Agentic AI?

Agentic AI refers to AI systems that exhibit autonomous, goal-driven decision-making and execution. Unlike traditional automation tools that follow rigid workflows, Agentic AI can:

Understand and interpret natural language instructions
Set objectives, create strategies, and prioritize actions
Adapt to changing conditions and make real-time decisions
Execute multi-step tasks with minimal human supervision
Integrate with multiple operational and analytical systems and data sources to complete workflows

Here is an example AI Agent dependency graph from Sean Falconer’s article “Event-Driven AI: Building a Research Assistant with Kafka and Flink“:

Source: Sean Falconer

Instead of merely analyzing data, Agentic AI acts on data, making it invaluable for operational and transactional use cases—far beyond traditional analytics.

However, without real-time, high-integrity data, these systems cannot function effectively. If AI is working with stale, incomplete, or inconsistent information, its decisions become unreliable and even counterproductive. This is where Kafka, Flink, and event-driven architectures become indispensable.

Why Batch Processing Fails for Agentic AI

Traditional AI and analytics systems have relied heavily on batch processing, where data is collected, stored, and processed in predefined intervals. This approach may work for generating historical reports or training machine learning models offline, but it completely breaks down when applied to operational and transactional AI use cases—which are at the core of Agentic AI.

I recently explored the Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming). And here’s why batch processing is fundamentally incompatible with Agentic AI and the real-world challenges it creates:

1. Delayed Decision-Making Slows AI Reactions

Agentic AI systems are designed to autonomously respond to real-time changes in the environment, whether it’s optimizing a telecommunications network, detecting fraud in banking, or dynamically adjusting supply chains.

In a batch-driven system, data is processed hours or even days later, making AI responses obsolete before they even reach the decision-making phase. For example:

Fraud detection: If a bank processes transactions in nightly batches, fraudulent activities may go unnoticed for hours, leading to financial losses.
E-commerce recommendations: If a retailer updates product recommendations only once per day, it fails to capture real-time shifts in customer behavior.
Network optimization: If a telecom company analyzes network traffic in batch mode, it cannot prevent congestion or outages before it affects users.

Agentic AI requires instantaneous decision-making based on streaming data, not delayed insights from batch reports.

2. Data Staleness Creates Inaccurate AI Decisions

AI agents must act on fresh, real-world data, but batch processing inherently means working with outdated information. If an AI agent is making decisions based on yesterday’s or last hour’s data, those decisions are no longer reliable.

Consider a self-healing IT infrastructure that uses AI to detect and mitigate outages. If logs and system metrics are processed in batch mode, the AI agent will be acting on old incident reports, missing live system failures that need immediate attention.

In contrast, an event-driven system powered by Kafka and Flink ensures that AI agents receive live system logs as they occur, allowing for proactive self-healing before customers are impacted.

3. High Latency Kills Operational AI

In industries like finance, healthcare, and manufacturing, even a few seconds of delay can lead to severe consequences. Batch processing introduces significant latency, making real-time automation impossible.

For example:

Healthcare monitoring: A real-time AI system should detect abnormal heart rates from a patient’s wearable device and alert doctors immediately. If health data is only processed in hourly batches, a critical deterioration could be missed, leading to life-threatening situations.
Automated trading in finance: AI-driven trading systems must respond to market fluctuations within milliseconds. Batch-based analysis would mean losing high-value trading opportunities to faster competitors.

Agentic AI must operate on a live data stream, where every event is processed instantly, allowing decisions to be made in real-time, not retrospectively.

4. Rigid Workflows Increase Complexity and Costs

Batch processing forces businesses to predefine rigid workflows that do not adapt well to changing conditions. In a batch-driven world:

Data must be manually scheduled for ingestion.
Systems must wait for the entire dataset to be processed before making decisions.
Business logic is hard-coded, requiring expensive engineering effort to update workflows.

Agentic AI, on the other hand, is designed for continuous, adaptive decision-making. By leveraging an event-driven architecture, AI agents listen to streams of real-time data, dynamically adjusting workflows on the fly instead of relying on predefined batch jobs.

This flexibility is especially critical in industries with rapidly changing conditions, such as supply chain logistics, cybersecurity, and IoT-based smart cities.

5. Batch Processing Cannot Support Continuous Learning

A key advantage of Agentic AI is its ability to learn from past experiences and self-improve over time. However, this is only possible if AI models are continuously updated with real-time feedback loops.

Batch-driven architectures limit AI’s ability to learn because:

Models are retrained infrequently, leading to outdated insights.
Feedback loops are slow, preventing AI from adjusting strategies in real time.
Drift in data patterns is not immediately detected, causing AI performance degradation.

For instance, in customer service chatbots, an AI-powered agent should adapt to customer sentiment in real time. If a chatbot is trained on stale customer interactions from last month, it won’t understand emerging trends or newly common issues.

By contrast, a real-time data streaming architecture ensures that AI agents continuously receive live customer interactions, retrain in real time, and evolve dynamically.

Agentic AI Requires an Event-Driven Architecture

Agentic AI must act in real time and integrate operational and analytical information. Whether it’s an AI-driven fraud detection system, an autonomous network optimization agent, or a customer service chatbot, acting on outdated information is not an option.

The Event-Driven Approach

An Event-Driven Architecture (EDA) enables continuous processing of real-time data streams, ensuring that AI agents always have the latest information available. By decoupling applications and processing events asynchronously, EDA allows AI to respond dynamically to changes in the environment without being constrained by rigid workflows.

AI can also be seamlessly integrated into existing business processes leveraging an EDA, bridging modern and legacy technologies without requiring a complete system overhaul. Not every data source may be real-time, but EDA ensures data consistency across all consumers—if an application processes data, it sees exactly what every other application sees. This guarantees synchronized decision-making, even in hybrid environments combining historical data with real-time event streams.

Why Apache Kafka is Essential for Agentic AI

For AI to be truly autonomous and effective, it must operate in real time, adapt to changing conditions, and ensure consistency across all applications. An Event-Driven Architecture (EDA) built with Apache Kafka provides the foundation for this by enabling:

Immediate Responsiveness → AI agents receive and act on events as they occur.
High Scalability → Components are decoupled and can scale independently.
Fault Tolerance → AI processes continue running even if some services fail.
Improved Data Consistency → Ensures AI agents are working with accurate, real-time data.

Building Agentic AI with Apache Kafka and Flink

To build truly autonomous AI systems, organizations need a real-time data infrastructure that can process, analyze, and act on events as they happen.

Source: Sean Falconer

Apache Kafka: The Real-Time Data Streaming Backbone

Apache Kafka provides a scalable, event-driven messaging infrastructure that ensures AI agents receive a constant, real-time stream of events. By acting as a central nervous system, Kafka enables:

Decoupled AI components that communicate through event streams.
Efficient data ingestion from multiple sources (IoT devices, applications, databases).
Guaranteed event delivery with fault tolerance and durability.
High-throughput processing to support real-time AI workloads.

Apache Flink: The Real-Time Stream Processing Engine

Apache Flink complements Kafka by providing stateful stream processing for AI-driven workflows. With Flink, AI agents can:

Analyze real-time data streams for anomaly detection, predictions, and decision-making.
Perform complex event processing to detect patterns and trigger automated responses.
Continuously learn and adapt based on evolving real-time data.
Orchestrate multi-agent workflows dynamically.

Use Cases of Agentic AI with Kafka and Flink Across Industries

Across industries, Agentic AI is redefining how businesses and governments operate. By leveraging event-driven architectures and real-time data streaming, organizations can unlock the full potential of AI-driven automation, improving efficiency, reducing costs, and delivering better experiences.

Here are key use cases across different industries:

Financial Services: Real-Time Fraud Detection and Risk Management

Traditional fraud detection systems rely on batch processing, leading to delayed responses and financial losses.

Agentic AI enables real-time transaction monitoring, detecting anomalies as they occur and blocking fraudulent activities instantly.

AI agents continuously learn from evolving fraud patterns, reducing false positives and improving security. In risk management, AI analyzes market trends, adjusts investment strategies, and automates compliance processes to ensure financial institutions stay ahead of threats and regulatory requirements.

Telecommunications: Autonomous Network Optimization

Telecom networks require constant tuning to maintain service quality, but traditional network management is reactive and expensive.

Agentic AI can proactively monitor network traffic, predict congestion, and automatically reconfigure network resources in real time. AI-powered agents optimize bandwidth allocation, detect outages before they impact customers, and enable self-healing networks, reducing operational costs and improving service reliability.

Retail: AI-Powered Personalization and Dynamic Pricing

Retailers struggle with static recommendation engines that fail to capture real-time customer intent.

Agentic AI analyzes customer interactions, adjusts recommendations dynamically, and personalizes promotions based on live purchasing behavior. AI-driven pricing strategies adapt to supply chain fluctuations, competitor pricing, and demand changes in real time, maximizing revenue while maintaining customer satisfaction.

AI agents also enhance logistics by optimizing inventory management and reducing stock shortages.

Healthcare: Real-Time Patient Monitoring and Predictive Care

Hospitals and healthcare providers require real-time insights to deliver proactive care, but batch processing delays critical decisions.

Agentic AI continuously streams patient vitals from medical devices to detect early signs of deterioration and triggering instant alerts to medical staff. AI-driven predictive analytics optimize hospital resource allocation, improve diagnosis accuracy, and enable remote patient monitoring, reducing emergency incidents and improving patient outcomes.

Gaming: Dynamic Content Generation and Adaptive AI Opponents

Modern games need to provide immersive, evolving experiences, but static game mechanics limit engagement.

Agentic AI enables real-time adaptation of gameplay to generate dynamic environments and personalizing challenges based on a player’s behavior. AI-driven opponents can learn and adapt to individual playstyles, keeping games engaging over time. AI agents also manage server performance, detect cheating, and optimize in-game economies for a better gaming experience.

Manufacturing & Automotive: Smart Factories and Autonomous Systems

Manufacturing relies on precision and efficiency, yet traditional production lines struggle with downtime and defects.

Agentic AI monitors production processes in real time to detect quality issues early and adjusting machine parameters autonomously. This directly improves Overall Equipment Effectiveness (OEE) by reducing downtime, minimizing defects, and optimizing machine performance to ensure higher productivity and operational efficiency to ensure higher productivity and operational efficiency.

In automotive, AI-driven agents analyze real-time sensor data from self-driving cars to make instant navigation decisions, predict maintenance needs, and optimize fleet operations for logistics companies.

Public Sector: AI-Powered Smart Cities and Citizen Services

Governments face challenges in managing infrastructure, public safety, and citizen services efficiently.

Agentic AI can optimize traffic flow by analyzing real-time data from sensors and adjusting signals dynamically. AI-powered public safety systems detect threats from surveillance data and dispatch emergency services instantly. AI-driven chatbots handle citizen inquiries, automate document processing, and improve response times for government services.

The Business Value of Real-Time AI using Autonomous Agents

By leveraging Kafka and Flink in an event-driven AI architecture, organizations can achieve:

Better Decision-Making → AI operates on fresh, accurate data.
Faster Time-to-Action → AI agents respond to events immediately.
Reduced Costs → Less reliance on expensive batch processing and manual intervention by humans.
Greater Scalability → AI systems can handle massive workloads in real time.
Vendor Independence → Kafka and Flink support open standards and hybrid/multi-cloud deployments, preventing vendor lock-in.

Why LangChain, LlamaIndex, and Similar Frameworks Are Not Enough for Agentic AI in Production

Frameworks like LangChain, LlamaIndex, and others have gained popularity for making it easy to prototype AI agents by chaining prompts, tools, and external APIs. They provide useful abstractions for reasoning steps, retrieval-augmented generation (RAG), and basic tool use—ideal for experimentation and lightweight applications.

However, when building agentic AI for operational, business-critical environments, these frameworks fall short on several fronts:

Many frameworks like LangChain are inherently synchronous and follows a request-response model, which limits its ability to handle real-time, event-driven inputs at scale. In contrast, LlamaIndex takes an event-driven approach, using a message broker—including support for Apache Kafka—for inter-agent communication.
Debugging, observability, and reproducibility are weak—there’s often no persistent, structured record of agent decisions or tool interactions.
State is ephemeral and in-memory, making long-running tasks, retries, or rollback logic difficult to implement reliably.
Most Agentic AI frameworks lack support for distributed, fault-tolerant execution and scalable orchestration, which are essential for production systems.

That said, these frameworks like LangChain and Llamaindex can still play a valuable, complementary role when integrated into an event-driven architecture. For example, an agent might use LangChain for planning or decision logic within a single task, while Apache Kafka and Apache Flink handle the real-time flow of events, coordination between agents, persistence, and system-level guarantees.

LangChain and similar toolkits help define how an agent thinks. But to run that thinking at scale, in real time, and with full traceability, you need a robust data streaming foundation. That’s where Kafka and Flink come in.

Model Context Protocol (MCP) and Agent-to-Agent (A2A) for Scalable, Composable Agentic AI Architectures

Model Context Protocol (MCP) is one of the hottest topics in AI right now. Coined by Anthropic, with early support emerging from OpenAI, Google, and other leading AI infrastructure providers, MCP is rapidly becoming a foundational layer for managing context in agentic systems. MCP enables systems to define, manage, and exchange structured context windows—making AI interactions consistent, portable, and state-aware across tools, sessions, and environments.

Google’s recently announced Agent-to-Agent (A2A) protocol adds further momentum to this movement, setting the groundwork for standardized interaction across autonomous agents. These advancements signal a new era of AI interoperability and composability.

Together with Kafka and Flink, MCP and protocols like A2A help bridge the gap between stateless LLM calls and stateful, event-driven agent architectures. Naturally, event-driven architecture is the perfect foundation for all this. The key now is to build enough product functionality and keep pushing the boundaries of innovation.

A dedicated blog post is coming soon to explore how MCP and A2A connect data streaming and request-response APIs in modern AI systems.

The Future of Agentic AI is Event-Driven with Data Streaming using Kafka and Flink

Agentic AI is poised to revolutionize industries by enabling fully autonomous, goal-driven AI systems that perceive, decide, and act continuously. But to function reliably in dynamic, production-grade environments, these agents require real-time, event-driven architectures—not outdated, batch-oriented pipelines.

Apache Kafka and Apache Flink form the foundation of this shift. Kafka ensures agents receive reliable, ordered event streams, while Flink provides stateful, low-latency stream processing for real-time reactions and long-lived context management. This architecture enables AI agents to process structured events as they happen, react to changes in the environment, and coordinate with other services or agents through durable, replayable data flows.

If your organization is serious about AI, the path forward is clear:

Move from batch to real-time, from passive analytics to autonomous action, and from isolated prompts to event-driven, context-aware agents—enabled by Kafka and Flink.

As a next step, learn more about “Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink“.

Let’s connect on LinkedIn and discuss how to implement these ideas in your organization. Stay informed about new developments by subscribing to my newsletter. And make sure to download my free book about data streaming use cases.

The post How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time appeared first on Kai Waehner.

CIO Summit: The State of AI and Why Data Streaming is Key for Success

Kai Waehner — Thu, 13 Mar 2025 07:31:33 +0000

This week, I had the privilege of engaging in insightful conversations at the CIO Summit organized by GDS Group in Amsterdam, Netherlands. The event brought together technology leaders from across Europe and industries such as financial services, manufacturing, energy, gaming, telco, and more. The focus? AI – but with a much-needed reality check. While the potential of AI is undeniable, the hype often outpaces real-world value. Discussions at the summit revolved around how enterprises can move beyond experimentation and truly integrate AI to drive business success.

Key Learnings on the State of AI

The CIO Summit in Amsterdam provided a reality check on AI adoption across industries. While excitement around AI is high, success depends on moving beyond the hype and focusing on real business value. Conversations with technology leaders revealed critical insights about AI’s maturity, challenges, and the key factors driving meaningful impact. Here are the most important takeaways.

AI is Still in Its Early Stages – Beware of the Buzz vs. Value

The AI landscape is evolving rapidly, but many organizations are still in the exploratory phase. Executives recognize the enormous promise of AI but also see challenges in implementation, scaling, and achieving meaningful ROI.

The key takeaway? AI is not a silver bullet. Companies that treat it as just another trendy technology risk wasting resources on hype-driven projects that fail to deliver tangible outcomes.

Generative AI vs. Predictive AI – Understanding the Differences

There was a lot of discussion about Generative AI (GenAI) vs. Predictive AI, two dominant categories that serve very different purposes:

Predictive AI analyzes historical and real-time data to forecast trends, detect anomalies, and automate decision-making (e.g., fraud detection, supply chain optimization, predictive maintenance).
Generative AI creates new content based on trained data (e.g., text, images, or code), enabling applications like automated customer service, software development, and marketing content generation.

While GenAI has captured headlines, Predictive AI remains the backbone of AI-driven automation in enterprises. CIOs must carefully evaluate where each approach adds real business value.

Good Data Quality is Non-Negotiable

A critical takeaway: AI is only as good as the data that fuels it. Poor data quality leads to inaccurate AI models, bad predictions, and failed implementations.

To build trustworthy and effective AI solutions, organizations need:

Accurate, complete, and well-governed data

Real-time and historical data integration

Continuous data validation and monitoring

Context Matters – AI Needs Real-Time Decision-Making

Many AI use cases rely on real-time decision-making. A machine learning model trained on historical data is useful, but without real-time context, it quickly becomes outdated.

For example, fraud detection systems need to analyze real-time transactions while comparing them to historical behavioral patterns. Similarly, AI-powered supply chain optimization depends on up-to-the-minute logistics datarather than just past trends.

The conclusion? Real-time data streaming is essential to unlocking AI’s full potential.

Automate First, Then Apply AI

One common theme among successful AI adopters: Optimize business processes before adding AI.

Organizations that try to retrofit AI onto inefficient, manual processes often struggle with adoption and ROI. Instead, the best approach is:

1⃣ Automate and optimize workflows using real-time data

2⃣ Apply AI to enhance automation and improve decision-making

By taking this approach, companies ensure that AI is applied where it actually makes a difference.

ROI Matters – AI Must Drive Business Value

CIOs are under pressure to deliver business-driven, NOT tech-driven AI projects. AI initiatives that lack a clear ROI roadmap often stall after pilot phases.

Two early success stories for Generative AI stand out:

Customer support – AI chatbots and virtual assistants enhance response times and improve customer experience.
Software engineering – AI-powered code generation boosts developer productivity and reduces time to market.

The lesson? Start with AI applications that deliver clear, measurable business impact before expanding into more experimental areas.

Data Streaming and AI – The Perfect Match

At the heart of AI’s success is data streaming. Why? Because modern AI requires a continuous flow of fresh, real-time data to make accurate predictions and generate meaningful insights.

Data streaming not only powers AI with real-time insights but also ensures that AI-driven decisions directly translate into measurable business value:

Here’s how data streaming powers both Predictive and Generative AI:

Predictive AI + Data Streaming

Predictive AI thrives on timely, high-quality data. Real-time data streaming enables AI models to process and react to events as they happen. Examples include:

Fraud detection: AI analyzes real-time transactions to detect suspicious activity before fraud occurs.

Predictive maintenance: Streaming IoT sensor data allows AI to predict equipment failures before they happen.

Supply chain optimization: AI dynamically adjusts logistics routes based on real-time disruptions.

Here is an example from Capital One bank about fraud detection and prevention in real-time, preventing $150 of fraud on average a year/customer:

Source: Confluent

Generative AI + Data Streaming

Generative AI also benefits from real-time data. Instead of relying on static datasets, streaming data enhances GenAI applications by incorporating the latest information:

AI-powered customer support: Chatbots analyze live customer interactions to generate more relevant responses.

AI-driven marketing content: GenAI adapts promotional messaging in real-time based on customer engagement signals.

Software development acceleration: AI assistants provide real-time code suggestions as developers write code.

In short, without real-time data, AI is limited to outdated insights.

Here is an example for GenAI with data streaming in the travel Industry by Expedia where 60% of travelers are self-servicing in chat, saving 40+% of variable agent cost:

Source: Confluent

The Future of AI: Agentic AI and the Role of Data Streaming

As AI evolves, we are moving toward Agentic AI – systems that autonomously take actions, learn from feedback, and adapt in real time.

For example:

AI-driven cybersecurity systems that detect and respond to threats instantly

Autonomous supply chains that dynamically adjust based on demand shifts

Intelligent business operations where AI continuously optimizes workflows

But Agentic AI can only work if it has access to real-time operational AND analytical data. That’s why data streaming is becoming a critical foundation for the next wave of AI innovation.

The Path to AI Success

The CIO Summit reinforced one key message: AI is here to stay, but its success depends on strategy, data quality, and business value – not just hype.

Organizations that:

Focus on AI applications with clear business ROI

Automate before applying AI

Prioritize real-time data streaming

… will be best positioned to drive AI success at scale.

As AI moves towards autonomous decision-making (Agentic AI), data streaming will become even more critical. The ability to process and act on real-time data will separate AI leaders from laggards.

Now the real question: Where is your AI strategy headed? Let’s discuss!

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And make sure to download my free book focusing on data streaming use cases, industry stories and business value.

The post CIO Summit: The State of AI and Why Data Streaming is Key for Success appeared first on Kai Waehner.

How Data Streaming and AI Help Telcos to Innovate: Top 5 Trends from MWC 2025

Kai Waehner — Fri, 07 Mar 2025 06:44:11 +0000

The telecommunications and technology industries are at a pivotal moment. As innovation accelerates, businesses must leverage cutting-edge technologies to stay ahead. For MWC 2025, McKinsey highlighted five crucial themes shaping the future: IT excellence in telecom, sustainability challenges, the evolution of 6G, the rise of generative AI, and AI-driven software development.

MWC (Mobile World Congress) 2025 serves as the global stage where industry leaders, telecom operators, and technology pioneers converge to discuss the next wave of connectivity and digital transformation. As organizations gear up for a data-driven future, real-time data streaming emerges as the critical enabler of efficiency, agility, and value creation.

This blog explores each of McKinsey’s key themes from MWC 2025 and how data streaming helps businesses innovate and gain a competitive advantage in the hyper-connected world ahead.

1. IT Excellence: Driving Telecom Innovation and Cost Efficiency

Telecom operators are under immense pressure to monetize massive infrastructure investments while maintaining cost efficiency. McKinsey’s benchmarking study shows that leading telecom tech players spend less on IT while achieving superior cost efficiency and innovation. Successful operators integrate business and IT transformations holistically, optimizing cloud strategies, IT architectures, and AI-driven processes.

How Data Streaming Powers IT Excellence

Real-Time IT Monitoring: Streaming data pipelines provide continuous observability into IT performance, reducing downtime and optimizing infrastructure costs.
Automated Network Operations: Event-driven architectures allow operators to dynamically allocate resources, minimizing network congestion and improving service quality.
Cloud-Native AI Models: By continuously feeding AI models with live data, telecom leaders ensure optimal network performance and predictive maintenance.

Business Impact: Faster time-to-market, lower IT costs, and improved network reliability.

A great example of this transformation is Dish Wireless, which built a fully cloud-native, software-driven 5G network powered by Apache Kafka. By leveraging real-time data streaming, Dish ensures low-latency, scalable, and event-driven operations, allowing it to optimize network performance, automate infrastructure management, and provide next-generation connectivity for enterprise applications.

Dish’s data-first approach demonstrates how streaming technologies are redefining telecom infrastructure and unlocking new business models.

Read more about how Apache Kafka powers Dish Wireless’ 5G infrastructure or watch the following webinar with Dish:

2. Tackling Telecom Emissions: A Sustainable Future

The telecom industry faces increasing regulatory pressure and consumer expectations to decarbonize operations. While many companies have reduced Scope 1 (direct emissions) and Scope 2 (energy consumption) emissions, the real challenge lies in Scope 3 emissions from supply chains. McKinsey’s research suggests that 60% of an integrated operator’s emissions can be reduced for less than $100 per ton of CO₂.

How Data Streaming Supports Sustainability Efforts

Energy Optimization in Real Time: Streaming analytics continuously monitor energy usage across network infrastructure, automatically adjusting power consumption.
Carbon Footprint Tracking: Data pipelines aggregate real-time emissions data, enabling operators to meet sustainability goals efficiently.
Predictive Maintenance for Energy Efficiency: AI-driven insights help optimize network hardware lifespan, reducing waste and unnecessary energy consumption.

Business Impact: Reduced carbon footprint, cost savings on energy consumption, and regulatory compliance.

Beyond telecom, data streaming is transforming sustainability efforts across industries. For example, in manufacturing and real estate, companies like Ampeers Energy and PAUL Tech AG use Apache Kafka and Flink to optimize energy distribution, reduce emissions, and improve ESG ratings.

These real-time data platforms analyze IoT sensor data, weather forecasts, and energy consumption patterns to automate decision-making and lower energy waste. Similarly, EverySens leverages streaming data to decarbonize freight transport, eliminating hundreds of thousands of unnecessary truck rides each year. These use cases demonstrate how data-driven sustainability strategiescan be scaled across sectors to achieve meaningful environmental impact.

3. Shaping the Future of 6G: Beyond Connectivity

6G is expected to revolutionize industries by enabling ultra-low latency, ubiquitous connectivity, and AI-driven network optimization. However, the transition from 5G to 6G requires overcoming legacy infrastructure challenges and developing multi-capability platforms that go beyond simple connectivity.

How Data Streaming Powers the 6G Revolution

Network Sensing and Intelligent Routing: Streaming architectures process real-time network telemetry, enabling adaptive, self-optimizing networks.
AI-Enhanced Edge Computing: Real-time analytics ensure minimal latency for mission-critical applications such as autonomous vehicles and smart cities.
Cross-Sector Data Monetization: Operators can leverage streaming data to offer network-as-a-service (NaaS) solutions, opening new revenue streams.

Business Impact: New monetization opportunities, improved network efficiency, and enhanced customer experience.

Source: Dish Wireless

As the 6G era approaches, real-time data streaming is already proving its value in 5G deployments, unlocking low-latency, high-bandwidth use cases.

A great example is Verizon’s Mobile Edge Computing (MEC) initiative, which uses data streaming and AI-powered analytics to support real-time applications like autonomous drone monitoring, vehicle-to-everything (V2X) communication, and predictive maintenance in industrial settings. By processing data at the network edge, telcos minimize latency and optimize bandwidth—capabilities that will be even more critical in 6G.

With cloud-native, event-driven architectures, data streaming enables telcos to evolve from traditional connectivity providers to technology leaders. As 6G advances, expect faster network automation, more sophisticated AI integration, and deeper partnerships between telecom operators and enterprise customers.

4. Generative AI: A Profitability Game-Changer for Telcos

McKinsey highlights generative AI’s potential to boost telco profitability by up to 10% in annual EBITDA through automation, hyper-personalization, and AI-driven customer engagement. Leading telcos are already leveraging AI to improve customer service, marketing, and network operations.

How Data Streaming Enhances Gen AI in Telecom

Real-Time Customer Insights: AI-powered recommendation engines deliver personalized offers and dynamic pricing in milliseconds.
Automated Call Center Operations: Real-time transcription and sentiment analysis improve chatbot accuracy and agent productivity.
Proactive Network Management: AI models trained on continuous streaming data predict and prevent network failures before they occur.

Business Impact: Higher customer satisfaction, reduced operational costs, and increased revenue per user.

As telecom providers integrate Generative AI (GenAI) into their business models, real-time data streaming is a foundational technology that enables efficient AI inference and model retraining. One compelling example is the GenAI Demo with Kafka, Flink, LangChain, and OpenAI, which illustrates how streaming architectures power AI-driven sales and customer interactions.

This demo showcases how real-time CRM data from Salesforce is enriched with web and LinkedIn data via streaming ETL using Apache Flink. Then, AI models process this context using LangChain and OpenAI, generating personalized, context-specific sales recommendations—a workflow that can be extended to telecom call centers and customer engagement platforms.

Expedia’s success story further highlights how GenAI combined with data streaming improves customer interactions. Facing a massive surge in support requests during COVID-19, Expedia automated responses with AI-driven chatbots, significantly reducing agent workloads. By integrating Apache Kafka with AI models, 60% of travelers began self-servicing their inquiries, resulting in over 40% cost savings in customer support operations.

Source: Confluent

For telecom providers, similar AI-driven automation can optimize call centers, personalized customer offers, fraud detection, and even predictive maintenance for network infrastructure. Data streaming ensures that AI models continuously learn from fresh data, making GenAI solutions more accurate, responsive, and cost-effective.

5. AI-Driven Software Development: Faster, Smarter, Better

AI is fundamentally transforming software development, accelerating the product development lifecycle (PDLC) and improving product quality. AI-assisted coding, automated testing, and real-time feedback loops are enabling companies to deliver customer-centric solutions at unprecedented speed.

How Data Streaming Transforms AI-Driven Software Development

Continuous Feedback and Iteration: Streaming analytics provide instant feedback from user behavior, enabling faster iterations and bug fixes.
Automated Code Quality Checks: AI-driven continuous integration (CI/CD) pipelines validate new code in real-time, ensuring seamless software deployments.
Live Performance Monitoring: Streaming data enables real-time anomaly detection, ensuring optimal application performance.

Business Impact: Faster time-to-market, higher software reliability, and reduced development costs.

For telecom providers, AI-driven software development is key to maintaining a reliable, scalable, and secure network infrastructure while rolling out new customer-facing services at speed. Data streaming accelerates software development by enabling real-time feedback loops, automated testing, and AI-powered observability—bringing the industry closer to a true “Shift Left” approach.

The Shift Left Architecture in software development means moving testing, security, and quality assurance earlier in the development lifecycle, reducing costly errors and vulnerabilities late in production. Data streaming enables this shift by continuously feeding AI-driven CI/CD pipelines with real-time insights, allowing developers to detect issues earlier, optimize network performance, and iterate faster on new services.

A relevant AI-powered automation example comes from the “GenAI for Development vs. Visual Coding” article, which discusses how automation is shifting from traditional code-based development to AI-assisted software engineering. Instead of manual coding, AI-driven workflows help telcos streamline DevOps, automate CI/CD pipelines, and enhance software quality in real time.

For telecom providers, this transformation means proactive issue detection, faster rollouts of network upgrades, and automated AI-driven security monitoring—all powered by real-time data streaming and a Shift Left mindset.

Data Streaming as the Ultimate Competitive Advantage for Telcos

Across all five of McKinsey’s key trends, real-time data streaming is the backbone of transformation. Whether optimizing IT efficiency, reducing emissions, unlocking 6G’s potential, enabling generative AI and Agentic AI, or accelerating software development, streaming technologies provide the agility and intelligence businesses need to win in 2025 and beyond.

The path forward isn’t just about adopting AI or cloud-native infrastructure—it’s about embracing real-time, event-driven architectures to drive innovation at scale.

As organizations take bold steps to lead the future, those who harness the power of data streaming will emerge as the industry’s true pioneers.

The post How Data Streaming and AI Help Telcos to Innovate: Top 5 Trends from MWC 2025 appeared first on Kai Waehner.

Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink

Kai Waehner — Sun, 23 Feb 2025 05:08:20 +0000

The landscape of artificial intelligence (AI) and machine learning (ML) is transforming rapidly. Online model training and model drift management become essential for businesses to maintain competitive edges. Data streaming with Apache Kafka and Apache Flink plays crucial roles in this evolution, enabling real-time updates and seamless integration into modern data infrastructures. This blog explores the challenges of model drift, investigates TikTok’s groundbreaking architecture, and highlights the business value and complementary nature of data streaming with other platforms.

Understanding Model Drift: The Achilles’ Heel of Static Models

Real-time model inference with a data streaming platform using Apache Kafka and Flink is a powerful solution for delivering fast and accurate predictions, as detailed in my model inference blog post, but it’s not enough to sustain long-term model accuracy.

Machine learning models degrade in accuracy over time due to shifts in data or concepts—a phenomenon known as model drift.

This can take several forms:

Concept Drift: Changing relationships between input and output variables, such as shifting user behavior.
Data Drift: Variations in data distribution, e.g., demographic shifts.
Upstream Data Changes: Pipeline modifications, e.g., new logging formats or unavailable sources.

Unchecked, model drift leads to poor predictions and missed opportunities. Addressing it requires continuous updates, which online machine learning enables through data streaming platforms like Kafka and Flink.

TikTok: Revolutionizing Real-Time Personalization with Kafka and Flink

TikTok’s recommendation system, detailed in ByteDance’s whitepaper, leverages a cutting-edge, real-time machine learning architecture powered by data streaming technologies like Kafka and Flink to deliver personalized content at scale, seamlessly integrating user behavior data, dynamic feature processing, and online model updates for unparalleled user engagement and platform efficiency.

What is ByteDance and TikTok?

ByteDance, TikTok’s parent company, is a Chinese technology giant renowned for its innovative use of AI and real-time ML. TikTok, its most famous product, has redefined user engagement through hyper-personalized video recommendations. TikTok employs real-time online machine learning, ensuring recommendations are dynamic, accurate, and engaging.

Why TikTok Outshines Competitors

While other social video platforms also leverage advanced machine learning for recommendations, TikTok’s architecture distinguishes itself by prioritizing real-time adaptability and hyper-personalization, ensuring it can respond to user behavior faster and more effectively than its competitors.

User Engagement: TikTok’s recommendation engine adapts in real-time, delivering hyper-relevant content that increases user retention.
Scalability: Unlike many platforms relying on periodic retraining, TikTok continuously updates its models, handling massive data streams with ease.
Speed: Real-time processing reduces latency in adapting to user behavior, a stark contrast to Facebook or YouTube’s delayed batch processes.

Technical Implementation using Apache Kafka, Flink and Machine Learning for Continuous Online Model Training

TikTok’s real-time recommendation system is built on a robust streaming data architecture:

Source: Bytedance

Data Ingestion:

User interactions like views, likes, and shares are streamed in real-time via Kafka.
Kafka ensures reliable collection and distribution of high-velocity event data.

Feature Engineering:

Flink processes raw data streams, performing real-time feature extraction and enrichment.
Techniques like point-in-time lookups prevent training-inference skew, ensuring the same features are used in both phases.

Online Model Training:

Lightweight models are continuously updated with fresh data.
This approach mitigates model drift, ensuring predictions stay relevant and accurate.

Real-Time Inference:

Updated models are deployed immediately to serve predictions.
TikTok’s architecture ensures latency is minimal, with recommendations delivered almost instantly.

This dynamic infrastructure has made TikTok a leader in real-time AI, setting a benchmark for others.

Data Streaming with Kafka and Flink: The Backbone of Modern Machine Learning and AI

Apache Kafka and Flink are indispensable for organizations embracing online ML.

Data streaming addresses key challenges:

Training-Inference Data Skew: By streaming real-time features into models, Flink ensures consistency in model training and inference data.
Multi-Model Governance: Kafka and Flink enable the data integration with small models for enrichment and large models for complex decision-making, ensuring governance and modularity.
Scalability and Efficiency: Data streaming pipelines handle massive volumes with low latency, enabling real-time decision-making.

Complementing Other Data Platforms: Streaming Meets Analytics

Data streaming complements platforms like Databricks, Snowflake, and Microsoft Fabric, creating a seamless ecosystem for AI/ML workflows:

Databricks: While Databricks excels in large-scale batch processing and AI model training, Kafka adds real-time data ingestion and pre-processing capabilities.
Snowflake: Zero-ETL integration with Kafka and Flink allows for real-time analytics alongside Snowflake’s strong data warehousing and AI features.
Microsoft Fabric: Fabric’s AI-powered analytics gain agility from Kafka’s event-driven architecture, ensuring near-instant data availability.

The Shift Left Architecture emphasizes moving from traditional batch processing and lakehouse-centric approaches to real-time data products, empowering businesses to act on data faster and with greater agility. Learn more about this transformative approach in my Shift Left Architecture blog post.

Meanwhile, Apache Iceberg, an open table format for lakehouses and streaming, ensures seamless data sharing across real-time and batch workflows by providing a unified view of data. Dive deeper into its capabilities in my Apache Iceberg blog post.

This complementary relationship enables businesses to leverage best-in-class tools without trade-offs, providing both real-time and batch capabilities. Learn more in my comparison blog series “Data Streaming with Kafka and Flink vs. Snowflake” and “Microsoft Fabric and Apache Kafka“.

Business Value: A Compelling Case for Real-Time AI/ML with Data Streaming using Kafka and Flink

The adoption of real-time ML with Kafka and Flink drives tangible business outcomes:

Enhanced User Engagement: Personalized recommendations lead to improved customer retention.
Faster Time to Market: Real-time data pipelines reduce the lead time for deploying ML solutions.
Improved ROI: Real-time adaptability ensures models deliver consistent business value.
Freedom of Choice: Kafka acts as the backbone, enabling seamless integration with diverse tools and platforms.

This translates to a flexible, scalable, and high-performing ML infrastructure capable of handling evolving business demands.

Real-Time AI/ML with Kafka and Flink: Transforming Data into Business Value

Online machine learning with Apache Kafka and Flink is the future of adaptive, real-time AI. TikTok’s success story is a testament to the power of dynamic AI/ML systems in driving engagement and staying competitive. By complementing platforms like Snowflake, Databricks, and Microsoft Fabric, data streaming enables a holistic, future-proof data strategy.

Organizations must embrace these technologies to unlock faster time to market, unparalleled user experiences, and sustained business growth.

The post Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink appeared first on Kai Waehner.

Artificial Intelligence Archives - Kai Waehner

How Penske Logistics Transforms Fleet Intelligence with Data Streaming and AI

Why Real-Time Data Matters in Logistics and Transportation

How Data Streaming with Apache Kafka and Flink Transforms the Supply Chain

Data Streaming Success Stories Across the Logistics and Transportation Industry

Penske Logistics: A Leader in Transportation, Fleet Services, and Supply Chain Innovation

Penske’s Data Streaming Success Story

Why Confluent for Apache Kafka?

Data Streaming and AI in Action at Penske

Fleet Intelligence in Action: Measurable Business Value Through Data Streaming

The Road Ahead: Agentic AI and the Next Evolution of Event-Driven Architecture Powered By Apache Kafka

Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker

Business Value of Agentic AI in the Enterprise

Model Context Protocol (MCP) + Agent2Agent (A2A): New Standards for Agentic AI

Why Apache Kafka Is a Better Fit Than an API (HTTP/REST) for A2A and MCP

MCP + Kafka = Open, Flexible Communication

Stream Processing as the Agent’s Companion

Technology Flexibility for Agentic AI Design with Data Contracts

Microservices, Data Products, and Reusability – Agentic AI Is Just One Piece of the Puzzle

Agentic Al Needs Integration with Core Enterprise Systems

Agentic AI Requires Decoupling – Apache Kafka Supports A2A and MCP as an Event Broker

Databricks and Confluent in the World of Enterprise Software (with SAP as Example)

About the Confluent and Databricks Blog Series

Most Enterprise Data Is Operational

SAP Product Landscape for Operational and Analytical Workloads

SAP Business Data Cloud (BDC)

SAP Databricks OEM: Limited Scope, Full Control by SAP

Confluent and SAP Integration

SAP Datasphere and Confluent

Confluent for Agentic AI with SAP Joule and Databricks

Agentic AI with Apache Kafka as Event Broker

Data Streaming Use Cases Across SAP Product Suites

Example Use Case and Architecture with SAP, Databricks and Confluent

Going Beyond SAP with Data Streaming

Strategic Value for the Enterprise of Event-based Real-Time Integration with Data Streaming

Beyond SAP: Enabling Agentic AI Across the Enterprise

Shift Left Architecture for AI and Analytics with Confluent and Databricks

About the Confluent and Databricks Blog Series

Medallion Architecture: Structured, Proven, but Not Always Optimal

Challenges of the Medallion Architecture

Shift Left Architecture: Process Earlier, Share Faster

How Confluent Enables Shift Left with Databricks

Data Quality Governance via Data Contracts and Schema Validation

Apache Flink for Continuous Stream Processing

Combining Shift Left with Medallion Architecture

Shift Left with Delta Lake, Iceberg, and Tableflow

AI Use Cases for Shift Left with Confluent and Databricks

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

Architecture Benefits Beyond Technology

Bringing in New Types of Data

Shift Left Using ONLY Databricks

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

The Past, Present, and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)

About the Confluent and Databricks Blog Series

Operational vs. Analytical Workloads

Confluent: From Apache Kafka to a Data Streaming Platform (DSP)

Databricks: From Apache Spark to a Data Intelligence Platform

Real-Time vs. Batch Processing

Data Processing and Data Sharing “In Motion” vs. “At Rest”

Stream Processing with Spark Structured Streaming vs. Apache Flink or Kafka Streams

Confluent Tableflow: Unify Operational and Analytic Workloads with Open Table Formats (such as Apache Iceberg and Delta Lake)

Confluent + Databricks = A Strategic Partnership for Future-Proof AI Architectures

Michelin: Real-Time Data Streaming and AI Innovation with Confluent and Databricks

Confluent @ Michelin: Real-Time Data Streaming Pipelines

Databricks @ Michelin: Centralized Lakehouse

Confluent + Databricks @ Michelin = Cloud-Native Data-Driven Enterprise

Confluent and Databricks: Better Together!

Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink

The Business of Mobility Services (Ride-Hailing, Food Delivery, Taxi Aggregators, etc.)

Why Fraud is a Major Challenge in Mobility Services

The Unseen Enemy: Core Challenges in Mobility Fraud Detection

How Data Streaming with Apache Kafka and Flink Enables Real-Time Fraud Detection

Apache Kafka: The Backbone of Event-Driven Fraud Detection

Apache Flink: Continuous Stream Processing for Fraud Detection in Real-Time

Real-World Fraud Prevention Stories from Mobility Leaders

FREE NOW (Lyft): Detecting Fraudulent Trips in Real Time by Analyzing GPS Data of Cars

Grab: AI-Powered Fraud Detection for Ride-Hailing and Delivery with Data Streaming and AI/ML

Uber: Project RADAR – AI-Powered Fraud Detection with Human Oversight

Data Streaming with Kafka and Flink Provides Real-Time Fraud Detection in Mobility Services

The Unseen Enemy: Core Challenges in Mobility Fraud
Detection