Apache Kafka
Apache Kafka is a distributed event streaming platform capable of handling trillions of events per day, designed for high-throughput, fault-tolerant real-time data pipelines and streaming applications.
Supported Versions and Architectures
- Versions: Kafka 2.0 ~ 2.5 (built on Scala 2.12)
- Architectures: Single-node or cluster
Supported Data Types
Category | Data Types |
---|---|
Boolean | BOOLEAN |
Integer | SHORT, INTEGER, LONG |
Floating Point | FLOAT, DOUBLE |
Numeric | NUMBER |
String | CHAR, VARCHAR, STRING, TEXT |
Binary | BINARY |
Composite | ARRAY, MAP, OBJECT |
Date/Time | TIME, DATE, DATETIME, TIMESTAMP |
UUID | UUID |
Data Structure Modes
Kafka supports two data structure modes to meet different business requirements:
- Standard Structure (Default)
- Original Structure
Purpose: Handles complete DML operations (INSERT, UPDATE, DELETE) with standardized event format for CDC scenarios.
Use Case: CDC log queues where relational database changes are streamed through Kafka to downstream systems.
Data Format:
{
"ts": 1727097087513,
"op": "DML:UPDATE",
"opTs": 1727097087512,
"table": "table_name",
"before": {},
"after": {}
}
Purpose: Native Kafka data handling with full control over message structure, supporting append-only operations.
Use Case: Unstructured data transformation and homogeneous data migration scenarios.
Data Format:
{
"partition": 3,
"timestamp": 1638349200000,
"headers": {"headerKey1": "headerValue1"},
"key": "user123",
"value": {
"id": 1,
"name": "John Doe",
"action": "login"
}
}
Sync Modes
- Full Only: Reads from the beginning and stops at the current position
- Full + Incremental: Reads all historical data then continues with real-time streaming
- Incremental Only: Starts from current position or specified timestamp
Limitations
- Authentication: Currently supports only authentication-free Kafka instances
- Data Types: Source data types must be compatible with target system requirements
- Delivery Semantics: At-least-once delivery may cause duplicates; ensure target-side idempotency
- Consumer Groups: Each consumption thread uses different consumer group IDs