Getting Started

Data Flow Overview

The platform follows an event-driven architecture where data flows through multiple stages:

┌─────────────┐     ┌──────────┐     ┌──────────────┐     ┌─────────────┐
│  Producer   │────▶│  Kafka   │────▶│   Consumer   │────▶│  Sentiment  │
│   Service   │     │ (Topics) │     │   Service    │     │   Service   │
└─────────────┘     └──────────┘     └──────────────┘     └─────────────┘
                         │                   │                     │
                         │                   │                     │
                         │                   ▼                     │
                         │            ┌─────────────┐              │
                         │            │ PostgreSQL  │              │
                         │            │  + Redis    │              │
                         │            └─────────────┘              │
                         │                   │                     │
                         ▼                   ▼                     │
                    ┌──────────┐     ┌──────────────┐             │
                    │   API    │◀────│ Kafka Topic  │◀────────────┘
                    │ Service  │     │(processed...)│
                    └──────────┘     └──────────────┘
                         │
                         ▼ (SSE)
                    ┌──────────┐
                    │Dashboard │
                    └──────────┘

Stage-by-Stage Description

1. Producer → Kafka (raw-comments)

What happens:

Producer generates mock comments every 100ms-10s
Comments include text, source (Twitter/Instagram/etc), timestamp
5% are intentional duplicates to test deduplication
Published to raw-comments Kafka topic

Why:

Kafka acts as a buffer between generation and processing
Ensures no message loss even if consumer is down
Allows multiple consumers to process same data

2. Consumer Receives Raw Comment

What happens:

Consumer subscribes to raw-comments topic
Receives comment message
Begins processing pipeline

Why:

Kafka consumer groups enable horizontal scaling
Each message processed exactly once per group

3. Deduplication Check

What happens:

Check Redis using commentId as key (processed:{commentId})
If found in Redis → comment already processed, discard and log
If not found → mark as new and continue processing

Why:

Prevents processing the same comment multiple times
Redis persists across service restarts (3-hour TTL)
Uses commentId to detect duplicate submissions

4. Sentiment Analysis (gRPC)

What happens:

Hash comment text (SHA256 of lowercase, trimmed text)
Check LRU cache (100 entries) for cached sentiment result
If cache hit → use cached tag, skip gRPC call
If cache miss → call Sentiment service via gRPC
Sentiment service classifies: positive/negative/neutral/unrelated
Uses keyword matching (e.g., "amazing" → positive)
Cache the result for future identical text
1 in 32 requests randomly fail to simulate errors

Why:

LRU cache avoids redundant analysis for identical text
gRPC is faster than REST for service-to-service
Failure simulation tests retry mechanism

5. Save to Database

What happens:

Consumer stores comment in PostgreSQL
Includes: text, textHash, tag, source, timestamps, consumerId, retry count
Marks commentId as processed in Redis for deduplication

Why:

PostgreSQL for persistent storage
Enables querying by tag, source, date range
Tracks retry attempts and consumer instance for monitoring

6. Publish to processed-comments

What happens:

Consumer publishes enriched comment to Kafka
Topic: processed-comments
Includes sentiment tag and processing metadata

Why:

Decouples consumer from API
Allows other services to consume processed data
Event sourcing pattern

7. API Streams Processed Comments

What happens:

API service subscribes to processed-comments
Receives processed comment from Kafka
Comment is already in database (saved by Consumer)
Broadcasts comment via SSE to connected dashboard clients

Why:

API doesn't need direct database writes
Real-time updates to dashboard via SSE
Decoupled from consumer service
Multiple API instances can stream same data

8. Dashboard Receives Updates

What happens:

Browser maintains SSE connection to API
Receives comment events from the stream
Validates comment data with Zod schema
Inserts into TanStack DB collection (auto-persists to localStorage)
Components using useLiveQuery automatically re-render
Statistics computed on-the-fly from the collection

Why:

SSE simpler than WebSockets for one-way updates
TanStack DB provides reactive queries without manual state management
localStorage persistence enables offline support
Computing statistics from collection ensures data consistency
No manual polling needed

Error Handling Flow

When sentiment analysis fails:

Consumer → Sentiment (fails)
    ↓
Retry with delay (1s, 2s, 4s, 8s, 16s)
    ↓
After 5 attempts → publish to dead-letter-queue
    ↓
Manual investigation required

Quick Setup

# Install dependencies
pnpm install
 
# Start all services
pnpm docker:up
 
# View logs
pnpm docker:logs
 
# Access dashboard
open http://localhost:3000

Verify Data Flow

Check Producer - Logs should show "Published comment..."
Check Consumer - Logs show "Processing comment..." and "Saved to database"
Check API - Visit http://localhost:3001/api/statistics
Check Dashboard - See real-time updates at http://localhost:3000

Next Steps

Setup Guide - Detailed stack explanation
Producer - How comments are generated
Consumer - Processing pipeline details
API & SSE - Real-time streaming