Skip to content

Getting Started

Data Flow Overview

The platform follows an event-driven architecture where data flows through multiple stages:

┌─────────────┐     ┌──────────┐     ┌──────────────┐     ┌─────────────┐
│  Producer   │────▶│  Kafka   │────▶│   Consumer   │────▶│  Sentiment  │
│   Service   │     │ (Topics) │     │   Service    │     │   Service   │
└─────────────┘     └──────────┘     └──────────────┘     └─────────────┘
                         │                   │                     │
                         │                   │                     │
                         │                   ▼                     │
                         │            ┌─────────────┐              │
                         │            │ PostgreSQL  │              │
                         │            │  + Redis    │              │
                         │            └─────────────┘              │
                         │                   │                     │
                         ▼                   ▼                     │
                    ┌──────────┐     ┌──────────────┐             │
                    │   API    │◀────│ Kafka Topic  │◀────────────┘
                    │ Service  │     │(processed...)│
                    └──────────┘     └──────────────┘

                         ▼ (SSE)
                    ┌──────────┐
                    │Dashboard │
                    └──────────┘

Stage-by-Stage Description

1. Producer → Kafka (raw-comments)

What happens:
  • Producer generates mock comments every 100ms-10s
  • Comments include text, source (Twitter/Instagram/etc), timestamp
  • 5% are intentional duplicates to test deduplication
  • Published to raw-comments Kafka topic
Why:
  • Kafka acts as a buffer between generation and processing
  • Ensures no message loss even if consumer is down
  • Allows multiple consumers to process same data

2. Consumer Receives Raw Comment

What happens:
  • Consumer subscribes to raw-comments topic
  • Receives comment message
  • Begins processing pipeline
Why:
  • Kafka consumer groups enable horizontal scaling
  • Each message processed exactly once per group

3. Deduplication Check

What happens:
  • Check Redis using commentId as key (processed:{commentId})
  • If found in Redis → comment already processed, discard and log
  • If not found → mark as new and continue processing
Why:
  • Prevents processing the same comment multiple times
  • Redis persists across service restarts (3-hour TTL)
  • Uses commentId to detect duplicate submissions

4. Sentiment Analysis (gRPC)

What happens:
  • Hash comment text (SHA256 of lowercase, trimmed text)
  • Check LRU cache (100 entries) for cached sentiment result
  • If cache hit → use cached tag, skip gRPC call
  • If cache miss → call Sentiment service via gRPC
  • Sentiment service classifies: positive/negative/neutral/unrelated
  • Uses keyword matching (e.g., "amazing" → positive)
  • Cache the result for future identical text
  • 1 in 32 requests randomly fail to simulate errors
Why:
  • LRU cache avoids redundant analysis for identical text
  • gRPC is faster than REST for service-to-service
  • Failure simulation tests retry mechanism

5. Save to Database

What happens:
  • Consumer stores comment in PostgreSQL
  • Includes: text, textHash, tag, source, timestamps, consumerId, retry count
  • Marks commentId as processed in Redis for deduplication
Why:
  • PostgreSQL for persistent storage
  • Enables querying by tag, source, date range
  • Tracks retry attempts and consumer instance for monitoring

6. Publish to processed-comments

What happens:
  • Consumer publishes enriched comment to Kafka
  • Topic: processed-comments
  • Includes sentiment tag and processing metadata
Why:
  • Decouples consumer from API
  • Allows other services to consume processed data
  • Event sourcing pattern

7. API Streams Processed Comments

What happens:
  • API service subscribes to processed-comments
  • Receives processed comment from Kafka
  • Comment is already in database (saved by Consumer)
  • Broadcasts comment via SSE to connected dashboard clients
Why:
  • API doesn't need direct database writes
  • Real-time updates to dashboard via SSE
  • Decoupled from consumer service
  • Multiple API instances can stream same data

8. Dashboard Receives Updates

What happens:
  • Browser maintains SSE connection to API
  • Receives comment events from the stream
  • Validates comment data with Zod schema
  • Inserts into TanStack DB collection (auto-persists to localStorage)
  • Components using useLiveQuery automatically re-render
  • Statistics computed on-the-fly from the collection
Why:
  • SSE simpler than WebSockets for one-way updates
  • TanStack DB provides reactive queries without manual state management
  • localStorage persistence enables offline support
  • Computing statistics from collection ensures data consistency
  • No manual polling needed

Error Handling Flow

When sentiment analysis fails:

Consumer → Sentiment (fails)

Retry with delay (1s, 2s, 4s, 8s, 16s)

After 5 attempts → publish to dead-letter-queue

Manual investigation required

Quick Setup

# Install dependencies
pnpm install
 
# Start all services
pnpm docker:up
 
# View logs
pnpm docker:logs
 
# Access dashboard
open http://localhost:3000

Verify Data Flow

  1. Check Producer - Logs should show "Published comment..."
  2. Check Consumer - Logs show "Processing comment..." and "Saved to database"
  3. Check API - Visit http://localhost:3001/api/statistics
  4. Check Dashboard - See real-time updates at http://localhost:3000

Next Steps