Back to Projects
Data

Realtime Data Pipeline

A real-time data pipeline that helps analytics and AI teams consume up-to-date PostgreSQL data in BigQuery at massive scale.

Categories

Data

About the Project

This project involved designing and building a real-time data pipeline to synchronize golden data from PostgreSQL to BigQuery. The system combines an initial large-scale data migration using batch jobs with a streaming pipeline based on Change Data Capture (CDC) to ensure near-real-time consistency. The pipeline was designed to handle billions of records (~9TB of data) and supports downstream analytics and chatbot use cases requiring low-latency, up-to-date data.

Technologies Used

GolangPostgreSQLCDCKafkaBigQueryBash Script

Key Outcomes

  • Successfully migrated billions of records (~9TB) from PostgreSQL to BigQuery
  • Built a CDC-based streaming pipeline for near-real-time data synchronization
  • Enabled real-time analytics and chatbot workloads with up-to-date data
  • Designed a scalable and fault-tolerant data ingestion architecture