Back to Projects
Data
Realtime Data Pipeline
A real-time data pipeline that helps analytics and AI teams consume up-to-date PostgreSQL data in BigQuery at massive scale.
Categories
Data
About the Project
This project involved designing and building a real-time data pipeline to synchronize golden data from PostgreSQL to BigQuery. The system combines an initial large-scale data migration using batch jobs with a streaming pipeline based on Change Data Capture (CDC) to ensure near-real-time consistency. The pipeline was designed to handle billions of records (~9TB of data) and supports downstream analytics and chatbot use cases requiring low-latency, up-to-date data.
Technologies Used
GolangPostgreSQLCDCKafkaBigQueryBash Script
Key Outcomes
- • Successfully migrated billions of records (~9TB) from PostgreSQL to BigQuery
- • Built a CDC-based streaming pipeline for near-real-time data synchronization
- • Enabled real-time analytics and chatbot workloads with up-to-date data
- • Designed a scalable and fault-tolerant data ingestion architecture