Daniel Egbo

Machine Learning & AI Engineer

Building production-grade LLM applications, multi-agent RAG infrastructure, and high-throughput ML pipelines. PhD researcher applying advanced scientific computing to large astronomy datasets at UCT/SAAO.

⚡ Specializing in applied ML, vector databases, and scalable AI infrastructure.

Cape Town, South Africa (Open to Remote / Relocation)

Google Scholar | LinkedIn | GitHub | Twitter

6+

Publications

15+

Projects

4

Awards & Grants

5+

Certifications

About Me

I am a Machine Learning Engineer and Astronomy PhD Candidate at the University of Cape Town and the South African Astronomical Observatory. My research focuses on active radio-emitting stars, a domain that requires processing, cross-matching, and engineering pipelines for massive multi-wavelength datasets from MeerKAT, Gaia, and eROSITA.

Alongside my academic research, I engineer production-grade machine learning systems, LLM applications, and robust data pipelines. My work includes architecting multi-agent RAG systems, fine-tuning domain-specific models, and building automated data workflows. I treat engineering with the same analytical rigor required by big-data astronomy, ensuring that systems are scalable, mathematically sound, and optimized for performance.

I actively collaborate on both scientific computing initiatives and applied AI projects, bringing a unique blend of deep analytical research and practical software engineering to teams building next-generation intelligent systems.

Featured Projects

⭐ Featured Project (GenAI / Agents)

Multi-Agent RAG Applications (GenAI Agents)

Designed and orchestrated a multi-agent retrieval-augmented generation (RAG) infrastructure utilizing LangGraph to handle complex, document-based scientific QA. The architecture manages stateful routing, automated query reformulation, self-correction, and citation verification mechanisms across isolated, specialized LLM agents. By implementing this modular agentic framework, the system achieved a 30% reduction in hallucination rates and improved response accuracy by 25% when evaluating dense scientific literature..

Tech Stack: LangGraph, LangChain, LiteLLM, Qdrant, Milvus, Airflow, Python

Medical Speech Recognition Pipeline

Built an end-to-end automated fine-tuning and inference pipeline for OpenAI's Whisper ASR model on clinical dictation datasets. The project focused on optimizing the model to process specialized medical nomenclature, complex radiology reports, and highly accented audio across more than 200 hours of recording data. The custom training pipeline successfully reduced the Word Error Rate (WER) by 15% over out-of-the-box baselines, providing a highly reliable translation layer for domain-specific medical terminology.

Tech Stack: PyTorch, Hugging Face, Whisper ASR, Python

Big Data Spatial Cross-Matching: SARAO MeerKAT & Gaia DR3

Engineered a massive spatial data processing pipeline to cross-match over 443,000+ radio sources from the MeerKAT Galactic Plane Survey against Gaia DR3’s 1.8 billion-object catalog. To isolate genuine astrophysical matches from background noise, I developed a statistical validation pipeline using Monte Carlo simulations to quantify alignment reliability and rule out chance alignments in highly crowded fields. The project successfully identified 629 high-confidence candidate stellar counterparts, the largest radio-optical cross-match sample of its kind to date.

Tech Stack: Astropy, TAP, TOPCAT, Monte Carlo simulation, Python

View All Projects Data Science Portfolio

Technical Skills

ML & AI Engineering

  • Frameworks & Core ML: PyTorch · Scikit-learn · XGBoost · LightGBM · Hugging Face
  • Generative AI & Multi-Agent Systems: LangGraph · LangChain · LLM Orchestration · Model Evaluation · LiteLLM · Agentic Workflows
  • Vector Databases & Retrieval: Qdrant · Milvus · Pinecone · OpenSearch · Semantic Search · Hybrid Search
  • Fine-Tuning & Inference: Unsloth (PEFT/LoRA) · Inference Optimization · OpenAI APIs

Data Engineering & Cloud Infrastructure

  • Data Orchestration & Pipelines: Apache Airflow · Prefect · Kestra · dbt · ETL/ELT Pipelines
  • Databases & Big Data SQL · BigQuery · DuckDB · PostgreSQL · Entity Resolution · Scalable Architecture
  • Cloud & Infrastructure: AWS · GCP · S3 · GCS · MinIO · Docker (Containerization) · Docker Compose · Kubernetes

Scientific Computing & Analytics

  • Statistical Modeling: Monte Carlo Simulations · Time-Series Analysis · Bayesian Inference · Hypothesis Testing Predictive Analytics
  • High-Performance Compute: Parallel Processing · Slurm ·
  • Astrophysics Stack: Astropy · Specutils · Astroquery · LSDB · TOPCAT · TAP · ADQL · DS9 · CARTA

Selected Publications

View all publications & presentations

Honors and Awards

Professional Training & Certifications

NVIDIA Deep Learning Institute

  • Building Conversational AI Applications (2025)
  • Accelerating End-to-End Data Science Workflows (2024)
  • Getting Started with Deep Learning (2024)
  • Generative AI with Diffusion Models (2024)
  • Building RAG Agents with LLMs (2024)
  • Disaster Risk Monitoring Using Satellite Imagery (2024)

Data Science & ML

  • MLOps Zoomcamp - DataTalks.Club (2025)
  • Machine Learning Zoomcamp - DataTalks.Club (2023)
  • Applied Data Science II: ML & Statistical Analysis - WorldQuant University (2023)
  • Applied Data Science I: Scientific Computing & Python - WorldQuant University (2023)

Summer Programs

  • Oxford Machine Learning Summer School (2023)
  • COSPAR X-Vision School: X-ray Astronomy (2023)
  • ESCAPE Summer School: Data Science for Astronomy (2021)
  • ZTF Summer School: Time-domain Astronomy (2021)
  • GROWTH Astronomy School: Time-domain Astronomy (2020)