About
I'm Vitor Sousa, a Data Scientist & AI Engineer at Wellhub in Vizela, northern Portugal. I work across the full ML lifecycle — from research and experimentation to shipping and monitoring production systems. My deepest work right now lives at the intersection of reinforcement learning and language model alignment: understanding how policy optimization actually works, implementing the algorithms from scratch, and writing about what I find along the way.
Background
I studied Information Systems Engineering at the University of Minho, where I first got pulled into machine learning. What started as academic curiosity quickly turned into something deeper — the gap between understanding an algorithm on paper and making it work on real data was humbling, and closing that gap became the thing I cared about most.
At Wellhub (formerly Gympass) I work on the Care Engagement System, building contextual bandits for personalized nudges, Kubeflow pipelines for training and serving, and evaluation frameworks that keep deployed models honest. Production engineering taught me something textbooks don't — a model is only as good as the system around it.
What I'm building toward is research-level depth in RL for LLM alignment. Not surface familiarity — the kind where you re-derive PPO's clipped objective from first principles, explain why GRPO eliminates the value network, and implement the full training loop without reaching for a library. I'm doing this systematically: one domain per quarter, from-scratch implementations, and writing that forces real understanding.
What I write about
My writing focuses on three areas: reinforcement learning for LLM alignment — a series covering PPO, GRPO, and GDPO in depth; contextual bandits from theory to production — a five-part series from decision frameworks through neural bandits to deployment; and systematic approaches to LLM evaluation. Everything I publish tries to bridge the gap between the paper and the pull request.
Current focus
Right now I'm in a deliberate foundations phase — working through Prince's Understanding Deep Learning, building transformer components from scratch (attention, positional encoding, full forward pass) with tests at every layer, and strengthening the math underpinnings: linear algebra, calculus, and probability at the derivation level.
The next stage shifts to reinforcement learning proper — Sutton & Barto cover to cover, from-scratch implementations of policy gradient methods and PPO. This converges on a flagship project I'm building toward: a research-quality RLVR + GRPO implementation that trains small language models on math reasoning, with proper benchmarks, ablations, and a technical write-up series documenting the full process.