Collaboration with Alexandre Candeias, Ivo Silva, and José Marcelino.
The problem
High-end fashion has a return rate problem that generic sizing can’t solve. At Farfetch — a marketplace connecting 4M+ active customers across 190 countries with 1,200+ luxury brands — every brand has its own sizing standards, and every customer has their own fit preferences. A Gucci 42 is not a Balenciaga 42. A customer who buys “true to size” in sneakers might size up in boots.
The existing approach (SFNet) treated size prediction as a static classification problem. It worked, but left significant accuracy and coverage on the table. The business impact was direct: inaccurate size recommendations drove returns, which drove logistics costs, which drove customer churn.
What we built
Tailor was a sequence classification system that reframed size recommendation as a sequential learning problem. Instead of treating each purchase in isolation, we modeled the full history of a customer’s interactions — purchases, Add2Bag events, return reasons — as a temporal sequence that encodes evolving size preferences.
Two model variants:
SSP-LSTM — a Long Short-Term Memory encoder that processes the user-event sequence (product characteristics, size chosen, outcome) and projects it into a shared embedding space with the target product. The LSTM captures how a customer’s sizing decisions evolve over time — a returned item teaches the model something different than a kept item.
SSP-Attention — replaces the LSTM with a self-attention mechanism over the event sequence. More expressive for capturing long-range dependencies (a return from 6 months ago is still informative), and parallelizable during training. Higher accuracy, slightly higher inference cost.
Both models predict a size position rather than an absolute size label (S/M/L/38/40/42). This was a key design decision: it reduces high-cardinality classification to a tractable problem and generalizes across brand-specific sizing systems. A “size position” represents where a customer sits on a brand’s size spectrum, decoupled from the label itself.
The signals that made it work
The biggest technical insight was integrating Add2Bag interactions as implicit signals alongside explicit purchase and return data. Add2Bag events capture a customer’s size intent even when they don’t complete a purchase — massively expanding the signal surface.
This single change increased user coverage by 24.5% compared to using order data alone. More users with enough signal to generate a recommendation means fewer cold-start failures.
Return reasons provided explicit negative signal. When a customer returns an item because it “runs small,” that’s direct supervision for the model. We encoded return reasons as structured features in the event sequence, giving the model explicit fit-direction feedback.
Feature engineering and ablation studies
We ran ablation studies to isolate the impact of each signal type:
- Add2Bag integration: +24.5% user coverage
- Return reason encoding: measurable accuracy lift on repeat customers
- Temporal ordering vs bag-of-events: sequential models consistently outperformed unordered baselines, confirming that the order of sizing decisions carries information
Feature engineering included product-level embeddings (brand, category, material), user-level aggregates (historical size distribution, return rate), and interaction-level features (time since last purchase, event type).
Results
The best model (SSP-Attention) outperformed SFNet by 45.7% on top-1 accuracy — the percentage of times the model’s first recommendation is the correct size.
In production deployment with AB testing and causal impact analysis:
- 7% reduction in return rates for men’s shoes
- 3.6% reduction in return rates for women’s shoes
- Both statistically significant and sustained over the measurement period
These numbers translate directly to reduced logistics costs, improved customer satisfaction scores, and higher retention.
Real-time serving
A size recommendation is useless if it arrives after the customer has already added an item to cart. We evaluated inference latency rigorously:
- SSP-LSTM: well under the 15ms production threshold
- SSP-Attention: slightly higher compute cost due to the attention mechanism, but still within real-time constraints through careful optimization (batch inference, model quantization, serving on Vespa.ai)
The system supported both batch predictions (pre-computing recommendations for known user-product pairs) and real-time inference (generating recommendations on-the-fly for new browsing sessions).
Links
- Paper: Tailor — Size Recommendations for High-End Fashion Marketplaces (arXiv) — Accepted at FashionXRecsys 2023, 17th ACM Conference on Recommender Systems
- Vespa.ai case study on Farfetch recommendations
Tailor Fit Advice
A parallel project I led solo from POC to production: predicting whether a product runs smaller or larger than expected, using Bayesian models on transactional data. This was a different framing — instead of predicting “buy size 42,” it predicts “this product runs small, consider sizing up.”
Deployed on Databricks with Airflow-orchestrated pipelines, with continuous model updates from real-time feedback. The fit advice was surfaced directly on product pages, giving customers confidence before they add to cart.
Results: 7% return reduction for men’s shoes, 3.5% for women’s shoes — complementary to the size recommendation models.
What this taught me
Farfetch was where I learned what production ML actually demands. Three lessons that shaped everything I’ve built since:
Signal design matters more than model architecture. The jump from SFNet to SSP wasn’t primarily about LSTMs vs simpler models — it was about incorporating Add2Bag signals and return reasons. The 24.5% coverage increase came from better data, not a bigger model.
Latency is a feature constraint, not an afterthought. A model that’s 2% more accurate but 10x slower is worse in production. We spent as much time on serving optimization as on model development.
Offline metrics lie gently. The AB tests and causal impact analyses told a different (and more honest) story than offline accuracy alone. The 7% return reduction was real; the offline accuracy improvement was directional but needed production validation to be trusted.
Tech stack
Python · PyTorch · PySpark · SQL · Databricks · Apache Airflow · Google BigQuery · Vespa.ai · Docker · Terraform