Scraping the Unscrappable: High-Performance Government Tender Extraction in Rust
The Challenge
Clients needed structured, up-to-date tender data from Etimad Tenders (Saudi Arabia) and Bury Council (UK). Both portals were fully client-side rendered, protected against automated access, and frequently changing their DOM structure and pagination logic. Scale mattered — scraping needed to be fast enough for daily business use.
Key Constraints
- •Portals were fully client-side rendered — no static HTML to parse.
- •Aggressive bot protection on government portals.
- •DOM structure and pagination changed frequently without notice.
- •Data needed to be production-ready in CSV/JSON for downstream business use.
Why Rust? Most scraping is done in Python. Rust was chosen deliberately — for memory safety, zero-cost abstractions, and native async/multithreading support that Python simply can't match at scale.
Why Chrome DevTools Protocol? CDP gives direct programmatic control over a real Chromium browser instance — meaning JavaScript executes fully, dynamic content loads, and the scraper behaves like a real user. This bypassed the rendering limitations of lightweight HTTP scrapers.
What Was Built
- A CDP-based browser automation engine in Rust that controlled headless Chromium
- Intelligent pagination handling that adapted to DOM structure changes
- Async + multithreaded pipelines that processed multiple pages simultaneously
- Extractors for titles, deadlines, and attached documents across both portals
- Export pipelines delivering clean CSV and JSON datasets ready for downstream use
- Similar scrapers for platforms like LinkedIn, TripAdvisor, and Instagram
Engineering Tradeoffs
Memory safety, zero-cost abstractions, native async/multithreading — ~50% faster scraping.
Steeper learning curve and longer initial development time compared to Python.
Full JavaScript execution, behaves like a real user, bypasses bot protection effectively.
Heavier resource usage — requires running a full Chromium instance.
Impact & Outcome
Scraping speed improved by ~50% over baseline using async + multithreaded pipelines.
Successfully extracted tender data from two heavily protected government portals across two countries.
Delivered structured CSV and JSON datasets ready for immediate downstream business use.