Skip to content
All work

Web Scraping & Data Pipelines · 2025

Resilient Web Content Extractor

Pulls clean markdown from JS-heavy pages, via CLI or HTTP.

Node.jsPlaywrightCLIHTTP server

The problem

Other tools choked on modern, client-rendered sites — returning empty shells or broken markup — which made downstream content processing unreliable.

What I built

I built it on a headless browser that fully renders each page before extraction, then converts the result to clean markdown. The same core is wrapped in a CLI for one-off pulls and an HTTP server for programmatic use, so it slots into both scripts and larger pipelines.

The outcome

Gave every downstream system a single dependable way to turn a URL into usable, structured text.

Want something like this built? Get in touch