When I inherited the MarTech stack during the LumApps consolidation, the task was ostensibly simple: document every place a Pardot form, tracking script, or embedded asset appeared across the web properties so the marketing team could plan their HubSpot transition. Standard audit work.
Standard tools didn't work. Screaming Frog couldn't get past the cookie consent wall. Sitebulb's JS rendering choked on the lazy-loaded Pardot injections. The WordPress admin inventory was out of date by thousands of pages. What I ended up building was a Playwright crawler that auto-accepted OneTrust, waited for Pardot's script to inject its forms, and walked the DOM for every tracking identifier. 3,489 pages later I had a complete inventory: 511 unique URLs across 5,066 form instances, mapped against 5 languages and ready for migration planning.
Why standard crawlers fail on MarTech audits
Most SEO crawlers treat the DOM as static. They fetch the HTML, parse it, extract links, move on. That works fine for 80% of audits. But MarTech content doesn't live in the initial HTML — it lives in what happens after:
- Consent walls. OneTrust, TrustArc, and Cookiebot inject their banners via JS after page load. Until the visitor clicks 'accept,' most analytics and MarTech scripts are blocked. A crawler that doesn't interact with the consent dialog never sees the Pardot forms you're trying to inventory.
- Lazy-loaded scripts. Pardot, HubSpot, and Marketo all defer their form injection to DOMContentLoaded or later. If the crawler doesn't wait, the forms aren't in the DOM yet.
- iframe embeds. Some MarTech forms are injected into same-origin iframes. A crawler that only queries the top document misses them.
- A/B tests and personalization. If Optimizely or VWO is running, different page variants show different forms. You need to observe the default state consistently.
Tools like Screaming Frog can technically execute JS, but their sandbox doesn't handle the consent interaction reliably, and you can't script complex wait conditions. The output is a partial picture that looks complete.
Playwright as the right-shaped hammer
Playwright gives you three things no traditional crawler does:
- Scriptable browser automation with real Chromium. The page behaves the same way a visitor's browser behaves.
- Precise wait conditions. page.waitForSelector and page.waitForResponse let you block until the thing you're auditing actually exists.
- Network interception. You can observe every request the page makes, including the ones you care about (Pardot, HubSpot, GA, GTM).
Which means you can write a crawler that says, in effect: for each URL in the sitemap, open the page, click through the cookie banner, wait for Pardot's form injection script to fire, and then record every form ID, iframe source, and tracking pixel present in the DOM.
The script, minus the boring parts
The core loop launches Chromium with a real viewport and locale, iterates over the sitemap, dismisses the OneTrust banner by clicking #onetrust-accept-btn-handler, waits for window.pardot to be defined or for an iframe[src*='pardot'] to appear in the DOM, then extracts every Pardot iframe's src and id attributes along with tracking pixels from img[src*='pardot']. Record the detected page language (document.documentElement.lang) for each result. Store the output as a JSON array keyed by URL.
In production you want concurrency (Playwright's browserContext plus a worker pool), retry logic for flaky pages, and rate limiting so you don't get banned from your own site. Five concurrent contexts with a 500ms delay between requests is my usual baseline.
Turning the raw output into a migration plan
A JSON file with 5,066 form records is not a deliverable. Here's what I built on top:
- Dedupe by form ID. Pardot embed IDs are the source of truth, not URLs. 5,066 instances collapsed to 511 unique forms after dedup.
- Map form ID to marketing owner. Most Pardot forms are named after the campaign they belong to. Join to the Pardot form metadata (exportable via API) to get the owner email and campaign status.
- Flag orphans. Forms that appear in the DOM but don't exist in the Pardot admin console are dead embeds. Prime candidates for removal.
- Language breakdown. Group form IDs by detected page language. On LumApps this revealed that three of the five languages had incomplete form coverage — an editorial gap the marketing team didn't know existed.
The final deliverable was a single-table pivot: rows are form IDs, columns are languages, cells are page counts. One glance told the marketing team which forms needed translation, which were dead, and which were duplicated across regions.
Gotchas worth knowing
- Cookie consent geography. OneTrust's banner varies by visitor region. Set context.geolocation and context.permissions explicitly so the crawler sees a consistent experience across runs.
- Infinite scroll pages. Some campaign landing pages use infinite scroll. A page.evaluate that scrolls to document.body.scrollHeight followed by a timeout gets you the full DOM.
- Rate limiting. Even crawling your own site, Cloudflare or your WAF may flag burst traffic. Five concurrent contexts with a 500ms delay is the usual compromise.
- Iframe lifecycle. Pardot forms are cross-origin iframes. You can't query their internal DOM, but you can read src and id attributes from the parent document, which is all you need for inventory.
When to reach for this
If you're auditing more than 200 pages of MarTech content, a checklist-style crawler won't cut it. The specific ROI of Playwright is that it gives you the same JS execution environment a real visitor has, which means the data you collect matches what Google and your tracking scripts actually see. The same pattern works for HubSpot, Marketo, or any platform that injects forms via JS or iframes — you just swap the selectors.
For the scoring model that sits downstream of this audit — how I use GSC data to decide which of those 511 forms to migrate, rewrite, or kill — see How I Scored 436 Pages for a Domain Migration. If you need this done on your stack and don't want to build the crawler from scratch, the CMS and MarTech Audit package is exactly this methodology, productized.