Playwright Scraping a Website

Previously I documented using Puppeteer to scrape a website, but this doesn't work well with Pipedream (it times out unexpectedly because of Pipedream limitations)

Therefore a better solution I have come up with is to create a FastAPI Dockerified version using Playwright instead to do the same thing. Making it work as a Docker image means I can push it out to DigitalOcean as a serverless app (that example except choose the appropriate Docker registry) hosting for a fixed low price ($5) per month.

Requirements

  • Send a url with appropriate query parameters
  • This target url page contains secured iframes within it that have the content and request/responses I need
  • Find the appropriate iframe or instead check all the responses that come back on the target page (I picked the latter here)
  • Respond with the iframe response (luckily it is JSON already so pass it along)

Steps (all async)

  • Keep a global cache of responses (to hold responses)
  • Create a unique identifier for this target url
  • Stick the unique identifier on all requests with a known honored extra request header. Using 'X-Forwarded-Host'
  • Create the Playwright instance
  • Assign the context of the browser instance with the header identifier
  • Setup the page response interceptor
  • Wait for all request and response to come to a completed state (networkidle)
  • Close the playwright instance
  • Return the cached response value (can be null)

Complete Code

import uuid
from playwright.async_api import async_playwright
import cachetools
from icecream import ic

response_cache = cachetools.TTLCache(maxsize=32, ttl=30)
ic.configureOutput(prefix='|> ')

header_identifier = 'X-Forwarded-Host'


async def intercept_response(current_response):
    global response_cache
    target_request = current_response.request
    if target_request.method == "GET":
        if target_request.url.startswith("https://<request within iframe to look for>"):
            request_id = ic(await target_request.header_value(header_identifier))
            response_cache[request_id] = ic(await current_response.json())
            return current_response
    return current_response


async def scrape_price_url(target_url: str):
    global response_cache
    request_id = ic(str(uuid.uuid4()))
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        context = await browser.new_context(extra_http_headers={header_identifier: request_id})
        page = await context.new_page()
        page.on("response", intercept_response)
        await page.goto(target_url)
        await page.wait_for_load_state("networkidle")
        await browser.close()
    return response_cache.get(ic(request_id), default=None)

Breakdown: Page Interceptor

async def intercept_response(current_response):
    global response_cache
    target_request = current_response.request
    if target_request.method == "GET":
        if target_request.url.startswith("https://<request within iframe to look for>"):
            request_id = ic(await target_request.header_value(header_identifier))
            response_cache[request_id] = ic(await current_response.json())
            return current_response
    return current_response

The page interceptor is middleware and needs to operate in it's pipeline of actions and that is why I'm using the global cache to hold unique request id. Using 2 filters on responses as I'm looking for GET requests that start with a particular known url

Breakdown: Playwright Scraper

async def scrape_price_url(target_url: str):
    global response_cache
    request_id = ic(str(uuid.uuid4()))
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        context = await browser.new_context(extra_http_headers={header_identifier: request_id})
        page = await context.new_page()
        page.on("response", intercept_response)
        await page.goto(target_url)
        await page.wait_for_load_state("networkidle")
        await browser.close()
    return response_cache.get(ic(request_id), default=None)

Here generate the id, create an entry in the expiring cache, hook up the response interceptor, then go to the page, wait, and finally respond.

It took some thinking and time to get this working right, so hope it helps for when you need to do iframe scraping of a target url. I would recommend this approach over parsing the iframes as that approach ultimately lead to frustration and inconsistency.