Scrape ld+json with cURL + htmlq

Google Vacation Rentals has proposed a linked data JSON format to setup properties for search indexing across their various platforms: search engine, maps, etc.

The details of the format are below

Vacation Rental Schema Markup | Google Search Central | Documentation | Google for Developers
Vacation listing structured data can help people find your vacation listings on Search. Learn about how to add markup and which fields to add, such as location, images, and the rating of your vacation property.

That means that web pages for vacation properties will have this formatted data within them like so.

 <script type="application/ld+json">
      {
        "@context": "https://schema.org",
        "@type": "VacationRental",
        "additionalType": "HolidayVillageRental",
        "brand": {
          "@type": "Brand",
          "name": "brandIdName"
        },
        ....

Knowing this you can easily scrape out the details of the site by knowing the following

  • Url
  • Tag to look for <script type="application/ld+json">
  • Contents of the tag from the Google format

Therefore using cURL, htmlq, and jq all together you can script out some bash one liner scripts to extract public structured data when scraping a webpage like so

curl -s https://property_page_url.invalid | htmlq 'script[type="application/ld+json"]' --text | jq -r '.image'
  • Grab the webpage contents with cURL
  • Pipe the output to htmlq
  • Extract the matching script content of <script type="application/ld+json">
  • Extract the image from the structured JSON data

Results

[
  "https://image_base_url.invalid/7f10ad979a4c4339b8fbd066ef4bc2b2",
  "https://image_base_url.invalid/77526a4214604591b201c7058533a420",
  ...
]