code

Scrape ld+json with cURL + htmlq

Robin Michael

09 Jul 2024 • 1 min read

Google Vacation Rentals has proposed a linked data JSON format to setup properties for search indexing across their various platforms: search engine, maps, etc.

The details of the format are below

That means that web pages for vacation properties will have this formatted data within them like so.

 <script type="application/ld+json">
      {
        "@context": "https://schema.org",
        "@type": "VacationRental",
        "additionalType": "HolidayVillageRental",
        "brand": {
          "@type": "Brand",
          "name": "brandIdName"
        },
        ....

Knowing this you can easily scrape out the details of the site by knowing the following

Url
Tag to look for <script type="application/ld+json">
Contents of the tag from the Google format

Therefore using cURL, htmlq, and jq all together you can script out some bash one liner scripts to extract public structured data when scraping a webpage like so

curl -s https://property_page_url.invalid | htmlq 'script[type="application/ld+json"]' --text | jq -r '.image'

Grab the webpage contents with cURL
Pipe the output to htmlq
Extract the matching script content of <script type="application/ld+json">
Extract the image from the structured JSON data

Results

[
  "https://image_base_url.invalid/7f10ad979a4c4339b8fbd066ef4bc2b2",
  "https://image_base_url.invalid/77526a4214604591b201c7058533a420",
  ...
]

Results

Sign up for more like this.