Scrape ld+json with cURL + htmlq
Google Vacation Rentals has proposed a linked data JSON format to setup properties for search indexing across their various platforms: search engine, maps, etc.
The details of the format are below
That means that web pages for vacation properties will have this formatted data within them like so.
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "VacationRental",
"additionalType": "HolidayVillageRental",
"brand": {
"@type": "Brand",
"name": "brandIdName"
},
....
Knowing this you can easily scrape out the details of the site by knowing the following
- Url
- Tag to look for
<script type="application/ld+json">
- Contents of the tag from the Google format
Therefore using cURL, htmlq, and jq all together you can script out some bash one liner scripts to extract public structured data when scraping a webpage like so
curl -s https://property_page_url.invalid | htmlq 'script[type="application/ld+json"]' --text | jq -r '.image'
- Grab the webpage contents with cURL
- Pipe the output to htmlq
- Extract the matching script content of
<script type="application/ld+json">
- Extract the
image
from the structured JSON data
Results
[
"https://image_base_url.invalid/7f10ad979a4c4339b8fbd066ef4bc2b2",
"https://image_base_url.invalid/77526a4214604591b201c7058533a420",
...
]