Scrape ld+json with cURL + htmlq
Google Vacation Rentals has proposed a linked data JSON format to setup properties for search indexing across their various platforms: search engine, maps, etc.
The details of the format are below
Vacation Rental Schema Markup | Google Search Central | Documentation | Google for Developers
Vacation listing structured data can help people find your vacation listings on Search. Learn about how to add markup and which fields to add, such as location, images, and the rating of your vacation property.
data:image/s3,"s3://crabby-images/4fd8e/4fd8ee54e219e88a4d93bad584ab5425325755ad" alt=""
That means that web pages for vacation properties will have this formatted data within them like so.
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "VacationRental",
"additionalType": "HolidayVillageRental",
"brand": {
"@type": "Brand",
"name": "brandIdName"
},
....
Knowing this you can easily scrape out the details of the site by knowing the following
- Url
- Tag to look for
<script type="application/ld+json">
- Contents of the tag from the Google format
Therefore using cURL, htmlq, and jq all together you can script out some bash one liner scripts to extract public structured data when scraping a webpage like so
curl -s https://property_page_url.invalid | htmlq 'script[type="application/ld+json"]' --text | jq -r '.image'
- Grab the webpage contents with cURL
- Pipe the output to htmlq
- Extract the matching script content of
<script type="application/ld+json">
- Extract the
image
from the structured JSON data
Results
[
"https://image_base_url.invalid/7f10ad979a4c4339b8fbd066ef4bc2b2",
"https://image_base_url.invalid/77526a4214604591b201c7058533a420",
...
]