Web Scraping with lynx + cURL
Wanted to grab data from a webpage and process the results. In this case the results are a csv
file dump
Get the utilities to be used here
sudo apt-get install curl lynx cargo
cargo install xsv;
Setup XSV with Cargo
export PATH="$HOME/.cargo/bin:$PATH"
Copy the contents of the WebPage to local html file
curl -s https://support.spatialkey.com/spatialkey-sample-csv-data/ -o sample.html
Process the html into text (for extraction)
lynx -dump sample.html > sample.txt
Extract the the urls from the webpage
cat sample.txt | grep -e '.csv$' | cut -c 7- > urls.txt
Take the urls and run cURL over it again to extract the data links and save to a csv
head -n 1 urls.txt | xargs curl -so realestate.csv
View the extracted and formatted CSV
xsv table realestate.csv
Full script
curl -s https://support.spatialkey.com/spatialkey-sample-csv-data/ -o sample.html;
lynx -dump sample.html > sample.txt;
cat sample.txt | grep -e '\.csv$' | cut -c 7- > urls.txt;
head -n 1 urls.txt | xargs curl -so realestate.csv;
xsv table realestate.csv;