code

HtmlAgilityPack

Robin Michael

21 Dec 2018 • 1 min read

Sometimes you need to parse out raw HTML (scraping because there is no public API). In python there is the excellent library BeautifulSoup that does this all of this nicely, but in .NET world I have found that HtmlAgilityPack is a good substitute.

In this case an automated webpage is generated producing an HTML table of links where data can be download like so for example

<html>
    <head>
        <title>website.url - /Reports/Current/Latest/</title>
    </head>
    <body>
        <H1>website.url - /Reports/Current/Latest/</H1>
        <hr>
        <pre>
            <A HREF="/Reports/Current/">[To Parent Directory]</A>
            <br>
            <br> Wednesday, November 14, 2018  4:11 AM        &lt;dir&gt; 
            <A HREF="/Reports/Current/Latest/DUPLICATE/">DUPLICATE</A>
            <br> Wednesday, November 22, 2017  4:11 AM      3241230 
...

In particular need to get the link to a file which would be listed like this

<A HREF="/Reports/Current/Latest/FileName.zip">FileName.zip</A>

So we need to to create a HtmlWeb and then load the contents to parse it

await new HtmlWeb().LoadFromWebAsync(url)

Now the HTML document is loaded and ready to be parsed

var hrefs = html.DocumentNode.SelectNodes("//a");
var details = hrefs.Select(z =>
{
	...
	return new
	{
		...
		Url = $"{baseUrl}{z.Attributes["href"].Value}"
	};
})

Find all the hrefs at the table level by selecting the anchors and then parse each anchor node and extract the href node link as well as the text node (this is to parse the FileName which details the timestamp of creation as we are unable to trust the timestamps within files etc)

Sign up for more like this.