A 101 how to use Screaming Frog SEO Spider + XPath to find all article outlinks on a section page


… without the menu, sidebar and footer links

Lets say you want to know which articles about Donald Trump are listet here

https://www.blick.ch/dossiers/donald-trump/ and on all paginated pages like https://www.blick.ch/dossiers/donald-trump/page2/ …

You are just looking for these links:

Step 1: Find the related elements in the source code

Open inspect the code in Google Chrome:

Use the inspect tool and hover the element you want to inspect

I want to check the links so I hover them with the tool. In the tool the corresponding HTML is highlighted

Now I’m looking for a link to the highlighted article. It must be somewhere near. The red arrow looks good. If you can’t find something like a link it’s probably still nested. Open with the little arrows (marked with blue arrow)

If you select the HTML code the inspect tool will color the related parts in the site:

Step 2: Get Xpath

More about XPath

XPath – Wikipedia
has an unclear citation style.citation and footnoting. Violates Wikipedia:External links: “Wikipedia articles may…en.wikipedia.org
The Complete Guide to Screaming Frog Custom Extraction with XPath & Regex
In this guide, I’ll show you how to use Screaming Frog’s Custom Extraction feature to scrape schema markup, HTML…uproer.com

You can try like this

In this case its

//*[@id="content"]/div/main/div/div[4]/div[3]/a

For me having [4] or [any other number] in the Xpath is most of the times an indicator that this is not useful.

So let’s create the Xpath manually…

If you are looking for LinkURL in

<a href=”LinkURL”>Link Text</a>

The Xpath is

//a/@href

It’s getting all a-tags and there the @href attribute text

But we are looking for specific links not all.

class=”clickable”

could be an indicator to identify the links in the list.

The Xpath to address this is

//a[@class=”clickable”]/@href

which is looking for all a-tags with

class=”clickable”

Another option could be to use the parent div-tag (yellow arrow) with

class=”layout-item”

and than the a child a-tag (blue arrow). It’s possible to use contains if you don’t want to check for all these class-names listed there

//div[contains(@class,’layout-item’)]/a/@href

So summed up

div[contains(@class,’layout-item’)]

is getting the div with class attribute containing “layout-item”

a

it getting the child a-tag

@href

is getting the href-attribute content.

The Xpath starts with // (2 slashes) and separates with / (1 slash) hierarchically

Step 3: Xpath Screaming Frog SEO Spider

Go to

Configuration > Custom > Extraction

and add the 2 Xpath ideas e.g. like this:

Now run with a include filter, which just checks the needed folder + all paginated pages.

The


.*

is the placeholder for

https://www.blick.ch/dossiers/donald-trump/page2 
https://www.blick.ch/dossiers/donald-trump/page3
https://www.blick.ch/dossiers/donald-trump/gugus

Now check in the Custom tab in Screaming Frog SEO Spider

The second try with a clickable seems to be wrong.

It lists menu items too:

The div layout-item a looks good

So this is the Xpath to work with:

//div[contains(@class,’layout-item’)]/a/@href

Just run and collect the links 🙂

Share this post if you enjoyed! 🙂



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *