#no more scraping and concatenating text from html pages
Explore tagged Tumblr posts
circumference-pie · 11 months ago
Text
thank god for 小说名txt searches
0 notes
geeksperhour · 6 years ago
Link
via Screaming Frog
Google’s search engine results pages (SERPs) have changed a great deal over the last 10 years, with more and more data and information being pulled directly into the results pages themselves. Google search features are a regular occurence on most SERPs nowadays, some of most common features being featured snippets (aka ‘position zero’), knowledge panels and related questions (aka ‘people also ask’). Data suggests that some features such as related questions may feature on nearly 90% of SERPs today – a huge increase over the last few years.
Understanding these features can be powerful for SEO. Reverse engineering why certain features appear for particular query types and analyisng the data or text included in said features can help inform us in making optimisation decisions. With organic CTR seemingly on the decline, optimising for Google search features is more important than ever, to ensure content is as visible as it possibly can be to search users.
This guide runs through the process of gathering search feature data from the SERPs, to help scale your analysis and optimisation efforts. I’ll demonstrate how to scrape data from the SERPs using the Screaming Frog SEO Spider using XPath, and show just how easy it is to grab a load of relevant and useful data very quickly. This guide focuses on featured snippets and related questions specifically, but the principles remain the same for scraping other features too.
TL;DR
If you’re already an XPath and scraping expert and are just here for the syntax and data type to setup your extraction (perhaps you saw me eloquently explain the process at SEOCamp Paris or Pubcon Las Vegas this year!), here you go (spoiler alert for everyone else!) –
Featured snippet XPath syntax
Featured snippet page title (Text) – (//div[@class='ellip'])[1]/text()
Featured snippet text paragraph (Text) – (//span[@class="e24Kjd"])[1]
Featured snippet bullet point text (Text) – //ul[@class="i8Z77e"]/li
Featured snippet numbered list (Text) – //ol[@class="X5LH0c"]/li
Featured snippet table (Text) – //table//tr
Featured snippet URL (Inner HTML) – (//div[@class="xpdopen"]//a/@href)[2]
Featured snippet image source (Text) – //div[@class="rg_ilbg"]
Related questions XPath syntax
Related question 1 text (Text) – (//div[1]/g-accordion-expander/div/div)[1]
Related question 2 text (Text) – (//div[2]/g-accordion-expander/div/div)[1]
Related question 3 text (Text) – (//div[3]/g-accordion-expander/div/div)[1]
Related question 4 text (Text) – (//div[4]/g-accordion-expander/div/div)[1]
Related question snippet text for all 4 questions (Text) – //g-accordion-expander//span[@class="e24Kjd"]
Related question page titles for all 4 questions (Text) – //g-accordion-expander//div[@class="ellip"]
Related question page URLs for all 4 questions (Inner HTML) – //div[@class="feCgPc y yf"]//div[@class="rc"]//a/@href
You can also get this list in our accompanying Google doc. Back to our regularly scheduled programming for the rest of you…follow these steps to start scraping featured snippets and related questions!
1) Preparation
To get started, you’ll need to download and install the SEO Spider software and have a licence to access the custom extraction feature necessary for scraping. I’d also recommend our web scraping and data extraction guide as a useful bit of light reading, just to cover the basics of what we’re getting up to here.
2) Gather keyword data
Next you’ll need to find relevant keywords where featured snippets and / or related questions are showing in the SERPs. Most well-known SEO intelligence tools have functionality to filter keywords you rank for (or want to rank for) and where these features show, or you might have your own rank monitoring systems to help. Failing that, simply run a few searches of important and relevant keywords to look for yourself, or grab query data from Google Search Console. Wherever you get your keyword data from, if you have a lot of data and are looking to prune and prioritise your keywords, I’d advise the following –
Prioritise keywords where you have a decent ranking position already. Not only is this relevant to winning a featured snippet (almost all featured snippets are taken from pages ranking organically in the top 10 positions, usually top 5), but more generally if Google thinks your page is already relevant to the query, you’ll have a better chance of targeting all types of search features.
Certainly consider search volume (the higher the better, right?), but also try and determine the likelihood of a search feature driving clicks too. As with keyword intent in the main organic results, not all search features will drive a significant amount of additional traffic, even if you achieve ‘position zero’. Try to consider objectively the intent behind a particular query, and prioritise keywords which are more likely to drive additional clicks.
3) Create a Google search query URL
We’re going to be crawling Google search query URLs, so need to feed the SEO Spider a URL to crawl using the keyword data gathered. This can either be done in Excel using find and replace and the ‘CONCATENATE’ formula to change the list of keywords into a single URL string (replace word spaces with + symbol, select your Google of choice, then CONCATENATE the cells to create an unbroken string), or, you can simply paste your original list of keywords into this handy Google doc with formula included (please make a copy of the doc first).
At the end of the process you should have a list of Google search query URLs which look something like this –
https://ift.tt/2zJIJ6H https://ift.tt/2PDsGSC https://ift.tt/2zJD1RY https://ift.tt/2PCCkVB https://ift.tt/2zWkCln etc.
4) Configure the SEO Spider
Experienced SEO Spider users will know that our tool has a multitude of configuration options to help you gather the important data you need. Crawling Google search query URLs requires a few configurations to work. Within the menu you need to configure as follows –
Configuration > Spider > Rendering > JavaScript
Configuration > robots.txt > Settings > Ignore robots.txt
Configuration > User-Agent > Present User Agents > Chrome
Configuration > Speed > Max Threads = 1 > Max URI/s = 0.5
These config options ensure that the SEO Spider can access the features and also not trigger a captcha by crawling too fast. Once you’ve setup this config I’d recommend saving it as a custom configuration which you can load up again in future.
5) Setup your extraction
Next you need to tell the SEO spider what to extract. For this, go into the ‘Configuration’ menu and select ‘Custom’ and ‘Extraction’ –
You should then see a screen like this –
From the ‘Inactive’ drop down menu you need to select ‘XPath’. From the new dropdown which appears on the right hand side, you need to select the type of data you’re looking to extract. This will depend on what data you’re looking to extract from the search results (full list of XPath syntax and data types listed below), so let’s use the example of related questions –
The above screenshot shows the related questions showing for the search query ‘seo’ in the UK. Let’s say we wanted to know what related questions were showing for the query, to ensure we had content and a page which targeted and answered these questions. If Google thinks they are relevant to the original query, at the very least we should consider that for analysis and potentially for optimisation. In this example we simply want the text of the questions themselves, to help inform us from a content perpective.
Typically 4 related questions show for a particular query, and these 4 questions have a separate XPath syntax –
Question 1 – (//div[1]/g-accordion-expander/div/div)[1]
Question 2 – (//div[2]/g-accordion-expander/div/div)[1]
Question 3 – (//div[3]/g-accordion-expander/div/div)[1]
Question 4 – (//div[4]/g-accordion-expander/div/div)[1]
To find the correct XPath syntax for your desired element, our web scraping guide can help, but we have a full list of the important ones at the end of this article!
Once you’ve input your syntax, you can also rename the extraction fields to correspond to each extraction (Question 1, Question 2 etc.). For this particular extraction we want the text of the questions themselves, so need to select ‘Extract Text’ in the data type dropdown menu. You should have a screen something like this –
If you do, you’re almost there!
6) Crawl in list mode
For this task you need to use the SEO Spider in List Mode. In the menu go Mode > List. Next, return to your list of created Google search query URL strings and copy all URLs. Return to the SEO Spider, hit the ‘Upload’ button and then ‘Paste’. Your list of search query URLs should appear in the window –
Hit ‘OK’ and your crawl will begin.
7) Analyse your results
To see your extraction you need to navigate to the ‘Custom’ tab in the SEO Spider, and select the ‘Extraction’ filter. Here you should start to see your extraction rolling in. When complete, you should have a nifty looking screen like this –
You can see your search query and the four related questions appearing in the SERPs being pulled in alongside it. When complete you can export the data and match up your keywords to your pages, and start to analyse the data and optimise to target the relevant questions.
8) Full list of XPath syntax
As promised, we’ve done a lot of the heavy lifting and have a list of XPath syntax to extract various featured snippet and related question elements from the SERPs –
Featured snippet XPath syntax
Featured snippet page title (Text) – (//div[@class='ellip'])[1]/text()
Featured snippet text paragraph (Text) – (//span[@class="e24Kjd"])[1]
Featured snippet bullet point text (Text) – //ul[@class="i8Z77e"]/li
Featured snippet numbered list (Text) – //ol[@class="X5LH0c"]/li
Featured snippet table (Text) – //table//tr
Featured snippet URL (Inner HTML) – (//div[@class="xpdopen"]//a/@href)[2]
Featured snippet image source (Text) – //div[@class="rg_ilbg"]
Related questions XPath syntax
Related question 1 text (Text) – (//div[1]/g-accordion-expander/div/div)[1]
Related question 2 text (Text) – (//div[2]/g-accordion-expander/div/div)[1]
Related question 3 text (Text) – (//div[3]/g-accordion-expander/div/div)[1]
Related question 4 text (Text) – (//div[4]/g-accordion-expander/div/div)[1]
Related question snippet text for all 4 questions (Text) – //g-accordion-expander//span[@class="e24Kjd"]
Related question page titles for all 4 questions (Text) – //g-accordion-expander//div[@class="ellip"]
Related question page URLs for all 4 questions (Text) – //div[@class="feCgPc y yf"]//div[@class="rc"]//a/@href
We’ve also included them in our accompanying Google doc for ease.
Conclusion
Hopefully our guide has been useful and can set you on your way to extract all sorts of useful and relevant data from the search results. Let me know how you get on, and if you have any other nifty XPath tips and tricks, please comment below!
The post How to Scrape Google Search Features Using XPath appeared first on Screaming Frog.
0 notes
imapplied · 7 years ago
Text
Scraping ‘People Also Ask’ boxes for SEO and content research
People Also Ask (PAA) boxes have become an increasingly prevalent SERP feature since their introduction in 2016. In fact, recent data from Mozcast suggests that PAA features on around 30% of the queries they monitor.
This box, tied to Google’s machine learning algorithms, shows questions related to a user’s initial query. For example, if you input “how does a computer work” in google.co.uk, then you see the following:
Initially, the reception to this feature amongst the SEO community was somewhat muted – perhaps because many saw it as another feature designed to erode organic CTRs – but over the past two years more and more people have identified the value of leveraging this data to supplement their campaigns.
How to use the data
Here are three ways that this newly gotten data can be used:
You can use the questions to build FAQ pages
We have used this tactic on several sites and found that FAQs become an invaluable resource for customers by using PAA data
Using the information to adjust the targeting of pages via page titles and headings, or even the creation of new pages.
Appealing topics related to your keywords can easily be found using PAA data which can help build out the content of your site and give additional information to users.
Using People Also Asked data to find related questions which may trigger featured snippets.
By formatting the answers to these questions as lists or step-by-step guides, it is possible to be featured in snippets. This will enhance your SERP listing and may give your site more prominence and authority in your chosen subject.
Regardless of how you intend to use the PAA SERP feature, the first step will always be data collection. There are several ways that you can do this.
Manual collection
In scenarios where you’re only looking at a handful of keywords, then manual collection is still a perfectly valid approach – just dump your queries in and see if the feature is triggered. If so, note down the applicable questions that are returned.
Using rank tracking software
Many rank trackers show which queries generate SERP features.  For example, in Moz if you log into an account and navigate to Rankings > SERP Features you can see the queries noted by the dual arrows icon.
Unfortunately, many of these programmes don’t show the generated questions which means once you have narrowed down your list, you will still have to check them.
Some programmes, such as AWR, will show you the questions in a custom report if you set search engine to ‘Google Universal’ and tick ‘Result type’ in your keyword rankings report. Be aware: it takes up to a day to process four thousand keywords or a week to run 20,000 — and that’s before running the report, which may take another hour!
If you have access to a tool with features similar to the above and have some time to play with, then my recommendation would be to use them. If you don’t have access or time isn’t on your side, then luckily there’s another option: step forward Screaming Frog!
Screaming Frog
One of the best and most underutilised Screaming Frog features is custom extraction. This allows you to take any piece of information from crawlable webpages and add to your Screaming Frog data pull.
At this point, it’s worth highlighting that this technically violates Google’s Terms & Conditions. It’s what your rank tracking software does on a daily or weekly basis, so I’m sure they will forgive us. We’re aiming to improve the quality of our content and websites after all.
That said, if you use your standard SF settings, you will trigger a captcha very quickly, and the extraction will not work. So, it is crucial that you throttle your speed and, if you want to be cautious, use a VPN or proxy server set to your desired location.
The following settings usually work well and can be adjusted in SF under Configuration > Speed:
It’s worth changing your user-agent to Chrome via Configuration > User-Agent also:
Next, you’ll need to compile the list of URLs you want to crawl. If you’ve leveraged a ranking tool, you may already know which queries you want to check. If not, don’t worry – you can still plug in thousands of queries without an issue.
We’re going to build our URLs using a spreadsheet program like Excel. We’re looking to create something like this:
In the first column, enter the version of Google you wish to use (e.g. https://www.google.co.uk) appended with the parameter for the search query (/search=?q=).
The second column should contain your search queries. Given that these are going to form part of the URL, you’ll need to replace spaces with the ‘+’ symbol and escape any other special characters (see URL encoding). If you’re happy with your final keyword list, then the ‘Find & Replace’ function can make short work of this job. Alternatively, if you plan to frequently add new keywords to your list and want to avoid the hassle of repeating this task, you can create a working column which dynamically makes the necessary character swaps using the SUBSTITUTE function.
A third column is then required for any other parameters that you need, such as those for location. You can find a list of most of the parameters here (it’s a little out of date, but I have yet to find a better resource).
Finally, the URL components need to be concatenated using =CONCAT(A1:C2), leaving you with a final string.
Example: https://www.google.co.uk/search?q=what+is+amazon+echo&ion=0&num=10
The extraction can’t be run yet, though, as we haven’t told Screaming Frog what to find on the page!
The next thing you need is a little Chrome extension called Scraper. This tool allows you to select data on a webpage and find things that are similar, and – more importantly for our purposes – will attempt to write the XPath string for you.
Once the extension is installed, navigate to a Google SERP that features a ‘People Also Ask’ section. Highlight one of the questions, right-click and select ‘scrape similar’ from the dialogue box. If you are lucky, all the questions are selected on the right-hand side, and the XPath is written for you. This is rarely the case though, in which case you’ll have to look in the source code.
Right-click on one of the questions and go to inspect. This should highlight the question and the relevant div class.
Then go back to your scraper tool and add the string to it, bearing in mind that:
You typically want to start your XPath with //
You likely then need the HTML tag (in this case the div)
Any classes, ids, etc. need to be formatted in square brackets and have an @ symbol at the start. So in this case [@class=”NWt7k”]
This means our full string should be: //div[@class=”NWt7k”]
Add this to the scraper tool and hit ‘Scrape’ to make sure all the text you require is pulled and that no extra information is present. It may be worth doing this on a couple of different SERPs to make sure it works.
The relevant class changes every so often, so if you are using the string in this example and it is not working, then check the DOM node using inspect element and modify as required.
The XPath string then needs adding to Screaming Frog’s Custom Extraction Menu.  This can be found in Configuration > Custom > Extraction.
Be sure to name the extraction something sensible so you can easily find it later. On this occasion, we want to make sure ‘Extract Text’ is selected via the drop down on the right-hand side.
You can now import your URL list, run the crawl, and see the returned questions by looking in the ‘custom’ or ‘internal’ reports
By hitting ‘Export’, we can get this valuable data into our spreadsheet program of choice for further analysis!
Other Google data that can be scraped
Similar logic can be applied to pull in data for other Google features, such as AMP results, AdWords Adverts, ‘searches related to…’ and any additional text-based information within the SERPs, or indeed any other information from your websites.
For example, we have used scraped AdWords data to help analyse if PPC campaigns are cannibalising potential SEO traffic. This tactic has proven to be fruitful in providing greater synergy between the two channels, allowing organic traffic to flourish while PPC budgets can be reinvested elsewhere. Ultimately this approach can maximise overall ROI and revenue for the client.
For further information, Screaming Frog offers a guide on web scraping
Scraping Google SERPs for ‘People Also Asked’ and other features can be a great way to find out what information users are searching for. Adding this to your content or SEO strategy can make your pages more robust and – more importantly – more useful to your users.
First Found Here
from https://www.imapplied.co.za/seo/scraping-people-also-ask-boxes-for-seo-and-content-research/
0 notes