GWT: Generate more complete crawl error report

Posted by Mike on Pro Webmasters See other posts from Pro Webmasters or by Mike
Published on 2013-02-11T21:44:27Z Indexed on 2013/10/29 10:21 UTC
Read the original article Hit count: 301

I'm a developer in charge of managing Webmasters and related issues (including correcting crawl errors) for dozens (hundreds, maybe?) of active sites and as part of my duties I create a report of every discrepancy, including all pages generating a 404 and all pages that link to those pages.

Currently within Webmaster Tools I'm able to download a csv file of all pages with a 404 response, but I'm then having to manually click on every single one of those links and copy the "linked from" field to paste into my spreadsheet. This is extremely tedious and seems unnecessary; I would expect the ability to download all that data at once. I'm ultimately looking for the end result of one csv file that has every url with a 404, but also has every url that links to each one of them.

Am I overlooking this functionality somewhere or does anyone have a good solution?

Edit 1 (2/11/2013):

Example of what the csv output looks like now:

URL,Response Code,News Error,Detected,Category
http://www.abcdef.com/123.php,404,,11/12/13,Not found
http://www.abcdef.com/456.php,404,,11/12/13,Not found

Which is great, but let's say 123.php has 5 pages that link to it. Now I have to duplicate that row in my spreadsheet 4 more times, then go into Webmasters, get all the url's that link to the page, and add that data to my spreadsheet.

The output I would prefer:

URL,Response Code,Linked From,News Error,Detected,Category
http://www.abcdef.com/123.php,404,http://www.ghijkl.com/naughtypage1.php,,11/12/13,Not found
http://www.abcdef.com/123.php,404,http://www.ghijkl.com/naughtypage2.php,,11/12/13,Not found
http://www.abcdef.com/123.php,404,http://www.ghijkl.com/naughtypage3.php,,11/12/13,Not found
http://www.abcdef.com/456.php,404,http://www.ghijkl.com/naughtypage1.php,,11/12/13,Not found
http://www.abcdef.com/456.php,404,http://www.ghijkl.com/naughtypage2.php,,11/12/13,Not found
http://www.abcdef.com/456.php,404,http://www.ghijkl.com/naughtypage3.php,,11/12/13,Not found

Note the (hypothetical) addition of a "Linked From" column, as well as the fact there are only 2 unique URL's now (like before) but all of the "Linked To" pages are shown in one report.

Edit 2 (2/12/2013):

To clarify, my question is less about detecting and correcting 404's, but more about generating a report of what Google has listed as errors. Oftentimes, these errors aren't even valid anymore but I still need documentation to show that Google detected a problem and that problem is now fixed. Many of the "linked from" url's I find are actually outdated, cached resources. For example, I'll frequently see that the linked-from url is the sitemap, which is actually an old sitemap cached by Google that points to an old page. Neither the sitemap or old page exist, but they still appear in my crawl error reports because they are cached resources.

© Pro Webmasters or respective owner

Related posts about google-webmaster-tools

Related posts about reporting