Looks like Google is breaking the rules

It would appear I am blocking Google from indexing my site, at least until they fix their robot indexers to perform as documented.

The documentation at http://www.google.com/bot.html, url thoughtfully provided in the user agent string, clearly says Googlebot and all respectable search engine bots will respect the directives in robots.txt and only spammers do not. Well Google obviously does ignore the robots.txt as seen from my logs.

access_log-20190526:66.249.71.126 - - [19/May/2019:20:59:40 +1200] "GET /badbehavedbot HTTP/1.1" 200 677 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
access_log-20190526:159.69.117.171 - - [25/May/2019:18:38:48 +1200] "GET /badbehavedbot HTTP/1.1" 200 678 "-" "Mozilla/5.0 (compatible; Seekport Crawler; http://seekport.com/)"
access_log-20190616:144.76.68.76 - - [12/Jun/2019:16:19:05 +1200] "GET /badbehavedbot HTTP/1.1" 200 676 "-" "serpstatbot/1.0 (advanced backlink tracking bot; http://serpstatbot.com/; abuse@serpstatbot.com)"

It does appear to be a real google server that is ignoring the robots.txt file.

[root@vosprey2 httpd]# nslookup 66.249.71.126
126.71.249.66.in-addr.arpa	name = crawl-66-249-71-126.googlebot.com.

Why do I think it is ignoring it you ask, because my robots.txt file starts with

User-agent: *
Disallow: /badbehavedbot/
and many more

And also specifically has a section

User-agent: Googlebot
Disallow: /badbehavedbot/
and many more

The reason this will be preventing Google from indexing my site is that the /badbehavedbot/ URL is a ‘honeytrap’ link that exists hidden on all pages including the main page, no user can ever click on it, it is specifically there to automatically blacklist any search engine that refuses to obey the robots.txt file and tries to follow it… therefore within seconds of Googlebot trying to index that link the source ip-address the robot is crawling from is permanently blocked in iptables rules from accessing my website.

The blacklisting is fully automatic and I don’t intend to change it, it only impacts badly behaved web crawlers which in the URL Google provide in their agent string clearly states is behaviour only done by spammers of nogoodniks; it appears Google now include themselves in that category.

Actually, looking at a few search results (yes found by google) on this issue it seems to have been reported in 2017 that Googlebot suddenly started ignoring robots.txt. You can do your own searches but it seems to be a common issue.

One suggestion made in the posts I found is that entries are being indexed because the bot followed the url from a link on another site which may have been new behaviour added, but as this particular link simply does not refer to an existing page I doubt anyone has linked to it or that is the reason for it happening here.

I suppose it is possible that following a url posted on another site referencing a page on my site makes them believe they can then try and index everything on my site simply because the remote site that directed them here did not contain in its robots.txt file that they should not do so is the sort of fuzzy logic that I will completely ignore and continue blacklisting ip-addresses that do not observe my robots.txt file when crawling my site.

Another suggestion is that the crawler behaviour was changed to retrieve all pages and then refer to the robots.txt to decide what not to index, rather than refer to the robots.txt file before attempting to retrieve pages. If that is the case they have decided to retrieve pages you do not want them to retrieve, which I am sure they would not do so lets discount that as a reason.

From my perspective, I am gradually blacklisting all the googlebot crawler servers ip-addresses (and other badly behaved bots) from accessing my site as new ip-addresses trigger the blacklist rule, however as I do not particularly care about page rankings I can live with that. I will probably every few months remove those specific blocks to see if the behaviour has improved happy in the knowledge that if it has not they will simply be automatically blocked again.

The important thing is that if there is a Disallow entry in the robots.txt file it is there because you specifically do not want that page retrieved and indexed, it may be sensitive or secure information. The correct behaviour is to immediately block the requesting ip-address from accessing the website, so I will leave my automation rules in place to do so.

About mark

At work, been working on Tandems for around 30yrs (programming + sysadmin), plus AIX and Solaris sysadmin also thrown in during the last 20yrs; also about 5yrs on MVS (mainly operations and automation but also smp/e work). At home I have been using linux for decades. Programming background is commercially in TAL/COBOL/SCOBOL/C(Tandem); 370 assembler(MVS); C, perl and shell scripting in *nix; and Microsoft Macro Assembler(windows).
This entry was posted in Home Life. Bookmark the permalink.