Every webmaster and digital marketer would be aware of Googlebot, right. Even if you are not then understanding the concept of it required first to read the article further. Why I am saying so is because this article brings you recent confirmation from Google’s product manager John Muller on Googlebot Nutch. So here read the concept of Googlebot then we will move onto grasp the concept of Nutch.
So are you gear up to know about Googlebot Nutch and what Google confirmed about it? If yes, then read this article. I will start from the scratch. Also make sure not to skip the article in middle.
Understand the concept of Googlebot first
Googlebot – a popular web crawler or spider which crawls your website and its content for indexing. In other words, crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index. Moreover, Googlebot contains many computers requesting and fetching pages much more quickly than you can with your web browser. So yes Googlebot is used to search the internet.
In addition, Googlebot uses web crawling software by Google. It allows them to scan, find, add and index web pages.
Now the question is – what that web crawling software is? The next section will help you understand this. Just keep reading.
Web Crawling Software – Nutch
The full name of Nutch is Apache Nutch. It is a web crawler and a very powerful software. Any webmaster, digital marketer, or SEO can use Nutch to crawl their site and create an index. Besides, Nutch is coded entirely in the Java programming language. But data you get is written in language-independent formats.
If Nutch is that much effective then why Google says that Googlebot does not use Nutch. Yes guys this is the recent update which we got to know via conversion happened between Britney Muller and John Muller on Twitter.
Let’s talk about the update now.
Google confirms that Googlebot doesn’t use Nutch
A few days ago, Britney Muller asked on Twitter whether Nutch-1.7 is an official Googlebot crawler. Now why she asked, is it all of sudden, or she saw someone using Nutch as a Googlebot crawler? I tell you what – Britney Muller spotted someone using Apache Nutch with a Googlebot user-agent name when crawling a site.
But John Muller from Google has confirmed that Googlebot does not use Nutch in its useragent. Google said Nutch is “highly extensible and scalable open source web crawler software project.”
On Twitter replying to Britney Muller’s question, John said “We don’t use “nutch” at all in any of the Googlebot user-agents we use for search or for the other uses of the shared infrastructure.”
What you will do if you see Nutch in Googlebot?
Just above you read what John Muller Said about Nutch that they don’t use it in Googlebot, right. In case, you see Nutch with Googlebot, what you should do then? So let me tell you that whenever you encounter Nutch with Googlebot, then it is not a real Googlebot. You can easily block it if it is causing you any issues.
There is no need to use such software which Google itself not uses. Find another crawler that can help you crawl and index your website and its content.
Let’s now dig into the history of Nutch below.
Digging the history of Apache Nutch
Doug Cutting is the person who created Nutch. In 2003, a successful 100-million-page demonstration system developed. The Nutch project implemented a MapReduce facility and a distributed file system to meet the multi-machine processing needs of the crawl and index tasks.
In 2005, Nutch joined the Apache Incubator, from where it graduated and became a subproject of Lucene in June of that year. And since April 2010, Nutch has been regarded as Independent, top level project of the Apache Software Foundation.
In February 2014, the Common Crawl project adopted Nutch for its open, large-scale web crawl. Once upon a time, it was a goal for the Nutch project to release a global large-scale web search engine, but now it is no longer the case here.
Here look at the list of search engines built with Nutch
- Krugle – This search engine uses Nutch to crawl web pages for code, archive, and technically interesting content. Basically, Krugle search helps organizations know critical code patterns and application issues immediately.
- Common Crawl – started using Nutch in the year 2014. Common Crawl is a non-profit organization that crawls the web and freely provides its archives and datasets to the public.
- Wikia Search – This is a short-lived search engine launched in 2008 and close down in 2009.
- Creative Commons Search – It is an American non-profit organization network that is fully devoted to educational access and expands the range of creative works available for others.
The bottom line
Apache Nutch once was a web crawler software that Googlebot would use to crawl sites. Now it is no more the case. Google has now confirmed that Googlebot does not use Nutch. So you also no need to use it. In case, you see Nutch with Googlebot then you know what you need to do – just block if it is causing you any trouble. I hope you found the article helpful.
To read articles on such interesting topics, and updates, Subscribe to our website.