Preventing Duplicate Content On WordPress Blogs
Avoid duplicate content penalties from search engines on your WordPress blog with robots.txt
A problem with the default configuration of the WordPress software is that it creates many pages with duplicate content such as the category and archive pages, tag pages and subsequent pages of the blog’s home page (page 2, 3, 4 etc). If you’ve been reading about search engine optimisation and duplicate content you’ll probably know that search engines such as Google penalise pages for duplicate content.
You can prevent search engines from accessing most of your blog’s duplicate content pages with a robots.txt file placed in your blog’s root directory. With robots.txt you can specify which pages you don’t want search engine spiders to access. Here’s what a typical robots.txt file for a WordPress blog might look like.
User-agent: * Disallow: /wp-login.php Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /index.php? Disallow: /page/ Disallow: /category/ Disallow: /tag/
The above example basically instructs all search engine spiders not to access the specified directories and pages listed above.
The /wp-login.php, /wp-admin/, /wp-includes/ and /index.php? statements don’t serve to prevent duplicate content but are useful in preventing indexing of your WordPress system pages and your blog’s search results pages.
You may have also noticed the daily/monthly/yearly archives pages haven’t been included in the example. The reason for this is that these archive pages are structured in the same format as the post pages so including them would prevent your post pages from being accessed as well. A way around this is to use the robots.txt file together with a WordPress plugin like the All-In-One SEO Pack which utilises the robots meta tag to prevent the archive pages from being indexed. This plugin also includes many other useful search engine optimisation features as well.
Google AdSense Publishers
This part only applies to Google AdSense publishers. The above robots.txt example will also block non search engine bots such as the Mediapartners-Google bot from accessing the disallowed pages, thus affecting the relevancy of your AdSense ads.
To avoid this include the following lines at the top of your robots.txt. This will ensure the Mediapartners-Google bot is able to access all pages.
User-agent: Mediapartners-Google* Disallow: User-agent: * Disallow: /wp-login.php Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /index.php? Disallow: /page/ Disallow: /category/ Disallow: /tag/
Comments
3 responses to “Preventing Duplicate Content On WordPress Blogs”
Leave a reply
I have been looking into this subject today and have come across a lot of posts from 2007. This is the most up to date post I have been able to find (so far, the quest continues) but what you are saying makes a lot of sense.
I actually started out looking at using robots.txt to help with security issues (I guess blocking the wp- folders will help with that) and as usual I have ended up steering off course but still finding some useful information!
Thanks!
I have been doing a bit more digging and used a robots.txt validator tool to check your script. It did not like this part –
Disallow: /wp-*
The problem was caused by the *
I have decided to use the following instead (which excludes folders individually)
User-agent: *
Disallow: /wp-admin
Disallow: /wp-content/plugins
Disallow: /wp-content/themes
Disallow: /wp-includes
Disallow: /index.php?
Disallow: /page/
Disallow: /category/
Disallow: /tag/
I am going to wait for the robots.txt file to be downloaded by Googlebot and check the validation again from with my Google Webmaster Tools account.
Thanks for pointing that out digitalizes. I have now updated the post and removed the * from the end of Disallow: /wp- as this isn’t required.