Simple Regular Expression To Reduce Unnecessary Files From Being Indexed By The Google Search Appliance
June 3rd, 2011
Duplicate Documents Creates Noise and Consumes Your License
Duplicate links can produce duplicate search results which:
- Can cost you more in licensing
- Produce duplicate results thereby angering your user base.
In a recent engagement, a global company’s CMS was producing and accepting urls in both of these formats:
- http://www.mcplusa.com/company/about
- http://www.mcplusa.com/company/about.html
The Google Search Appliance will see both of these documents as separate urls. I reviewed the clients requirements and all of the site content either produced a ‘/’ or a file type at the end. This is fairly common among CMS and other SEO friendly publication system.
The Expression
The regular expression that I came up with was
regex:http://[put your site here]/.*/$|[put your site here]/.*\.([a-zA-Z]{3,9})$
Which [put your site here] contained the content source.
The Take Away
We tested it out and after applying the pattern we reduced the total number of documents by 20%. This was especially benefical since the client was at about 480k documents on their current 500k license. The change took them well below the license limit and cleaned up the search interface.
*Plug* – If you have had your Google Search Appliance for awhile, I would recommend considering our Health Check where we can review multiple configuration settings to see if your appliance is properly tuned.
Michael Cizmar
Managing Partner
http://www.twitter.com/michaelcizmar



