MC+A Stream

Our Blog and News Stream

Simple Regular Expression To Reduce Unnecessary Files From Being Indexed By The Google Search Appliance

June 3rd, 2011

Duplicate Documents Creates Noise and Consumes Your License

Duplicate links can produce duplicate search results which:

  1. Can cost you more in licensing
  2. Produce duplicate results thereby angering your user base.

In a recent engagement, a global company’s CMS was producing and accepting urls in both of these formats:

  • http://www.mcplusa.com/company/about
  • http://www.mcplusa.com/company/about.html

The Google Search Appliance will see both of these documents as separate urls.    I reviewed the clients requirements and all of the site content either produced a ‘/’ or a file type at the end.  This is fairly common among CMS and other SEO friendly publication system.

The Expression

The regular expression that I came up with was

regex:http://[put your site here]/.*/$|[put your site here]/.*\.([a-zA-Z]{3,9})$

Which [put your site here] contained the content source.

The Take Away

We tested it out and after applying the pattern we reduced the total number of documents by 20%.   This was especially benefical since the client was at about 480k documents on their current 500k license.  The change took them well below the license limit and cleaned up the search interface.

*Plug* – If you have had your Google Search Appliance for awhile, I would recommend considering our Health Check where we can review multiple configuration settings to see if your appliance is properly tuned.

Michael Cizmar
Managing Partner

http://www.twitter.com/michaelcizmar

Parsing a URL Encoded String for the parameter (match between two words)

December 17th, 2009

I spent the last hour trying to find a regular expression to parse a url that was url-encoded or to match between two words.  Java was having a difficult time retrieving the url since it was parameter=something& but rather proxystylesheet%3Ddefault_frontend%26.  The expression is:

(?<=proxystylesheet%3D).*?(?=%26)

Where proxystylesheet is the parameter you are looking for.  This won’t find the value if it is at the end of the line but in my case, that is unnecessary.