Setting up the Google Mini

When you first login to the Google Mini, you are presented with a web-based interface to configure the general parameters. These settings include the device IP Address, DNS Server, Admin E-mail address, Time zone, etc. Nothing fancy here, although we did have to configure an internal DNS server due to some firewall routing issues.

Configuring your first Collection

Like most any search product, the first task is to create a collection of what you want searched. The Google Mini supports one collection while its larger brother, the “Google Search Appliance, supports an unlimited number of collections. Collections can contain sub-collections (which I’ll explain a bit later).

Once you’ve created your first collection, the first step is to edit the collection parameters and set it up for indexing. URLs to Crawl was where we started, which contains a few parameters, Starting URLs from which to crawl, Follow and Crawl certain URLs or parts thereof and Do Not Crawl URLs matching certain patterns. This was probably where we spent 99% of our time configuring the Mini. The mini allows for 100,000 documents/URLs to be stored in a collection, and AnandTech contains approximately 40,000 articles, news and blog entries.

When we first set up the Mini, we told it to start in each of the website’s sections (for example, http://www.anandtech.com/it/) and in the web news area. The Mini considers any unique URL string to be a unique document, which makes sense (but is a bit surprising the first time that you run an index).

After four hours of indexing, the Mini had managed to reach its document limit and we had to improvise. After several attempts at filtering out various URL patterns and restricting the crawling as much as we could, we ended up writing some code. We created a file to which a link to every article, news post and blog post that have been published on the site would be dumped. That file is cached for a few hours as we update the index 3 times a week. We then configured the Mini to start at those URLs and restricted it only to URLs ending in showdoc.aspx, shownews.aspx and a few others. It worked - the next index was around 38,000 documents. A word to the wise: don’t let the Mini crawl your entire site without keeping a close eye on it.

Sub-collections

Before you let the Google Mini go off and crawl to its hearts content, consider creating some sub-collections if they are required. Sub-collections are simply small collections containing specific fragments of your site. For instance, on AnandTech, we have Articles, News, Blogs and FAQs as sub-collections. Each of these can be searched separately within the collection to allow us to have targeted searches within the various sections of the web site.

KeyMatch/Synonyms

Like the google.com search, the Google Mini supports key matches that allow you to have links appear at the top of your search results, which match keywords that you enter in the Google Mini interface. Another useful feature that is included is Synonyms, which allow you to enter synonyms for various search terms. We have a few created. Try typing “ iram” into our search, and you’ll notice that it suggests “i-ram” as a possible search.

Look and feel integration

The last thing that we worked on was making the Mini look like it is part of AnandTech.com. There are two ways to go about this in the Mini admin. One is to use their built-in page layout helper, which allows you to wrap the search screens with a custom header and footer. The other way (which we prefer) is to use the XSLT Stylesheet editor and modify the stylesheet to meet your needs.

All in all, our integration went fairly smoothly, and the Mini has made it exponentially easier to find content on AnandTech.com.

Screen Shots


Settings

Collection

Output

Subcollections

Examining the Google Mini
Comments Locked

48 Comments

View All Comments

  • Calin - Tuesday, September 6, 2005 - link

    Pentium III processors are still offered in those 1U servers. The reason would be (probably) cheap price for good performance and lower thermal load than any other competitive Intel processors.
    Low thermal load helps a lot for dual processors servers.
  • flatblastard - Tuesday, September 6, 2005 - link

    Not only a p3 mobo, but PC133 ram labeled for DELL!?!? I guess Google is buying up all the old junk and putting it to good use.
  • bhtooefr - Tuesday, September 6, 2005 - link

    Heck, I've got a Dell PowerEdge 350 (1U, single 850 P3, i440BX chipset) sitting in front of me, and the RAM's not even labelled Dell...
  • flatblastard - Tuesday, September 6, 2005 - link

    making a fortune in the process....
  • brownba - Tuesday, September 6, 2005 - link

    Ok, so I tested it with this query:
    google mini search server
    - it came back with 18700 useless results.
    I also tried the title of the article:
    anandtech search goes google
    - 712 useless results

    how long does it take to crawl?
  • glennpratt - Tuesday, September 6, 2005 - link

    I'm guessing jason clark meant to reply to you.
  • Rock Hydra - Tuesday, September 6, 2005 - link

    I like it. I tested it out and got the returns I was expecting. Very Google-y style. Now if they implemented something this well into the forums search....maybe another day.
  • TheInvincibleMustard - Tuesday, September 6, 2005 - link

    So, naturally, I searched for iram ... returned zero results, but it did suggest i-ram as a possibility. So I clicked that link ...
    doh ... "I" is a very common word and so was not included in my search, meaning all I actually wound up searching for is "RAM" (of which there was several thousand entires, and not one of the top few was actually about the I-RAM product), so perhaps a bit more tweaking is in order ;-)

    Granted, though, the search did only take 0.02 seconds! :-D
  • dvinnen - Tuesday, September 6, 2005 - link

    I was playing around with it to. Did the i-ram search also, the first artical presented was an artical from 1997 about memory terms (think EDO). The actuall i-ram artical was actually the forth result presented. Hell, just a google search gives it as the 4th link in all the internets. Defently could use some tweaking (give added weight based on the date of the artical?) but looks to be a step up from the useless one built into SQL server you were useing.
  • glennpratt - Tuesday, September 6, 2005 - link

    Did you read my post?

    Did you click on the link that says 'i is a common word and was excluded' in the search? That would have given you a couple of choices on what to do to fix it.

Log in

Don't have an account? Sign up now