Herald Journal Columns
July 24, 2006, Herald Journal

Did you Google today?

By MARK OLLIG

Millions of us do! When I come across a term or subject matter that I need information about, I will perform a search or a query in this search engine. No doubt you may of heard someone asking you how you found some specific information on the Internet. When people ask me, I sometimes respond with “I just googled it.”

How does Google get so much information so quickly? Is it a super-fast carrier pigeon at work behind the scenes? Not at all, it’s just a matter of some powerful software and computer hardware at work.

Google operates over a network of thousands of computers and can therefore carry out a term called “rapid parallel processing.” Parallel processing is a way of computation in which a lot of calculations and complex algorithm formulas can be performed all at once, which drastically speeds up overall data processing power.

Google has three separate parts; the first is called a “Googlebot.” This is a web-crawler or a computer software program that browses the World Wide Web in a methodical, automated manner. Web crawlers are used to create a copy of all the visited pages they find out there on the World Wide Web.

Next, the Google Indexer will sort out every word on every page and will store this index of words in a very large computer server database. The content inside the index servers is comparable to the index in the back of a book; it tells which pages contain the words that match any particular query-search term.

Third is the “Query-Processor,” which compares your search query against the Indexer and recommends or presents the documents in the results page that it considers the most applicable using special software computations.

Googlebot: Google’s web crawler

Googlebot is Google’s web crawling “robot,” or “bot” for short. This bot is made up of many computers that only run software which finds and retrieves pages on the web. Googlebot then deposits these pages to the Google indexer.

It’s easy to imagine Googlebot as a very smart and fast spider; running across the thin web-like connecting strands of cyberspace finding information – but in reality, Googlebot doesn’t navigate over the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, and then handing it off to Google’s Indexer.

Of course, this process of requesting, obtaining and retrieving pages of information is completed much more quickly then you can do with a regular web-browser. The search results are returned to the user in a fraction of a second.

In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming the web servers or requests from human users, Googlebot by design makes requests of each individual web server more slowly than it’s actually capable of doing.

Googlebot finds pages in two ways: through an “Add URL” form located at www.google.com/addurl.html, and by way of finding links by “crawling” or searching the web with those searches “bots.” The use of “Meta-tags” or “keywords” that help search engines like Google index and rank your website when you are creating your web-pages are also used.

Unfortunately, those unpopular and unloved by many “spammers” lurking across the web figured out how to create automated bots that bombarded the Add-URL form with millions of Uniform Resource Locators or “URLs” directed to their profit-making websites.

Google tries to reject those URLs submitted through its “Add URL Form” that it suspects are trying to mislead users by using strategies such as including hidden text or links on a page, padding a web page with irrelevant words, the old bait and switchs, routine, and other sneaky methods to re-direct you to the spammers’ destination. Because of this, Google had created a test when you use the Add URL Form. The form will display some squiggly letters designed to fool those automated bot “letter-guessers,” It asks you to enter the letters you see, something like an eye-chart test to stop these “spam-bots.”

When Googlebot obtains a web-page, it collects all the links appearing on that page and adds them to a holding queue for consecutive web crawling.

Googlebot tends to encounter little spam because most web authors link only to what they consider are high-quality web pages. By returning links from every page it encounters, Googlebot can rapidly put together a list of links that cover broad reaches of the web. This technique is known as “deep crawling,” and allows Googlebot to search deep within individual websites. Because of their gigantic scale, deep crawls can reach almost every page out there on the web. The web is enormous, so this can take some time. Some pages may be crawled only once a month.

Googlebot must be programmed to send out simultaneous requests for thousands of pages. The results of these requests must be constantly examined and compared with URLs already in Google’s index queue. Any duplicates in the queue must be removed to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. To keep the index current, Google constantly “re-crawls” popular and frequently changing web pages.

These re-crawls keep the index current and are known as “fresh crawls,” which keeps the content current.

Google ranking pages

“PageRank” is Google’s computing system for ranking web pages. A page with a higher page ranking is considered more significant, and is more likely to be listed above a page with a lower page ranking. These results are accomplished by using over a hundred computational factors that Google has a patent application on.

Here is a tip: When you want a definition of a word or term, just type “define:” (include the colon at the end of define), ahead of your word that you need a definition on in the look-up box, and then press the search key.

An example of this type of search is: “define: VoIP” (without the parentheses). This search will present you with lots of results (many of them from the glossaries of the related industry’s or businesses’ own websites) for you.

Visit: www.google.com and search for something new today!