April 14th, 2008

Google Crawls HTML Forms to Index Deep, Invisible Web



Google aims to crawl and index every page on the Internet. But as millions of web pages continue to hide behind flash, javascript, dynamic pages, unlinked pages, and password protected pages - Google has started crawling forms to index the unexplored Deep Web (also called Hidden Web, or Invisible Web) and is going where no search engine has gone before…

Google elaborates on their new adventure to crawl more…

Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

They say only a small number of ‘particularly useful sites’ will be checked like this and Googlebot will always follow the robots.txt, nofollow, and noindex directives. This means that if you have content which you would definitely not want to expose to the world wide web and secure its privacy - be sure to nofollow and noindex it by Meta tags or robot.txt, or else the next time you could be finding it on Google search.

It is often claimed that the Deep or Invisible web is much larger than the indexed web, and now Google is determined to change that…



If you like this post, then please subscribe to my full text RSS feed. You can also subscribe by email and have new posts sent to your inbox.

Read more
NASA’s Deep Impact & Comet Tempel 1 : Deep Space Fireworks on Fourth of July
Google Search Stops Labeling Supplemental Results
Set Your Preferred Canonical Domain in Google Sitemaps
Google I/O: Explore Open Web Technologies
Banned from Google : Common Reasons & How to Avoid it

Comments

RSS feed for comments on this post.
Articles Linking Here (Trackback url)


Comment on “Google Crawls HTML Forms to Index Deep, Invisible Web”


Your Ad Here
Large ads starting at $75!

Recent Posts

arrow Popular Articles

Web Hosting

Hosted on Dreamhost.
Host unlimited domains, 500GB storage, 5TB bandwidth. Read More...