Robots, Spiders, Worms and Crawlers

All major search engines send out a little program called a ‘spider’ to index your pages. Some search engines use them to ‘index’ your entire site, some just 1 or 2 pages. Spiders take a ‘snapshot’ of your page, and determine what your page is about by looking at text on the page, META tags, and various other page factors. Most directories such as Looksmart, Zeal, and the Open Directory Project also send out a spider. However, since they are not search engines, the primary function of their spiders are to ensure your site is still up and running.

These robots leave a trace behind of their access attempts in your server log files just as a human visitor does, so if you have access to your stats you will be able to spot them. Your best hint of this indexing attempt will be seen by checking access attempts to your ‘robots.txt’ file in the root of your webs directory. If you don’t have a ‘robots.txt’ file that is because you never created one. Don’t worry however, a spider will still crawl your site without one. All search engines check for this little text file that will tell the crawler where and where not to go. It can also allow and disallow certain robots if you find a particular spider to be nasty in nature. The main purpose of this file is so the robot will not index directories or files it isn’t supposed to, such as cgi directories, administration files, etc. If you wish to create a ‘robots.txt’ file but don’t know where to start, visit the ‘official’ robots.txt site by clicking here.

Unfortunately, there are also malicious spiders on the web that are used for reasons other than search engines. Some spiders are designed to copy your website to the clients hard-drive, others are designed to collect e-mail addresses to be used for sending unsolicited e-mail.

PR Matters

You may also find these articles interesting