Posts Tagged ‘robots’

Robot Exclusions

Tuesday, October 21st, 2008

Preparing for 642-436 is easy for a 1Y0-259 professional who has done 70-536 as well as 70-642 as compared to a professional who has only done 642-642 series.

The file robots.txt is a text based document that should be included in the root of your domain, and it essentially contains instructions to any robots that comes to your site about what they are and are not allowed to index. To communicate with the crawler, you need a specific syntax that it can understand. In its most basic form, the text might look something like this:

User-agent: *
Disallow: /

These two parts of the text are essential. The first part, User-agent:, tells a crawler what user agent, or crawler, you’re commanding. The asterisk (*) indicates that all crawlers are covered, but you can specify a single crawler or even multiple crawlers. The second part, Disallow:, tells the crawler what it is not allowed to access. The slash (/) indicates “all directories.”
When you’re writing robots.txt, remember to include the colon (:) after the User-agent indicator and after the Disallow indicator. The colon indicates that important information follows to which the crawler should care about it. You won’t usually want to tell all crawlers to ignore all directories. Instead, you can tell all crawlers to ignore your temporary directories by writing the text like this:

User-agent: *
Disallow: /tmp/

Or you can take it one step further and tell all crawlers to ignore multiple directories:

User-agent: *
Disallow: /temp/
Disallow: /users/
Disallow: /adm/listing.html

That piece of text tells the crawler to ignore temporary directories, private directories, and the web page (title Listing) that contains links — the crawler won’t be able to follow those links. One thing to keep in mind about crawlers is that they read the robots.txt file from top to bottom and as soon as they find a guideline that applies to them, they stop reading and begin crawling your site. So if you’re commanding multiple crawlers with your robots.txt file, you want to be careful how you write it. This is the wrong:

User-agent: *
Disallow: /tmp/User-agent: CrawlerName
Disallow: /temp/
Disallow: /adm/listing.html

This bit of text tells crawlers first that all crawlers should ignore the temporary directories. So every crawler reading that file will automatically ignore the temporary files. But you’ve also told a specific crawler (indicated by CrawlerName) to disallow both temporary directories and the links on the Listing page. If you want to command multiple crawlers, you need to first begin by naming the crawlers you want to control. Only after they’ve been named should you leave your instructions for all crawlers. Written properly, the text from the preceding code should look like this:

User-agent: CrawlerName
Disallow: /temp/
Disallow: /links/listing.html
User-agent: *
Disallow: /temp/

Each search engine crawler goes by a different name, and you can see them at your web server log.

What Are Robots and Crawlers?

Tuesday, October 14th, 2008

Robot, spider, or crawler is a part of software that is programmed to “crawl” from one web page to another search for the links on those pages. It collects content (text, inks) from web sites and saves those in a database that is indexed and ranked according to the search engine algorithm. The links in a crawl  will sometimes take the crawler to other pages on the same web site, and sometimes they will take it away from the site completely. The crawler will follow the links again and again until every link on a page has been followed.

The crawler sends a request to the web server where the web site resides for review. The difference between what your browser sees and what the crawler sees is that the crawler is viewing the pages in a completely text interface. No graphics or other types of media files are displayed. It is all text. If the site doesn’t eventually begin to cooperate with the crawler, it’s penalized for the failures and your site’s search engine ranking will fall.

Entireweb Newsletter (6)

Wednesday, September 3rd, 2008

19. Use Robots.txt file to manage and control search engine spiders in indexing your site. You can allow and disallow spiders and choose directories you want to be crawled and indexed. But with bad bots or spam bots you need to modify your HTACCESS file to properly and effectively manage bots or spiders. Visit http://www.robotstxt.org/wc/faq.html to learn more about Robots.txt file.
20. Do not attempt to present different content to search engines than what you show to your site visitors.

…to be continued on SEOmag blog…