views
Blocking Search Engines with robots.txt Files
Understand robots.txt files. A robots.txt file is a plain or ASCII text file that informs search engine spiders what they are allowed to access on your site. Files and folders listed in a robots.txt file may not be crawled and indexed by a search engine spiders. You may need a robots.txt file if: You want to block specific content from search engine spiders. You are developing a live site and are not prepared to have search engine spiders crawl and index the site You want to limit access to reputable bots.
Create and save and robots.txt file. To create the file, launch a plain text editor or a code editor. Save the file as: robots.txt. The file name must be all lowercase. Do not forget the “s.” When you save the file, choose the extension “'.txt”'. If you are using Word, select the “Plain Text” option.
Write a full-disallow robots.txt file. It is possible to block every reputable search engine spider from crawling and indexing your site with a “full-disallow” robots.txt. Write the following lines in your text file: User-agent: * Disallow: / Using a “full-disallow” robots.txt file is not strongly recommended. When a bot, such as Bingbot, reads this file, it will not index your site and the search engine will not display your website. User-agents: this is another term for search engine spiders, or robots *: the asterisk signifies that the code applies to all user-agents Disallow: /: the forward slash indicates that the entire site is off-limits to bots
Write a conditional-allow robots.txt file. Instead of blocking all bots, consider blocking specific spiders from certain areas of your site. Common conditional-allow commands include: Block a specific bot: replace the asterisks next to User-agent with googlebot, googlebot-news, googlebot-image, bingbot, or teoma. Block a directory and its contents: User-agent: * Disallow: /sample-directory/ Block a webpage: User-agent: * Disallow: /private_file.html Block an image: User-agent: googlebot-image Disallow: /images_mypicture.jpg Block all images: User-agent: googlebot-image Disallow: / Block a specific file format: User-agent: * Disallow: /p*.gif$
Encourage bots to index and crawl your site. Many people want to welcome, instead of block, search engine spiders because they want their entire site indexed. To accomplish this, you have three options. First, you can opt out of creating a robots.txt file—when the robot does not find a robots.txt file, it will continue to crawl and index your entire site. Second, you can create an empty robots.txt file—the robot will find the robots.txt file, recognize that it is empty, and continue to crawl and index your site. Lastly, you can write a full-allow robots.txt file. Use the code: User-agent: * Disallow: When a bot, such as googlebot, reads this file, it will feel free to visit your entire site. User-agents: this is another term for search engine spiders, or robots *: the asterisk signifies that the code applies to all user-agents Disallow: the blank disallow command indicates that all files and folders are accessible
Save the txt file to the root of your domain. After you have written the robots.txt file, save the changes. Upload the file to your site's root directory. For example, if your domain is www.yourdomain.com, place the robots.txt file at www.yourdomain.com/robots.txt.
Blocking Search Engines with Meta Tags
Understand HTML robots meta tags. The robots meta tag allows programmers to set parameters for bots, or search engine spiders. These tags are used to block bots from indexing and crawling an entire site or just parts of the site. You can also use these tags to block a specific search engine spider from indexing your content. These tags appear in the head of your HTML file. This method is commonly used by programmers that do not have access to a website's root directory.
Block bots from a single page. It is possible to block all bots from indexing a page and or from following a page's links. This tag is commonly used when a live site is under development. Once the site is complete, it is strongly recommended that you remove this tag. If you do not remove the tag, your page will not be indexed or searchable via search engines. You may block bots from indexing the page and from following any of the links: You may block all bots from indexing the page: You may block all bots from following the page's links:
Allow the bots to index a page, but not follow its links. If you allow the bots to index the page, the page will be indexed; if you prevent the spiders from following the links, the link path from this specific page to other pages will break. Insert the following line of code into your header:
Let the search engine spiders follow the links but not index the page. If you allow the bots to follow the links the link path from this specific page to other pages will remain in tact; if you restrict them from indexing the page, your web page will not appear in the index. Insert the following line of code into your header:
Block a single outgoing link. To hide a single link on a page, embed a rel tag within the link tag. You may wish to use this tag to block links on other pages that lead to the specific page you want to block. Insert Link to Blocked Page
Block a specific search engine spider. Instead of blocking all bots from your web page, you may wish to prevent one bot from crawling and indexing the page. To accomplish this, replace “'robot”' within the meta tag with the name of a specific bot. Examples include: googlebot, googlebot-news, googlebot-image, bingbot, and teoma.
Encourage bots to crawl and index your page. If you want to ensure that your page will be indexed and its links will be followed, you can insert a follow-allow meta “robot” tag into your header. Use the following code:
Comments
0 comment