Crawling content

Before end users can use the enterprise search functionality in Microsoft Office SharePoint Server 2007 to search for content, you must first crawl the content that you want to make available for users to query. For the purpose of this article, content is an item that can be crawled, such as a Web page, a Microsoft Office Word document, or a SharePoint site.

To crawl content, do the following:

  1. Create a content source     A content source defines the type of repository that contains the content you want to crawl, the start addresses from which to start crawling, the behavior to use when crawling, and the crawling schedule.

  2. Specify the credentials to use when crawling all URLs or a specific range of URLs     By default, the default content access account uses Windows domain user credentials to crawl the content repositories that are defined by content sources. Instead, you can use a crawl rule to specify a different content access account, which can be a client certificate, forms credentials, a cookie, or a different content access account.

  3. Configure proxy server settings for search     When you crawl content that is hosted outside your network, you will probably set up a proxy server to reach the host server. In this case, it is important to verify the settings for the proxy server and configure them in Search Server 2008. To do this, on the Search Administration page, under Crawling, click Proxy and timeouts. Usually, you only need to set this option once.

  4. Start a full crawl    You can begin by crawling small amounts of content defined in a particular content source in order to test your set up configuration. Once you have a small amount of content working, increase your criteria to build your index.

    For information about starting a full crawl, see Start a full crawl.

  5. View the crawl log     During the crawl, we recommend that you view the crawl log to check on its progress. This enables you to confirm that the crawl is successful or to detect problems. Common problems are that the authorization failed or that the host is unreachable. When see problems in the log file, you can stop the crawl, adjust the settings in the Manage Content Sources, Manage Crawl Rules, and Manage Farm-Level Search Settings pages, and then try the crawl again.

Share Facebook Facebook Twitter Twitter Email Email

Was this information helpful?

Great! Any other feedback?

How can we improve it?

Thank you for your feedback!