Plan to crawl content

Note:  This content is preliminary content for a preliminary software release. It might be incomplete and is subject to change.

In this article

About crawling content

Plan content sources

Plan crawling considerations for SSPs

Plan crawling considerations for server farms

Use the search features planning worksheet

Before you can use the enterprise search functionality in Microsoft Office SharePoint Server 2007 to search for content across your organization, you must decide what content to include in search and plan for crawling the content so that the content and its properties can be used in search queries.

Microsoft Office SharePoint Server 2007 uses content sources to crawl content in your site collections or on related external sites or business data applications so that relevant content and data appears in search results. Other search features filter or modify content after it has been crawled. Good planning for content sources helps you to build search capability during your initial deployment that enables you to configure and manage content across your organization based on key subsets of content and data, content and data external to your Office Server deployment, or content and data external to your organization. You also plan for crawl schedules, crawl rules, property management and relevance settings for each content source.

About crawling content

Crawling is a process of indexing content, data, and metadata so that search queries can provide relevant search results. A content source is a starting point used by Office SharePoint Server 2007 to crawl content to make it available for search queries. Content sources are composed of one or more start addresses, which are URLs containing content or data that you want to include in searches in your organization. Content is included or excluded in a content index based on rules that are selected by the Shared Services Provider (SSP) administrator for search. After you crawl content and data in a content source, query servers process queries based on managed properties of search and the capability of the search service, and provide relevant search results. By default, all content on each Web application that uses the SSP is crawled in a single content source.

SSP administrators for search can create additional content sources for key subsets of content. They can select crawl rules for start addresses that apply to all content sources in the SSP based on what content and data at each start address is relevant to the organization, and configure settings for each content source.

With previous versions of Microsoft SharePoint Portal Server, administrators managed content indexes, which are the underlying collections of all content crawled by content sources. With Microsoft Office SharePoint Server 2007, this is no longer necessary. The single content index for each SSP is automatically created based on the settings selected for each content source, and content indexes are no longer displayed to administrators.

Top of Page

Plan content sources

The default content source for the SSP crawls the content on all Web applications that use the SSP. The start addresses for all Web applications in the SSP are automatically added to the content source, so that all content in the SSP is available to search after the first full crawl of the content source.

Your information architecture should also suggest additional content sources to create for each of your site collections in each of your Web applications. To manage and schedule crawls independently, you can create content sources that crawl a subset of content throughout the SSP.. This is useful to crawl high-priority or quickly changing content more frequently without needing to crawl all content.

Examples of content you might want to plan additional content sources for include:

  • Content on file shares within your organization.

  • Exchange Server content.

  • Content on Lotus Notes servers.

  • Sites in the Site Directory.

  • Other content in your organization not found in SharePoint sites.

  • Content external to your SSP or external to your organization.

  • Business data stored in line of business applications.

Each content source can contain one or more start addresses that point to locations for any combination of these types of content. Whether you group content in a content source or create additional content sources depends largely upon administration considerations. Administrators often make changes that require a full update of a particular content source. Changes to crawl rules, the crawling or access account, or managed properties require a full update. To make administration easier, organize content sources in such a way that updating that content at the same time is convenient for administrators and their other planned administration tasks.

Content on file shares and servers outside your server farm such as mail servers, Web servers that do not contain SharePoint sites, or business data application servers should be organized by availability. If the servers containing content are available at the same time, you're more likely to successfully crawl all the content in the content source, with less need to run full updates later.

Beyond these considerations, to effectively crawl all the content needed within each site collection in your organization, use as few content sources as you can. Use the "Plan for crawling and querying search features" worksheet to record the decisions you make about content sources for your initial deployment.

Plan external content sources

External content refers to two types of content useful for people in your organization:

  • Content within a Web application that uses a different SSP that you want to crawl by using this SSP.

  • Internet or extranet content that is not created or controlled by people in your organization.

Typically, if content on a Web application is relevant enough to be included in a content source, that Web application should probably be using the same SSP as other Web applications in start addresses in the content source. In some cases, you might want to include a subset of content in your organization from a Web application that uses different shared services. If you can, avoid this situation by carefully planning your information architecture, SSPs, and site structure. If you must crawl content on a Web application that uses a different SSP, ensure that the relevant crawling account has read permission to the content, and try to group the start address in a content source with other content available at similar times, or that is conceptually related.

A common scenario involves content outside the control of your organization that relates to the content on your SharePoint sites. You can add the start addresses for this content to an existing content source or create a new content source for external content. Because availability of external sites varies widely, it is helpful to add separate content sources for different external content. You can then update each set of external content on a crawl schedule that makes sense for the availability of each site.

Crawler impact rules are particularly important when crawling external content sources because crawling uses resources on the crawled servers. Requesting too much content on external servers or making requests too frequently can cause administrators of those sites to limit your future access if your crawls are using too many resources or too much bandwidth. You can also use crawl settings for each content source and crawl rules for the SSP to limit the impact on external servers.

Plan content sources for business data

Business data content sources require that the applications hosting the data are first registered in the Business Data Catalog, and the properties mapped to managed properties that are consistent with your search schema. Business data start addresses cannot be combined with start addresses for other content, so you must separately manage business data content sources.

Often, the people who plan for integration of business data into your site collections will not be the same people involved in the overall content planning process. Include business application administrators in your content planning teams so that they can advise you how to integrate their data into your other content and effectively present it on your site collections.

Plan crawl settings

For each content source, you can also select how extensively to crawl the start addresses in that content source. The options available in the properties for each content source are:

  • Crawl everything under the host name for each start address.

  • Crawl only the SharePoint site of each start address.

As with other content source decisions, the most important factors to consider when planning the crawl settings of content sources are the relevance of the information and the impact to performance. For best results:

  • Crawl only the SharePoint site if the content available on linked sites is not likely to be relevant, and the content on the site itself is relevant.

  • Crawl everything if the links on the start address tend to point to relevant content.

Plan crawl schedules

Each content source can be independently updated based on a crawl schedule for that content source. Crawl schedules should be planned based on the availability, performance, and bandwidth considerations of both the servers running the search service and the servers hosting the crawled content.

For best results, plan crawl schedules based on the following considerations:

  • Group start addresses in content sources based on similar availability and with acceptable overall resource usage for the servers hosting the content.

  • Schedule incremental crawls for each content source during times when the servers hosting the content are available but the demand on the resources of the server are low.

  • Stagger crawl schedules so that the load on the servers in your farm is distributed over time.

  • Schedule full crawls less frequently.

  • Schedule administration changes that require a full crawl to occur shortly before the planned schedule for full crawls.

You can adjust schedules after the initial deployment based on the performance and capacity of servers in the farm and the servers hosting content.

Top of Page

Plan crawling considerations for SSPs

After you have identified content sources to plan for your initial deployment, consider SSP planning. The settings for crawling in the Shared Services Administration pages for the SSP affect the crawling of all content sources in the SSP. In most organizations, only one SSP is used to crawl and query all content, so these settings apply to all content sources in the organization.

During deployment, you will create SSPs and then create content sources for each SSP. During planning, it actually can help to plan content sources first. In the small number of cases where additional SSPs are needed, planning content sources can help identify the need for multiple SSPs. Planning content sources also helps you identify content that could benefit from crawl rules or new file types.

SSP settings that affect crawled content include:

  • Setting up the default content access account.

  • Configuring crawl rules for specific start addresses used by any of your content sources.

  • Including file types.

Plan default content access account

The default content access account is the account that is used by default when crawling content sources. This account is selected by the SSP administrator during post-setup configuration. The default access account must have read access to all content that is crawled, or the content will not be crawled and will not be available during search queries. For individual sites in a content source, you can use crawl rules to use a different access account. The best practice is to select a default content access account that has the broadest access to most of your crawled content, and only use other access accounts when security considerations require separate accounts. For each content source you plan, identify the start addresses that cannot be accessed by the default content access account and plan to add access accounts for those start addresses. Administrators can configure additional access accounts in crawl rules for the relevant start addresses. for more information about the planning considerations for access accounts, see the following section about crawl rules.

Plan crawl rules

Crawl rules are used to limit content crawled by content sources to minimize use of server resources and network traffic, and to increase the relevance of search results. Crawl rules apply simultaneously to all content sources. You create crawl rules to exclude a specific site or location from crawling, to configure how a particular site is crawled, or to change the crawling account to be different from the default content access account.

Each crawl rule includes a URL or a set of URLs represented by wildcards, an inclusion or exclusion rule, and a crawling account.

You can use exclusion rules to avoid crawling irrelevant content. Often, most of the content for a particular site address is relevant, but a specific subsite or range of sites is not. By selecting a focused combination of start addresses and exclusion crawl rules, SSP administrators can maximize crawled content while minimizing the impact on crawling performance and the size of content databases. Exclusion rules are particularly useful when planning for start addresses for external content, where the impact on resource usage is not under the control of people in your organization.

You can use inclusion rules can to include content for a specific URL or range of URLs, with options to change how that content is crawled. Any combination of three options for inclusion rules are available:

  • Follow links and not the content at the URL for the start address. This option is useful for sites with links of relevant content when the page containing the links contains irrelevant information.

  • Crawl complex URLs. This option crawls URLs that contain complex characters. Depending upon the site, these URLs might or might not include relevant content. Because complex URLs can often redirect to irrelevant sites, it is a good idea to only enable this option on sites where the content available from complex URLs is known to be relevant.

  • Crawl content in SharePoint sites as HTTP.

Regardless of whether a crawl rule includes or excludes content, administrators have the option of changing the crawling account for the rule. The default content access account is used unless another account is specified in a crawl rule. The main reason to use a different crawling account for a crawl rule is that the default content access account does not have access to all start addresses. For those start addresses, you can create a crawl rule and select an account that does have access.

A good practice for the initial deployment is to use crawl rules to focus crawled content on what is most relevant according to the concepts and business processes that are most relevant to your organization, as identified in the information architecture. Because crawling content consumes resources and bandwidth, it is better to include a smaller amount of content that you know is relevant than a larger amount of content that might be irrelevant. After the initial deployment, you can review the query and crawl logs and adjust content sources and crawl rules to be more relevant and include more content.

Plan file-type inclusions

Content is only crawled if the relevant file extension is included in the file-type inclusions list. Several file types are included automatically during initial installation. When you plan for content sources in your initial deployment, it's a good idea to check if any major content uses file types that are not included. If so, add those content types. If certain file types contain mostly irrelevant content, you can decide to delete the file type inclusion for that extension, which will exclude file names that have that extension from crawls.

When you add file types, you must also ensure that you have an IFilter that can be used to crawl the file type. IFilters for several file types are available from third-party vendors, and if necessary software developers can create IFilters for new file types.

Top of Page

Plan crawling considerations for server farms

In addition to the settings that are configured at the SSP level, several settings managed by server farm administrators affect how content sources are crawled. Consider these settings while planning for crawling.

Farm-level settings that affect crawling include:

  • Farm-level search settings.

  • Crawler impact rules.

  • Farm services.

  • Shared services for multiple farm deployments.

Plan farm-level search settings

Farm-level search settings include the following settings:

  • Contact e-mail address

  • Proxy settings

  • Time-out settings

  • SSL settings

The contact e-mail address is the address of the person to contact about the impact created by crawling content sources. This address appears in logs for administrators of the servers containing start addresses, so that those administrators can contact someone if the impact of crawling on their performance and bandwidth is too high, or other issues occur. The contact e-mail address should be a person or well-monitored alias with the necessary expertise and availability to quickly respond to requests. Regardless of whether the content crawled is stored internally to the organization or not, quick response time is important.

Proxy settings include the proxy server to use when crawling content. The proxy server to use depends upon the topology of your SharePoint deployment and the architecture of other servers in your organization. The time-out settings are used to limit the time that the search server waits while connecting to other services. The SSL settings determine whether the SSL certificate must exactly match in order to crawl content.

Plan crawler impact rules

You use crawler impact rules to manage the load on crawled servers. Crawler impact rules limit how often you request documents from a site while crawling, or how many documents you request at a time.

For content within your organization, you can coordinate with administrators of other sites to set crawler impact rules based on the performance and capacity of the servers. For most external sites, this coordination is not possible, so the best practice is to crawl too little rather than crawl too much and risk losing access to crawl the relevant content.

During initial deployment, set the crawler impact rules to make as small an impact on other servers as possible while still crawling enough content frequently enough to make the crawling worthwhile.

During operations, you can adjust crawler impact rules based on your experiences and data from crawl logs.

Plan for multiple server farm deployments

Larger organizations often plan deployments with multiple server farms based on security or architecture considerations. For example, an organization might use one farm for producing content and another farm for publishing content on the internet. Other examples incluede a geographically distributed deployment with farms for each major subsidiary, or an additional farm for a confidential or sensitive project that must be kept distinct from other projects.

If you have more than one farm, you will have to plan for how shared services are configured across farms. On the Application Management page for Central Administration for each farm, in the Office SharePoint Server Shared Services section, you can select the option to grant or configure shared services between farms. You can configure each farm to use one of three options:

  • Do not participate in shared services between farms    Farms using this option do not participate in interfarm shared services, and rely upon the shared services of an SSP on the same farm. This is the typical configuration for a small or medium organization that has a small deployment using a single server farm.

  • Provide shared services to other farms    Farms providing services to other farms are designed to manage interfarm shared services in large organizations, and typically have a greater capacity than farms for smaller divisions or organizations.

  • Consume shared services from another farm    Farms that consume shared services are typically divisional farms running divisional portal sites or small-scale business applications. These farms can also have their own SSPs, so that when the central SSP is not available, they can use the services available on the local farm.

Top of Page

Share Facebook Facebook Twitter Twitter Email Email

Was this information helpful?

Great! Any other feedback?

How can we improve it?

Thank you for your feedback!

×