How to reduce training characters when scanning a website

Last updated: February 4, 2026

When training your chatbot on a website, limiting the amount of scanned data to what's most useful helps reduce character usage, speeds up training, and improves answer quality by focusing only on relevant content.

This guide explains how to optimize your scan using the Advanced settings in the website training modal (Training > Links > Add New Training: Website).


Why Less Training Can Be Better

ChatLab uses RAG (Retrieval-Augmented Generation) architecture. Instead of memorizing all website content, the AI searches for the most relevant pieces of information at the time of each user question and uses them to generate an answer.

The Problem with Too Much Data

More data doesn't always mean better answers. Adding irrelevant or redundant content can confuse the retrieval process:

  • RAG scores chunks of content for relevance to the user's question. If your site contains many pages with repetitive, vague, or generic text (menus, footers, press releases, blog tag pages), they dilute the quality of search results.
  • When the chatbot retrieves multiple low-value chunks that share keywords but lack helpful context, it may generate inaccurate, overly general, or off-topic answers.
  • For example: If your footer contains links like "Home", "Contact", "Privacy", "FAQ" on every page, the bot might respond with a generic paragraph from the footer instead of a helpful phone number or email when a user asks "How can I contact you?"

High Signal, Low Noise = Better Answers

By scanning only useful content, you help the AI retrieve clear, specific, high-quality context for each question. Focus on:

  • Product descriptions
  • FAQ pages
  • Support articles
  • Policy documents
  • Key landing pages

Use Sitemap Instead of Full Website Scan

Recommended: Use the sitemap option whenever possible.

Full website scan follows all visible links on the page, including links from footers, menus, and sidebars, which often leads to scanning repetitive or non-valuable content. These sections are typically present on every page and rarely include unique information useful for chat answers.

By using a sitemap (usually located at https://yourdomain.com/sitemap.xml), you:

  • Avoid crawling unnecessary pages (e.g., legal pages, social links, repeated blog tags)
  • Control exactly which URLs are scanned
  • Prevent accidental overuse of characters from menus, footers, and headers

Advanced Settings Overview

To access advanced settings, click Add New Training: Website in the Training tab, then expand the Advanced settings section. The settings are organized into four categories:

  • URL Filtering - Control which URLs to include or exclude
  • Query Parameters - Handle URL parameters
  • Content Filtering - Filter specific page elements
  • Scraping Behavior - Configure how content is extracted

URL Filtering

URL Filtering options

Include Only Specific URLs

Field: Advanced Settings > URL Filtering > Include only URLs that contain

Add semicolon-separated keywords to include only specific URLs in the scan.

Example: To scan only English help pages:

/en;/help

This scans only pages containing /en or help in the URL, saving characters by skipping irrelevant sections.

Exclude Irrelevant URLs

Field: Advanced Settings > URL Filtering > Exclude URLs that contain

Add semicolon-separated keywords to skip sections like blogs, images, or news.

Example:

/news;/pictures;/press

This avoids scanning content-heavy but chatbot-irrelevant sections like news or image galleries.


Query Parameters

Query Parameters options

Ignore All Query Parameters

Field: Advanced Settings > Query Parameters > Ignore all query parameters

Enable this checkbox to treat URLs with different query parameters as the same page. This prevents scanning duplicate content.

Ignore Specific Query Parameters

Field: Advanced Settings > Query Parameters > Ignore specific query parameters

Add semicolon-separated parameter names to ignore specific tracking or filtering parameters.

Example:

ref;source;campaign;utm_source

This treats page.html?ref=email and page.html?ref=social as the same URL.


Content Filtering

Content Filtering options

Include Specific Elements

Field: Advanced Settings > Content Filtering > Include element IDs

Add semicolon-separated element identifiers to scan only specific parts of the page.

Example:

#main-content;.article-body

This scans only the main content area and article bodies, ignoring everything else.

Follow Only Links in Included Elements

Field: Advanced Settings > Content Filtering > Follow only links in included elements

This toggle appears when you specify Include element IDs. When enabled, the crawler will only follow links found within the included elements. This is useful when you want to limit crawling to specific sections of the website, such as article content areas, while ignoring navigation menus in headers or footers.

Default: Disabled (all links on the page are followed)

Exclude Page Elements

Field: Advanced Settings > Content Filtering > Exclude element IDs

Prevent scanning of headers, footers, and other layout elements.

Example:

header;footer;#navigation;.sidebar

This removes HTML blocks that are repeated on every page and generally contain no useful content for your chatbot.

Follow Links in Excluded Elements

Field: Advanced Settings > Content Filtering > Follow links in excluded elements

This toggle appears when you specify Exclude element IDs. When enabled, the crawler will still follow links found within excluded elements (like navigation menus in headers) even though the content itself is excluded from training.

Default: Disabled (links in excluded elements are not followed)

Use case: Enable this when you want to exclude header/footer content from training but still need the crawler to discover pages through navigation links in those areas. This gives you the best of both worlds: clean training data without menu clutter, while still crawling your entire site.

Important: If you use full website scan and exclude elements like header without enabling "Follow links in excluded elements", the scanner may not discover deeper links (menus are often inside headers). In this case, either enable the toggle or use a sitemap for the safest and most efficient method.


Scraping Behavior

Scraping Behavior options

Scrape Documents

Field: Advanced Settings > Scraping Behavior > Scrape documents

When enabled, the system extracts and processes document content found on the website. Currently supports PDF files.

Exclude Image Links

Field: Advanced Settings > Scraping Behavior > Exclude image links

Enable this toggle to exclude all image links from the training dataset. By default, the system extracts all image links, which can significantly increase character count.

Enabling this option can substantially reduce your character count.

Scrape Hidden Contents

Field: Advanced Settings > Scraping Behavior > Scrape hidden contents

When enabled (default), the system extracts content from hidden HTML elements (elements with display:none, visibility:hidden, or similar CSS properties).

Default: Enabled

When to disable: Disable this option to skip hidden content and only scrape visible elements. This is useful when websites contain hidden content like:

  • Collapsed accordion sections
  • Mobile-only or desktop-only content
  • Hidden form fields
  • Developer comments or debug information

Disabling this can reduce character count and prevent irrelevant hidden content from entering your training data.

Delay Between Pages

Field: Advanced Settings > Scraping Behavior > Delay between pages

Add a delay (in seconds) between scanning each page to prevent overloading your server. Useful for websites with rate limiting.

Simulate Visit From Specific Country

Field: Advanced Settings > Scraping Behavior > Country code

If your site shows different content depending on visitor location, enter a 2-letter country code (e.g., US, PL, DE) to simulate scanning from that location.


Recommended Configuration for Most Websites

For most websites, the following configuration provides the best balance between comprehensive content and reduced character count:

  1. Use sitemap when available for precise URL control
  2. Exclude element IDs: header;footer;#navigation;.sidebar
  3. Enable "Exclude image links" to remove image URLs from training
  4. Disable "Scrape hidden contents" if your site has many hidden elements
  5. Consider "Follow links in excluded elements" if using full scan with excluded headers

This configuration typically reduces character count by 20-40% while keeping all valuable content.


Summary: Best Practices

  • Use sitemap instead of full scan for precise control
  • Include URLs to target only valuable content
  • Exclude URLs to skip irrelevant or heavy pages
  • Ignore query parameters to avoid duplicate content
  • Exclude element IDs to remove repeated page parts like headers and footers
  • Control link following with "Follow only links in included elements" or "Follow links in excluded elements" for fine-grained crawl control
  • Exclude image links to reduce character count significantly
  • Disable scrape hidden contents if your site has unnecessary hidden elements
  • Focus on useful content for your bot (FAQs, product pages, support articles)