DocuChat Logo
Documents

Website Crawling

Learn how DocuChat discovers, filters, and ingests content from websites using its advanced web crawler and AI-powered page filtering.

DocuChat includes an advanced web crawler that can automatically discover and extract content from entire websites. This guide covers how to use the website crawling features to efficiently add website content to your chatbots.

Adding Website URLs

To add website content, select the Website option when adding a new document. Enter one or more URLs (one per line) and DocuChat will extract the content from those pages.

Discovering Linked Pages

Enable the Include all webpages linked from these websites toggle to automatically discover pages linked from your provided URLs. DocuChat will crawl the seed URL and return all internal links it finds.

How Discovery Works

  1. You enter a seed URL (e.g., https://example.com)
  2. DocuChat crawls that page and extracts all internal links
  3. The discovered pages are presented in a list for you to review
  4. You choose which pages to include before processing
Only internal links (same domain) are discovered. External links to other websites are excluded.

Selecting Pages

After discovery, you have full control over which pages get processed:

  • Select/deselect individual pages using the checkboxes next to each page
  • Select or deselect all pages at once using the header checkbox
  • Search through discovered pages by name or URL to quickly find specific pages
  • Pages are displayed with readable names derived from link text when available, with the full URL shown below
Each batch is limited to 250 linked pages. If your website has more pages, you can process them in multiple batches.

AI-Powered Filtering

DocuChat offers an optional AI filter that can automatically identify and deselect pages that are unlikely to contain useful content for your chatbot.

Enabling the AI Filter

After pages are discovered, you can enable the AI Filter toggle. When enabled, the AI analyzes the discovered pages and deselects those it considers irrelevant, such as:

  • Login and registration pages
  • Terms of service, privacy policies, and cookie notices
  • Navigation and sitemap pages
  • Shopping carts and checkout pages
  • Social media profile links
  • Admin and settings pages

The AI filter is opt-in — it is not applied automatically. You always see all discovered pages first and can choose to apply the filter.

Reviewing Filtered Results

After the AI filter runs:

  • Pages identified as irrelevant are automatically deselected
  • A summary shows how many pages were filtered out
  • You can toggle between the filtered and unfiltered view
  • You can manually re-select any page the AI deselected
  • You can manually deselect any page the AI kept

The AI filter is a starting point — you always have the final say on which pages to include.

Tips for Best Results

  • Start with your main pages: Enter your homepage or a key section page as the seed URL to discover the most relevant linked pages.
  • Use search: If hundreds of pages are discovered, use the search box to quickly find pages by keyword.
  • Review AI suggestions: The AI filter works well for most websites, but always review its suggestions — it may occasionally exclude pages you want to keep.
  • Process in batches: For very large websites, consider adding content section by section rather than crawling the entire site at once.
  • Sitemaps work too: You can enter a sitemap URL (e.g., example.com/sitemap.xml) as your seed URL to discover pages listed in the sitemap.