Files

Sitemap Builder

This folder contains the sitemap crawler used by the combined web application in the repository root.

The crawler can still be used directly from Python, but the primary supported experience is now the shared Streamlit interface in the root project:

../app.py

Current Role In The Combined App

The root application uses this module to:

  • crawl a site from a submitted starting URL
  • discover internal URLs from HTML links and XML sitemaps
  • export a sitemap CSV
  • save crawl state and crawl logs for resume support

Output

The crawler writes:

  • a CSV file
  • a sidecar crawl state file ending in .crawlstate.json
  • a crawl log file ending in .crawl.log

The CSV contains these columns:

  • URL
  • Title
  • Canonical URL
  • Type

Standalone CLI Usage

Interactive mode:

python3 sitemap_builder.py

Command line mode:

python3 sitemap_builder.py https://example.com -o ./sitemap.csv

On Windows:

python .\sitemap_builder.py https://example.com -o .\sitemap.csv

Useful Options

python3 sitemap_builder.py https://example.com --max-pages 20000 --delay 0.25 --include-subdomains
  • --max-pages: stop after the given number of visited pages. Default: 10000
  • --delay: wait between requests to reduce load on the site
  • --timeout: request timeout in seconds
  • --include-subdomains: crawl subdomains of the starting host
  • --include-documents: include document links such as PDF, CSV, DOC, DOCX, XLSX, and similar files
  • --workers: number of worker threads to use. Set 1 to disable multithreading. Default: all CPUs visible to the current machine or container
  • --save-every: save progress after every N pages. Default: 25
  • --resume: resume from an existing state file
  • --fresh: ignore the existing state file and start over

Discovery And Behavior

  • The crawler checks robots.txt for sitemap references and also tries /sitemap.xml
  • XML sitemap URLs are added to the crawl queue before page crawling begins
  • HTML pages store page title and canonical URL in the CSV when available
  • On Windows CLI runs, P pauses, R resumes, and Q stops cleanly and saves progress

Recommendation

For normal use, run the root application or Docker container instead of calling this script directly. That is now the intended user interface for this repository.