Sitemap Builder
This folder contains the sitemap crawler used by the combined web application in the repository root.
The crawler can still be used directly from Python, but the primary supported experience is now the shared Streamlit interface in the root project:
../app.py
Current Role In The Combined App
The root application uses this module to:
- crawl a site from a submitted starting URL
- discover internal URLs from HTML links and XML sitemaps
- export a sitemap CSV
- save crawl state and crawl logs for resume support
Output
The crawler writes:
- a CSV file
- a sidecar crawl state file ending in
.crawlstate.json - a crawl log file ending in
.crawl.log
The CSV contains these columns:
URLTitleCanonical URLType
Standalone CLI Usage
Interactive mode:
python3 sitemap_builder.py
Command line mode:
python3 sitemap_builder.py https://example.com -o ./sitemap.csv
On Windows:
python .\sitemap_builder.py https://example.com -o .\sitemap.csv
Useful Options
python3 sitemap_builder.py https://example.com --max-pages 20000 --delay 0.25 --include-subdomains
--max-pages: stop after the given number of visited pages. Default:10000--delay: wait between requests to reduce load on the site--timeout: request timeout in seconds--include-subdomains: crawl subdomains of the starting host--include-documents: include document links such as PDF, CSV, DOC, DOCX, XLSX, and similar files--workers: number of worker threads to use. Set1to disable multithreading. Default: all CPUs visible to the current machine or container--save-every: save progress after every N pages. Default:25--resume: resume from an existing state file--fresh: ignore the existing state file and start over
Discovery And Behavior
- The crawler checks
robots.txtfor sitemap references and also tries/sitemap.xml - XML sitemap URLs are added to the crawl queue before page crawling begins
- HTML pages store page title and canonical URL in the CSV when available
- On Windows CLI runs,
Ppauses,Rresumes, andQstops cleanly and saves progress
Recommendation
For normal use, run the root application or Docker container instead of calling this script directly. That is now the intended user interface for this repository.