first commit

2026-04-09 10:42:10 -07:00
commit ead872a0a5
19 changed files with 2783 additions and 0 deletions
@@ -0,0 +1,80 @@
+# Sitemap Builder
+
+This folder contains the sitemap crawler used by the combined web application in the repository root.
+
+The crawler can still be used directly from Python, but the primary supported experience is now the shared Streamlit interface in the root project:
+
+```text
+../app.py
+```
+
+## Current Role In The Combined App
+
+The root application uses this module to:
+
+- crawl a site from a submitted starting URL
+- discover internal URLs from HTML links and XML sitemaps
+- export a sitemap CSV
+- save crawl state and crawl logs for resume support
+
+## Output
+
+The crawler writes:
+
+- a CSV file
+- a sidecar crawl state file ending in `.crawlstate.json`
+- a crawl log file ending in `.crawl.log`
+
+The CSV contains these columns:
+
+- `URL`
+- `Title`
+- `Canonical URL`
+- `Type`
+
+## Standalone CLI Usage
+
+Interactive mode:
+
+```bash
+python3 sitemap_builder.py
+```
+
+Command line mode:
+
+```bash
+python3 sitemap_builder.py https://example.com -o ./sitemap.csv
+```
+
+On Windows:
+
+```powershell
+python .\sitemap_builder.py https://example.com -o .\sitemap.csv
+```
+
+## Useful Options
+
+```bash
+python3 sitemap_builder.py https://example.com --max-pages 20000 --delay 0.25 --include-subdomains
+```
+
+- `--max-pages`: stop after the given number of visited pages. Default: `10000`
+- `--delay`: wait between requests to reduce load on the site
+- `--timeout`: request timeout in seconds
+- `--include-subdomains`: crawl subdomains of the starting host
+- `--include-documents`: include document links such as PDF, CSV, DOC, DOCX, XLSX, and similar files
+- `--workers`: number of worker threads to use. Set `1` to disable multithreading
+- `--save-every`: save progress after every N pages. Default: `25`
+- `--resume`: resume from an existing state file
+- `--fresh`: ignore the existing state file and start over
+
+## Discovery And Behavior
+
+- The crawler checks `robots.txt` for sitemap references and also tries `/sitemap.xml`
+- XML sitemap URLs are added to the crawl queue before page crawling begins
+- HTML pages store page title and canonical URL in the CSV when available
+- On Windows CLI runs, `P` pauses, `R` resumes, and `Q` stops cleanly and saves progress
+
+## Recommendation
+
+For normal use, run the root application or Docker container instead of calling this script directly. That is now the intended user interface for this repository.