WDW-Sitemap-and-Scraper-Docker/Sitemap Builder/README.md

# Sitemap Builder

This folder contains the sitemap crawler used by the combined web application in the repository root.

The crawler can still be used directly from Python, but the primary supported experience is now the shared Streamlit interface in the root project:

```text
../app.py
```

## Current Role In The Combined App

The root application uses this module to:

- crawl a site from a submitted starting URL
- discover internal URLs from HTML links and XML sitemaps
- export a sitemap CSV
- save crawl state and crawl logs for resume support

## Output

The crawler writes:

- a CSV file
- a sidecar crawl state file ending in `.crawlstate.json`
- a crawl log file ending in `.crawl.log`

The CSV contains these columns:

- `URL`
- `Title`
- `Canonical URL`
- `Type`

## Standalone CLI Usage

Interactive mode:

```bash
python3 sitemap_builder.py
```

Command line mode:

```bash
python3 sitemap_builder.py https://example.com -o ./sitemap.csv
```

On Windows:

```powershell
python .\sitemap_builder.py https://example.com -o .\sitemap.csv
```

## Useful Options

```bash
python3 sitemap_builder.py https://example.com --max-pages 20000 --delay 0.25 --include-subdomains
```

- `--max-pages`: stop after the given number of visited pages. Default: `10000`
- `--delay`: wait between requests to reduce load on the site
- `--timeout`: request timeout in seconds
- `--include-subdomains`: crawl subdomains of the starting host
- `--include-documents`: include document links such as PDF, CSV, DOC, DOCX, XLSX, and similar files
- `--workers`: number of worker threads to use. Set `1` to disable multithreading
- `--save-every`: save progress after every N pages. Default: `25`
- `--resume`: resume from an existing state file
- `--fresh`: ignore the existing state file and start over

## Discovery And Behavior

- The crawler checks `robots.txt` for sitemap references and also tries `/sitemap.xml`
- XML sitemap URLs are added to the crawl queue before page crawling begins
- HTML pages store page title and canonical URL in the CSV when available
- On Windows CLI runs, `P` pauses, `R` resumes, and `Q` stops cleanly and saves progress

## Recommendation

For normal use, run the root application or Docker container instead of calling this script directly. That is now the intended user interface for this repository.