81 lines
2.3 KiB
Markdown
81 lines
2.3 KiB
Markdown
# Sitemap Builder
|
|
|
|
This folder contains the sitemap crawler used by the combined web application in the repository root.
|
|
|
|
The crawler can still be used directly from Python, but the primary supported experience is now the shared Streamlit interface in the root project:
|
|
|
|
```text
|
|
../app.py
|
|
```
|
|
|
|
## Current Role In The Combined App
|
|
|
|
The root application uses this module to:
|
|
|
|
- crawl a site from a submitted starting URL
|
|
- discover internal URLs from HTML links and XML sitemaps
|
|
- export a sitemap CSV
|
|
- save crawl state and crawl logs for resume support
|
|
|
|
## Output
|
|
|
|
The crawler writes:
|
|
|
|
- a CSV file
|
|
- a sidecar crawl state file ending in `.crawlstate.json`
|
|
- a crawl log file ending in `.crawl.log`
|
|
|
|
The CSV contains these columns:
|
|
|
|
- `URL`
|
|
- `Title`
|
|
- `Canonical URL`
|
|
- `Type`
|
|
|
|
## Standalone CLI Usage
|
|
|
|
Interactive mode:
|
|
|
|
```bash
|
|
python3 sitemap_builder.py
|
|
```
|
|
|
|
Command line mode:
|
|
|
|
```bash
|
|
python3 sitemap_builder.py https://example.com -o ./sitemap.csv
|
|
```
|
|
|
|
On Windows:
|
|
|
|
```powershell
|
|
python .\sitemap_builder.py https://example.com -o .\sitemap.csv
|
|
```
|
|
|
|
## Useful Options
|
|
|
|
```bash
|
|
python3 sitemap_builder.py https://example.com --max-pages 20000 --delay 0.25 --include-subdomains
|
|
```
|
|
|
|
- `--max-pages`: stop after the given number of visited pages. Default: `10000`
|
|
- `--delay`: wait between requests to reduce load on the site
|
|
- `--timeout`: request timeout in seconds
|
|
- `--include-subdomains`: crawl subdomains of the starting host
|
|
- `--include-documents`: include document links such as PDF, CSV, DOC, DOCX, XLSX, and similar files
|
|
- `--workers`: number of worker threads to use. Set `1` to disable multithreading
|
|
- `--save-every`: save progress after every N pages. Default: `25`
|
|
- `--resume`: resume from an existing state file
|
|
- `--fresh`: ignore the existing state file and start over
|
|
|
|
## Discovery And Behavior
|
|
|
|
- The crawler checks `robots.txt` for sitemap references and also tries `/sitemap.xml`
|
|
- XML sitemap URLs are added to the crawl queue before page crawling begins
|
|
- HTML pages store page title and canonical URL in the CSV when available
|
|
- On Windows CLI runs, `P` pauses, `R` resumes, and `Q` stops cleanly and saves progress
|
|
|
|
## Recommendation
|
|
|
|
For normal use, run the root application or Docker container instead of calling this script directly. That is now the intended user interface for this repository.
|