@@ -0,0 +1,80 @@
|
||||
# Sitemap Builder
|
||||
|
||||
This folder contains the sitemap crawler used by the combined web application in the repository root.
|
||||
|
||||
The crawler can still be used directly from Python, but the primary supported experience is now the shared Streamlit interface in the root project:
|
||||
|
||||
```text
|
||||
../app.py
|
||||
```
|
||||
|
||||
## Current Role In The Combined App
|
||||
|
||||
The root application uses this module to:
|
||||
|
||||
- crawl a site from a submitted starting URL
|
||||
- discover internal URLs from HTML links and XML sitemaps
|
||||
- export a sitemap CSV
|
||||
- save crawl state and crawl logs for resume support
|
||||
|
||||
## Output
|
||||
|
||||
The crawler writes:
|
||||
|
||||
- a CSV file
|
||||
- a sidecar crawl state file ending in `.crawlstate.json`
|
||||
- a crawl log file ending in `.crawl.log`
|
||||
|
||||
The CSV contains these columns:
|
||||
|
||||
- `URL`
|
||||
- `Title`
|
||||
- `Canonical URL`
|
||||
- `Type`
|
||||
|
||||
## Standalone CLI Usage
|
||||
|
||||
Interactive mode:
|
||||
|
||||
```bash
|
||||
python3 sitemap_builder.py
|
||||
```
|
||||
|
||||
Command line mode:
|
||||
|
||||
```bash
|
||||
python3 sitemap_builder.py https://example.com -o ./sitemap.csv
|
||||
```
|
||||
|
||||
On Windows:
|
||||
|
||||
```powershell
|
||||
python .\sitemap_builder.py https://example.com -o .\sitemap.csv
|
||||
```
|
||||
|
||||
## Useful Options
|
||||
|
||||
```bash
|
||||
python3 sitemap_builder.py https://example.com --max-pages 20000 --delay 0.25 --include-subdomains
|
||||
```
|
||||
|
||||
- `--max-pages`: stop after the given number of visited pages. Default: `10000`
|
||||
- `--delay`: wait between requests to reduce load on the site
|
||||
- `--timeout`: request timeout in seconds
|
||||
- `--include-subdomains`: crawl subdomains of the starting host
|
||||
- `--include-documents`: include document links such as PDF, CSV, DOC, DOCX, XLSX, and similar files
|
||||
- `--workers`: number of worker threads to use. Set `1` to disable multithreading
|
||||
- `--save-every`: save progress after every N pages. Default: `25`
|
||||
- `--resume`: resume from an existing state file
|
||||
- `--fresh`: ignore the existing state file and start over
|
||||
|
||||
## Discovery And Behavior
|
||||
|
||||
- The crawler checks `robots.txt` for sitemap references and also tries `/sitemap.xml`
|
||||
- XML sitemap URLs are added to the crawl queue before page crawling begins
|
||||
- HTML pages store page title and canonical URL in the CSV when available
|
||||
- On Windows CLI runs, `P` pauses, `R` resumes, and `Q` stops cleanly and saves progress
|
||||
|
||||
## Recommendation
|
||||
|
||||
For normal use, run the root application or Docker container instead of calling this script directly. That is now the intended user interface for this repository.
|
||||
Reference in New Issue
Block a user