Files
WDW-Sitemap-and-Scraper-Docker/README.md
T
2026-04-09 11:27:13 -07:00

152 lines
3.4 KiB
Markdown

# WDW Sitemap And Import Tools
This repository combines two internal tools into one web application and one Docker image:
- `Sitemap Generator`
- `Page Importer`
The application uses Streamlit and presents both tools behind a single URL with two tabs at the top of the page.
## What It Does
### Sitemap Generator
- Crawls a site from a starting URL
- Discovers URLs from page links and XML sitemaps
- Exports a sitemap CSV
- Saves crawl state and logs so a crawl can be resumed later
### Page Importer
- Reads a CSV of submitted URLs
- Scrapes page content
- Lets you review the extracted content
- Exports a WordPress WXR XML import file
## Project Layout
- `app.py`: top-level Streamlit app with both tabs
- `requirements.txt`: shared Python dependencies for the combined app
- `Dockerfile`: single image for the combined tool
- `.gitea/workflows/docker-image.yml`: Gitea Actions workflow for Docker builds
- `Sitemap Builder/`: sitemap crawler logic
- `Page Importer/`: WordPress import logic
## Run Locally
### Linux or macOS
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
```
### Windows PowerShell
```powershell
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
streamlit run app.py
```
Then open:
```text
http://localhost:8501
```
## Docker
Build the image:
```bash
docker build -t wdw-sitemap-and-importer .
```
Run the container:
```bash
docker run --rm -p 8501:8501 -v wdw-tools-data:/data wdw-sitemap-and-importer
```
Then open:
```text
http://localhost:8501
```
The mounted `/data` volume stores sitemap CSV files, crawl state files, and crawl logs so sitemap jobs can survive container restarts.
## Docker Compose
A ready-to-use compose file is included:
```text
docker-compose.yml
```
It pulls this image:
```text
git.websupport.work/wdw_internal_tools/wdw-sitemap-and-scraper-docker:latest
```
Start it with:
```bash
docker compose up -d
```
Then open:
```text
http://localhost:8501
```
## Gitea Automation
The workflow file is:
```text
.gitea/workflows/docker-image.yml
```
It runs on pushes to `main` and on manual workflow dispatch.
The workflow always builds the Docker image. If these secrets are configured in Gitea, it also logs in and pushes the image to your registry:
- `REGISTRY_URL`
- `REGISTRY_USERNAME`
- `REGISTRY_PASSWORD`
Published tags:
- `${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:<commit-sha>`
- `${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:latest`
`REGISTRY_URL` should be the registry host only, for example:
```text
registry.example.com
```
or:
```text
gitea.example.com
```
Do not include `http://`, `https://`, or the repository path in `REGISTRY_URL`. The workflow derives the repository path from the Gitea repository name and converts it to lowercase for Docker compatibility.
If the registry secrets are not configured, the workflow still performs the build as validation but skips the push steps.
## Notes
- Sitemap output files are written under `/data` in Docker.
- Sitemap Generator worker threads default to the number of CPUs visible inside the Docker container.
- The sitemap crawler can resume previous runs when a matching crawl state file exists.
- The importer keeps its existing scraping and WordPress export behavior, but it now runs inside the shared interface instead of as a separate app.