first commit
Build Docker Image / docker (push) Successful in 44s

This commit is contained in:
2026-04-09 10:42:10 -07:00
commit ead872a0a5
19 changed files with 2783 additions and 0 deletions
+110
View File
@@ -0,0 +1,110 @@
# WDW Sitemap And Import Tools
This repository combines two internal tools into one web application and one Docker image:
- `Sitemap Generator`
- `Page Importer`
The application uses Streamlit and presents both tools behind a single URL with two tabs at the top of the page.
## What It Does
### Sitemap Generator
- Crawls a site from a starting URL
- Discovers URLs from page links and XML sitemaps
- Exports a sitemap CSV
- Saves crawl state and logs so a crawl can be resumed later
### Page Importer
- Reads a CSV of submitted URLs
- Scrapes page content
- Lets you review the extracted content
- Exports a WordPress WXR XML import file
## Project Layout
- `app.py`: top-level Streamlit app with both tabs
- `requirements.txt`: shared Python dependencies for the combined app
- `Dockerfile`: single image for the combined tool
- `.gitea/workflows/docker-image.yml`: Gitea Actions workflow for Docker builds
- `Sitemap Builder/`: sitemap crawler logic
- `Page Importer/`: WordPress import logic
## Run Locally
### Linux or macOS
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
```
### Windows PowerShell
```powershell
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
streamlit run app.py
```
Then open:
```text
http://localhost:8501
```
## Docker
Build the image:
```bash
docker build -t wdw-sitemap-and-importer .
```
Run the container:
```bash
docker run --rm -p 8501:8501 -v wdw-tools-data:/data wdw-sitemap-and-importer
```
Then open:
```text
http://localhost:8501
```
The mounted `/data` volume stores sitemap CSV files, crawl state files, and crawl logs so sitemap jobs can survive container restarts.
## Gitea Automation
The workflow file is:
```text
.gitea/workflows/docker-image.yml
```
It runs on pushes to `main` and on manual workflow dispatch.
The workflow always builds the Docker image. If these secrets are configured in Gitea, it also logs in and pushes the image to your registry:
- `GITEA_REGISTRY_URL`
- `GITEA_REGISTRY_USERNAME`
- `GITEA_REGISTRY_PASSWORD`
Published tags:
- `${REGISTRY}/wdw-sitemap-and-importer:<commit-sha>`
- `${REGISTRY}/wdw-sitemap-and-importer:latest`
If the registry secrets are not configured, the workflow still performs the build as validation but skips the push steps.
## Notes
- Sitemap output files are written under `/data` in Docker.
- The sitemap crawler can resume previous runs when a matching crawl state file exists.
- The importer keeps its existing scraping and WordPress export behavior, but it now runs inside the shared interface instead of as a separate app.