WDW Sitemap And Import Tools
This repository combines two internal tools into one web application and one Docker image:
Sitemap GeneratorPage Importer
The application uses Streamlit and presents both tools behind a single URL with two tabs at the top of the page.
What It Does
Sitemap Generator
- Crawls a site from a starting URL
- Discovers URLs from page links and XML sitemaps
- Exports a sitemap CSV
- Saves crawl state and logs so a crawl can be resumed later
Page Importer
- Reads a CSV of submitted URLs
- Scrapes page content
- Lets you review the extracted content
- Exports a WordPress WXR XML import file
Project Layout
app.py: top-level Streamlit app with both tabsrequirements.txt: shared Python dependencies for the combined appDockerfile: single image for the combined tool.gitea/workflows/docker-image.yml: Gitea Actions workflow for Docker buildsSitemap Builder/: sitemap crawler logicPage Importer/: WordPress import logic
Run Locally
Linux or macOS
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
Windows PowerShell
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
streamlit run app.py
Then open:
http://localhost:8501
Docker
Build the image:
docker build -t wdw-sitemap-and-importer .
Run the container:
docker run --rm -p 8501:8501 -v wdw-tools-data:/data wdw-sitemap-and-importer
Then open:
http://localhost:8501
The mounted /data volume stores sitemap CSV files, crawl state files, and crawl logs so sitemap jobs can survive container restarts.
Docker Compose
A ready-to-use compose file is included:
docker-compose.yml
It pulls this image:
git.websupport.work/wdw_internal_tools/wdw-sitemap-and-scraper-docker:latest
Start it with:
docker compose up -d
Then open:
http://localhost:8501
Gitea Automation
The workflow file is:
.gitea/workflows/docker-image.yml
It runs on pushes to main and on manual workflow dispatch.
The workflow always builds the Docker image. If these secrets are configured in Gitea, it also logs in and pushes the image to your registry:
REGISTRY_URLREGISTRY_USERNAMEREGISTRY_PASSWORD
Published tags:
${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:<commit-sha>${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:latest
REGISTRY_URL should be the registry host only, for example:
registry.example.com
or:
gitea.example.com
Do not include http://, https://, or the repository path in REGISTRY_URL. The workflow derives the repository path from the Gitea repository name and converts it to lowercase for Docker compatibility.
If the registry secrets are not configured, the workflow still performs the build as validation but skips the push steps.
Notes
- Sitemap output files are written under
/datain Docker. - Sitemap Generator worker threads default to the number of CPUs visible inside the Docker container.
- The sitemap crawler can resume previous runs when a matching crawl state file exists.
- The importer keeps its existing scraping and WordPress export behavior, but it now runs inside the shared interface instead of as a separate app.