Files
WDW-Sitemap-and-Scraper-Docker/README.md
T
wdwalrus ead872a0a5
Build Docker Image / docker (push) Successful in 44s
first commit
2026-04-09 10:42:10 -07:00

2.6 KiB

WDW Sitemap And Import Tools

This repository combines two internal tools into one web application and one Docker image:

  • Sitemap Generator
  • Page Importer

The application uses Streamlit and presents both tools behind a single URL with two tabs at the top of the page.

What It Does

Sitemap Generator

  • Crawls a site from a starting URL
  • Discovers URLs from page links and XML sitemaps
  • Exports a sitemap CSV
  • Saves crawl state and logs so a crawl can be resumed later

Page Importer

  • Reads a CSV of submitted URLs
  • Scrapes page content
  • Lets you review the extracted content
  • Exports a WordPress WXR XML import file

Project Layout

  • app.py: top-level Streamlit app with both tabs
  • requirements.txt: shared Python dependencies for the combined app
  • Dockerfile: single image for the combined tool
  • .gitea/workflows/docker-image.yml: Gitea Actions workflow for Docker builds
  • Sitemap Builder/: sitemap crawler logic
  • Page Importer/: WordPress import logic

Run Locally

Linux or macOS

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py

Windows PowerShell

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
streamlit run app.py

Then open:

http://localhost:8501

Docker

Build the image:

docker build -t wdw-sitemap-and-importer .

Run the container:

docker run --rm -p 8501:8501 -v wdw-tools-data:/data wdw-sitemap-and-importer

Then open:

http://localhost:8501

The mounted /data volume stores sitemap CSV files, crawl state files, and crawl logs so sitemap jobs can survive container restarts.

Gitea Automation

The workflow file is:

.gitea/workflows/docker-image.yml

It runs on pushes to main and on manual workflow dispatch.

The workflow always builds the Docker image. If these secrets are configured in Gitea, it also logs in and pushes the image to your registry:

  • GITEA_REGISTRY_URL
  • GITEA_REGISTRY_USERNAME
  • GITEA_REGISTRY_PASSWORD

Published tags:

  • ${REGISTRY}/wdw-sitemap-and-importer:<commit-sha>
  • ${REGISTRY}/wdw-sitemap-and-importer:latest

If the registry secrets are not configured, the workflow still performs the build as validation but skips the push steps.

Notes

  • Sitemap output files are written under /data in Docker.
  • The sitemap crawler can resume previous runs when a matching crawl state file exists.
  • The importer keeps its existing scraping and WordPress export behavior, but it now runs inside the shared interface instead of as a separate app.