wdwalrus 9e87bc4453
Build Docker Image / docker (push) Successful in 6s
Fix UI issue
2026-04-09 11:35:04 -07:00
2026-04-09 10:42:10 -07:00
2026-04-09 10:42:10 -07:00
2026-04-09 11:35:04 -07:00
2026-04-09 10:42:10 -07:00
2026-04-09 10:42:10 -07:00

WDW Sitemap And Import Tools

This repository combines two internal tools into one web application and one Docker image:

  • Sitemap Generator
  • Page Importer

The application uses Streamlit and presents both tools behind a single URL with two tabs at the top of the page.

What It Does

Sitemap Generator

  • Crawls a site from a starting URL
  • Discovers URLs from page links and XML sitemaps
  • Exports a sitemap CSV
  • Saves crawl state and logs so a crawl can be resumed later

Page Importer

  • Reads a CSV of submitted URLs
  • Scrapes page content
  • Lets you review the extracted content
  • Exports a WordPress WXR XML import file

Project Layout

  • app.py: top-level Streamlit app with both tabs
  • requirements.txt: shared Python dependencies for the combined app
  • Dockerfile: single image for the combined tool
  • .gitea/workflows/docker-image.yml: Gitea Actions workflow for Docker builds
  • Sitemap Builder/: sitemap crawler logic
  • Page Importer/: WordPress import logic

Run Locally

Linux or macOS

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py

Windows PowerShell

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
streamlit run app.py

Then open:

http://localhost:8501

Docker

Build the image:

docker build -t wdw-sitemap-and-importer .

Run the container:

docker run --rm -p 8501:8501 -v wdw-tools-data:/data wdw-sitemap-and-importer

Then open:

http://localhost:8501

The mounted /data volume stores sitemap CSV files, crawl state files, and crawl logs so sitemap jobs can survive container restarts.

Docker Compose

A ready-to-use compose file is included:

docker-compose.yml

It pulls this image:

git.websupport.work/wdw_internal_tools/wdw-sitemap-and-scraper-docker:latest

Start it with:

docker compose up -d

Then open:

http://localhost:8501

Gitea Automation

The workflow file is:

.gitea/workflows/docker-image.yml

It runs on pushes to main and on manual workflow dispatch.

The workflow always builds the Docker image. If these secrets are configured in Gitea, it also logs in and pushes the image to your registry:

  • REGISTRY_URL
  • REGISTRY_USERNAME
  • REGISTRY_PASSWORD

Published tags:

  • ${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:<commit-sha>
  • ${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:latest

REGISTRY_URL should be the registry host only, for example:

registry.example.com

or:

gitea.example.com

Do not include http://, https://, or the repository path in REGISTRY_URL. The workflow derives the repository path from the Gitea repository name and converts it to lowercase for Docker compatibility.

If the registry secrets are not configured, the workflow still performs the build as validation but skips the push steps.

Notes

  • Sitemap output files are written under /data in Docker.
  • Sitemap Generator worker threads default to the number of CPUs visible inside the Docker container.
  • The sitemap crawler can resume previous runs when a matching crawl state file exists.
  • The importer keeps its existing scraping and WordPress export behavior, but it now runs inside the shared interface instead of as a separate app.
S
Description
No description provided
Readme 114 KiB
Languages
Python 99.6%
Dockerfile 0.4%