# WDW Sitemap And Import Tools This repository combines two internal tools into one web application and one Docker image: - `Sitemap Generator` - `Page Importer` The application uses Streamlit and presents both tools behind a single URL with two tabs at the top of the page. ## What It Does ### Sitemap Generator - Crawls a site from a starting URL - Discovers URLs from page links and XML sitemaps - Exports a sitemap CSV - Saves crawl state and logs so a crawl can be resumed later ### Page Importer - Reads a CSV of submitted URLs - Scrapes page content - Lets you review the extracted content - Exports a WordPress WXR XML import file ## Project Layout - `app.py`: top-level Streamlit app with both tabs - `requirements.txt`: shared Python dependencies for the combined app - `Dockerfile`: single image for the combined tool - `.gitea/workflows/docker-image.yml`: Gitea Actions workflow for Docker builds - `Sitemap Builder/`: sitemap crawler logic - `Page Importer/`: WordPress import logic ## Run Locally ### Linux or macOS ```bash python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt streamlit run app.py ``` ### Windows PowerShell ```powershell python -m venv .venv .venv\Scripts\Activate.ps1 pip install -r requirements.txt streamlit run app.py ``` Then open: ```text http://localhost:8501 ``` ## Docker Build the image: ```bash docker build -t wdw-sitemap-and-importer . ``` Run the container: ```bash docker run --rm -p 8501:8501 -v wdw-tools-data:/data wdw-sitemap-and-importer ``` Then open: ```text http://localhost:8501 ``` The mounted `/data` volume stores sitemap CSV files, crawl state files, and crawl logs so sitemap jobs can survive container restarts. ## Gitea Automation The workflow file is: ```text .gitea/workflows/docker-image.yml ``` It runs on pushes to `main` and on manual workflow dispatch. The workflow always builds the Docker image. If these secrets are configured in Gitea, it also logs in and pushes the image to your registry: - `REGISTRY_URL` - `REGISTRY_USERNAME` - `REGISTRY_PASSWORD` Published tags: - `${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:` - `${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:latest` `REGISTRY_URL` should be the registry host only, for example: ```text registry.example.com ``` or: ```text gitea.example.com ``` Do not include `http://`, `https://`, or the repository path in `REGISTRY_URL`. The workflow derives the repository path from the Gitea repository name and converts it to lowercase for Docker compatibility. If the registry secrets are not configured, the workflow still performs the build as validation but skips the push steps. ## Notes - Sitemap output files are written under `/data` in Docker. - The sitemap crawler can resume previous runs when a matching crawl state file exists. - The importer keeps its existing scraping and WordPress export behavior, but it now runs inside the shared interface instead of as a separate app.