# WDW Sitemap And Import Tools This repository combines two internal tools into one web application and one Docker image: - `Sitemap Generator` - `Page Importer` The application uses Streamlit and presents both tools behind a single URL with two tabs at the top of the page. ## What It Does ### Sitemap Generator - Crawls a site from a starting URL - Discovers URLs from page links and XML sitemaps - Exports a sitemap CSV - Saves crawl state and logs so a crawl can be resumed later ### Page Importer - Reads a CSV of submitted URLs - Scrapes page content - Lets you review the extracted content - Exports a WordPress WXR XML import file ## Project Layout - `app.py`: top-level Streamlit app with both tabs - `requirements.txt`: shared Python dependencies for the combined app - `Dockerfile`: single image for the combined tool - `.gitea/workflows/docker-image.yml`: Gitea Actions workflow for Docker builds - `Sitemap Builder/`: sitemap crawler logic - `Page Importer/`: WordPress import logic ## Run Locally ### Linux or macOS ```bash python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt streamlit run app.py ``` ### Windows PowerShell ```powershell python -m venv .venv .venv\Scripts\Activate.ps1 pip install -r requirements.txt streamlit run app.py ``` Then open: ```text http://localhost:8501 ``` ## Docker Build the image: ```bash docker build -t wdw-sitemap-and-importer . ``` Run the container: ```bash docker run --rm -p 8501:8501 -v wdw-tools-data:/data wdw-sitemap-and-importer ``` Then open: ```text http://localhost:8501 ``` The mounted `/data` volume stores sitemap CSV files, crawl state files, and crawl logs so sitemap jobs can survive container restarts. ## Gitea Automation The workflow file is: ```text .gitea/workflows/docker-image.yml ``` It runs on pushes to `main` and on manual workflow dispatch. The workflow always builds the Docker image. If these secrets are configured in Gitea, it also logs in and pushes the image to your registry: - `REGISTRY_URL` - `REGISTRY_USERNAME` - `REGISTRY_PASSWORD` Published tags: - `${REGISTRY}/wdw-sitemap-and-importer:` - `${REGISTRY}/wdw-sitemap-and-importer:latest` If the registry secrets are not configured, the workflow still performs the build as validation but skips the push steps. ## Notes - Sitemap output files are written under `/data` in Docker. - The sitemap crawler can resume previous runs when a matching crawl state file exists. - The importer keeps its existing scraping and WordPress export behavior, but it now runs inside the shared interface instead of as a separate app.