152 lines
3.4 KiB
Markdown
152 lines
3.4 KiB
Markdown
# WDW Sitemap And Import Tools
|
|
|
|
This repository combines two internal tools into one web application and one Docker image:
|
|
|
|
- `Sitemap Generator`
|
|
- `Page Importer`
|
|
|
|
The application uses Streamlit and presents both tools behind a single URL with two tabs at the top of the page.
|
|
|
|
## What It Does
|
|
|
|
### Sitemap Generator
|
|
|
|
- Crawls a site from a starting URL
|
|
- Discovers URLs from page links and XML sitemaps
|
|
- Exports a sitemap CSV
|
|
- Saves crawl state and logs so a crawl can be resumed later
|
|
|
|
### Page Importer
|
|
|
|
- Reads a CSV of submitted URLs
|
|
- Scrapes page content
|
|
- Lets you review the extracted content
|
|
- Exports a WordPress WXR XML import file
|
|
|
|
## Project Layout
|
|
|
|
- `app.py`: top-level Streamlit app with both tabs
|
|
- `requirements.txt`: shared Python dependencies for the combined app
|
|
- `Dockerfile`: single image for the combined tool
|
|
- `.gitea/workflows/docker-image.yml`: Gitea Actions workflow for Docker builds
|
|
- `Sitemap Builder/`: sitemap crawler logic
|
|
- `Page Importer/`: WordPress import logic
|
|
|
|
## Run Locally
|
|
|
|
### Linux or macOS
|
|
|
|
```bash
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install -r requirements.txt
|
|
streamlit run app.py
|
|
```
|
|
|
|
### Windows PowerShell
|
|
|
|
```powershell
|
|
python -m venv .venv
|
|
.venv\Scripts\Activate.ps1
|
|
pip install -r requirements.txt
|
|
streamlit run app.py
|
|
```
|
|
|
|
Then open:
|
|
|
|
```text
|
|
http://localhost:8501
|
|
```
|
|
|
|
## Docker
|
|
|
|
Build the image:
|
|
|
|
```bash
|
|
docker build -t wdw-sitemap-and-importer .
|
|
```
|
|
|
|
Run the container:
|
|
|
|
```bash
|
|
docker run --rm -p 8501:8501 -v wdw-tools-data:/data wdw-sitemap-and-importer
|
|
```
|
|
|
|
Then open:
|
|
|
|
```text
|
|
http://localhost:8501
|
|
```
|
|
|
|
The mounted `/data` volume stores sitemap CSV files, crawl state files, and crawl logs so sitemap jobs can survive container restarts.
|
|
|
|
## Docker Compose
|
|
|
|
A ready-to-use compose file is included:
|
|
|
|
```text
|
|
docker-compose.yml
|
|
```
|
|
|
|
It pulls this image:
|
|
|
|
```text
|
|
git.websupport.work/wdw_internal_tools/wdw-sitemap-and-scraper-docker:latest
|
|
```
|
|
|
|
Start it with:
|
|
|
|
```bash
|
|
docker compose up -d
|
|
```
|
|
|
|
Then open:
|
|
|
|
```text
|
|
http://localhost:8501
|
|
```
|
|
|
|
## Gitea Automation
|
|
|
|
The workflow file is:
|
|
|
|
```text
|
|
.gitea/workflows/docker-image.yml
|
|
```
|
|
|
|
It runs on pushes to `main` and on manual workflow dispatch.
|
|
|
|
The workflow always builds the Docker image. If these secrets are configured in Gitea, it also logs in and pushes the image to your registry:
|
|
|
|
- `REGISTRY_URL`
|
|
- `REGISTRY_USERNAME`
|
|
- `REGISTRY_PASSWORD`
|
|
|
|
Published tags:
|
|
|
|
- `${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:<commit-sha>`
|
|
- `${REGISTRY_URL}/wdw_internal_tools/wdw-sitemap-and-scraper-docker:latest`
|
|
|
|
`REGISTRY_URL` should be the registry host only, for example:
|
|
|
|
```text
|
|
registry.example.com
|
|
```
|
|
|
|
or:
|
|
|
|
```text
|
|
gitea.example.com
|
|
```
|
|
|
|
Do not include `http://`, `https://`, or the repository path in `REGISTRY_URL`. The workflow derives the repository path from the Gitea repository name and converts it to lowercase for Docker compatibility.
|
|
|
|
If the registry secrets are not configured, the workflow still performs the build as validation but skips the push steps.
|
|
|
|
## Notes
|
|
|
|
- Sitemap output files are written under `/data` in Docker.
|
|
- Sitemap Generator worker threads default to the number of CPUs visible inside the Docker container.
|
|
- The sitemap crawler can resume previous runs when a matching crawl state file exists.
|
|
- The importer keeps its existing scraping and WordPress export behavior, but it now runs inside the shared interface instead of as a separate app.
|