@@ -0,0 +1,110 @@
|
||||
# WDW Sitemap And Import Tools
|
||||
|
||||
This repository combines two internal tools into one web application and one Docker image:
|
||||
|
||||
- `Sitemap Generator`
|
||||
- `Page Importer`
|
||||
|
||||
The application uses Streamlit and presents both tools behind a single URL with two tabs at the top of the page.
|
||||
|
||||
## What It Does
|
||||
|
||||
### Sitemap Generator
|
||||
|
||||
- Crawls a site from a starting URL
|
||||
- Discovers URLs from page links and XML sitemaps
|
||||
- Exports a sitemap CSV
|
||||
- Saves crawl state and logs so a crawl can be resumed later
|
||||
|
||||
### Page Importer
|
||||
|
||||
- Reads a CSV of submitted URLs
|
||||
- Scrapes page content
|
||||
- Lets you review the extracted content
|
||||
- Exports a WordPress WXR XML import file
|
||||
|
||||
## Project Layout
|
||||
|
||||
- `app.py`: top-level Streamlit app with both tabs
|
||||
- `requirements.txt`: shared Python dependencies for the combined app
|
||||
- `Dockerfile`: single image for the combined tool
|
||||
- `.gitea/workflows/docker-image.yml`: Gitea Actions workflow for Docker builds
|
||||
- `Sitemap Builder/`: sitemap crawler logic
|
||||
- `Page Importer/`: WordPress import logic
|
||||
|
||||
## Run Locally
|
||||
|
||||
### Linux or macOS
|
||||
|
||||
```bash
|
||||
python3 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
streamlit run app.py
|
||||
```
|
||||
|
||||
### Windows PowerShell
|
||||
|
||||
```powershell
|
||||
python -m venv .venv
|
||||
.venv\Scripts\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
streamlit run app.py
|
||||
```
|
||||
|
||||
Then open:
|
||||
|
||||
```text
|
||||
http://localhost:8501
|
||||
```
|
||||
|
||||
## Docker
|
||||
|
||||
Build the image:
|
||||
|
||||
```bash
|
||||
docker build -t wdw-sitemap-and-importer .
|
||||
```
|
||||
|
||||
Run the container:
|
||||
|
||||
```bash
|
||||
docker run --rm -p 8501:8501 -v wdw-tools-data:/data wdw-sitemap-and-importer
|
||||
```
|
||||
|
||||
Then open:
|
||||
|
||||
```text
|
||||
http://localhost:8501
|
||||
```
|
||||
|
||||
The mounted `/data` volume stores sitemap CSV files, crawl state files, and crawl logs so sitemap jobs can survive container restarts.
|
||||
|
||||
## Gitea Automation
|
||||
|
||||
The workflow file is:
|
||||
|
||||
```text
|
||||
.gitea/workflows/docker-image.yml
|
||||
```
|
||||
|
||||
It runs on pushes to `main` and on manual workflow dispatch.
|
||||
|
||||
The workflow always builds the Docker image. If these secrets are configured in Gitea, it also logs in and pushes the image to your registry:
|
||||
|
||||
- `GITEA_REGISTRY_URL`
|
||||
- `GITEA_REGISTRY_USERNAME`
|
||||
- `GITEA_REGISTRY_PASSWORD`
|
||||
|
||||
Published tags:
|
||||
|
||||
- `${REGISTRY}/wdw-sitemap-and-importer:<commit-sha>`
|
||||
- `${REGISTRY}/wdw-sitemap-and-importer:latest`
|
||||
|
||||
If the registry secrets are not configured, the workflow still performs the build as validation but skips the push steps.
|
||||
|
||||
## Notes
|
||||
|
||||
- Sitemap output files are written under `/data` in Docker.
|
||||
- The sitemap crawler can resume previous runs when a matching crawl state file exists.
|
||||
- The importer keeps its existing scraping and WordPress export behavior, but it now runs inside the shared interface instead of as a separate app.
|
||||
Reference in New Issue
Block a user