first commit

2026-04-09 10:42:10 -07:00
commit ead872a0a5
19 changed files with 2783 additions and 0 deletions
@@ -0,0 +1,63 @@
+# Page Importer
+
+This folder contains the WordPress import tool used by the combined application in the repository root.
+
+The importer still uses Streamlit internally, but it is now rendered as the `Page Importer` tab inside the shared app rather than being the main entrypoint for the repository.
+
+## Features
+
+- Upload a CSV of submitted URLs
+- Choose the URL column and optional title override column
+- Optionally map post type from the CSV or force a single post type
+- Scrape only the listed URLs
+- Extract title, publish date, author, body HTML, categories, and tags
+- Retry failed rows
+- Export a WordPress WXR XML file
+
+## Recommended Usage
+
+Run the root application:
+
+```bash
+streamlit run ../app.py
+```
+
+Or run the combined Docker container from the repository root.
+
+## Standalone Usage
+
+If you need to run this importer by itself:
+
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+streamlit run app.py
+```
+
+On Windows PowerShell:
+
+```powershell
+python -m venv .venv
+.venv\Scripts\Activate.ps1
+pip install -r requirements.txt
+streamlit run app.py
+```
+
+## CSV Input
+
+The app accepts CSV files with any columns. You choose:
+
+- the URL column to scrape
+- an optional title or name column to override the scraped title
+- an optional post type column with values like `post` or `page`
+- an optional category column whose values are appended during export
+
+You can also add manual categories in the sidebar to append them to every exported item.
+
+## Notes
+
+- Exported posts default to `draft` unless changed in the UI
+- Image and link URLs remain pointed at the source site
+- Some themes need heuristic fallback. The `Force heuristic scraping` option skips JSON-LD-first extraction and relies on page structure
+- In the combined app, dependencies come from the root `requirements.txt`