Page Importer
This folder contains the WordPress import tool used by the combined application in the repository root.
The importer still uses Streamlit internally, but it is now rendered as the Page Importer tab inside the shared app rather than being the main entrypoint for the repository.
Features
- Upload a CSV of submitted URLs
- Choose the URL column and optional title override column
- Optionally map post type from the CSV or force a single post type
- Scrape only the listed URLs
- Extract title, publish date, author, body HTML, categories, and tags
- Retry failed rows
- Export a WordPress WXR XML file
Recommended Usage
Run the root application:
streamlit run ../app.py
Or run the combined Docker container from the repository root.
Standalone Usage
If you need to run this importer by itself:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
On Windows PowerShell:
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
streamlit run app.py
CSV Input
The app accepts CSV files with any columns. You choose:
- the URL column to scrape
- an optional title or name column to override the scraped title
- an optional post type column with values like
postorpage - an optional category column whose values are appended during export
You can also add manual categories in the sidebar to append them to every exported item.
Notes
- Exported posts default to
draftunless changed in the UI - Image and link URLs remain pointed at the source site
- Some themes need heuristic fallback. The
Force heuristic scrapingoption skips JSON-LD-first extraction and relies on page structure - In the combined app, dependencies come from the root
requirements.txt