Files
2026-04-09 11:27:13 -07:00
..
2026-04-09 10:42:10 -07:00
2026-04-09 10:42:10 -07:00
2026-04-09 10:42:10 -07:00
2026-04-09 10:42:10 -07:00

Page Importer

This folder contains the WordPress import tool used by the combined application in the repository root.

The importer still uses Streamlit internally, but it is now rendered as the Page Importer tab inside the shared app rather than being the main entrypoint for the repository.

Features

  • Upload a CSV of submitted URLs
  • Choose the URL column and optional title override column
  • Optionally map post type from the CSV or force a single post type
  • Scrape only the listed URLs
  • Extract title, publish date, author, body HTML, categories, and tags
  • Retry failed rows
  • Export a WordPress WXR XML file

Run the root application:

streamlit run ../app.py

Or run the combined Docker container from the repository root.

Standalone Usage

If you need to run this importer by itself:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py

On Windows PowerShell:

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
streamlit run app.py

CSV Input

The app accepts CSV files with any columns. You choose:

  • the URL column to scrape
  • an optional title or name column to override the scraped title
  • an optional post type column with values like post or page
  • an optional category column whose values are appended during export

You can also add manual categories in the sidebar to append them to every exported item.

Notes

  • Exported posts default to draft unless changed in the UI
  • Image and link URLs remain pointed at the source site
  • Some themes need heuristic fallback. The Force heuristic scraping option skips JSON-LD-first extraction and relies on page structure
  • In the combined app, dependencies come from the root requirements.txt