@@ -0,0 +1,63 @@
|
||||
# Page Importer
|
||||
|
||||
This folder contains the WordPress import tool used by the combined application in the repository root.
|
||||
|
||||
The importer still uses Streamlit internally, but it is now rendered as the `Page Importer` tab inside the shared app rather than being the main entrypoint for the repository.
|
||||
|
||||
## Features
|
||||
|
||||
- Upload a CSV of submitted URLs
|
||||
- Choose the URL column and optional title override column
|
||||
- Optionally map post type from the CSV or force a single post type
|
||||
- Scrape only the listed URLs
|
||||
- Extract title, publish date, author, body HTML, categories, and tags
|
||||
- Retry failed rows
|
||||
- Export a WordPress WXR XML file
|
||||
|
||||
## Recommended Usage
|
||||
|
||||
Run the root application:
|
||||
|
||||
```bash
|
||||
streamlit run ../app.py
|
||||
```
|
||||
|
||||
Or run the combined Docker container from the repository root.
|
||||
|
||||
## Standalone Usage
|
||||
|
||||
If you need to run this importer by itself:
|
||||
|
||||
```bash
|
||||
python3 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
streamlit run app.py
|
||||
```
|
||||
|
||||
On Windows PowerShell:
|
||||
|
||||
```powershell
|
||||
python -m venv .venv
|
||||
.venv\Scripts\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
streamlit run app.py
|
||||
```
|
||||
|
||||
## CSV Input
|
||||
|
||||
The app accepts CSV files with any columns. You choose:
|
||||
|
||||
- the URL column to scrape
|
||||
- an optional title or name column to override the scraped title
|
||||
- an optional post type column with values like `post` or `page`
|
||||
- an optional category column whose values are appended during export
|
||||
|
||||
You can also add manual categories in the sidebar to append them to every exported item.
|
||||
|
||||
## Notes
|
||||
|
||||
- Exported posts default to `draft` unless changed in the UI
|
||||
- Image and link URLs remain pointed at the source site
|
||||
- Some themes need heuristic fallback. The `Force heuristic scraping` option skips JSON-LD-first extraction and relies on page structure
|
||||
- In the combined app, dependencies come from the root `requirements.txt`
|
||||
Reference in New Issue
Block a user