Generator directory format #6

Merged
newsch merged 1 commit from generator-compat into main 2023-07-10 14:34:21 +00:00

1 commit

Author SHA1 Message Date
Evan Lloyd New-Schmidt
382d351740 Write to generator-compatible folder structure
The map generator expects a certain folder structure created by the
current scraper to add the article content into the mwm files.

- Article html is written to wikidata directory.
- Directories are created for any matched titles and symlinked to the
  wikidata directory.
- Articles without a QID are written to article title directory.
- Article titles containing `/` are not escaped, so multiple
  subdirectories are possible.

The output folder hierarchy looks like this:

    .
    ├── de.wikipedia.org
    │  └── wiki
    │     ├── Coal_River_Springs_Territorial_Park
    │     │  ├── de.html
    │     │  └── ru.html
    │     ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park
    │     │  ├── de.html
    │     │  └── en.html
    │    ...
    ├── en.wikipedia.org
    │  └── wiki
    │     ├── Arctic_National_Wildlife_Refuge
    │     │  ├── de.html
    │     │  ├── en.html
    │     │  ├── es.html
    │     │  ├── fr.html
    │     │  └── ru.html
    │     ├── Baltimore
    │     │  └── Washington_International_Airport
    │     │     ├── de.html
    │     │     ├── en.html
    │     │     ├── es.html
    │     │     ├── fr.html
    │     │     └── ru.html
    │    ...
    └── wikidata
       ├── Q59320
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
       ├── Q120306
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
      ...

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:29:49 -04:00