Generator directory format #6
42
README.md
|
@ -2,7 +2,45 @@
|
|||
|
||||
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
|
||||
![]() I meant a shell list/array(?), separated by spaces. One example is a glob, so using a directory and then referencing I meant a shell list/array(?), separated by spaces.
One example is a glob, so using a directory and then referencing `$WIKIPEDIA_DUMP_DIRECTORY/*.json.tar.gz` might be clearer?
![]() It doesn't need to be, the example script read more clearly to me if it's in the context of the It doesn't need to be, the example script read more clearly to me if it's in the context of the `intermediate_data` directory. It could also be run as `../../../wikiparser/target/release/om-wikiparser`, with `cargo run --release` from the wikiparser directory, or anything else.
![]() Pros:
Cons:
When I did this earlier, it was with the I can create an issue for this, but the rough steps to get that working are:
Pros:
- Independent of the generator process. Can be run as soon as planet file is updated.
Cons:
- Need to keep osm query in sync with generator's own multi-step filtering and transformation process.
- Need to match generator's multi-step processing of urls exactly.
When I did this earlier, it was with the `osm-filter` tool, I only tested it on the yukon region, and it output _more_ entries than the generator did.
I can create an issue for this, but the rough steps to get that working are:
- Convert `osmfilter` query to `osmium` command so it can work on `pbf` files directly.
- Dig into generator map processing to try to improve querying.
- Compare processing of a complete planet with generator output.
- Write conversion of `osmuim` output for `wikiparser` to use.
![]() Sorry, old habits die hard. Sorry, old habits die hard.
![]() It's better to mention list item separators explicitly and provide some example for clarity. It's better to mention list item separators explicitly and provide some example for clarity.
![]() ...then why suggesting to install the tool at PATH? ...then why suggesting to install the tool at PATH?
![]()
It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. WDYT? 1. Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?
2. What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right? Do you remember how big is the percent of "unnecessary" articles?
3. osmfilter can work with o5m, osmconvert can process pbf. There is also https://docs.rs/osmpbf/latest/osmpbf/ for direct pbf processing if it makes the approach simpler. How good is the osmium tool compared to other options?
It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. WDYT?
![]() So that you can always reference it as I meant this as an explanation of how to use it, not a step-by-step for what to run on the build server filesystem. Maybe writing a shell script to use on the maps server instead would be helpful? Would you prefer:
or
So that you can always reference it as `om-wikiparser` wherever you are, without worrying about where it is relative to you, or copying it into your working directory.
I meant this as an explanation of how to use it, not a step-by-step for what to run on the build server filesystem.
Maybe writing a shell script to use on the maps server instead would be helpful?
Would you prefer:
```shell
# Transform intermediate files from generator.
cut -f 2 id_to_wikidata.csv > wikidata_ids.txt
tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt
# Begin extraction.
for dump in $WIKIPEDIA_ENTERPRISE_DUMPS
do
tar xzf $dump | $WIKIPARSER_DIR/target/release/om-wikiparser \
--wikidata-ids wikidata_ids.txt \
--wikipedia-urls wikipedia_urls.txt \
descriptions/
done
```
or
```shell
# Transform intermediate files from generator.
maps_build=~/maps_build/$BUILD_DATE/intermediate_data
cut -f 2 $maps_build/id_to_wikidata.csv > $maps_build/wikidata_ids.txt
tail -n +2 $maps_build/wiki_urls.txt | cut -f 3 > $maps_build/wikipedia_urls.txt
# Begin extraction.
for dump in $WIKIPEDIA_ENTERPRISE_DUMPS
do
tar xzf $dump | ./target/release/om-wikiparser \
--wikidata-ids $maps_build/wikidata_ids.txt \
--wikipedia-urls $maps_build/wikipedia_urls.txt \
$maps_build/descriptions/
done
```
![]()
Think about me testing your code soon on a production server. Less surprises = less stress ;-) 1. Can it be wrapped in a helper script that can be easily customized and run on the generator, maybe directly from the wikiparser repo? :)
2. `cargo run -r` may be even better instead of a path to binary :) But it's also ok to hard-code the path or use `$WIKIPARSER_BINARY` var.
Think about me testing your code soon on a production server. Less surprises = less stress ;-)
![]() Btw, it may make sense to also print/measure time taken to execute some commands after the first run on the whole planet, to have some reference starting values. Btw, it may make sense to also print/measure time taken to execute some commands after the first run on the whole planet, to have some reference starting values.
![]()
Absolutely agree!
I think so, do you mean the wikipedia/wikidata files or the mwm format in general? By transformations, when I looked at it last, it looked like it was doing some sort of merging of ways/nodes/shapes to get a single parent object. As we talked about before, there are also multiple layers of filtering nodes by amenity or other tags, and I only looked at the final layer when I was trying this
As you say, the worst case isn't a problem for the end user, but I want to do more comparisons with the whole planet file to be confident that this is really a superset of them.
That was around 25%, but in the Yukon territory so not very many nodes and I would guess not comparable to the planet.
I haven't looked into > It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term.
Absolutely agree!
> Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?
I think so, do you mean the wikipedia/wikidata files or the mwm format in general?
By transformations, when I looked at it last, it looked like it was doing some sort of merging of ways/nodes/shapes to get a single parent object.
When I compared the OSM IDs that it output with the Wikidata ids it didn't match up with what I got from `osmfilter`, even if the urls were the same. Not a problem for the wikiparser, as long as the QIDs/articles are all caught, but it was harder to tell if they were doing the same thing.
As we talked about before, there are also multiple layers of filtering nodes by amenity or other tags, and I only looked at the final layer when I was trying this `osmfilter` approach (based on [`ftypes_matcher.cpp`](https://github.com/organicmaps/organicmaps/blob/982c6aa92d7196a5690dcdc1564e427de7611806/indexer/ftypes_matcher.cpp#L473)).
> What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right?
As you say, the worst case isn't a problem for the end user, but I want to do more comparisons with the whole planet file to be confident that this is really a superset of them.
> Do you remember how big is the percent of "unnecessary" articles?
That was around 25%, but in the Yukon territory so not very many nodes and I would guess not comparable to the planet.
> How good is the osmium tool compared to other options?
I haven't looked into `omium` much, but my understanding is it is at least as powerful as `osmfilter`/`osmconvert`. I know we talked about using pbfs directly at some point so that's why I mentioned it.
![]() I will update the README to be more of an explanation, and make another issue/PR for a script that handles the build directory, timing, backtraces, saving logs, etc. I will update the README to be more of an explanation, and make another issue/PR for a script that handles the build directory, timing, backtraces, saving logs, etc.
![]()
I meant those files that are required for wikiparser to work. It actually may make sense to keep it in README or some other doc, not in an issue. > I think so, do you mean the wikipedia/wikidata files or the mwm format in general?
I meant those files that are required for wikiparser to work. It actually may make sense to keep it in README or some other doc, not in an issue.
|
||||
|
||||
## Usage
|
||||
## Configuring
|
||||
|
||||
[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
|
||||
It defines article sections that are not important for users and should be removed.
|
||||
It defines article sections that are not important for users and should be removed from the extracted HTML.
|
||||
|
||||
## Usage
|
||||
|
||||
First, install [the rust language tools](https://www.rust-lang.org/)
|
||||
|
||||
For best performance, use `--release` when building or running.
|
||||
|
||||
You can run the program from within this directory using `cargo run --release --`.
|
||||
|
||||
Alternatively, build it with `cargo build --release`, which places the binary in `./target/release/om-wikiparser`.
|
||||
|
||||
Run the program with the `--help` flag to see all supported arguments.
|
||||
|
||||
It takes as inputs:
|
||||
- A wikidata enterprise JSON dump, extracted and connected to `stdin`.
|
||||
- A file of Wikidata QIDs to extract, one per line (e.g. `Q12345`), passed as the CLI flag `--wikidata-ids`.
|
||||
- A file of Wikipedia article titles to extract, one per line (e.g. `https://$LANG.wikipedia.org/wiki/$ARTICLE_TITLE`), passed as a CLI flag `--wikipedia-urls`.
|
||||
- A directory to write the extracted articles to, as a CLI argument.
|
||||
|
||||
As an example of usage with the map generator:
|
||||
- Assuming this program is installed to `$PATH` as `om-wikiparser`.
|
||||
- Download [the dumps in the desired languages](https://dumps.wikimedia.org/other/enterprise_html/runs/) (Use the files with the format `${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz`).
|
||||
Set `DUMP_DOWNLOAD_DIR` to the location they are downloaded.
|
||||
- Run the following from within the `intermediate_data` subdirectory of the maps build directory:
|
||||
|
||||
```shell
|
||||
# Transform intermediate files from generator.
|
||||
cut -f 2 id_to_wikidata.csv > wikidata_ids.txt
|
||||
tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt
|
||||
# Begin extraction.
|
||||
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
|
||||
do
|
||||
tar xzf $dump | om-wikiparser \
|
||||
--wikidata-ids wikidata_ids.txt \
|
||||
--wikipedia-urls wikipedia_urls.txt \
|
||||
descriptions/
|
||||
done
|
||||
![]() Would a hint about om-wikiparser command line options be helpful? Would a hint about om-wikiparser command line options be helpful?
|
||||
```
|
||||
|
|
153
src/main.rs
|
@ -1,27 +1,18 @@
|
|||
// Usage:
|
||||
// # prep outputs from map generator
|
||||
// cut -f 2 ~/Downloads/id_to_wikidata.csv > /tmp/wikidata_ids.txt
|
||||
// tail -n +2 ~/Downloads/wiki_urls.txt | cut -f 3 > /tmp/wikipedia_urls.txt
|
||||
// # feed gzipped tarfile
|
||||
// pv ~/Downloads/enwiki-NS0-20230401-ENTERPRISE-HTML.json.tar.gz | tar xzO \
|
||||
// | cargo run --release -- \
|
||||
// --wikidata-ids /tmp/wikidata_ids.txt \
|
||||
// --wikipedia-urls /tmp/wikipedia_urls.txt \
|
||||
// output_dir
|
||||
use std::{
|
||||
fs::{create_dir, File},
|
||||
fs::{self, File},
|
||||
io::{stdin, BufRead, Write},
|
||||
os::unix,
|
||||
path::{Path, PathBuf},
|
||||
};
|
||||
|
||||
use anyhow::bail;
|
||||
use anyhow::{anyhow, bail, Context};
|
||||
use clap::Parser;
|
||||
#[macro_use]
|
||||
extern crate log;
|
||||
|
||||
use om_wikiparser::{
|
||||
html::simplify,
|
||||
wm::{is_wikidata_match, is_wikipedia_match, parse_wikidata_file, parse_wikipedia_file, Page},
|
||||
wm::{parse_wikidata_file, parse_wikipedia_file, Page, WikipediaTitleNorm},
|
||||
};
|
||||
|
||||
#[derive(Parser)]
|
||||
|
@ -33,33 +24,115 @@ struct Args {
|
|||
wikipedia_urls: Option<PathBuf>,
|
||||
}
|
||||
|
||||
fn write(dir: impl AsRef<Path>, page: Page) -> anyhow::Result<()> {
|
||||
let Some(qid) = page.main_entity.map(|e| e.identifier) else {
|
||||
// TODO: handle and still write
|
||||
bail!("Page in list but without wikidata qid: {:?} ({})", page.name, page.url);
|
||||
/// Determine the directory to write the article contents to, create it, and create any necessary symlinks to it.
|
||||
fn create_article_dir(
|
||||
base: impl AsRef<Path>,
|
||||
page: &Page,
|
||||
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
|
||||
) -> anyhow::Result<PathBuf> {
|
||||
let base = base.as_ref();
|
||||
![]() How are they processed in the generator? How are they processed in the generator?
![]() For example? For example?
![]() The generator only works with complete wikipedia urls (and wikidata QIDs). The OSM tags like The generator only works with complete wikipedia urls (and wikidata QIDs). The OSM tags like `en:Article_Title` are converted to urls somewhere early in the OSM ingestion process.
It [dumps the urls to a file](https://github.com/organicmaps/organicmaps/blob/acc7c0547db4285dd8841ae7f98811268e38d908/generator/wiki_url_dumper.cpp#L63) for the descriptions scraper, then when it adds them to the mwm files [it strips the protocol, appends the url to the base directory, and looks for language html files in the folder at that location.](https://github.com/organicmaps/organicmaps/blob/34bbdf6a2f077b3d629b3f17e8e05bd18a4e4110/generator/descriptions_section_builder.cpp#L142).
It doesn't do any special processing for articles with a slash in the title, they are just another subdirectory down. I'll update the diagram to show that.
![]() The "incorrect links, directories" refers to updating a directory tree from a previous run, instead of starting from scratch. Right now the behavior is to skip any file that exists. The "incorrect links, directories" refers to updating a directory tree from a previous run, instead of starting from scratch. Right now the behavior is to skip any file that exists.
![]() Good. Please don't forget that if something can be simplified or improved by changing the current generator approach, then it makes sense to do it. Good. Please don't forget that if something can be simplified or improved by changing the current generator approach, then it makes sense to do it.
![]() I'm working on a list of changes that would be helpful I'm working on a list of changes that would be helpful
|
||||
let mut redirects = redirects.into_iter();
|
||||
|
||||
let main_dir = match page.wikidata() {
|
||||
None => {
|
||||
// Write to wikipedia title directory.
|
||||
// Prefer first redirect, fall back to page title if none exist
|
||||
info!("Page without wikidata qid: {:?} ({})", page.name, page.url);
|
||||
redirects
|
||||
.next()
|
||||
.or_else(|| match page.title() {
|
||||
Ok(title) => Some(title),
|
||||
Err(e) => {
|
||||
warn!("Unable to parse title for page {:?}: {:#}", page.name, e);
|
||||
None
|
||||
}
|
||||
})
|
||||
// hard fail when no titles can be parsed
|
||||
.ok_or_else(|| anyhow!("No available titles for page {:?}", page.name))?
|
||||
.get_dir(base.to_owned())
|
||||
}
|
||||
Some(qid) => {
|
||||
// Otherwise use wikidata as main directory and symlink from wikipedia titles.
|
||||
qid.get_dir(base.to_owned())
|
||||
}
|
||||
};
|
||||
|
||||
let mut filename = dir.as_ref().to_owned();
|
||||
filename.push(qid);
|
||||
if main_dir.is_symlink() {
|
||||
fs::remove_file(&main_dir)
|
||||
.with_context(|| format!("removing old link for main directory {:?}", &main_dir))?;
|
||||
}
|
||||
fs::create_dir_all(&main_dir)
|
||||
.with_context(|| format!("creating main directory {:?}", &main_dir))?;
|
||||
|
||||
// Write symlinks to main directory.
|
||||
// TODO: Only write redirects that we care about.
|
||||
for title in redirects {
|
||||
let wikipedia_dir = title.get_dir(base.to_owned());
|
||||
|
||||
// Build required directory.
|
||||
//
|
||||
// Possible states from previous run:
|
||||
// - Does not exist (and is not a symlink)
|
||||
// - Exists, is a directory
|
||||
// - Exists, is a valid symlink to correct location
|
||||
// - Exists, is a valid symlink to incorrect location
|
||||
if wikipedia_dir.exists() {
|
||||
if wikipedia_dir.is_symlink() {
|
||||
// Only replace if not valid
|
||||
if fs::read_link(&wikipedia_dir)? == main_dir {
|
||||
continue;
|
||||
}
|
||||
fs::remove_file(&wikipedia_dir)?;
|
||||
} else {
|
||||
fs::remove_dir_all(&wikipedia_dir)?;
|
||||
}
|
||||
} else {
|
||||
// titles can contain `/`, so ensure necessary subdirs exist
|
||||
let parent_dir = wikipedia_dir.parent().unwrap();
|
||||
fs::create_dir_all(parent_dir)
|
||||
.with_context(|| format!("creating wikipedia directory {:?}", parent_dir))?;
|
||||
}
|
||||
|
||||
unix::fs::symlink(&main_dir, &wikipedia_dir).with_context(|| {
|
||||
format!(
|
||||
"creating symlink from {:?} to {:?}",
|
||||
wikipedia_dir, main_dir
|
||||
)
|
||||
})?;
|
||||
}
|
||||
|
||||
Ok(main_dir)
|
||||
}
|
||||
|
||||
/// Write selected article to disk.
|
||||
///
|
||||
/// - Write page contents to wikidata page (`wikidata.org/wiki/QXXX/lang.html`).
|
||||
/// - If the page has no wikidata qid, write contents to wikipedia location (`lang.wikipedia.org/wiki/article_title/lang.html`).
|
||||
/// - Create links from all wikipedia urls and redirects (`lang.wikipedia.org/wiki/a_redirect -> wikidata.org/wiki/QXXX`).
|
||||
fn write(
|
||||
base: impl AsRef<Path>,
|
||||
page: &Page,
|
||||
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
|
||||
) -> anyhow::Result<()> {
|
||||
let article_dir = create_article_dir(base, page, redirects)?;
|
||||
|
||||
// Write html to determined file.
|
||||
let mut filename = article_dir;
|
||||
filename.push(&page.in_language.identifier);
|
||||
filename.set_extension("html");
|
||||
|
||||
debug!("{:?}: {:?}", page.name, filename);
|
||||
|
||||
if filename.exists() {
|
||||
debug!("Exists, skipping");
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let subfolder = filename.parent().unwrap();
|
||||
if !subfolder.exists() {
|
||||
create_dir(subfolder)?;
|
||||
debug!("Overwriting existing file");
|
||||
}
|
||||
|
||||
let html = simplify(&page.article_body.html, &page.in_language.identifier);
|
||||
|
||||
let mut file = File::create(&filename)?;
|
||||
file.write_all(html.as_bytes())?;
|
||||
let mut file =
|
||||
File::create(&filename).with_context(|| format!("creating html file {:?}", filename))?;
|
||||
file.write_all(html.as_bytes())
|
||||
.with_context(|| format!("writing html file {:?}", filename))?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
@ -104,14 +177,28 @@ fn main() -> anyhow::Result<()> {
|
|||
for page in stream {
|
||||
let page = page?;
|
||||
|
||||
if !(is_wikidata_match(&wikidata_ids, &page).is_some()
|
||||
|| is_wikipedia_match(&wikipedia_titles, &page).is_some())
|
||||
{
|
||||
let is_wikidata_match = page
|
||||
.wikidata()
|
||||
.map(|qid| wikidata_ids.contains(&qid))
|
||||
.unwrap_or_default();
|
||||
|
||||
let matching_titles = page
|
||||
.all_titles()
|
||||
.filter_map(|r| {
|
||||
r.map(Some).unwrap_or_else(|e| {
|
||||
warn!("Could not parse title for {:?}: {:#}", &page.name, e);
|
||||
None
|
||||
})
|
||||
})
|
||||
.filter(|t| wikipedia_titles.contains(t))
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
if !is_wikidata_match && matching_titles.is_empty() {
|
||||
continue;
|
||||
}
|
||||
|
||||
if let Err(e) = write(&args.output_dir, page) {
|
||||
error!("Error writing article: {}", e);
|
||||
if let Err(e) = write(&args.output_dir, &page, matching_titles) {
|
||||
error!("Error writing article {:?}: {:#}", page.name, e);
|
||||
}
|
||||
}
|
||||
|
||||
|
|
104
src/wm/mod.rs
|
@ -1,5 +1,8 @@
|
|||
//! Wikimedia types
|
||||
use std::{collections::HashSet, ffi::OsStr, fs, num::ParseIntError, str::FromStr};
|
||||
use std::{
|
||||
collections::HashSet, ffi::OsStr, fmt::Display, fs, num::ParseIntError, path::PathBuf,
|
||||
str::FromStr,
|
||||
};
|
||||
|
||||
use anyhow::{anyhow, bail, Context};
|
||||
|
||||
|
@ -40,53 +43,6 @@ pub fn parse_wikipedia_file(
|
|||
.collect()
|
||||
}
|
||||
|
||||
pub fn is_wikidata_match(ids: &HashSet<WikidataQid>, page: &Page) -> Option<WikidataQid> {
|
||||
let Some(wikidata) = &page.main_entity else { return None;};
|
||||
let wikidata_id = &wikidata.identifier;
|
||||
let wikidata_id = match WikidataQid::from_str(wikidata_id) {
|
||||
Ok(qid) => qid,
|
||||
Err(e) => {
|
||||
warn!(
|
||||
"Could not parse QID for {:?}: {:?}: {:#}",
|
||||
page.name, wikidata_id, e
|
||||
);
|
||||
return None;
|
||||
}
|
||||
};
|
||||
|
||||
ids.get(&wikidata_id).map(|_| wikidata_id)
|
||||
}
|
||||
|
||||
pub fn is_wikipedia_match(
|
||||
titles: &HashSet<WikipediaTitleNorm>,
|
||||
page: &Page,
|
||||
) -> Option<WikipediaTitleNorm> {
|
||||
match WikipediaTitleNorm::from_title(&page.name, &page.in_language.identifier) {
|
||||
Err(e) => warn!("Could not parse title for {:?}: {:#}", page.name, e),
|
||||
Ok(title) => {
|
||||
if titles.get(&title).is_some() {
|
||||
return Some(title);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for redirect in &page.redirects {
|
||||
match WikipediaTitleNorm::from_title(&redirect.name, &page.in_language.identifier) {
|
||||
Err(e) => warn!(
|
||||
"Could not parse redirect title for {:?}: {:?}: {:#}",
|
||||
page.name, redirect.name, e
|
||||
),
|
||||
Ok(title) => {
|
||||
if titles.get(&title).is_some() {
|
||||
return Some(title);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
None
|
||||
}
|
||||
|
||||
/// Wikidata QID/Q Number
|
||||
///
|
||||
/// See https://www.wikidata.org/wiki/Wikidata:Glossary#QID
|
||||
|
@ -118,6 +74,23 @@ impl FromStr for WikidataQid {
|
|||
}
|
||||
}
|
||||
|
||||
impl Display for WikidataQid {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
write!(f, "Q{}", self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl WikidataQid {
|
||||
pub fn get_dir(&self, base: PathBuf) -> PathBuf {
|
||||
let mut path = base;
|
||||
path.push("wikidata");
|
||||
// TODO: can use as_mut_os_string with 1.70.0
|
||||
path.push(self.to_string());
|
||||
|
||||
path
|
||||
}
|
||||
}
|
||||
|
||||
/// Normalized wikipedia article title that can compare:
|
||||
/// - titles `Spatial Database`
|
||||
/// - urls `https://en.wikipedia.org/wiki/Spatial_database#Geodatabase`
|
||||
|
@ -132,6 +105,11 @@ impl FromStr for WikidataQid {
|
|||
///
|
||||
/// assert!(WikipediaTitleNorm::from_url("https://en.wikipedia.org/not_a_wiki_page").is_err());
|
||||
/// assert!(WikipediaTitleNorm::from_url("https://wikidata.org/wiki/Q12345").is_err());
|
||||
///
|
||||
/// assert!(
|
||||
/// WikipediaTitleNorm::from_url("https://de.wikipedia.org/wiki/Breil/Brigels").unwrap() !=
|
||||
/// WikipediaTitleNorm::from_url("https://de.wikipedia.org/wiki/Breil").unwrap()
|
||||
/// );
|
||||
/// ```
|
||||
#[derive(Debug, PartialOrd, Ord, PartialEq, Eq, Hash)]
|
||||
pub struct WikipediaTitleNorm {
|
||||
|
@ -145,7 +123,7 @@ impl WikipediaTitleNorm {
|
|||
title.trim().replace(' ', "_")
|
||||
}
|
||||
|
||||
// https://en.wikipedia.org/wiki/Article_Title
|
||||
// https://en.wikipedia.org/wiki/Article_Title/More_Title
|
||||
pub fn from_url(url: &str) -> anyhow::Result<Self> {
|
||||
let url = Url::parse(url.trim())?;
|
||||
|
||||
|
@ -159,21 +137,17 @@ impl WikipediaTitleNorm {
|
|||
}
|
||||
let lang = subdomain;
|
||||
|
||||
let mut paths = url
|
||||
.path_segments()
|
||||
.ok_or_else(|| anyhow!("Expected path"))?;
|
||||
let path = url.path();
|
||||
|
||||
let root = paths
|
||||
.next()
|
||||
.ok_or_else(|| anyhow!("Expected first segment in path"))?;
|
||||
let (root, title) = path
|
||||
.strip_prefix('/')
|
||||
.unwrap_or(path)
|
||||
.split_once('/')
|
||||
.ok_or_else(|| anyhow!("Expected at least two segments in path"))?;
|
||||
|
||||
if root != "wiki" {
|
||||
bail!("Expected 'wiki' in path")
|
||||
bail!("Expected 'wiki' as root path, got: {:?}", root)
|
||||
}
|
||||
|
||||
let title = paths
|
||||
.next()
|
||||
.ok_or_else(|| anyhow!("Expected second segment in path"))?;
|
||||
let title = urlencoding::decode(title)?;
|
||||
|
||||
Self::from_title(&title, lang)
|
||||
|
@ -202,4 +176,14 @@ impl WikipediaTitleNorm {
|
|||
let lang = lang.to_owned();
|
||||
Ok(Self { name, lang })
|
||||
}
|
||||
|
||||
pub fn get_dir(&self, base: PathBuf) -> PathBuf {
|
||||
let mut path = base;
|
||||
// TODO: can use as_mut_os_string with 1.70.0
|
||||
path.push(format!("{}.wikipedia.org", self.lang));
|
||||
path.push("wiki");
|
||||
path.push(&self.name);
|
||||
|
||||
path
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,5 +1,9 @@
|
|||
use std::{iter, str::FromStr};
|
||||
|
||||
use serde::Deserialize;
|
||||
|
||||
use super::{WikidataQid, WikipediaTitleNorm};
|
||||
|
||||
// TODO: consolidate into single struct
|
||||
/// Deserialized Wikimedia Enterprise API Article
|
||||
///
|
||||
|
@ -20,6 +24,31 @@ pub struct Page {
|
|||
pub redirects: Vec<Redirect>,
|
||||
}
|
||||
|
||||
impl Page {
|
||||
pub fn wikidata(&self) -> Option<WikidataQid> {
|
||||
// TODO: return error
|
||||
self.main_entity
|
||||
.as_ref()
|
||||
.map(|e| WikidataQid::from_str(&e.identifier).unwrap())
|
||||
}
|
||||
|
||||
/// Title of the article
|
||||
pub fn title(&self) -> anyhow::Result<WikipediaTitleNorm> {
|
||||
WikipediaTitleNorm::from_title(&self.name, &self.in_language.identifier)
|
||||
}
|
||||
|
||||
/// All titles that lead to the article, the main title followed by any redirects.
|
||||
pub fn all_titles(&self) -> impl Iterator<Item = anyhow::Result<WikipediaTitleNorm>> + '_ {
|
||||
iter::once(self.title()).chain(self.redirects())
|
||||
}
|
||||
|
||||
pub fn redirects(&self) -> impl Iterator<Item = anyhow::Result<WikipediaTitleNorm>> + '_ {
|
||||
self.redirects
|
||||
.iter()
|
||||
.map(|r| WikipediaTitleNorm::from_title(&r.name, &self.in_language.identifier))
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
pub struct Wikidata {
|
||||
pub identifier: String,
|
||||
|
|
List? Delimited by what? Any example? Is specifying a directory with dumps better?
Why it should be at PATH? Can it be run from any directory?
Is extracting ids directly from the osm pbf planet dump better than relying on the intermediate generator files? What are pros and cons?
nit: Start sentences with a capital letter and end them with a dot.