Generator directory format #6

Merged
newsch merged 1 commit from generator-compat into main 2023-07-10 14:34:21 +00:00
4 changed files with 233 additions and 95 deletions

View file

@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
biodranik commented 2023-07-05 18:11:13 +00:00 (Migrated from github.com)
Review

List? Delimited by what? Any example? Is specifying a directory with dumps better?

List? Delimited by what? Any example? Is specifying a directory with dumps better?
biodranik commented 2023-07-05 18:11:36 +00:00 (Migrated from github.com)
Review

Why it should be at PATH? Can it be run from any directory?

Why it should be at PATH? Can it be run from any directory?
biodranik commented 2023-07-05 18:13:39 +00:00 (Migrated from github.com)
Review

Is extracting ids directly from the osm pbf planet dump better than relying on the intermediate generator files? What are pros and cons?

Is extracting ids directly from the osm pbf planet dump better than relying on the intermediate generator files? What are pros and cons?
biodranik commented 2023-07-05 18:14:04 +00:00 (Migrated from github.com)
Review

nit: Start sentences with a capital letter and end them with a dot.

nit: Start sentences with a capital letter and end them with a dot.
newsch commented 2023-07-06 14:07:21 +00:00 (Migrated from github.com)
Review

I meant a shell list/array(?), separated by spaces.

One example is a glob, so using a directory and then referencing $WIKIPEDIA_DUMP_DIRECTORY/*.json.tar.gz might be clearer?

I meant a shell list/array(?), separated by spaces. One example is a glob, so using a directory and then referencing `$WIKIPEDIA_DUMP_DIRECTORY/*.json.tar.gz` might be clearer?
newsch commented 2023-07-06 14:41:06 +00:00 (Migrated from github.com)
Review

It doesn't need to be, the example script read more clearly to me if it's in the context of the intermediate_data directory. It could also be run as ../../../wikiparser/target/release/om-wikiparser, with cargo run --release from the wikiparser directory, or anything else.

It doesn't need to be, the example script read more clearly to me if it's in the context of the `intermediate_data` directory. It could also be run as `../../../wikiparser/target/release/om-wikiparser`, with `cargo run --release` from the wikiparser directory, or anything else.
newsch commented 2023-07-06 14:52:10 +00:00 (Migrated from github.com)
Review

Pros:

  • Independent of the generator process. Can be run as soon as planet file is updated.

Cons:

  • Need to keep osm query in sync with generator's own multi-step filtering and transformation process.
  • Need to match generator's multi-step processing of urls exactly.

When I did this earlier, it was with the osm-filter tool, I only tested it on the yukon region, and it output more entries than the generator did.

I can create an issue for this, but the rough steps to get that working are:

  • Convert osmfilter query to osmium command so it can work on pbf files directly.
  • Dig into generator map processing to try to improve querying.
  • Compare processing of a complete planet with generator output.
  • Write conversion of osmuim output for wikiparser to use.
Pros: - Independent of the generator process. Can be run as soon as planet file is updated. Cons: - Need to keep osm query in sync with generator's own multi-step filtering and transformation process. - Need to match generator's multi-step processing of urls exactly. When I did this earlier, it was with the `osm-filter` tool, I only tested it on the yukon region, and it output _more_ entries than the generator did. I can create an issue for this, but the rough steps to get that working are: - Convert `osmfilter` query to `osmium` command so it can work on `pbf` files directly. - Dig into generator map processing to try to improve querying. - Compare processing of a complete planet with generator output. - Write conversion of `osmuim` output for `wikiparser` to use.
newsch commented 2023-07-06 14:52:54 +00:00 (Migrated from github.com)
Review

Sorry, old habits die hard.

Sorry, old habits die hard.
biodranik commented 2023-07-06 15:03:40 +00:00 (Migrated from github.com)
Review

It's better to mention list item separators explicitly and provide some example for clarity.

It's better to mention list item separators explicitly and provide some example for clarity.
biodranik commented 2023-07-06 15:04:39 +00:00 (Migrated from github.com)
Review

...then why suggesting to install the tool at PATH?

...then why suggesting to install the tool at PATH?
biodranik commented 2023-07-06 15:15:48 +00:00 (Migrated from github.com)
Review
  1. Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?
  2. What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right? Do you remember how big is the percent of "unnecessary" articles?
  3. osmfilter can work with o5m, osmconvert can process pbf. There is also https://docs.rs/osmpbf/latest/osmpbf/ for direct pbf processing if it makes the approach simpler. How good is the osmium tool compared to other options?

It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. WDYT?

1. Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why? 2. What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right? Do you remember how big is the percent of "unnecessary" articles? 3. osmfilter can work with o5m, osmconvert can process pbf. There is also https://docs.rs/osmpbf/latest/osmpbf/ for direct pbf processing if it makes the approach simpler. How good is the osmium tool compared to other options? It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. WDYT?
newsch commented 2023-07-06 15:21:47 +00:00 (Migrated from github.com)
Review

So that you can always reference it as om-wikiparser wherever you are, without worrying about where it is relative to you, or copying it into your working directory.

I meant this as an explanation of how to use it, not a step-by-step for what to run on the build server filesystem.

Maybe writing a shell script to use on the maps server instead would be helpful?

Would you prefer:

# Transform intermediate files from generator.
cut -f 2 id_to_wikidata.csv > wikidata_ids.txt
tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt
# Begin extraction.
for dump in $WIKIPEDIA_ENTERPRISE_DUMPS
do
  tar xzf $dump | $WIKIPARSER_DIR/target/release/om-wikiparser \
    --wikidata-ids wikidata_ids.txt \
    --wikipedia-urls wikipedia_urls.txt \
    descriptions/
done

or

# Transform intermediate files from generator.
maps_build=~/maps_build/$BUILD_DATE/intermediate_data
cut -f 2 $maps_build/id_to_wikidata.csv > $maps_build/wikidata_ids.txt
tail -n +2 $maps_build/wiki_urls.txt | cut -f 3 > $maps_build/wikipedia_urls.txt
# Begin extraction.
for dump in $WIKIPEDIA_ENTERPRISE_DUMPS
do
  tar xzf $dump | ./target/release/om-wikiparser \
    --wikidata-ids $maps_build/wikidata_ids.txt \
    --wikipedia-urls $maps_build/wikipedia_urls.txt \
    $maps_build/descriptions/
done
So that you can always reference it as `om-wikiparser` wherever you are, without worrying about where it is relative to you, or copying it into your working directory. I meant this as an explanation of how to use it, not a step-by-step for what to run on the build server filesystem. Maybe writing a shell script to use on the maps server instead would be helpful? Would you prefer: ```shell # Transform intermediate files from generator. cut -f 2 id_to_wikidata.csv > wikidata_ids.txt tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt # Begin extraction. for dump in $WIKIPEDIA_ENTERPRISE_DUMPS do tar xzf $dump | $WIKIPARSER_DIR/target/release/om-wikiparser \ --wikidata-ids wikidata_ids.txt \ --wikipedia-urls wikipedia_urls.txt \ descriptions/ done ``` or ```shell # Transform intermediate files from generator. maps_build=~/maps_build/$BUILD_DATE/intermediate_data cut -f 2 $maps_build/id_to_wikidata.csv > $maps_build/wikidata_ids.txt tail -n +2 $maps_build/wiki_urls.txt | cut -f 3 > $maps_build/wikipedia_urls.txt # Begin extraction. for dump in $WIKIPEDIA_ENTERPRISE_DUMPS do tar xzf $dump | ./target/release/om-wikiparser \ --wikidata-ids $maps_build/wikidata_ids.txt \ --wikipedia-urls $maps_build/wikipedia_urls.txt \ $maps_build/descriptions/ done ```
biodranik commented 2023-07-06 16:04:06 +00:00 (Migrated from github.com)
Review
  1. Can it be wrapped in a helper script that can be easily customized and run on the generator, maybe directly from the wikiparser repo? :)
  2. cargo run -r may be even better instead of a path to binary :) But it's also ok to hard-code the path or use $WIKIPARSER_BINARY var.

Think about me testing your code soon on a production server. Less surprises = less stress ;-)

1. Can it be wrapped in a helper script that can be easily customized and run on the generator, maybe directly from the wikiparser repo? :) 2. `cargo run -r` may be even better instead of a path to binary :) But it's also ok to hard-code the path or use `$WIKIPARSER_BINARY` var. Think about me testing your code soon on a production server. Less surprises = less stress ;-)
biodranik commented 2023-07-06 16:05:30 +00:00 (Migrated from github.com)
Review

Btw, it may make sense to also print/measure time taken to execute some commands after the first run on the whole planet, to have some reference starting values.

Btw, it may make sense to also print/measure time taken to execute some commands after the first run on the whole planet, to have some reference starting values.
newsch commented 2023-07-06 16:25:01 +00:00 (Migrated from github.com)
Review

It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term.

Absolutely agree!

Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?

I think so, do you mean the wikipedia/wikidata files or the mwm format in general?

By transformations, when I looked at it last, it looked like it was doing some sort of merging of ways/nodes/shapes to get a single parent object.
When I compared the OSM IDs that it output with the Wikidata ids it didn't match up with what I got from osmfilter, even if the urls were the same. Not a problem for the wikiparser, as long as the QIDs/articles are all caught, but it was harder to tell if they were doing the same thing.

As we talked about before, there are also multiple layers of filtering nodes by amenity or other tags, and I only looked at the final layer when I was trying this osmfilter approach (based on ftypes_matcher.cpp).

What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right?

As you say, the worst case isn't a problem for the end user, but I want to do more comparisons with the whole planet file to be confident that this is really a superset of them.

Do you remember how big is the percent of "unnecessary" articles?

That was around 25%, but in the Yukon territory so not very many nodes and I would guess not comparable to the planet.

How good is the osmium tool compared to other options?

I haven't looked into omium much, but my understanding is it is at least as powerful as osmfilter/osmconvert. I know we talked about using pbfs directly at some point so that's why I mentioned it.

> It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. Absolutely agree! > Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why? I think so, do you mean the wikipedia/wikidata files or the mwm format in general? By transformations, when I looked at it last, it looked like it was doing some sort of merging of ways/nodes/shapes to get a single parent object. When I compared the OSM IDs that it output with the Wikidata ids it didn't match up with what I got from `osmfilter`, even if the urls were the same. Not a problem for the wikiparser, as long as the QIDs/articles are all caught, but it was harder to tell if they were doing the same thing. As we talked about before, there are also multiple layers of filtering nodes by amenity or other tags, and I only looked at the final layer when I was trying this `osmfilter` approach (based on [`ftypes_matcher.cpp`](https://github.com/organicmaps/organicmaps/blob/982c6aa92d7196a5690dcdc1564e427de7611806/indexer/ftypes_matcher.cpp#L473)). > What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right? As you say, the worst case isn't a problem for the end user, but I want to do more comparisons with the whole planet file to be confident that this is really a superset of them. > Do you remember how big is the percent of "unnecessary" articles? That was around 25%, but in the Yukon territory so not very many nodes and I would guess not comparable to the planet. > How good is the osmium tool compared to other options? I haven't looked into `omium` much, but my understanding is it is at least as powerful as `osmfilter`/`osmconvert`. I know we talked about using pbfs directly at some point so that's why I mentioned it.
newsch commented 2023-07-06 16:32:06 +00:00 (Migrated from github.com)
Review

I will update the README to be more of an explanation, and make another issue/PR for a script that handles the build directory, timing, backtraces, saving logs, etc.

I will update the README to be more of an explanation, and make another issue/PR for a script that handles the build directory, timing, backtraces, saving logs, etc.
biodranik commented 2023-07-06 16:38:34 +00:00 (Migrated from github.com)
Review

I think so, do you mean the wikipedia/wikidata files or the mwm format in general?

I meant those files that are required for wikiparser to work. It actually may make sense to keep it in README or some other doc, not in an issue.

> I think so, do you mean the wikipedia/wikidata files or the mwm format in general? I meant those files that are required for wikiparser to work. It actually may make sense to keep it in README or some other doc, not in an issue.
## Usage
## Configuring
[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
It defines article sections that are not important for users and should be removed.
It defines article sections that are not important for users and should be removed from the extracted HTML.
## Usage
First, install [the rust language tools](https://www.rust-lang.org/)
For best performance, use `--release` when building or running.
You can run the program from within this directory using `cargo run --release --`.
Alternatively, build it with `cargo build --release`, which places the binary in `./target/release/om-wikiparser`.
Run the program with the `--help` flag to see all supported arguments.
It takes as inputs:
- A wikidata enterprise JSON dump, extracted and connected to `stdin`.
- A file of Wikidata QIDs to extract, one per line (e.g. `Q12345`), passed as the CLI flag `--wikidata-ids`.
- A file of Wikipedia article titles to extract, one per line (e.g. `https://$LANG.wikipedia.org/wiki/$ARTICLE_TITLE`), passed as a CLI flag `--wikipedia-urls`.
- A directory to write the extracted articles to, as a CLI argument.
As an example of usage with the map generator:
- Assuming this program is installed to `$PATH` as `om-wikiparser`.
- Download [the dumps in the desired languages](https://dumps.wikimedia.org/other/enterprise_html/runs/) (Use the files with the format `${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz`).
Set `DUMP_DOWNLOAD_DIR` to the location they are downloaded.
- Run the following from within the `intermediate_data` subdirectory of the maps build directory:
```shell
# Transform intermediate files from generator.
cut -f 2 id_to_wikidata.csv > wikidata_ids.txt
tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt
# Begin extraction.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzf $dump | om-wikiparser \
--wikidata-ids wikidata_ids.txt \
--wikipedia-urls wikipedia_urls.txt \
descriptions/
done
biodranik commented 2023-07-05 18:17:35 +00:00 (Migrated from github.com)
Review

Would a hint about om-wikiparser command line options be helpful?

Would a hint about om-wikiparser command line options be helpful?
```

View file

@ -1,27 +1,18 @@
// Usage:
// # prep outputs from map generator
// cut -f 2 ~/Downloads/id_to_wikidata.csv > /tmp/wikidata_ids.txt
// tail -n +2 ~/Downloads/wiki_urls.txt | cut -f 3 > /tmp/wikipedia_urls.txt
// # feed gzipped tarfile
// pv ~/Downloads/enwiki-NS0-20230401-ENTERPRISE-HTML.json.tar.gz | tar xzO \
// | cargo run --release -- \
// --wikidata-ids /tmp/wikidata_ids.txt \
// --wikipedia-urls /tmp/wikipedia_urls.txt \
// output_dir
use std::{
fs::{create_dir, File},
fs::{self, File},
io::{stdin, BufRead, Write},
os::unix,
path::{Path, PathBuf},
};
use anyhow::bail;
use anyhow::{anyhow, bail, Context};
use clap::Parser;
#[macro_use]
extern crate log;
use om_wikiparser::{
html::simplify,
wm::{is_wikidata_match, is_wikipedia_match, parse_wikidata_file, parse_wikipedia_file, Page},
wm::{parse_wikidata_file, parse_wikipedia_file, Page, WikipediaTitleNorm},
};
#[derive(Parser)]
@ -33,33 +24,115 @@ struct Args {
wikipedia_urls: Option<PathBuf>,
}
fn write(dir: impl AsRef<Path>, page: Page) -> anyhow::Result<()> {
let Some(qid) = page.main_entity.map(|e| e.identifier) else {
// TODO: handle and still write
bail!("Page in list but without wikidata qid: {:?} ({})", page.name, page.url);
/// Determine the directory to write the article contents to, create it, and create any necessary symlinks to it.
fn create_article_dir(
base: impl AsRef<Path>,
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
biodranik commented 2023-06-24 05:45:00 +00:00 (Migrated from github.com)
Review

How are they processed in the generator?

How are they processed in the generator?
biodranik commented 2023-06-24 05:45:09 +00:00 (Migrated from github.com)
Review

For example?

For example?
newsch commented 2023-06-24 17:06:15 +00:00 (Migrated from github.com)
Review

The generator only works with complete wikipedia urls (and wikidata QIDs). The OSM tags like en:Article_Title are converted to urls somewhere early in the OSM ingestion process.
It dumps the urls to a file for the descriptions scraper, then when it adds them to the mwm files it strips the protocol, appends the url to the base directory, and looks for language html files in the folder at that location..
It doesn't do any special processing for articles with a slash in the title, they are just another subdirectory down. I'll update the diagram to show that.

The generator only works with complete wikipedia urls (and wikidata QIDs). The OSM tags like `en:Article_Title` are converted to urls somewhere early in the OSM ingestion process. It [dumps the urls to a file](https://github.com/organicmaps/organicmaps/blob/acc7c0547db4285dd8841ae7f98811268e38d908/generator/wiki_url_dumper.cpp#L63) for the descriptions scraper, then when it adds them to the mwm files [it strips the protocol, appends the url to the base directory, and looks for language html files in the folder at that location.](https://github.com/organicmaps/organicmaps/blob/34bbdf6a2f077b3d629b3f17e8e05bd18a4e4110/generator/descriptions_section_builder.cpp#L142). It doesn't do any special processing for articles with a slash in the title, they are just another subdirectory down. I'll update the diagram to show that.
newsch commented 2023-06-24 17:15:01 +00:00 (Migrated from github.com)
Review

The "incorrect links, directories" refers to updating a directory tree from a previous run, instead of starting from scratch. Right now the behavior is to skip any file that exists.

The "incorrect links, directories" refers to updating a directory tree from a previous run, instead of starting from scratch. Right now the behavior is to skip any file that exists.
biodranik commented 2023-06-24 20:45:44 +00:00 (Migrated from github.com)
Review

Good. Please don't forget that if something can be simplified or improved by changing the current generator approach, then it makes sense to do it.

Good. Please don't forget that if something can be simplified or improved by changing the current generator approach, then it makes sense to do it.
newsch commented 2023-06-26 13:52:47 +00:00 (Migrated from github.com)
Review

I'm working on a list of changes that would be helpful

I'm working on a list of changes that would be helpful
let mut redirects = redirects.into_iter();
let main_dir = match page.wikidata() {
None => {
// Write to wikipedia title directory.
// Prefer first redirect, fall back to page title if none exist
info!("Page without wikidata qid: {:?} ({})", page.name, page.url);
redirects
.next()
.or_else(|| match page.title() {
Ok(title) => Some(title),
Err(e) => {
warn!("Unable to parse title for page {:?}: {:#}", page.name, e);
None
}
})
// hard fail when no titles can be parsed
.ok_or_else(|| anyhow!("No available titles for page {:?}", page.name))?
.get_dir(base.to_owned())
}
Some(qid) => {
// Otherwise use wikidata as main directory and symlink from wikipedia titles.
qid.get_dir(base.to_owned())
}
};
let mut filename = dir.as_ref().to_owned();
filename.push(qid);
if main_dir.is_symlink() {
fs::remove_file(&main_dir)
.with_context(|| format!("removing old link for main directory {:?}", &main_dir))?;
}
fs::create_dir_all(&main_dir)
.with_context(|| format!("creating main directory {:?}", &main_dir))?;
// Write symlinks to main directory.
// TODO: Only write redirects that we care about.
for title in redirects {
let wikipedia_dir = title.get_dir(base.to_owned());
// Build required directory.
//
// Possible states from previous run:
// - Does not exist (and is not a symlink)
// - Exists, is a directory
// - Exists, is a valid symlink to correct location
// - Exists, is a valid symlink to incorrect location
if wikipedia_dir.exists() {
if wikipedia_dir.is_symlink() {
// Only replace if not valid
if fs::read_link(&wikipedia_dir)? == main_dir {
continue;
}
fs::remove_file(&wikipedia_dir)?;
} else {
fs::remove_dir_all(&wikipedia_dir)?;
}
} else {
// titles can contain `/`, so ensure necessary subdirs exist
let parent_dir = wikipedia_dir.parent().unwrap();
fs::create_dir_all(parent_dir)
.with_context(|| format!("creating wikipedia directory {:?}", parent_dir))?;
}
unix::fs::symlink(&main_dir, &wikipedia_dir).with_context(|| {
format!(
"creating symlink from {:?} to {:?}",
wikipedia_dir, main_dir
)
})?;
}
Ok(main_dir)
}
/// Write selected article to disk.
///
/// - Write page contents to wikidata page (`wikidata.org/wiki/QXXX/lang.html`).
/// - If the page has no wikidata qid, write contents to wikipedia location (`lang.wikipedia.org/wiki/article_title/lang.html`).
/// - Create links from all wikipedia urls and redirects (`lang.wikipedia.org/wiki/a_redirect -> wikidata.org/wiki/QXXX`).
fn write(
base: impl AsRef<Path>,
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<()> {
let article_dir = create_article_dir(base, page, redirects)?;
// Write html to determined file.
let mut filename = article_dir;
filename.push(&page.in_language.identifier);
filename.set_extension("html");
debug!("{:?}: {:?}", page.name, filename);
if filename.exists() {
debug!("Exists, skipping");
return Ok(());
}
let subfolder = filename.parent().unwrap();
if !subfolder.exists() {
create_dir(subfolder)?;
debug!("Overwriting existing file");
}
let html = simplify(&page.article_body.html, &page.in_language.identifier);
let mut file = File::create(&filename)?;
file.write_all(html.as_bytes())?;
let mut file =
File::create(&filename).with_context(|| format!("creating html file {:?}", filename))?;
file.write_all(html.as_bytes())
.with_context(|| format!("writing html file {:?}", filename))?;
Ok(())
}
@ -104,14 +177,28 @@ fn main() -> anyhow::Result<()> {
for page in stream {
let page = page?;
if !(is_wikidata_match(&wikidata_ids, &page).is_some()
|| is_wikipedia_match(&wikipedia_titles, &page).is_some())
{
let is_wikidata_match = page
.wikidata()
.map(|qid| wikidata_ids.contains(&qid))
.unwrap_or_default();
let matching_titles = page
.all_titles()
.filter_map(|r| {
r.map(Some).unwrap_or_else(|e| {
warn!("Could not parse title for {:?}: {:#}", &page.name, e);
None
})
})
.filter(|t| wikipedia_titles.contains(t))
.collect::<Vec<_>>();
if !is_wikidata_match && matching_titles.is_empty() {
continue;
}
if let Err(e) = write(&args.output_dir, page) {
error!("Error writing article: {}", e);
if let Err(e) = write(&args.output_dir, &page, matching_titles) {
error!("Error writing article {:?}: {:#}", page.name, e);
}
}

View file

@ -1,5 +1,8 @@
//! Wikimedia types
use std::{collections::HashSet, ffi::OsStr, fs, num::ParseIntError, str::FromStr};
use std::{
collections::HashSet, ffi::OsStr, fmt::Display, fs, num::ParseIntError, path::PathBuf,
str::FromStr,
};
use anyhow::{anyhow, bail, Context};
@ -40,53 +43,6 @@ pub fn parse_wikipedia_file(
.collect()
}
pub fn is_wikidata_match(ids: &HashSet<WikidataQid>, page: &Page) -> Option<WikidataQid> {
let Some(wikidata) = &page.main_entity else { return None;};
let wikidata_id = &wikidata.identifier;
let wikidata_id = match WikidataQid::from_str(wikidata_id) {
Ok(qid) => qid,
Err(e) => {
warn!(
"Could not parse QID for {:?}: {:?}: {:#}",
page.name, wikidata_id, e
);
return None;
}
};
ids.get(&wikidata_id).map(|_| wikidata_id)
}
pub fn is_wikipedia_match(
titles: &HashSet<WikipediaTitleNorm>,
page: &Page,
) -> Option<WikipediaTitleNorm> {
match WikipediaTitleNorm::from_title(&page.name, &page.in_language.identifier) {
Err(e) => warn!("Could not parse title for {:?}: {:#}", page.name, e),
Ok(title) => {
if titles.get(&title).is_some() {
return Some(title);
}
}
}
for redirect in &page.redirects {
match WikipediaTitleNorm::from_title(&redirect.name, &page.in_language.identifier) {
Err(e) => warn!(
"Could not parse redirect title for {:?}: {:?}: {:#}",
page.name, redirect.name, e
),
Ok(title) => {
if titles.get(&title).is_some() {
return Some(title);
}
}
}
}
None
}
/// Wikidata QID/Q Number
///
/// See https://www.wikidata.org/wiki/Wikidata:Glossary#QID
@ -118,6 +74,23 @@ impl FromStr for WikidataQid {
}
}
impl Display for WikidataQid {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "Q{}", self.0)
}
}
impl WikidataQid {
pub fn get_dir(&self, base: PathBuf) -> PathBuf {
let mut path = base;
path.push("wikidata");
// TODO: can use as_mut_os_string with 1.70.0
path.push(self.to_string());
path
}
}
/// Normalized wikipedia article title that can compare:
/// - titles `Spatial Database`
/// - urls `https://en.wikipedia.org/wiki/Spatial_database#Geodatabase`
@ -132,6 +105,11 @@ impl FromStr for WikidataQid {
///
/// assert!(WikipediaTitleNorm::from_url("https://en.wikipedia.org/not_a_wiki_page").is_err());
/// assert!(WikipediaTitleNorm::from_url("https://wikidata.org/wiki/Q12345").is_err());
///
/// assert!(
/// WikipediaTitleNorm::from_url("https://de.wikipedia.org/wiki/Breil/Brigels").unwrap() !=
/// WikipediaTitleNorm::from_url("https://de.wikipedia.org/wiki/Breil").unwrap()
/// );
/// ```
#[derive(Debug, PartialOrd, Ord, PartialEq, Eq, Hash)]
pub struct WikipediaTitleNorm {
@ -145,7 +123,7 @@ impl WikipediaTitleNorm {
title.trim().replace(' ', "_")
}
// https://en.wikipedia.org/wiki/Article_Title
// https://en.wikipedia.org/wiki/Article_Title/More_Title
pub fn from_url(url: &str) -> anyhow::Result<Self> {
let url = Url::parse(url.trim())?;
@ -159,21 +137,17 @@ impl WikipediaTitleNorm {
}
let lang = subdomain;
let mut paths = url
.path_segments()
.ok_or_else(|| anyhow!("Expected path"))?;
let path = url.path();
let root = paths
.next()
.ok_or_else(|| anyhow!("Expected first segment in path"))?;
let (root, title) = path
.strip_prefix('/')
.unwrap_or(path)
.split_once('/')
.ok_or_else(|| anyhow!("Expected at least two segments in path"))?;
if root != "wiki" {
bail!("Expected 'wiki' in path")
bail!("Expected 'wiki' as root path, got: {:?}", root)
}
let title = paths
.next()
.ok_or_else(|| anyhow!("Expected second segment in path"))?;
let title = urlencoding::decode(title)?;
Self::from_title(&title, lang)
@ -202,4 +176,14 @@ impl WikipediaTitleNorm {
let lang = lang.to_owned();
Ok(Self { name, lang })
}
pub fn get_dir(&self, base: PathBuf) -> PathBuf {
let mut path = base;
// TODO: can use as_mut_os_string with 1.70.0
path.push(format!("{}.wikipedia.org", self.lang));
path.push("wiki");
path.push(&self.name);
path
}
}

View file

@ -1,5 +1,9 @@
use std::{iter, str::FromStr};
use serde::Deserialize;
use super::{WikidataQid, WikipediaTitleNorm};
// TODO: consolidate into single struct
/// Deserialized Wikimedia Enterprise API Article
///
@ -20,6 +24,31 @@ pub struct Page {
pub redirects: Vec<Redirect>,
}
impl Page {
pub fn wikidata(&self) -> Option<WikidataQid> {
// TODO: return error
self.main_entity
.as_ref()
.map(|e| WikidataQid::from_str(&e.identifier).unwrap())
}
/// Title of the article
pub fn title(&self) -> anyhow::Result<WikipediaTitleNorm> {
WikipediaTitleNorm::from_title(&self.name, &self.in_language.identifier)
}
/// All titles that lead to the article, the main title followed by any redirects.
pub fn all_titles(&self) -> impl Iterator<Item = anyhow::Result<WikipediaTitleNorm>> + '_ {
iter::once(self.title()).chain(self.redirects())
}
pub fn redirects(&self) -> impl Iterator<Item = anyhow::Result<WikipediaTitleNorm>> + '_ {
self.redirects
.iter()
.map(|r| WikipediaTitleNorm::from_title(&r.name, &self.in_language.identifier))
}
}
#[derive(Deserialize)]
pub struct Wikidata {
pub identifier: String,